Research  Open  Published:
Fast halfsibling population reconstruction: theory and algorithms
Algorithms for Molecular Biologyvolume 8, Article number: 20 (2013)
Abstract
Background
Kinship inference is the task of identifying genealogically relatedindividuals. Kinship information is important for determining matingstructures, notably in endangered populations. Although many solutions existfor reconstructing full sibling relationships, few exist forhalfsiblings.
Results
We consider the problem of determining whether a proposed halfsiblingpopulation reconstruction is valid under Mendelian inheritance assumptions.We show that this problem is NPcomplete and provide a 0/1 integerprogram that identifies the minimum number of individuals that must beremoved from a population in order for the reconstruction to become valid.We also present SibJoin, a heuristicbased clustering approach based onMendelian genetics, which is strikingly fast. The software is available at http://github.com/ddexter/SibJoin.git+.
Conclusions
Our SibJoin algorithm is reasonably accurate and thousands of times fasterthan existing algorithms. The heuristic is used to infer a halfsiblingstructure for a population which was, until recently, too large toevaluate.
Background
Conservation biologists and molecular ecologists use pedigree analysis to gaininsight into the mating habits of populations. For example, knowing the reproductionmechanics of a population helps biologists make important ecological decisions abouta region [1, 2]. The information may also be used to assist in reproduction andconservation of endangered or threatened species [3, 4]. A subfield of pedigree analysis focus on relationships amongsamegeneration individuals. Identifying related full sibling individuals, orindividuals who share both a common mother and common father, is well studied andmany algorithms exist for inferring relationships in such populations [5]. A similar, but much more difficult, task involves discoveringindividuals who are related by a single parent, also known as halfsiblings. Theability to infer halfsibling relationships extends to being able to understandfullsibling relationships, but the converse of this is not true. Correcthalfsibling reconstruction also allows biologists to develop a more completeunderstanding of how species mate than is possible with fullsibling reconstructionalone.
Knowing halfsibling relationships has important realworld applications and answersquestions that full sibling reconstruction cannot. For example, knowing whichindividuals share a single common parent allows biologists to measure the degree ofpolygamy within a population [6]. Halfsibling reconstruction gives insight about pollination patterns, asmothers are pollinated by potentially distant fathers. The diversity of pollinatorscan be used to measure the degree of isolation, due to deforestation, whichthreatens many forests [1].
For diploid species, children inherit one maternal and one paternal chromosome ateach locus from their parents. Mendelian genetic properties can identify unrelatedindividuals, but they also allow us to make predictions about related individuals.Unfortunately, we show that, unlike for fullsiblings, determining whether aproposed halfsibling relationship structure obeys Mendelian inheritance rules isNPcomplete.
The NPhardness result indicates that any algorithm that attempts to providevalid Mendelian family relationships to polygamous populations will likely require arunning time that is exponential in the size of the population, unless the objectivebeing optimized is trivial. However, we provide an extremely fast heuristic, calledSibJoin, which creates reasonably accurate population reconstructions in polynomialtime. We also describe a 0/1 integer program that identifies the minimum number ofindividuals that must be removed in order to make a proposed populationreconstruction valid under Mendelian inheritance rules. SibJoin uses the principlethat if the genotypes of two individuals are very similar, we can be more confidentthat they are related than we can of two individuals with much different genotypes.The accuracy and speed of our algorithm allows us to infer halfsiblingrelationships for previously inaccessible population sizes. We reconstructhalfsibship partitionings for a real population of 672 kelp rockfish that previoushalfsibling reconstruction algorithms fail to solve [7]. SibJoin is written in Python 2.7 and may be checked out from the masterbranch of its git repository at http://github.com/ddexter/SibJoin.git+.
Related work
Many groups have produced algorithms for constructing fullsibling partitions.Current sibling reconstruction techniques fall into three categories: likelihoodestimation, combinatorial objective optimization, and heuristics. While fullsibling reconstruction is a well studied problem with many very accuratealgorithms, halfsibling reconstructions algorithms are relatively few.
Likelihood estimation
Likelihood methods estimate the probability of the data under differentpartitionings of a population. An optimal solution maximizes thisprobability. For population reconstruction these strategies are often veryslow, even with local search heuristics, which makes them ill suited forsibling reconstruction on large data sets. On the other hand, because thisclass of algorithm establishes a probabilistic model, it is often possibleto directly incorporate error handling and prior assumptions about thepopulation to increase accuracy. For a more detailed discussion oflikelihood methods, see Jones and Wang [5].
COLONY [8] and COLONY 2.0 [9] are likelihood methods which construct halfsibling families.COLONY reconstructs full sibling families with high accuracy, but allows forpolygamy in only one sex. COLONY 2.0 performs half sibship reconstructionwhen both sexes are polygamous. Both of these programs use a likelihoodfunction and simulated annealing to find an optimal sibling structure for apopulation. However, results by Sheikh et al.[7], as well as our own results, show that COLONY and COLONY 2.0become prohibitively slow for even mediumsized populations. Additionally,as demonstrated in Almudevar and Anderson [10], COLONY 2.0 often splits true sibgroups into smaller groups,leading to an incomplete reconstruction.
Combinatorial optimization
Combinatorial optimization solutions seek to provide a sibship partitioningwhich minimizes or maximizes some objective function, such as number offamilies, matings, or parents. As with likelihood methods, finding globaloptima for large populations can be computationally demanding. However, manyoptimization techniques are easily parallelizable.
KINALYZER [11] seeks a minimum set cover, by using an integer programming (IP)formulation where each set is subject to the restrictions of Mendeliancompatibility for full siblings. KINALYZER yields decent results [12]; however, like the COLONY programs, does not scale well withpopulation size. The minimum set cover objective used by KINALYZER isNPhard [12]. Recent work has included halfsibling IP strategies that aresimilar to the full sibling strategies in KINALYZER, but they areunsuccessful for large populations [7]. The most viable of these is the halfsibling minimum set cover(HSMSC) IP. Both COLONY and the HSMSC cannot estimate halfsibling groupsfor large populations due to slow runtimes. Additionally, there is noevidence that minimizing the number of sibgroups is the right thing to do inall instances [10].
Fast heuristics
By making simplifying observations, heuristics can produce reasonablyaccurate results thousands of times faster than pure likelihood orcombinatorial methods. Brown and BergerWolf propose a clustering algorithmwhich joins two individuals based on the number of genetically compatiblethird parties [13]. The assumption is, if two individuals form a large number ofcompatible full sibling triplets, then they are likely to be full siblings,alongside the recognition that any incompatible proposed family includes anincompatible triplet, which Brown and BergerWolf also prove. For apopulation of n individuals with m loci, this algorithmhas an O(n^{3}m) runtime and gives accurate results for modest numbers of allelesand loci.
Another heuristic, employed in PRT 2 [10], enumerates a list of maximal sibgroups: sibgroups for which noadditional population may be added. PRT makes the assumption that it isunlikely to find unrelated individuals in a large sibgroup of this form. Aset cover of the maximal sibgroups is then selected using a likelihoodfunction. Although the authors claim that PRT supports halfsiblings,halfsibling groups are never directly computed. Instead, full siblinggroups are presented with a list of which pairs of groups can form validhalfsiblings. This is problematic in instances where both sexes are highlypolygamous because there will be many pairs of full sibling families thatare also halfsibling compatible, and PRT does not indicate which of theseare true halfsibling groups nor divide the halfsibling groups into thematernal and paternal groups. In fact, determining valid halfsiblingfamilies is NPcomplete, as we show below, though for small probleminstances or special cases, this may not be a major concern.
Notation
Information about individuals’ genotypes are collected and expressedthrough the measurement of microsatellites, sequences of repeating DNAbase pairs, such as ATATATAT, on a chromosome. The number of repeats gives aninteger value denoting the allele for an individual. Microsatellitesare collected from both chromosomes, though it is impossible to distinguish thetwo chromosomes with inexpensive technology. Each measurement site is called amicrosatellite locus. In practice, scientists identify and reportalleles at multiple loci in a population, typically to a maximum of one locus oneach autosomal chromosome, to avoid linkage effects and recombination.
SibJoin requires that each individual be diploid, meaning that population memberspossess two of each type of chromosome. Exactly one chromosome is inherited fromeach of the individual’s parents; therefore, each locus will have amaternal and paternal allele. Let m be the number of measured loci fora population. Each locus will have a variable number of alleles, k,which we represent as A_{ l } = {a_{0},a_{1},…,a_{k−1}}.
When the inherited maternal and paternal alleles are combined, they give anindividual’s genotype, which is unordered: (a_{ i },a_{ j }) is equivalent to (a_{ j },a_{ i }). Unfortunately, it is not always possible to reconstruct anindividual’s alleles for a given locus. Allelic dropout is a termthat refers to a common error in genotyping where information about a locuscannot be confidently determined and is omitted. We express sites with allelicdropouts as (∗,∗). When the same allele is inherited from bothparents, the genotype is homozygous; when they differ, it isheterozygous.
The halfsibling problem is, given a population of n offspring, toreconstruct a maternal and paternal partitioning $\mathcal{\mathcal{M}}$ and $\mathcal{P}$ of the population that are consistent with Mendelian laws. foreach maternal halfsibgroup $M\in \mathcal{\mathcal{M}}$ and for each paternal halfsibgroup $P\in \mathcal{P}$, there must be a genotype for that sibgroup such that theindividuals in F := M∩P must bevalid offspring of those two genotypes. Further, the genotype chosen for a groupM or P must be the same in all families that derive fromthat common parent. To avoid triviality, we typically seek to optimize someobjective function when choosing the two partitions, as otherwise, one couldsimply assign each individual to a unique pair of parents. Our heuristic SibJoinrelies on measurements of similarity between individuals. We denote thesimilarity between individuals x and y as s_{ x y } and the similarity between clusters C_{ i } and C_{ j } as s i m(C_{ i },C_{ j }).
Mendelian compatibility
BergerWolf et al.[14] give two Mendelian properties of diploid full siblings. Refer totheir article for the concrete mathematical expression; here, we give a shortexegesis. In any full sibgroup, at all alleles, at most four alleles appearsince there are two parents each with at most two alleles. BergerWolf etal. refer to this rule as the 4allele property. The 2allele propertyenforces the rule that for each full sibling group, there is a partitioning ofthe alleles at each locus into a maternal and paternal group, such that eachindividual obtains exactly one allele from the maternal set and one from thepaternal set, at each locus. Sheikh et al.[7] extend these rules to halfsiblings. The halfsibshipproperty states that for each locus in a halfsibling family, thereexists two alleles {a_{ i },a_{ j }}, which are the alleles of the shared parent, such that each individualpossesses at least one copy of either a_{ i } or a_{ j } at each respective locus; this describes the requirement that thehalfsibling group inherits from the common parent.
Forced allele incompatibilities
When populations are completely monogamous, determining whether each family in apopulation structure has valid parent genotypes is trivial, as decisions made inreconstructing the parents of one sibgroup are independent of all other families.However, when reconstructing halfsibling populations, determining whether there isa set of parents and matings that can explain a collection of identified sibgroupsunder Mendelian inheritance assumptions is much more difficult. For any individual,choosing its father’s genotype uniquely determines which allele must have beeninherited from, and hence present in, the mother, unless the offspring genotypeexactly matches the paternal one. Thus, the decision affects the maternal genotype.In polygamous populations, this influence on the choice of maternal genotype bypaternal genotype also indirectly affects the choice of genotype for each otherfather that the common mother has mated with. For example, if both maternal allelesat a locus are fixed by one sibgroup’s reconstruction, then any other allelesfound in offspring from a different mating of the same mother must have beeninherited from the father.
For highly polygamous populations with many indirect inheritance relationships ofthis sort, it can be difficult to determine whether a proposed population structureis valid. We show that deciding whether there is a valid parental genotype for eachhalfsibling partition in a candidate halfsibling population reconstruction isNPComplete. Thus, we cannot expect polynomial time algorithms toproduce valid parental genotypes, even when they exist. We instead propose aninteger program, which identifies the minimum number of individuals that must beremoved from a candidate population reconstruction so that the resulting populationis valid under Mendelian inheritance, and give experimental results of using it.
Complexity of the valid halfsibling partitioning decision problem
Given maternal and paternal halfsibling partitionings, with each individualbelonging to exactly one maternal and one paternal partition, is it possible toassign genotypes to the parents of each halfsibling family in a way thatrespects the property that every individual must inherit one of exactly twoalleles from each parent at each locus? We will call this problem VALIDHALFSIBLING PARTITIONING.
Theorem 1. VALID HALFSIBLING PARTITIONING is NP complete.
Proof. We first show that VALID HALFSIBLINGPARTITIONING ∈ NP. Given an instance of theproblem and an assignment of genotypes to the parents of each halfsiblingfamily, we can verify in polynomial time whether or not the population structureis valid. The algorithm is straightforward: for each heterozygotic child, checkthat the allele inherited from the mother is not the same as the alleleinherited from the father. When the parent and child genotypes differ, we saythat the child is forced to inherit a specific allele, e.g. child(a,b) is forced to inherit allele a from mother(a,c). If the parent does not force an allele because theparent genotype is identical to the child’s genotype, then that childcannot cause an incompatibility: the inherited allele from the identical parentis whichever allele was not inherited from the child’s other parent.
Next, we give a polynomialtime reduction from the NPcomplete MONOTONEONEINTHREE SAT problem to VALID HALFSIBLING PARTITIONING. The MONOTONEONEINTHREE SAT problem is: given a set of boolean clauses, each containingthree nonnegated literals, determine whether a configuration of literals existssuch that exactly one literal in each clause is set true. This problem is alsocalled EXACTCOVERBY3SETS (X3C), which was used in the first proof of theNPhardness of parsimony phylogeny [15]. The reduction requires three gadgets that translate literals andclauses in a MONOTONE ONEINTHREESAT instance into alleles and families in aVALID HALFSIBLING PARTITIONING instance, respectively.
The first gadget translates picking a literal in a clause to picking a parent fora family. The second gadget defines paternal families that introduce additionalnecessary offspring. From the MONOTONE ONEINTHREE SAT perspective, the thirdgadget enforces the rule that if a literal is chosen to be set true in oneclause, it must be chosen to be true in all of the clauses it belongs to.
We begin by defining a onetoone function f :x→y which assigns each SAT literal to a uniqueinteger allele value. Assume also that there are m clauses.

1.
The selection gadget creates a maternal family with three possible mothers and six offspring for each clause in the SAT instance by mapping literals in a clause to alleles in a family. For the clause (x _{ i }∨x _{ j }∨x _{ k }), the corresponding y _{ i }, y _{ j }, and y _{ k } will be the alleles present in the created family. Six children are created by making two copies of each pairwise grouping of the y alleles: for this clause, we would create the maternal groups in Table 1.
In this halfsibgroup, there are three choices for the maternalgenotype: (y_{ i },y_{ j }), (y_{ i },y_{ k }), and (y_{ j },y_{ k }). Choosing, for example, maternal genotype (y_{ j },y_{ k }) corresponds to setting literal x_{ i } to true in the MONOTONE ONEINTHREE SAT instance: the rule is that theallele not present in the maternal genotype is the true literal. There are totalof m of these maternal families, each with six members.

2.
The mapping gadget creates two paternal families for each potential mother, for a total of six paternal families per maternal selection gadget family. The gadget highlights which literal is set to true in the clause corresponding to the selection gadget.
The 6m mapping families introduce new alleles s_{0}…s_{m−1}, one for each clause, and another distinct allelez common to all of the mapping families. The s_{ i } alleles are used in the third gadget to enforce that, once a literal isset to true in one clause, it is true in every clause.
We now show how to construct the paternal families using the(y_{ i },y_{ j }) children from the selection gadget. Let k_{ i } be the number of clauses that contain variable x_{ i }. Each paternal family p containing the y_{ i } allele must have $\left(\genfrac{}{}{0ex}{}{{k}_{i}}{2}\right)$ copies of the offspring (s_{ p },y_{ i })_{0} and (s_{ p },y_{ i })_{0}. For the clause (x_{ i }∨x_{ j }∨x_{ k }), we create the six families in Table 2 (thoughhere we only show one copy of (s_{ p },y_{ i })_{0} and (s_{ p },y_{ i })_{1}.
Consider the set of six mapping gadget families with alleles{y_{ i },y_{ j },y_{ k }}, corresponding to the clause c_{ p } = (x_{ i }∨x_{ j }∨x_{ k }). If, for example, the y_{ i } allele for all copies of offspring (s_{ p },y_{ i })_{0} and of (s_{ p },y_{ i })_{1} is inherited from its father, then the correspondingmaternal selection parent must be (y_{ j },y_{ k }), indicating that variable x_{ i } is set to true. Again without loss of generality, if the s_{ p } allele is inherited from the father, then the maternal parent from theselection gadget cannot possibly be (y_{ j },y_{ k }).

3.
Lastly, we construct a gadget that maintains consistency in each use of the variables, from the SAT instance perspective. That is, it forces each true literal to be true and each false literal to be false in every clause in which that literal appears. The enforcement gadget constructs a constraining maternal family for each pair of clauses in which a common literal occurs. For instance, if literal x _{ i } appears in clauses c _{ p } and c _{ q }, then a c _{ p }/c _{ q } enforcment family will be constructed so that either y _{ i } is forced to be maternal or s _{ p } and s _{ q } are forced to be maternal. The gadget makes use of the (s _{ p },y _{ i })_{∗} and (s _{ q },y _{ i })_{∗} individuals created by the mapping gadget. The families created for a literal x _{ i } that appears in clauses c _{ p }, c _{ q }, and c _{ r } are found in Table 3.
Each enforcement gadget family for y_{ i } has two copies of (s_{ p },y_{ i }) which are the two children from the mapping gadgets containing(y_{ i },y_{ j }) and (y_{ i },y_{ k }). If s_{ p } is forced, then s_{ q } must also be forced to avoid an incompatibility. As a result, y_{ i } is forced in both paternal mapping gadget families, indicating thatx_{ i } is kept true in both cases. As stated in the mapping gadget, this forceswhich maternal parent must be chosen in the selection gadget. However, ify_{ i } is forced in the enforcement gadget, then s_{ p } and s_{ q } are forced to be true in the paternal mapping gadget families, whichexcludes the parents corresponding to x_{ i } in the SAT instance from being chosen in the respective selection gadgetfamilies.
Finally, for all individuals with only a single assigned parent, which is truefor all individuals with the allele z, we assign them to asingleelement family corresponding to their unassigned parent, thus enforcingno further restrictions on parental genotypes.
In the MONOTONE ONEINTHREE SAT problem, a literal from each clause must be setto true. The selection gadget translates the task of choosing an allele topicking the parents of maternal families. Each selection gadget family containsthree distinct alleles {y_{ i },y_{ j },y_{ k }}. Choosing maternal parent (y_{ i },y_{ j }) is equivalent to setting x_{ k } true and the x_{ i } and x_{ j } literals to false.
Finally, let n be the number of literals and m be the number ofclauses. Constructing the VALID HALFSIBLING PARTITIONING instance requiresO(m) children for the first gadget, O(m^{2}·n) additional children for the second gadget, andO(m^{2}) additional children for the third gadget. Sincem ≤ n, the resulting transformation ispolynomial in the number of literals. □
As an example of this reduction, consider the MONOTONE ONEINTHREE SAT instance(x_{1} ∨ x_{2} ∨ x_{3}) ∧ (x_{1} ∨ x_{4} ∨ x_{5}). Its maternal selection and enforcement gadget families are inTable 4 and its paternal mapping gadget families are inTable 5.
There are several feasible solutions to the M13SAT instance, but the exampleillustrates the case where literals x_{2} and x_{4} are set true in the M13SAT instance. The inherited allele foreach individual in each family is bolded to represent the corresponding VHSPsolution where mothers (1,3) and (1,5) are chosen in the selection gadget.
Identifying allele incompatibilities
Unfortunately, the NPcompleteness of the VHSP problem makes it veryunlikely that a polynomial time algorithm exists for verifying that givenmaternal and paternal partitions have valid parental genotype assignments.Therefore, we present a 0/1 integer program that identifies individuals toremove to obey Mendelian inheritance. We use parsimony and select the fewestindividuals to remove.
We now present a 0/1 integer program to solve this problem.
For a population with n individuals, genotyped at m loci, letx_{ i } = 1 denote the decision to remove individual i fromthe population. The variable ${y}_{j,k}^{l}$ represents the binary choice of having the k^{th} allele in the common parent of the family j at locus l.Denote the multiset which contains the maternal and paternal families as $\mathcal{C}$, and let π_{0} and π_{1} be functions that map an individual to its maternal and paternalindex in $\mathcal{C}$ respectively. Finally, let λ_{0} and λ_{1} map the first and second allele of an individual to an index inK, the set of all alleles.
The first constraint enforces that no parent can have more than two alleles. Thesecond and third constraints enforce Mendelian mating requirements onindividuals: an individual is invalid, and hence must be removed if it does notreceive one allele from its mother and its other allele from its father. Thereare two possible ways to satisfy this constraint. Either the child received itsfirst listed allele from its mother and the second allele from its father orvice versa. The two constraints are the logical or of these two possibilities,and the fourth constraint identifies incompatible loci as those that areunsatisfied by either of these possibilities. Finally, the last constraintenforces the requirement that x_{ i } must be 1, corresponding to individual i being selected forremoval, if the individual has any incompatible loci.
We note that the minimization objective forces x_{ i } and ${x}_{i}^{l}$ to be 0 as often as possible. Since there can never be morefamilies than individuals, the integer program has a total ofO(m·n) constraints.
Clustering halfsibling families
Sibship reconstruction finds a population clustering which obeys Mendelian genetics.In the full sibling clustering $\mathcal{F}$, each individual appears only once. For halfsiblings, analgorithm must construct $\mathcal{\mathcal{M}}$ and $\mathcal{P}$ when both sexes are polygamous or only one of the two partitionswhen one sex is monogamous. Here, we describe the approach of SibJoin, our programfor identifying halfsibling populations.
Measuring similarity
For a given halfsibling input, SibJoin relies on an n×nsimilarity matrix which expresses the similarity between each pair ofindividuals in the offspring population set. Brown and BergerWolf [13] used a similarity measure which, for each pair of individuals, is thecount of third individuals in the population that form a compatible full siblingtriplet with the pair. They proved that any incompatible candidate full siblinggroup must contain an incompatible triplet and give a probabilistic argumentthat pairs of individuals with large numbers of compatible triplets are likelysiblings. Unfortunately, the halfsibling property is much weaker because itonly operates on one parent. Ruling out a potential halfsibling group can takeas many as six individuals, compared to the three that is required for fullsiblings.
Theorem 2. There exist incompatible halfsibling groups for which their smallestincompatible subgroup has six members
Proof. Consider the sextet of individuals with one locus and fouralleles {[ (1,2)],[ (1,3)],[ (1,4)],[ (2,3)], [ (2,4)],[ (3,4)]}. Any five ofthe individuals form a valid halfsibship under the halfsibling property, butthe incompatibility appears when all six individuals are examined together: thecommon halfsibling parent would need three alleles at the locus. □
The lower bound suggests that examining triplets for halfsiblings could yield afalsely high count when individuals are not actually related. Additionally, theprobability that three random individuals form a valid halfsibship is muchhigher than that of three individuals forming a valid full sibship. Byenumerating all possible triplets with a pool of five alleles, we see that96.62% of all triplets are compatible under the halfsiblingproperty, but only 56.61% of all triplets are compatible under thefull sibling properties: that is, incompatibilities are not nearly so much of awarning of unrelatedness for halfsibling reconstruction as for fullsiblingreconstruction. If the number of alleles is set to ten, then 75.46%of halfsibling triplets are compatible (most often, these result when any twoindividuals in the triple share an allele), while the number of full siblingcompatible triplets is just 20.94%.
In the place of a tripletbased similarity function, SibJoin uses a pairwisemeasure. Given two individuals, each with m loci, we count sharedalleles at each locus independently, between the two individuals. For example,the pair of individuals x = [ (1,2),(2,2),(1,3)] andy =[ (1,1),(2,2),(2,3)] has similarity s_{ x y } = 4. The pairwise similarity matrix for this simple measuremay be computed in O(n^{2}m) time, as opposed to the O(n^{3}m) time that is required by the bruteforce triplet method.
Let X be the random variable that represents the number of sharedalleles between two individuals at a single locus. If we assume an even alleledistribution, then the expected number of shared alleles at a single locus isgiven in Eq. 1, 2, and 3.
Assuming the parents of two full siblings are each heterozygotic, two siblingshave a 50% chance of inheriting the same allele from each parent: if eitherparent is homozygotic, then the offspring are guaranteed to inherit the sameallele from that parent. Similarly, halfsiblings have a 50% chance ofinheriting the same allele from their heterozygotic common parent. For unrelatedindividuals, the expected number of shared alleles approaches 0 as the number ofdistinct alleles at a locus grows. When there are many possible alleles, it isunlikely that two unrelated individuals will inherit the same alleles. So, ask grows without bound E[ Xfull siblings]→1, $\mathbf{\text{E}}\left[\phantom{\rule{0.3em}{0ex}}X\right\text{half siblings}]\to \frac{1}{2}$, and E[ X unrelated]→0.
Additionally, we may apply Hoeffding’s inequality to show that theprobability that a pairwise allele similarity deviates far from its meandecreases exponentially as the number of loci increases. Let X be arandom variable as described above. For independent loci, the allele similarityx_{ i } is the allele similarity of the i’th locus with0 ≤ X_{ i } ≤ 2 for1 ≤ i ≤ m. Byapplication of Hoeffding’s inequality to the mean allele similarity $\overline{X}=\sum _{i=0}^{m}{X}_{i}/m$,
Therefore, the pairwise similarity measure will perform well when either thenumber of alleles or loci is sufficiently large: it easily separates unrelatedindividuals from half siblings and full siblings. Our results also indicate thatpairwise similarity method performs well, even with modest numbers of allelesand loci.
Joining individuals
SibJoin’s halfsibling clustering uses the observation that individualswith high allele similarity are very likely half or full siblings. SibJoinbegins with 2n clusters, each of which contains a single individual.Every individual appears in exactly two clusters, representing its maternal andpaternal halfsib groups. SibJoin uses a variant of single linkage clustering tojoin clusters. Single linkage clustering is a form of agglomerative clusteringthat determines the similarity of two clusters C_{ i } and C_{ j } by computing s i m(C_{ i },C_{ j }) = maxx∈C_{ i },y∈C_{ j }s_{ x y }, and then joins the two compatible groups with highest similarity. Asample join is demonstrated in Figure 1. Ties insimilarity are broken by joining the groups with the highest combined number ofmembers first since large compatible halfsibling groups are more likely to berelated than small groups. SibJoin’s success comes from two observations.First, in order for bad joins to occur between any pair of individualsi and j, the similarity between i and jwould need to be larger than the similarity between i and each ofi’s real halfsiblings, and likewise for j.Secondly, as clusters grow, the odds that two unrelated clusters form acompatible halfsibship rapidly diminishes.
Joining must only occur if two clusters form a valid halfsibship. At theinitialization of the algorithm, each individual is assigned a feasible parentset with size at most O(k) per locus. Each join results in aparent set which is the intersection of the parent sets from the two clusters.If the intersection produces the null set, then there is no parent which canexplain the new cluster and the join is rejected. Therefore, testing whether ornot a join is valid takes O(k m) time. When a site experiences allelic dropout, SibJoin makes noassumptions about its parental restrictions; however, sites with(∗,∗) are never counted toward allele similarity betweenindividuals.
Unlike crisp clustering methods which mandate that each individual appear inexactly one cluster, the halfsibling problem contains both a maternal andpaternal group for each individual. We enforce the restriction that any set ofindividuals sharing both a maternal and paternal cluster must be compatible fullsiblings under the 4allele and 2allele properties by maintaining a clusteringof full siblings. Because incompatible full sibling groups are less likely thanincompatible halfsibling groups of the same size, at each similarity stepSibJoin joins clusters which form valid full sibships first.
Microsatellites give no information about which alleles are maternal and whichare paternal. Since SibJoin constructs families in an iterative manner, part ofa maternal family could be reconstructed in the maternal partitioning, while theother part of the family is constructed in the paternal partitioning. If we arestrict about which sets we call maternal and paternal, then the two halves willnever be joined and the half on the paternal side will likely force incorrectfuture joins. Our solution is to implement a bipartite graphG = (V,E) where each cluster is avertex and edges exist between clusters which share an individual. Let a joinbetween clusters C_{ i } and C_{ j } be an event which combines C_{ j } into C_{ i } and let E(v) be the set of edges that touch v.In our graph, j o i n(C_{ i },C_{ j }) results in $E\left({v}_{i}\right):=E\left({v}_{i}\right)\bigcup E\left({v}_{j}\right)$ followed by the removal of v_{ j } and all edges in E(v_{ j }). Enforcing bipartiteness as a postcondition of the join operationallows flexibility while ensuring that the solution results in each individualhaving one parent of each sex.
Results and discussion
We evaluated SibJoin’s performance with simulated and real population data. Theexperimental results from real populations are contrasted with the the HSMSC andCOLONY halfsibling approaches.
Accuracy measure
Partition distance is a metric which measures the distance between two partitionsas the minimum number of individuals that must be removed from a populationuntil the two clusterings are identical. This metric is widely used in sibshipreconstruction literature and in bioinformatics in general [16, 17]. When true partitionings are known, partition distance may be used tocompute the true accuracy of an algorithm; however, it may also be used toassess changes between candidate sibships for which ground truth is not known [18].
Despite its prominence in the kinship analysis literature, partition distanceoffers only a coarse estimate of correctness because it disregards how theexcluded individuals are constructed within the partitioning. For example,failing to join two related partial families is less destructive thanincorrectly joining one of the partial families to an unrelated family. However,in both instances, the partition distance is identical. A concrete example isgiven in Meilă [19].
Instead, we use an informationtheoretic metric called variation of information(VI) [19]. The VI measures how much the information given by two clustersdiffer and is preferable because it quantifies the amount of uncertaintyintroduced by misplaced individuals.
The VI between two partitionings is 0 if and only if the two partitionings areidentical. Smaller VI corresponds to more similar partitionings. Like entropy,the VI is always nonnegative. It also has a tight upper bound of logn[19]; therefore, we normalize VI to a value in [0,1] before reporting thescore for each of our trials.
For halfsiblings, calculating the VI is not straightforward because we have twopartitionings, maternal and paternal, instead of the single partitioning that iscommon in most clustering problems. Since there are two clusterings, we computethe average variation of information between the two maternal clusters, $\mathcal{\mathcal{M}}$ and ${\mathcal{\mathcal{M}}}^{\prime}$, and the two paternal clusters $\mathcal{P}$ and ${\mathcal{P}}^{\prime}$, where $\mathcal{\mathcal{M}}$ and $\mathcal{P}$ are the true partitionings. Since it is usually impossible todetermine the sex of the common parent, we calculate two different VI values andchoose the sex assignment that minimize the VI.
Simulated data set results
We constructed simulation sets to test various parameters. Our model generatesindividuals from an equal number of mothers and fathers. Parents are chosenrandomly, and children are generated from motherfather pairs according to aneven allele distribution. Simulated data had default parameter values of 6alleles, 6 loci, halfsibling family sizes of 5 individuals, and a populationsize of 40 individuals. The results are an average of ten trials per parametervalue. Trials which failed to complete in 1 day are reported as ’’.The population size was increased to 80 individuals for family size trials sothat the partitionings did not become trivial. The loci count was increased to10 and family size to 20 when testing population sizes above 200 individuals. Asummary of our parameter tests and their results may be found in Table 6. Testing occurred on a 2.66 GHz machine, containing 8 GBof RAM, and running Python 2.7.
In most cases, the reported VI score approximates the ratio of partition distanceto population size. Overall, COLONY 2.0 was more accurate, but took thousands oftimes longer, often with only small gains in accuracy. SibJoin does much worsethan COLONY 2.0 on the 10 allele test set, but the discrepancy is due to asingle trial for which SibJoin receives a VI of 0.084 while COLONY 2.0 producesa perfect reconstruction. For the 10 loci test set, SibJoin’s VI is againhigher, but in practice the false positive difference between it and COLONY 2.0is about one individual per trial.
SibJoin does worst when the population size is large and the family size issmall. For instance, when tested with a 100 individual population and familiesof 5 individuals, SibJoin rendered a VI of 0.201 compared to COLONY 2.0’sVI of 0.086. When family sizes are small and population sizes are large, it ismuch more likely for two unrelated individuals to be mistakenly labeled ashalfsiblings due to the explanations given in section 4.3. However,SibJoin’s accuracy rapidly improves with modest increases in family size.In fact, SibJoin is more accurate than COLONY 2.0 in trials with familiescontaining 20 individuals. Unsurprisingly, both methods poorly reconstructpopulations where only two alleles are present. With only two alleles, allindividuals can be full or halfsiblings.
We may also use SibJoin to explore populations with extreme numbers ofindividuals. SibJoin was able to reconstruct populations of 500, 1000, and 2000individuals in under 10 minutes, yet problems of this magnitude are intractablefor the HSMSC and both of the COLONY programs.
Biological data set results
SibJoin was tested on two biological data sets. The first data set is apopulation of 112 field crickets with 7 mothers and 6 sampled loci [20]. The second data set is a population of 672 kelp rockfish with 7mothers and 7 sampled loci [21]. Results are shown in Table 7. Neither COLONY2.0 nor the HSMSC produced a solution for the 672 rockfish population, sosamples from three of the parents were taken to reduce the population size to288 individuals. In both populations, only maternal parentage was available. Forall trials, SibJoin was run in a configuration that only attempts to reconstructthe maternal sex.
Our results are compared to the HSMSC results in [7] and to our own benchmarks on COLONY 2.0. Because the HSMSC is notpublicly available, we could not assess runtime information for the program.However, the authors do note that the HSMSC IP finished in under one day. Thedifference between the two runtimes is not explained merely by CPU speedincreases across a small number of years. Additionally, neither COLONY 2.0 northe HSMSC’s halfsibling minimum set cover approach constructed afeasible answer for the 672 rockfish data set: COLONY 2.0 was stopped afterrunning for three days. SibJoin constructs an accurate solution in under 10seconds.
The HSMSC ILP does not enforce that individuals must have one parent of each sexand both partition distance and variation of information are illdefined whenthe result is not a true partitioning. In the population of 112 crickets, theHSMSC had two false positives and was otherwise correct. In the test setcontaining 288 rockfish, HSMSC had 4 false positives and was otherwise correct.COLONY 2.0 was correct in all instances. SibJoin correctly reconstructed thehalfsibship for the 288 rockfish and only misplaced one individual in thecricket test. SibJoin was the only algorithm to complete for the population of672 rockfish. Overall, SibJoin is as accurate as the HSMSC and nearly asaccurate as COLONY 2.0, but is much faster than either: SibJoin solves the smallrockfish instance over 42,000 times faster than COLONY 2.0.
Minimum population removal IP results
Using the integer program outlined previously, we can identify the minimumsizeset of individuals which must be removed in order to make a SibJoin solutionfeasible. We assume that this set is small, so finding the minimum individualsto remove should capture many incorrect individuals.
Although the IP generally solves quickly, it struggles to find a global optimumfor populations of hundreds of individuals. In these cases the IP gets veryclose, often with integrality gaps below 3%, but never reaches an optimalinteger solution since it runs out of memory. In our experiments, we enforce a 5minute time limit on the IP. We report the percent of trials that failed toreach integrality in Table 8. An approximate solution isacceptable as long as there is a reliable way to correctly readd identifiedindividuals into the population; also, of course, there is no reason to assumethat the smallest set to remove consists of those that are causing trouble.
Table 8 reports the recall and precision of the IP: thepercentage of all incorrect individuals that are identified by the IP and thepercentage of individuals that are actually incorrect among the individualsidentified by the IP. We find that the integer program can have a poor recall,finding only 30% of the false positives in some situations; however, theprecision is relatively high. For individuals in the minimum removal set, thenumber of incorrectly placed individuals is consistently above 50%. Theprecision is significant since SibJoin’s total error rate is often farbelow 50%. If we found a way to correctly reintroduce the set of individualsidentified by the IP, then the overall SibJoin error rate would decreasesignificantly.
The IP does worst when there are only two alleles or two loci. This isunsurprising since there will be no incompatibilities when each locus containsless than three alleles and data with few loci have smaller risk of forbiddenallele structures with bad joins. However, both recall and precision tend toincrease with population size as demonstrated by the 100 and 200 population sizetest cases. For populations with 200 individuals, the IP did not reachintegrality within 5 minutes, but still produced solutions with high recall andprecision relative to the other tests, indicating that the IP is still useful atidentifying misplaced individuals in large populations.
Conclusions
It is important to be able to determine whether or not a proposed populationstructure is valid under Mendelian inheritance assumptions. For halfsiblings, wehave proved that even determining if such a structure obeys Mendelian laws isNPcomplete, which is surprising since the same determination inmonogomous populations is trivial. This result has important implications forhalfsibling algorithms in general since most existing algorithms do notspecifically enforce which allele is inherited from which parent and those that dovery likely have running times which are exponential in the size of the population.We have also provided an integer program that solves an optimization variant of theproblem: what is the minimum number of individuals that must be removed from apopulation in order for the population structure to be valid. The IP was run againstSibJoin’s population reconstructions. Although the IP only had a recall of 30to 40 percent when run against SibJoin’s population reconstructions, theprecision was high: 55 to 78 percent of the individuals identified for removal wereactually incorrect.
We have also demonstrated an application of allele similarity with our fast SibJoinheuristic. SibJoin is a bottomup algorithm based on single linkage clustering. Ourexperiments show that despite being a heuristic, the algorithm competes in accuracywith existing likelihoodbased algorithms, but is thousands of times faster inpractice. The speed of our algorithm is important since existing algorithms fail toreconstruct halfsibling families when the population size is above a few hundredindividuals. SibJoin can reconstruct these populations in seconds. SibJoin is ableto reconstruct real biological populations that existing algorithms fail toreconstruct, and it does so with high accuracy.
References
 1.
Isagi Y, Kanazashi T, Suzuki W, Tanaka H, Abe T: Highly variable pollination patterns in Magnolia obovata revealed bymicrosatellite paternity analysis. Int J Plant Sci. 2004, 165 (6): 10471053. 10.1086/423883.
 2.
Sezen U, Chazdon R, Holsinger K: Genetic consequences of tropical secondgrowth forest regeneration. Science. 2005, 307 (5711): 891.
 3.
Caughley G: Directions in conservation biology. J Anim Ecol. 1994, 63 (2): 215244. 10.2307/5542.
 4.
Hedrick P, Lacy R, Allendorf F, Soule M: Directions in conservation Biology: Comments on Caughley. Conserv Biol. 1996, 10 (5): 13121320. 10.1046/j.15231739.1996.10051312.
 5.
Jones OR, Wang J: Molecular markerbased pedigrees for animal conservation biologists. Anim Cons. 2010, 13: 2634. 10.1111/j.14691795.2009.00324.
 6.
Painter I: Sibship reconstruction without parental information. J Agric Biol Envir S. 1997, 2 (2): 212229. 10.2307/1400404.
 7.
Sheikh SI, BergerWolf TY, Khokhar AA, Caballero IC, Ashley MV, Chaovalitwongse W, Chou C, Dasgupta B: Combinatorial reconstruction of halfsibling groups from microsatellitedata. J Bioinf Comput Biol. 2010, 8 (2): 337356. 10.1142/S0219720010004793.
 8.
Wang J: Sibship reconstruction from genetic data with typing errors. Genet. 2004, 166 (4): 19631979. 10.1534/genetics.166.4.1963.
 9.
Wang J, Santure AW: Parentage and sibship inference from multilocus genotype data underpolygamy. Genet. 2009, 181 (4): 15791594. 10.1534/genetics.108.100214.
 10.
Almudevar A, Anderson E: A new version of PRT software for sibling groups reconstruction with commentsregarding several issues in the sibling reconstruction problem. Mol Ecol Resour. 2011, 12: 164178.
 11.
Ashley M, Caballero I, Chaovalitwongse W, Dasgupta B, Govindan P, Sheikh SI, BergerWolf TY: KINALYZER, a computer program for reconstructing sibling groups. Mol Ecol Resour. 2009, 9 (4): 11271131.
 12.
Chaovalitwongse WA, Chou CA, BergerWolf TY, DasGupta B, Sheikh S, Ashley MV, Caballero IC: New optimization model and algorithm for sibling reconstruction from geneticmarkers. INFORMS J Comp. 2009, 22 (2): 180194.
 13.
Brown D, BergerWolf TY: Discovering kinship through small subsets. Proc 10th WABI10. 2010, 6293 in LNCS: 111123.
 14.
BergerWolf TY, DasGupta B: Combinatorial reconstruction of sibling relationships. Proc CGBI. 2005, 36.
 15.
Foulds L, Graham R: The Steiner problem in Phylogeny is NPComplete. Adv Appl Math. 1982, 3: 4349. 10.1016/S01968858(82)800043.
 16.
Gusfield D: Partitiondistance : A problem and class of perfect graphs arising inclustering. Info Proc Lett. 2002, 82 (3): 159164. 10.1016/S00200190(01)002630.
 17.
Lin Y, Rajan V, Moret B: A metric for phylogenetic trees based on matching. IEEE/ACM Trans Comp Bio Bioinf. 2011, 9 (4): 10141022.
 18.
Coombs JA, Letcher BH, Nislow KH: PedAgree: software to quantify error and assess accuracy and congruence forgenetically reconstructed pedigree relationships. Conserv Genet Resour. 2010, 2: 147150. 10.1007/s1268601092029.
 19.
Meila M: Comparing clusterings by the variation of information. Proc COLT. 2003, 2777: 173187.
 20.
Bretman A, Tregenza T: Measuring polyandry in wild populations: a case study using promiscuouscrickets. Mol Ecol. 2005, 14 (7): 21692179.
 21.
Sogard SM, GilbertHorvath E, Anderson EC, Fisher R, Berkeley SA, Garza JC: Multiple paternity in viviparous kelp rockfish, Sebastes atrovirens. Environ Biol Fishes. 2008, 81: 713.
Acknowledgements
This work is supported by the Natural Sciences and Engineering Research Councilof Canada and the Government of Ontario. We would like to thank TanyaBergerWolf and Saad Sheikh for providing real population data, and also thankTanya BergerWolf, Marina Meilă, and Rita Ackerman for helpfuldiscussions.
Author information
Additional information
Competing interests
The authors declare they have no competing interests.
Authors’ contributions
DD and DGB jointly developed the algorithms and proofs. DD implemented the algorithmsand conducted the experiments. DGB directed the project. Both authors wrote, read,and approve the manuscript.
Authors’ original submitted files for images
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Kinship discovery
 Halfsibling
 Population genetics
 Conservation biology
 Heuristics