Fast half-sibling population reconstruction: theory and algorithms
- Daniel Dexter^{1} and
- Daniel G Brown^{1}Email author
https://doi.org/10.1186/1748-7188-8-20
© Dexter and Brown; licensee BioMed Central Ltd. 2013
Received: 13 December 2012
Accepted: 29 June 2013
Published: 12 July 2013
Abstract
Background
Kinship inference is the task of identifying genealogically relatedindividuals. Kinship information is important for determining matingstructures, notably in endangered populations. Although many solutions existfor reconstructing full sibling relationships, few exist forhalf-siblings.
Results
We consider the problem of determining whether a proposed half-siblingpopulation reconstruction is valid under Mendelian inheritance assumptions.We show that this problem is NP-complete and provide a 0/1 integerprogram that identifies the minimum number of individuals that must beremoved from a population in order for the reconstruction to become valid.We also present SibJoin, a heuristic-based clustering approach based onMendelian genetics, which is strikingly fast. The software is available at http://github.com/ddexter/SibJoin.git+.
Conclusions
Our SibJoin algorithm is reasonably accurate and thousands of times fasterthan existing algorithms. The heuristic is used to infer a half-siblingstructure for a population which was, until recently, too large toevaluate.
Keywords
Kinship discovery Half-sibling Population genetics Conservation biology HeuristicsBackground
Conservation biologists and molecular ecologists use pedigree analysis to gaininsight into the mating habits of populations. For example, knowing the reproductionmechanics of a population helps biologists make important ecological decisions abouta region [1, 2]. The information may also be used to assist in reproduction andconservation of endangered or threatened species [3, 4]. A sub-field of pedigree analysis focus on relationships amongsame-generation individuals. Identifying related full sibling individuals, orindividuals who share both a common mother and common father, is well studied andmany algorithms exist for inferring relationships in such populations [5]. A similar, but much more difficult, task involves discoveringindividuals who are related by a single parent, also known as half-siblings. Theability to infer half-sibling relationships extends to being able to understandfull-sibling relationships, but the converse of this is not true. Correcthalf-sibling reconstruction also allows biologists to develop a more completeunderstanding of how species mate than is possible with full-sibling reconstructionalone.
Knowing half-sibling relationships has important real-world applications and answersquestions that full sibling reconstruction cannot. For example, knowing whichindividuals share a single common parent allows biologists to measure the degree ofpolygamy within a population [6]. Half-sibling reconstruction gives insight about pollination patterns, asmothers are pollinated by potentially distant fathers. The diversity of pollinatorscan be used to measure the degree of isolation, due to deforestation, whichthreatens many forests [1].
For diploid species, children inherit one maternal and one paternal chromosome ateach locus from their parents. Mendelian genetic properties can identify unrelatedindividuals, but they also allow us to make predictions about related individuals.Unfortunately, we show that, unlike for full-siblings, determining whether aproposed half-sibling relationship structure obeys Mendelian inheritance rules isNP-complete.
The NP-hardness result indicates that any algorithm that attempts to providevalid Mendelian family relationships to polygamous populations will likely require arunning time that is exponential in the size of the population, unless the objectivebeing optimized is trivial. However, we provide an extremely fast heuristic, calledSibJoin, which creates reasonably accurate population reconstructions in polynomialtime. We also describe a 0/1 integer program that identifies the minimum number ofindividuals that must be removed in order to make a proposed populationreconstruction valid under Mendelian inheritance rules. SibJoin uses the principlethat if the genotypes of two individuals are very similar, we can be more confidentthat they are related than we can of two individuals with much different genotypes.The accuracy and speed of our algorithm allows us to infer half-siblingrelationships for previously inaccessible population sizes. We reconstructhalf-sibship partitionings for a real population of 672 kelp rockfish that previoushalf-sibling reconstruction algorithms fail to solve [7]. SibJoin is written in Python 2.7 and may be checked out from the masterbranch of its git repository at http://github.com/ddexter/SibJoin.git+.
Related work
Many groups have produced algorithms for constructing full-sibling partitions.Current sibling reconstruction techniques fall into three categories: likelihoodestimation, combinatorial objective optimization, and heuristics. While fullsibling reconstruction is a well studied problem with many very accuratealgorithms, half-sibling reconstructions algorithms are relatively few.
Likelihood estimation
Likelihood methods estimate the probability of the data under differentpartitionings of a population. An optimal solution maximizes thisprobability. For population reconstruction these strategies are often veryslow, even with local search heuristics, which makes them ill suited forsibling reconstruction on large data sets. On the other hand, because thisclass of algorithm establishes a probabilistic model, it is often possibleto directly incorporate error handling and prior assumptions about thepopulation to increase accuracy. For a more detailed discussion oflikelihood methods, see Jones and Wang [5].
COLONY [8] and COLONY 2.0 [9] are likelihood methods which construct half-sibling families.COLONY reconstructs full sibling families with high accuracy, but allows forpolygamy in only one sex. COLONY 2.0 performs half sibship reconstructionwhen both sexes are polygamous. Both of these programs use a likelihoodfunction and simulated annealing to find an optimal sibling structure for apopulation. However, results by Sheikh et al.[7], as well as our own results, show that COLONY and COLONY 2.0become prohibitively slow for even medium-sized populations. Additionally,as demonstrated in Almudevar and Anderson [10], COLONY 2.0 often splits true sibgroups into smaller groups,leading to an incomplete reconstruction.
Combinatorial optimization
Combinatorial optimization solutions seek to provide a sibship partitioningwhich minimizes or maximizes some objective function, such as number offamilies, matings, or parents. As with likelihood methods, finding globaloptima for large populations can be computationally demanding. However, manyoptimization techniques are easily parallelizable.
KINALYZER [11] seeks a minimum set cover, by using an integer programming (IP)formulation where each set is subject to the restrictions of Mendeliancompatibility for full siblings. KINALYZER yields decent results [12]; however, like the COLONY programs, does not scale well withpopulation size. The minimum set cover objective used by KINALYZER isNP-hard [12]. Recent work has included half-sibling IP strategies that aresimilar to the full sibling strategies in KINALYZER, but they areunsuccessful for large populations [7]. The most viable of these is the half-sibling minimum set cover(HS-MSC) IP. Both COLONY and the HS-MSC cannot estimate half-sibling groupsfor large populations due to slow runtimes. Additionally, there is noevidence that minimizing the number of sibgroups is the right thing to do inall instances [10].
Fast heuristics
By making simplifying observations, heuristics can produce reasonablyaccurate results thousands of times faster than pure likelihood orcombinatorial methods. Brown and Berger-Wolf propose a clustering algorithmwhich joins two individuals based on the number of genetically compatiblethird parties [13]. The assumption is, if two individuals form a large number ofcompatible full sibling triplets, then they are likely to be full siblings,alongside the recognition that any incompatible proposed family includes anincompatible triplet, which Brown and Berger-Wolf also prove. For apopulation of n individuals with m loci, this algorithmhas an O(n^{3}m) runtime and gives accurate results for modest numbers of allelesand loci.
Another heuristic, employed in PRT 2 [10], enumerates a list of maximal sibgroups: sibgroups for which noadditional population may be added. PRT makes the assumption that it isunlikely to find unrelated individuals in a large sibgroup of this form. Aset cover of the maximal sibgroups is then selected using a likelihoodfunction. Although the authors claim that PRT supports half-siblings,half-sibling groups are never directly computed. Instead, full siblinggroups are presented with a list of which pairs of groups can form validhalf-siblings. This is problematic in instances where both sexes are highlypolygamous because there will be many pairs of full sibling families thatare also half-sibling compatible, and PRT does not indicate which of theseare true half-sibling groups nor divide the half-sibling groups into thematernal and paternal groups. In fact, determining valid half-siblingfamilies is NP-complete, as we show below, though for small probleminstances or special cases, this may not be a major concern.
Notation
Information about individuals’ genotypes are collected and expressedthrough the measurement of microsatellites, sequences of repeating DNAbase pairs, such as ATATATAT, on a chromosome. The number of repeats gives aninteger value denoting the allele for an individual. Microsatellitesare collected from both chromosomes, though it is impossible to distinguish thetwo chromosomes with inexpensive technology. Each measurement site is called amicrosatellite locus. In practice, scientists identify and reportalleles at multiple loci in a population, typically to a maximum of one locus oneach autosomal chromosome, to avoid linkage effects and recombination.
SibJoin requires that each individual be diploid, meaning that population memberspossess two of each type of chromosome. Exactly one chromosome is inherited fromeach of the individual’s parents; therefore, each locus will have amaternal and paternal allele. Let m be the number of measured loci fora population. Each locus will have a variable number of alleles, k,which we represent as A_{ l } = {a_{0},a_{1},…,a_{k−1}}.
When the inherited maternal and paternal alleles are combined, they give anindividual’s genotype, which is unordered: (a_{ i },a_{ j }) is equivalent to (a_{ j },a_{ i }). Unfortunately, it is not always possible to reconstruct anindividual’s alleles for a given locus. Allelic dropout is a termthat refers to a common error in genotyping where information about a locuscannot be confidently determined and is omitted. We express sites with allelicdropouts as (∗,∗). When the same allele is inherited from bothparents, the genotype is homozygous; when they differ, it isheterozygous.
The half-sibling problem is, given a population of n offspring, toreconstruct a maternal and paternal partitioning $\mathcal{\mathcal{M}}$ and $\mathcal{P}$ of the population that are consistent with Mendelian laws. foreach maternal half-sibgroup $M\in \mathcal{\mathcal{M}}$ and for each paternal half-sibgroup $P\in \mathcal{P}$, there must be a genotype for that sibgroup such that theindividuals in F := M∩P must bevalid offspring of those two genotypes. Further, the genotype chosen for a groupM or P must be the same in all families that derive fromthat common parent. To avoid triviality, we typically seek to optimize someobjective function when choosing the two partitions, as otherwise, one couldsimply assign each individual to a unique pair of parents. Our heuristic SibJoinrelies on measurements of similarity between individuals. We denote thesimilarity between individuals x and y as s_{ x y } and the similarity between clusters C_{ i } and C_{ j } as s i m(C_{ i },C_{ j }).
Mendelian compatibility
Berger-Wolf et al.[14] give two Mendelian properties of diploid full siblings. Refer totheir article for the concrete mathematical expression; here, we give a shortexegesis. In any full sibgroup, at all alleles, at most four alleles appearsince there are two parents each with at most two alleles. Berger-Wolf etal. refer to this rule as the 4-allele property. The 2-allele propertyenforces the rule that for each full sibling group, there is a partitioning ofthe alleles at each locus into a maternal and paternal group, such that eachindividual obtains exactly one allele from the maternal set and one from thepaternal set, at each locus. Sheikh et al.[7] extend these rules to half-siblings. The half-sibshipproperty states that for each locus in a half-sibling family, thereexists two alleles {a_{ i },a_{ j }}, which are the alleles of the shared parent, such that each individualpossesses at least one copy of either a_{ i } or a_{ j } at each respective locus; this describes the requirement that thehalf-sibling group inherits from the common parent.
Forced allele incompatibilities
When populations are completely monogamous, determining whether each family in apopulation structure has valid parent genotypes is trivial, as decisions made inreconstructing the parents of one sibgroup are independent of all other families.However, when reconstructing half-sibling populations, determining whether there isa set of parents and matings that can explain a collection of identified sibgroupsunder Mendelian inheritance assumptions is much more difficult. For any individual,choosing its father’s genotype uniquely determines which allele must have beeninherited from, and hence present in, the mother, unless the offspring genotypeexactly matches the paternal one. Thus, the decision affects the maternal genotype.In polygamous populations, this influence on the choice of maternal genotype bypaternal genotype also indirectly affects the choice of genotype for each otherfather that the common mother has mated with. For example, if both maternal allelesat a locus are fixed by one sibgroup’s reconstruction, then any other allelesfound in offspring from a different mating of the same mother must have beeninherited from the father.
For highly polygamous populations with many indirect inheritance relationships ofthis sort, it can be difficult to determine whether a proposed population structureis valid. We show that deciding whether there is a valid parental genotype for eachhalf-sibling partition in a candidate half-sibling population reconstruction isNP-Complete. Thus, we cannot expect polynomial time algorithms toproduce valid parental genotypes, even when they exist. We instead propose aninteger program, which identifies the minimum number of individuals that must beremoved from a candidate population reconstruction so that the resulting populationis valid under Mendelian inheritance, and give experimental results of using it.
Complexity of the valid half-sibling partitioning decision problem
Given maternal and paternal half-sibling partitionings, with each individualbelonging to exactly one maternal and one paternal partition, is it possible toassign genotypes to the parents of each half-sibling family in a way thatrespects the property that every individual must inherit one of exactly twoalleles from each parent at each locus? We will call this problem VALIDHALF-SIBLING PARTITIONING.
Theorem 1. VALID HALF-SIBLING PARTITIONING is NP -complete.
Proof. We first show that VALID HALF-SIBLINGPARTITIONING ∈ NP. Given an instance of theproblem and an assignment of genotypes to the parents of each half-siblingfamily, we can verify in polynomial time whether or not the population structureis valid. The algorithm is straight-forward: for each heterozygotic child, checkthat the allele inherited from the mother is not the same as the alleleinherited from the father. When the parent and child genotypes differ, we saythat the child is forced to inherit a specific allele, e.g. child(a,b) is forced to inherit allele a from mother(a,c). If the parent does not force an allele because theparent genotype is identical to the child’s genotype, then that childcannot cause an incompatibility: the inherited allele from the identical parentis whichever allele was not inherited from the child’s other parent.
Next, we give a polynomial-time reduction from the NP-complete MONOTONEONE-IN-THREE SAT problem to VALID HALF-SIBLING PARTITIONING. The MONOTONEONE-IN-THREE SAT problem is: given a set of boolean clauses, each containingthree non-negated literals, determine whether a configuration of literals existssuch that exactly one literal in each clause is set true. This problem is alsocalled EXACT-COVER-BY-3-SETS (X3C), which was used in the first proof of theNP-hardness of parsimony phylogeny [15]. The reduction requires three gadgets that translate literals andclauses in a MONOTONE ONE-IN-THREE-SAT instance into alleles and families in aVALID HALF-SIBLING PARTITIONING instance, respectively.
The first gadget translates picking a literal in a clause to picking a parent fora family. The second gadget defines paternal families that introduce additionalnecessary offspring. From the MONOTONE ONE-IN-THREE SAT perspective, the thirdgadget enforces the rule that if a literal is chosen to be set true in oneclause, it must be chosen to be true in all of the clauses it belongs to.
- 1.
The selection gadget creates a maternal family with three possible mothers and six offspring for each clause in the SAT instance by mapping literals in a clause to alleles in a family. For the clause (x _{ i }∨x _{ j }∨x _{ k }), the corresponding y _{ i }, y _{ j }, and y _{ k } will be the alleles present in the created family. Six children are created by making two copies of each pairwise grouping of the y alleles: for this clause, we would create the maternal groups in Table 1.
Selection gadget for clause ( x _{ i } ∨ x _{ j } ∨ x _{ k } )
Genotype of | ||
---|---|---|
Literals | Family | possible shared parent |
x _{ i } | (y_{ j },y_{ k })_{0} | (y_{ j },y_{ k }) |
(y_{ j },y_{ k })_{1} | ||
x _{ j } | (y_{ i },y_{ k })_{0} | (y_{ i },y_{ k }) |
(y_{ i },y_{ k })_{1} | ||
x _{ k } | (y_{ i },y_{ j })_{0} | (y_{ i },y_{ j }) |
(y_{ i },y_{ j })_{1} |
- 2.
The mapping gadget creates two paternal families for each potential mother, for a total of six paternal families per maternal selection gadget family. The gadget highlights which literal is set to true in the clause corresponding to the selection gadget.
The 6m mapping families introduce new alleles s_{0}…s_{m−1}, one for each clause, and another distinct allelez common to all of the mapping families. The s_{ i } alleles are used in the third gadget to enforce that, once a literal isset to true in one clause, it is true in every clause.
Paternal families for clause ( x _{ i } ∨ x _{ j } ∨ x _{ k } )
y_{ i }fam. | y_{ i }fam. | y_{ j }fam. | y_{ j }fam. | y_{ k }fam. | y_{ k }fam. |
---|---|---|---|---|---|
(y_{ i },y_{ j })_{0} | (y_{ i },y_{ k })_{0} | (y_{ i },y_{ j })_{1} | (y_{ j },y_{ k })_{0} | (y_{ i },y_{ k })_{1} | (y_{ j },y_{ k })_{1} |
(s_{ p },y_{ i })_{0} | (s_{ p },y_{ i })_{1} | (s_{ p },y_{ j })_{0} | (s_{ p },y_{ j })_{1} | (s_{ p },y_{ k })_{0} | (s_{ p },y_{ k })_{1} |
(s_{ p },z)_{0} | (s_{ p },z)_{1} | (s_{ p },z)_{2} | (s_{ p },z)_{3} | (s_{ p },z)_{4} | (s_{ p },z)_{5} |
(y_{ j },z)_{0} | (y_{ k },z)_{0} | (y_{ i },z)_{0} | (y_{ k },z)_{1} | (y_{ i },z)_{1} | (y_{ j },z)_{1} |
- 3.
Lastly, we construct a gadget that maintains consistency in each use of the variables, from the SAT instance perspective. That is, it forces each true literal to be true and each false literal to be false in every clause in which that literal appears. The enforcement gadget constructs a constraining maternal family for each pair of clauses in which a common literal occurs. For instance, if literal x _{ i } appears in clauses c _{ p } and c _{ q }, then a c _{ p }/c _{ q } enforcment family will be constructed so that either y _{ i } is forced to be maternal or s _{ p } and s _{ q } are forced to be maternal. The gadget makes use of the (s _{ p },y _{ i })_{∗} and (s _{ q },y _{ i })_{∗} individuals created by the mapping gadget. The families created for a literal x _{ i } that appears in clauses c _{ p }, c _{ q }, and c _{ r } are found in Table 3.
Enforcement gadget for variable x _{ i } , appearing in c _{ p } , c _{ q } and c _{ r }
c_{ p }/c_{ q } | c_{ p }/c_{ r } | c_{ q }/c_{ r } |
---|---|---|
(s_{ p },y_{ i })_{0} | (s_{ p },y_{ i })_{0} | (s_{ q },y_{ i })_{0} |
(s_{ p },y_{ i })_{1} | (s_{ p },y_{ i })_{1} | (s_{ q },y_{ i })_{1} |
(s_{ q },y_{ i })_{0} | (s_{ r },y_{ i })_{0} | (s_{ r },y_{ i })_{0} |
(s_{ q },y_{ i })_{1} | (s_{ r },y_{ i })_{1} | (s_{ r },y_{ i })_{1} |
(s_{ p },z)_{0} | (s_{ p },z)_{1} | (s_{ q },z)_{1} |
(s_{ q },z)_{0} | (s_{ r },z)_{0} | (s_{ r },z)_{1} |
Each enforcement gadget family for y_{ i } has two copies of (s_{ p },y_{ i }) which are the two children from the mapping gadgets containing(y_{ i },y_{ j }) and (y_{ i },y_{ k }). If s_{ p } is forced, then s_{ q } must also be forced to avoid an incompatibility. As a result, y_{ i } is forced in both paternal mapping gadget families, indicating thatx_{ i } is kept true in both cases. As stated in the mapping gadget, this forceswhich maternal parent must be chosen in the selection gadget. However, ify_{ i } is forced in the enforcement gadget, then s_{ p } and s_{ q } are forced to be true in the paternal mapping gadget families, whichexcludes the parents corresponding to x_{ i } in the SAT instance from being chosen in the respective selection gadgetfamilies.
Finally, for all individuals with only a single assigned parent, which is truefor all individuals with the allele z, we assign them to asingle-element family corresponding to their unassigned parent, thus enforcingno further restrictions on parental genotypes.
In the MONOTONE ONE-IN-THREE SAT problem, a literal from each clause must be setto true. The selection gadget translates the task of choosing an allele topicking the parents of maternal families. Each selection gadget family containsthree distinct alleles {y_{ i },y_{ j },y_{ k }}. Choosing maternal parent (y_{ i },y_{ j }) is equivalent to setting x_{ k } true and the x_{ i } and x_{ j } literals to false.
Finally, let n be the number of literals and m be the number ofclauses. Constructing the VALID HALF-SIBLING PARTITIONING instance requiresO(m) children for the first gadget, O(m^{2}·n) additional children for the second gadget, andO(m^{2}) additional children for the third gadget. Sincem ≤ n, the resulting transformation ispolynomial in the number of literals. □
Maternal selection and enforcement gadgets for example SATinstance
M _{0} | M _{1} | M _{2} |
---|---|---|
(1,2)_{0} | (1,4)_{0} | (s_{0},1)_{0} |
(1,2)_{1} | (1,4)_{1} | (s_{0},1)_{1} |
(1,3)_{0} | (1,5)_{0} | (s_{0},1)_{1} |
(1,3)_{1} | (1,5)_{1} | (s_{1},1)_{1} |
(2,3)_{0} | (4,5)_{0} | (s_{0},z)_{6} |
(2,3)_{1} | (4,5)_{1} | (s_{1},z)_{6} |
Paternal mapping gadget families for example SAT instance
P _{0} | P _{1} | P _{2} | P _{3} | P _{4} | P _{5} |
---|---|---|---|---|---|
(1,2)_{0} | (1,3)_{0} | (2,3)_{0} | (1,2)_{1} | (1,3)_{1} | (2,3)_{1} |
(s_{0},1)_{0} | (s_{0},1)_{1} | (s_{0},2)_{0} | (s_{0},2)_{1} | (s_{0},3)_{0} | (s_{0},3)_{1} |
(s_{0},z)_{0} | (s_{0},z)_{1} | (s_{0},z)_{2} | (s_{0},z)_{3} | (s_{0},z)_{4} | (s_{0},z)_{5} |
(2,z)_{0} | (3,z)_{0} | (3,z)_{1} | (1,z)_{0} | (1,z)_{1} | (2,z)_{1} |
P _{ 6 } | P _{ 7 } | P _{ 8 } | P _{ 9 } | P _{ 10 } | P _{ 11 } |
(1,4)_{0} | (1,5)_{0} | (4,5)_{0} | (1,4)_{1} | (1,5)_{1} | (4,5)_{1} |
(s_{1},1)_{0} | (s_{1},1)_{1} | (s_{1},4)_{0} | (s_{1},4)_{1} | (s_{1},5)_{1} | (s_{1},5)_{1} |
(s_{1},z)_{0} | (s_{1},z)_{1} | (s_{1},z)_{2} | (s_{1},z)_{3} | (s_{1},z)_{4} | (s_{1},z)_{5} |
(4,z)_{0} | (5,z)_{0} | (5,z)_{1} | (1,z)_{2} | (1,z)_{3} | (4,z)_{1} |
There are several feasible solutions to the M-1-3-SAT instance, but the exampleillustrates the case where literals x_{2} and x_{4} are set true in the M-1-3-SAT instance. The inherited allele foreach individual in each family is bolded to represent the corresponding VHSPsolution where mothers (1,3) and (1,5) are chosen in the selection gadget.
Identifying allele incompatibilities
Unfortunately, the NP-completeness of the VHSP problem makes it veryunlikely that a polynomial time algorithm exists for verifying that givenmaternal and paternal partitions have valid parental genotype assignments.Therefore, we present a 0/1 integer program that identifies individuals toremove to obey Mendelian inheritance. We use parsimony and select the fewestindividuals to remove.
For a population with n individuals, genotyped at m loci, letx_{ i } = 1 denote the decision to remove individual i fromthe population. The variable ${y}_{j,k}^{l}$ represents the binary choice of having the k^{ t h } allele in the common parent of the family j at locus l.Denote the multi-set which contains the maternal and paternal families as $\mathcal{C}$, and let π_{0} and π_{1} be functions that map an individual to its maternal and paternalindex in $\mathcal{C}$ respectively. Finally, let λ_{0} and λ_{1} map the first and second allele of an individual to an index inK, the set of all alleles.
The first constraint enforces that no parent can have more than two alleles. Thesecond and third constraints enforce Mendelian mating requirements onindividuals: an individual is invalid, and hence must be removed if it does notreceive one allele from its mother and its other allele from its father. Thereare two possible ways to satisfy this constraint. Either the child received itsfirst listed allele from its mother and the second allele from its father orvice versa. The two constraints are the logical or of these two possibilities,and the fourth constraint identifies incompatible loci as those that areunsatisfied by either of these possibilities. Finally, the last constraintenforces the requirement that x_{ i } must be 1, corresponding to individual i being selected forremoval, if the individual has any incompatible loci.
We note that the minimization objective forces x_{ i } and ${x}_{i}^{l}$ to be 0 as often as possible. Since there can never be morefamilies than individuals, the integer program has a total ofO(m·n) constraints.
Clustering half-sibling families
Sibship reconstruction finds a population clustering which obeys Mendelian genetics.In the full sibling clustering $\mathcal{F}$, each individual appears only once. For half-siblings, analgorithm must construct $\mathcal{\mathcal{M}}$ and $\mathcal{P}$ when both sexes are polygamous or only one of the two partitionswhen one sex is monogamous. Here, we describe the approach of SibJoin, our programfor identifying half-sibling populations.
Measuring similarity
For a given half-sibling input, SibJoin relies on an n×nsimilarity matrix which expresses the similarity between each pair ofindividuals in the offspring population set. Brown and Berger-Wolf [13] used a similarity measure which, for each pair of individuals, is thecount of third individuals in the population that form a compatible full siblingtriplet with the pair. They proved that any incompatible candidate full siblinggroup must contain an incompatible triplet and give a probabilistic argumentthat pairs of individuals with large numbers of compatible triplets are likelysiblings. Unfortunately, the half-sibling property is much weaker because itonly operates on one parent. Ruling out a potential half-sibling group can takeas many as six individuals, compared to the three that is required for fullsiblings.
Theorem 2. There exist incompatible half-sibling groups for which their smallestincompatible subgroup has six members
Proof. Consider the sextet of individuals with one locus and fouralleles {[ (1,2)],[ (1,3)],[ (1,4)],[ (2,3)], [ (2,4)],[ (3,4)]}. Any five ofthe individuals form a valid half-sibship under the half-sibling property, butthe incompatibility appears when all six individuals are examined together: thecommon half-sibling parent would need three alleles at the locus. □
The lower bound suggests that examining triplets for half-siblings could yield afalsely high count when individuals are not actually related. Additionally, theprobability that three random individuals form a valid half-sibship is muchhigher than that of three individuals forming a valid full sibship. Byenumerating all possible triplets with a pool of five alleles, we see that96.62% of all triplets are compatible under the half-siblingproperty, but only 56.61% of all triplets are compatible under thefull sibling properties: that is, incompatibilities are not nearly so much of awarning of unrelatedness for half-sibling reconstruction as for full-siblingreconstruction. If the number of alleles is set to ten, then 75.46%of half-sibling triplets are compatible (most often, these result when any twoindividuals in the triple share an allele), while the number of full siblingcompatible triplets is just 20.94%.
In the place of a triplet-based similarity function, SibJoin uses a pairwisemeasure. Given two individuals, each with m loci, we count sharedalleles at each locus independently, between the two individuals. For example,the pair of individuals x = [ (1,2),(2,2),(1,3)] andy =[ (1,1),(2,2),(2,3)] has similarity s_{ x y } = 4. The pairwise similarity matrix for this simple measuremay be computed in O(n^{2}m) time, as opposed to the O(n^{3}m) time that is required by the brute-force triplet method.
Assuming the parents of two full siblings are each heterozygotic, two siblingshave a 50% chance of inheriting the same allele from each parent: if eitherparent is homozygotic, then the offspring are guaranteed to inherit the sameallele from that parent. Similarly, half-siblings have a 50% chance ofinheriting the same allele from their heterozygotic common parent. For unrelatedindividuals, the expected number of shared alleles approaches 0 as the number ofdistinct alleles at a locus grows. When there are many possible alleles, it isunlikely that two unrelated individuals will inherit the same alleles. So, ask grows without bound E[ X|full siblings]→1, $\mathbf{\text{E}}\left[\phantom{\rule{0.3em}{0ex}}X\right|\text{half siblings}]\to \frac{1}{2}$, and E[ X unrelated]→0.
Therefore, the pairwise similarity measure will perform well when either thenumber of alleles or loci is sufficiently large: it easily separates unrelatedindividuals from half siblings and full siblings. Our results also indicate thatpairwise similarity method performs well, even with modest numbers of allelesand loci.
Joining individuals
Joining must only occur if two clusters form a valid half-sibship. At theinitialization of the algorithm, each individual is assigned a feasible parentset with size at most O(k) per locus. Each join results in aparent set which is the intersection of the parent sets from the two clusters.If the intersection produces the null set, then there is no parent which canexplain the new cluster and the join is rejected. Therefore, testing whether ornot a join is valid takes O(k m) time. When a site experiences allelic dropout, SibJoin makes noassumptions about its parental restrictions; however, sites with(∗,∗) are never counted toward allele similarity betweenindividuals.
Unlike crisp clustering methods which mandate that each individual appear inexactly one cluster, the half-sibling problem contains both a maternal andpaternal group for each individual. We enforce the restriction that any set ofindividuals sharing both a maternal and paternal cluster must be compatible fullsiblings under the 4-allele and 2-allele properties by maintaining a clusteringof full siblings. Because incompatible full sibling groups are less likely thanincompatible half-sibling groups of the same size, at each similarity stepSibJoin joins clusters which form valid full sibships first.
Microsatellites give no information about which alleles are maternal and whichare paternal. Since SibJoin constructs families in an iterative manner, part ofa maternal family could be reconstructed in the maternal partitioning, while theother part of the family is constructed in the paternal partitioning. If we arestrict about which sets we call maternal and paternal, then the two halves willnever be joined and the half on the paternal side will likely force incorrectfuture joins. Our solution is to implement a bipartite graphG = (V,E) where each cluster is avertex and edges exist between clusters which share an individual. Let a joinbetween clusters C_{ i } and C_{ j } be an event which combines C_{ j } into C_{ i } and let E(v) be the set of edges that touch v.In our graph, j o i n(C_{ i },C_{ j }) results in $E\left({v}_{i}\right):=E\left({v}_{i}\right)\bigcup E\left({v}_{j}\right)$ followed by the removal of v_{ j } and all edges in E(v_{ j }). Enforcing bipartiteness as a post-condition of the join operationallows flexibility while ensuring that the solution results in each individualhaving one parent of each sex.
Results and discussion
We evaluated SibJoin’s performance with simulated and real population data. Theexperimental results from real populations are contrasted with the the HS-MSC andCOLONY half-sibling approaches.
Accuracy measure
Partition distance is a metric which measures the distance between two partitionsas the minimum number of individuals that must be removed from a populationuntil the two clusterings are identical. This metric is widely used in sibshipreconstruction literature and in bioinformatics in general [16, 17]. When true partitionings are known, partition distance may be used tocompute the true accuracy of an algorithm; however, it may also be used toassess changes between candidate sibships for which ground truth is not known [18].
Despite its prominence in the kinship analysis literature, partition distanceoffers only a coarse estimate of correctness because it disregards how theexcluded individuals are constructed within the partitioning. For example,failing to join two related partial families is less destructive thanincorrectly joining one of the partial families to an unrelated family. However,in both instances, the partition distance is identical. A concrete example isgiven in Meilă [19].
Instead, we use an information-theoretic metric called variation of information(VI) [19]. The VI measures how much the information given by two clustersdiffer and is preferable because it quantifies the amount of uncertaintyintroduced by misplaced individuals.
The VI between two partitionings is 0 if and only if the two partitionings areidentical. Smaller VI corresponds to more similar partitionings. Like entropy,the VI is always non-negative. It also has a tight upper bound of logn[19]; therefore, we normalize VI to a value in [0,1] before reporting thescore for each of our trials.
Simulated data set results
Simulated test results for SibJoin and COLONY 2.0 averaged over 10trials
Fixed | Parameter | SibJoin | COLONY 2.0 | ||
---|---|---|---|---|---|
parameter | settings | Runtime | VI (normalized) | Runtime | VI (normalized) |
2 | 2.8 ms | 0.396 | 48.9 min | 0.553 | |
5 | 13.2 ms | 0.222 | 19.7 min | 0.110 | |
k: number of alleles | 10 | 6.7 ms | 0.014 | 12.8 min | 0.004 |
15 | 5.1 ms | 0.014 | 10.2 min | 0.006 | |
20 | 5.7 ms | 0.003 | 10.0 min | 0.000 | |
2 | 8.7 ms | 0.469 | 10.7 min | 0.524 | |
5 | 10.1 ms | 0.156 | 17.2 min | 0.130 | |
m: number of loci | 10 | 11.1 ms | 0.035 | 14.2 min | 0.001 |
15 | 12.7 ms | 0.002 | 20.4 min | 0.000 | |
20 | 12.1 ms | 0.000 | 21.3 min | 0.000 | |
10 | 0.4 ms | 0.042 | 2.29 min | 0.343 | |
n: population size | 50 | 16.8 ms | 0.104 | 17.1 min | 0.078 |
100 | 82.5 ms | 0.201 | 73.5 min | 0.086 | |
200 | 3.31 sec | 0.230 | - | - | |
500 | 34.68 sec | 0.013 | - | - | |
1000 | 2.84 min | 0.015 | - | - | |
2000 | 12.43 min | 0.018 | - | - | |
1 | 51.9 ms | 0.546 | - | - | |
f: family size | 5 | 51.1 ms | 0.183 | 29.6 min | 0.051 |
10 | 46.2 ms | 0.040 | 19.6 min | 0.017 | |
20 | 58.4 ms | 0.009 | 21.7 min | 0.042 |
In most cases, the reported VI score approximates the ratio of partition distanceto population size. Overall, COLONY 2.0 was more accurate, but took thousands oftimes longer, often with only small gains in accuracy. SibJoin does much worsethan COLONY 2.0 on the 10 allele test set, but the discrepancy is due to asingle trial for which SibJoin receives a VI of 0.084 while COLONY 2.0 producesa perfect reconstruction. For the 10 loci test set, SibJoin’s VI is againhigher, but in practice the false positive difference between it and COLONY 2.0is about one individual per trial.
SibJoin does worst when the population size is large and the family size issmall. For instance, when tested with a 100 individual population and familiesof 5 individuals, SibJoin rendered a VI of 0.201 compared to COLONY 2.0’sVI of 0.086. When family sizes are small and population sizes are large, it ismuch more likely for two unrelated individuals to be mistakenly labeled ashalf-siblings due to the explanations given in section 4.3. However,SibJoin’s accuracy rapidly improves with modest increases in family size.In fact, SibJoin is more accurate than COLONY 2.0 in trials with familiescontaining 20 individuals. Unsurprisingly, both methods poorly reconstructpopulations where only two alleles are present. With only two alleles, allindividuals can be full or half-siblings.
We may also use SibJoin to explore populations with extreme numbers ofindividuals. SibJoin was able to reconstruct populations of 500, 1000, and 2000individuals in under 10 minutes, yet problems of this magnitude are intractablefor the HS-MSC and both of the COLONY programs.
Biological data set results
Tests for biological data
Data set | Algorithm | Runtime | VI (normalized) | False positives |
---|---|---|---|---|
112 crickets | COLONY 2.0 | 35.7 min | 0.000 | 0 |
HS-MSC | - | n/a (see caption) | 2 | |
SibJoin | 19.3 ms | 0.014 | 1 | |
288 kelp rockfish | COLONY 2.0 | 624.5 min | 0.000 | 0 |
HS-MSC | - | n/a (see caption) | 0 | |
SibJoin | 87.5 ms | 0.000 | 0 | |
672 kelp rockfish | COLONY 2.0 | - | - | - |
HS-MSC | - | - | - | |
SibJoin | 5.02 sec | 0.108 | 78 |
Our results are compared to the HS-MSC results in [7] and to our own benchmarks on COLONY 2.0. Because the HS-MSC is notpublicly available, we could not assess runtime information for the program.However, the authors do note that the HS-MSC IP finished in under one day. Thedifference between the two runtimes is not explained merely by CPU speedincreases across a small number of years. Additionally, neither COLONY 2.0 northe HS-MSC’s half-sibling minimum set cover approach constructed afeasible answer for the 672 rockfish data set: COLONY 2.0 was stopped afterrunning for three days. SibJoin constructs an accurate solution in under 10seconds.
The HS-MSC ILP does not enforce that individuals must have one parent of each sexand both partition distance and variation of information are ill-defined whenthe result is not a true partitioning. In the population of 112 crickets, theHS-MSC had two false positives and was otherwise correct. In the test setcontaining 288 rockfish, HS-MSC had 4 false positives and was otherwise correct.COLONY 2.0 was correct in all instances. SibJoin correctly reconstructed thehalf-sibship for the 288 rockfish and only misplaced one individual in thecricket test. SibJoin was the only algorithm to complete for the population of672 rockfish. Overall, SibJoin is as accurate as the HS-MSC and nearly asaccurate as COLONY 2.0, but is much faster than either: SibJoin solves the smallrockfish instance over 42,000 times faster than COLONY 2.0.
Minimum population removal IP results
Using the integer program outlined previously, we can identify the minimum-sizeset of individuals which must be removed in order to make a SibJoin solutionfeasible. We assume that this set is small, so finding the minimum individualsto remove should capture many incorrect individuals.
SibJoin trials with forbidden allele detection
Fixed parameter | Parameter | Norm. VI | FP | Recall | Precision | Timeout rate |
---|---|---|---|---|---|---|
k: number of alleles | 2 | 0.396 | 25.9 | 0.000 | 0.000 | 0.0 |
5 | 0.225 | 12.3 | 0.300 | 0.694 | 0.0 | |
10 | 0.013 | 0.2 | 0.000 | 0.000 | 0.0 | |
15 | 0.014 | 0.0 | - | - | 0.0 | |
20 | 0.003 | 0.0 | - | - | 0.0 | |
m: number of loci | 2 | 0.491 | 23.7 | 0.109 | 0.563 | 0.0 |
5 | 0.150 | 6.6 | 0.355 | 0.537 | 0.0 | |
10 | 0.032 | 1.2 | 0.62 | 0.650 | 0.0 | |
15 | 0.002 | 0.0 | - | - | 0.0 | |
20 | 0.000 | 0.0 | - | - | 0.0 | |
n: population size | 10 | 0.042 | 0.5 | 0.2 | 1 | 0.0 |
50 | 0.098 | 10.2 | 0.340 | 0.679 | 0.0 | |
100 | 0.201 | 41.0 | 0.400 | 0.765 | 0.1 | |
200 | 0.220 | 88.9 | 0.408 | 0.780 | 1.0 | |
f: family size | 1 | 0.527 | 58.5 | 0.317 | 0.778 | 0.7 |
5 | 0.181 | 22.8 | 0.439 | 0.756 | 0.0 | |
10 | 0.038 | 3.6 | 0.313 | 0.477 | 0.0 | |
20 | 0.009 | 1.4 | 0.000 | 0.000 | 0.0 |
Table 8 reports the recall and precision of the IP: thepercentage of all incorrect individuals that are identified by the IP and thepercentage of individuals that are actually incorrect among the individualsidentified by the IP. We find that the integer program can have a poor recall,finding only 30% of the false positives in some situations; however, theprecision is relatively high. For individuals in the minimum removal set, thenumber of incorrectly placed individuals is consistently above 50%. Theprecision is significant since SibJoin’s total error rate is often farbelow 50%. If we found a way to correctly reintroduce the set of individualsidentified by the IP, then the overall SibJoin error rate would decreasesignificantly.
The IP does worst when there are only two alleles or two loci. This isunsurprising since there will be no incompatibilities when each locus containsless than three alleles and data with few loci have smaller risk of forbiddenallele structures with bad joins. However, both recall and precision tend toincrease with population size as demonstrated by the 100 and 200 population sizetest cases. For populations with 200 individuals, the IP did not reachintegrality within 5 minutes, but still produced solutions with high recall andprecision relative to the other tests, indicating that the IP is still useful atidentifying mis-placed individuals in large populations.
Conclusions
It is important to be able to determine whether or not a proposed populationstructure is valid under Mendelian inheritance assumptions. For half-siblings, wehave proved that even determining if such a structure obeys Mendelian laws isNP-complete, which is surprising since the same determination inmonogomous populations is trivial. This result has important implications forhalf-sibling algorithms in general since most existing algorithms do notspecifically enforce which allele is inherited from which parent and those that dovery likely have running times which are exponential in the size of the population.We have also provided an integer program that solves an optimization variant of theproblem: what is the minimum number of individuals that must be removed from apopulation in order for the population structure to be valid. The IP was run againstSibJoin’s population reconstructions. Although the IP only had a recall of 30to 40 percent when run against SibJoin’s population reconstructions, theprecision was high: 55 to 78 percent of the individuals identified for removal wereactually incorrect.
We have also demonstrated an application of allele similarity with our fast SibJoinheuristic. SibJoin is a bottom-up algorithm based on single linkage clustering. Ourexperiments show that despite being a heuristic, the algorithm competes in accuracywith existing likelihood-based algorithms, but is thousands of times faster inpractice. The speed of our algorithm is important since existing algorithms fail toreconstruct half-sibling families when the population size is above a few hundredindividuals. SibJoin can reconstruct these populations in seconds. SibJoin is ableto reconstruct real biological populations that existing algorithms fail toreconstruct, and it does so with high accuracy.
Declarations
Acknowledgements
This work is supported by the Natural Sciences and Engineering Research Councilof Canada and the Government of Ontario. We would like to thank TanyaBerger-Wolf and Saad Sheikh for providing real population data, and also thankTanya Berger-Wolf, Marina Meilă, and Rita Ackerman for helpfuldiscussions.
Authors’ Affiliations
References
- Isagi Y, Kanazashi T, Suzuki W, Tanaka H, Abe T: Highly variable pollination patterns in Magnolia obovata revealed bymicrosatellite paternity analysis. Int J Plant Sci. 2004, 165 (6): 1047-1053. 10.1086/423883.View ArticleGoogle Scholar
- Sezen U, Chazdon R, Holsinger K: Genetic consequences of tropical second-growth forest regeneration. Science. 2005, 307 (5711): 891.View ArticlePubMedGoogle Scholar
- Caughley G: Directions in conservation biology. J Anim Ecol. 1994, 63 (2): 215-244. 10.2307/5542.View ArticleGoogle Scholar
- Hedrick P, Lacy R, Allendorf F, Soule M: Directions in conservation Biology: Comments on Caughley. Conserv Biol. 1996, 10 (5): 1312-1320. 10.1046/j.1523-1739.1996.10051312.View ArticleGoogle Scholar
- Jones OR, Wang J: Molecular marker-based pedigrees for animal conservation biologists. Anim Cons. 2010, 13: 26-34. 10.1111/j.1469-1795.2009.00324.View ArticleGoogle Scholar
- Painter I: Sibship reconstruction without parental information. J Agric Biol Envir S. 1997, 2 (2): 212-229. 10.2307/1400404.View ArticleGoogle Scholar
- Sheikh SI, Berger-Wolf TY, Khokhar AA, Caballero IC, Ashley MV, Chaovalitwongse W, Chou C, Dasgupta B: Combinatorial reconstruction of half-sibling groups from microsatellitedata. J Bioinf Comput Biol. 2010, 8 (2): 337-356. 10.1142/S0219720010004793.View ArticleGoogle Scholar
- Wang J: Sibship reconstruction from genetic data with typing errors. Genet. 2004, 166 (4): 1963-1979. 10.1534/genetics.166.4.1963.View ArticleGoogle Scholar
- Wang J, Santure AW: Parentage and sibship inference from multilocus genotype data underpolygamy. Genet. 2009, 181 (4): 1579-1594. 10.1534/genetics.108.100214.View ArticleGoogle Scholar
- Almudevar A, Anderson E: A new version of PRT software for sibling groups reconstruction with commentsregarding several issues in the sibling reconstruction problem. Mol Ecol Resour. 2011, 12: 164-178.View ArticlePubMedGoogle Scholar
- Ashley M, Caballero I, Chaovalitwongse W, Dasgupta B, Govindan P, Sheikh SI, Berger-Wolf TY: KINALYZER, a computer program for reconstructing sibling groups. Mol Ecol Resour. 2009, 9 (4): 1127-1131.View ArticlePubMedGoogle Scholar
- Chaovalitwongse WA, Chou CA, Berger-Wolf TY, DasGupta B, Sheikh S, Ashley MV, Caballero IC: New optimization model and algorithm for sibling reconstruction from geneticmarkers. INFORMS J Comp. 2009, 22 (2): 180-194.View ArticleGoogle Scholar
- Brown D, Berger-Wolf TY: Discovering kinship through small subsets. Proc 10th WABI-10. 2010, 6293 in LNCS: 111-123.Google Scholar
- Berger-Wolf TY, DasGupta B: Combinatorial reconstruction of sibling relationships. Proc CGBI. 2005, 3-6.Google Scholar
- Foulds L, Graham R: The Steiner problem in Phylogeny is NP-Complete. Adv Appl Math. 1982, 3: 43-49. 10.1016/S0196-8858(82)80004-3.View ArticleGoogle Scholar
- Gusfield D: Partition-distance : A problem and class of perfect graphs arising inclustering. Info Proc Lett. 2002, 82 (3): 159-164. 10.1016/S0020-0190(01)00263-0.View ArticleGoogle Scholar
- Lin Y, Rajan V, Moret B: A metric for phylogenetic trees based on matching. IEEE/ACM Trans Comp Bio Bioinf. 2011, 9 (4): 1014-1022.View ArticleGoogle Scholar
- Coombs JA, Letcher BH, Nislow KH: PedAgree: software to quantify error and assess accuracy and congruence forgenetically reconstructed pedigree relationships. Conserv Genet Resour. 2010, 2: 147-150. 10.1007/s12686-010-9202-9.View ArticleGoogle Scholar
- Meila M: Comparing clusterings by the variation of information. Proc COLT. 2003, 2777: 173-187.Google Scholar
- Bretman A, Tregenza T: Measuring polyandry in wild populations: a case study using promiscuouscrickets. Mol Ecol. 2005, 14 (7): 2169-2179.View ArticlePubMedGoogle Scholar
- Sogard SM, Gilbert-Horvath E, Anderson EC, Fisher R, Berkeley SA, Garza JC: Multiple paternity in viviparous kelp rockfish, Sebastes atrovirens. Environ Biol Fishes. 2008, 81: 7-13.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), whichpermits unrestricted use, distribution, and reproduction in any medium, provided theoriginal work is properly cited.