Gene tree parsimony for incomplete gene trees: addressing true biological loss

Motivation Species tree estimation from gene trees can be complicated by gene duplication and loss, and “gene tree parsimony” (GTP) is one approach for estimating species trees from multiple gene trees. In its standard formulation, the objective is to find a species tree that minimizes the total number of gene duplications and losses with respect to the input set of gene trees. Although much is known about GTP, little is known about how to treat inputs containing some incomplete gene trees (i.e., gene trees lacking one or more of the species). Results We present new theory for GTP considering whether the incompleteness is due to gene birth and death (i.e., true biological loss) or taxon sampling, and present dynamic programming algorithms that can be used for an exact but exponential time solution for small numbers of taxa, or as a heuristic for larger numbers of taxa. We also prove that the “standard” calculations for duplications and losses exactly solve GTP when incompleteness results from taxon sampling, although they can be incorrect when incompleteness results from true biological loss. The software for the DP algorithm is freely available as open source code at https://github.com/smirarab/DynaDup.


Background
The estimation of species trees is often performed by estimating multiple sequence alignments for some collection of genes, concatenating these alignments into one supermatrix, and then estimating a tree (often using maximum likelihood or a Bayesian technique) on the resultant supermatrix. However, this approach cannot be used when the species' genomes contain multiple copies of some gene, which can result from gene duplication. Since gene duplication and loss is a common phenomenon, the estimation of species trees requires a different type of approach in this case.
The most powerful approaches for species tree estimation for multi-copy gene families are likely to be methods such as Phyldog [1], which co-estimate gene trees and species trees under parametric models of gene evolution that include duplications and losses. Another type of approach uses initial assignments of orthology and paralogy to inform gene tree and species tree estimation [2]. However, by far the most common approach for estimating species trees uses gene tree parsimony, which we now describe.
Gene tree parsimony (GTP) is an optimization problem for estimating species trees from a set of gene trees (estimated from individual gene sequence alignments). In its most typical formulations, only gene duplication and loss are considered, so that GTP depends upon two parameters: c d (the cost for a duplication) and c l (the cost for a loss). The two most popular versions of GTP are MGD (minimize gene duplication), for which c d = 1 and c l = 0 , and MGDL (minimize gene duplication and loss), for which c d = c l = 1. The version of GTP that seeks the tree minimizing the total number of losses has also been studied; for this, c d = 0 and c l = 1. These variants of GTP are NP-hard optimization problems [3], but software such as DupTree [4] and iGTP [5] for GTP are in wide use.
Basic to all these problems is the ability to compute the number of duplications and losses implied by a species tree and gene tree. This problem is called the "reconciliation problem", surveyed in [6], and intensively studied in the literature (see, for example, [3,[7][8][9][10][11][12][13][14][15][16][17]). The mathematical formulation of the reconciliation problem was derived for the case where the gene tree and the species tree have the same set of taxa, and then extended to be able to be used on incomplete gene trees, i.e., trees that can miss some taxa.
Incomplete gene trees are quite common, and can arise for two different reasons: (1) taxon sampling: the gene may be available in the species' genome, but was not included for some reason in the dataset for that gene, or (2) gene birth/death: as a result of gene birth and death (true biological gene loss), the species does not have the gene in its genome.
Given a gene tree gt and a species tree ST, two formulations for the number of losses have been defined. The most commonly used one computes the number of losses by first computing the "homeomorphic subtree" ST(gt) of ST induced by gt, and then computing the number of losses required to reconcile gt with ST(gt) (see, for example, [3,8,17]). Although this second formulation is in wide use (and is the basis of both iGTP [5] and Duptree [4], two popular methods for "solving" GTP), we will show that this can be incorrect when incompleteness is due to true biological loss. We refer to this formulation as the "standard" approach because of this widespread use in both software and the theoretical literature on GTP. The other, described in [18,19], correctly computes the number of losses when incompleteness is a result of true gene loss, as we will prove. This paper addresses the GTP problem for the case where some of the input gene trees may be incomplete due to either sampling or true biological loss. The main results are as follows: • We formalize the duploss reconciliation problem when gene trees are incomplete due to taxon sampling as the "optimal completion of a gene tree", and we prove (Theorem 1) that the standard calculation correctly computes losses for this case. • We show by example that the standard calculation for losses in GTP can be incorrect when incompleteness is due to true biological loss. • We show how to compute the number of losses implied by a gene tree and species tree, when incompleteness is due to true biological loss. • We formulate variants of the GTP problem (when gene tree incompleteness is due to true biological loss) as minimum weight maximum clique problems (see Theorem 11 for one duploss variant), and show how to solve these problems efficiently using dynamic programming. We show that these optimal cliques can be found in polynomial time in the number of vertices of the graph, because of the special structure of the graphs. We also show that a con-strained version of these problems, where the subtree-bipartitions of the species tree are drawn from the subtree-bipartitions of the input gene trees, can be solved in time that is polynomial in the number of gene trees and taxa.

Notation and terminology
We now define some general terminology we will use throughout this paper; other terminology will be introduced as needed. Throughout this paper we will assume that gene trees and species trees are rooted binary trees, with leaves drawn from the set X of n taxa, and we allow the gene trees to have multiple copies of the taxa, and even to miss some taxa. We orient each tree so that the root is on top and the leaves are at the bottom; hence, we also say that a node v is above node w if the path from w to the root of the tree goes through the node v (similarly we say that w is below v).
We let gt denote a gene tree and ST denote a species tree. We let L(t) denote the set of taxa at the leaves of the tree t, and require that L(gt) ⊆ L(ST ). If L(gt) = L(ST ) we say that gt is complete, and otherwise we say that gt is incomplete.
Let T be a rooted binary tree. We denote the set of vertices of T by V(T), the set of edges of T by E(T), the root by r(T), the internal nodes by V int (T ), and the set of taxa that appear at the leaves by L(T). Note that if T is a gene tree, it can be incomplete, and so it is possible for |L(T)| to be smaller than the number of leaves in T. A clade in T is a subtree of T rooted at a node in T, and the set of leaves of the clade is called a cluster. Given a node v in T, the cluster of leaves below v is denoted by c T (v), and the subtree of T rooted at v is denoted by T v . The most recent common ancestor (MRCA) of a set A of leaves in T is denoted by MRCA T (A). Given a gene tree gt and a species tree ST, we define M : . Finally, given a node u in a rooted binary tree, we let r denote the right child of u and l denote the left child of u.
For a rooted gene tree gt and a rooted species tree ST, and otherwise v is a speciation node [3,8,17,20].
ST(gt) is the homeomorphic subtree of ST induced by the taxon set of gt, and is produced as follows: ST is restricted to the taxon set of gt, and then nodes with indegree and out-degree 1 are suppressed. ST * (gt) is the tree obtained by restricting ST to the taxon set of gt, but not suppressing nodes of in-degree and out-degree 1.
We say that clade cl in ST is a missing clade with respect to gt if L(gt) ∩ L(cl) = ∅, and a maximal missing clade if it is not contained in any other missing clade. Maximal missing clades that are descendants of M(r(gt)) are called the "lower" maximal missing clades, and those that are not descendants of M(r(gt)) are called the "upper" maximal missing clades. We denote by LMMC(gt, ST) (or LMMC), the set of lower maximal missing clades, and UMMC(gt, ST) (or UMMC), the set of upper maximal missing clades. Note UMMC(gt, ST ) = ∅ if M(r(gt)) = r(ST ).

The standard formula for computing losses
The standard formula (see, for example, [3,8,16,17,20]) for computing the minimum number of losses of a (potentially incomplete) gene tree gt with respect to a species tree ST is denoted L std (gt, ST ), and is defined to be L std (gt, ST ) = u∈V int (gt) F (u, ST (gt)), where F(u, T) is defined for internal nodes u with children l and r (which can be interchanged in the formula below) by: where d(s, s ′ ) is the number of internal nodes in T on the path from s to s ′ . When gt is complete, then ST (gt) = ST, and this formula follows from [18].
Optimal completion of a gene tree: • Input rooted binary gene tree gt and rooted binary species tree with L(gt) ⊆ L(ST ). • Output complete gene tree T samp (gt, ST ) that is an extension of gt such that T samp (gt, ST ) implies a minimum number of losses with respect to ST.
In other words, we add all the missing taxa into gt (each taxon added at least once, but perhaps several times) so as to produce a complete binary gene tree that has a minimum number of losses with respect to ST. Let L samp (gt, ST ) = L std (T samp (gt, ST ), ST ). Thus, L samp (gt, ST ) denotes the total number of losses needed for an optimal completion of gt. Similarly, we can define DL samp (gt, ST ) to be the total number of duplications and losses needed for a completion of gt that minimizes the duploss score.
The following theorem shows that the standard formula correctly computes the number of losses, when we treat incompleteness as due to taxon sampling hence also the total number of duplications and losses, where we treat losses as due to sampling. Furthermore, L std (gt, ST ) = L samp (gt, ST ), which means the standard formula correctly computes the number of losses when we treat incompleteness as due to sampling.
Proof Consider ST(gt), the homeomorphic subtree of ST defined by the taxon set of gt. Since gt is complete with respect to ST(gt), the optimal reconciliation that minimizes duplications, losses, and their sum, is defined by M, the MRCA mapping from gt to ST, and the standard formula correctly computes the number of losses for this reconciliation [18]. Note that for any completion t of gt, L std (t, ST ) ≥ L std (gt, ST ); in other words, the number of losses cannot decrease by making gt complete. Similarly, the number of duplications for t with respect to ST cannot be less than the number of duplications of gt with respect to ST. We will show that we can add all the remaining taxa into gt to produce a complete gene tree t * such that L std (t * , ST ) = L std (gt, ST ). Therefore, t * will be an optimal completion with respect to the DL samp problem. Furthermore, this will also imply that L std (gt, ST ) = L samp (gt, ST ), as desired.
Recall the definition of the sets UMMC and LMMC, the upper and lower maximal missing clades, respectively. Since gt is not complete, there must be at least one missing taxon, and hence at least one maximal missing clade. If M(r(gt)) = r(ST ) then UMMC = ∅ and we set gt ′ = gt. Otherwise, M(r(gt)) � = r(ST ) and UMMC � = ∅. Consider the path in ST from r(ST) down to M(r(gt)), and the m ≥ 1 subtrees that hang off that path before we reach M(r(gt)). Note that each of these subtrees is an upper maximal missing clade. Let gt ′ be the tree created by starting with ST and replacing the subtree of ST rooted at M(r(gt)) by gt. Note also that the number of duplications has not changed, and that L std (gt ′ , ST ) = L std (gt, ST ).
If LMMC = ∅ we are done; otherwise, we now add the lower maximal missing clades to gt ′ one at a time. Let LMMC = {t 1 , t 2 , . . . , t p }, so that p ≥ 1. We will define a sequence of gene trees gt 1 , gt 2 , . . . , gt p = t * , so that gt 1 is the result of adding clade t 1 to gt ′ , and gt i is the result of adding clade t i to gt i−1 for p ≥ i ≥ 2. We will show that L std (gt i , ST ) = L std (gt ′ , ST ) for p ≥ i ≥ 1, and that the number of duplications in gt i is the same as the number of duplications in gt ′ . Since gt p = t * is a completion of gt, our theorem will be proven.
So consider t = t 1 , the first lower maximal missing clade, and let q be the node in ST that is the parent of r(t) [i.e., q = p(r(t))]. Consider the edges (x, y) in gt ′ with y = p(x), such that q lies in the path between M(x) and M(y). Subdivide each such edge (creating a new node), and add t to gt ′ by making its root the child of each such newly created node. Note that there must be at least one such edge in gt ′ but there can be several such edges, and hence this step adds t at least once (and perhaps several times) to gt ′ . Note that when we add t 1 to gt ′ , we do not change the image under the MRCA mapping for any node v that is in gt ′ .
We now show that t = t 1 has the same number of duplications as gt with respect to ST. Clearly, any node in t is a speciation node (since t is a subtree of ST, which only has speciation nodes). Now consider a node u created by subdividing an edge (x, y), where y is the parent of x in gt ′ . One child of u is the root of t and the other child has an entirely disjoint leaf set; thus u is a speciation node. When we subdivide edge (x, y) we make y the parent of u. Therefore, M(u) � = M(y). Thus, y is a duplication node in gt 1 if and only if M(z) = M(y) where z is the other child of y in gt ′ . But then y is a duplication node in gt ′ if and only if y is a duplication node in gt 1 , since the MRCA mapping does not change. Hence, no node in gt ′ changes duplication/speciation status, and the newly added nodes are always speciation nodes. Therefore the number of duplication nodes does not change.
We now show that the number of losses does not change, i.e., L std (gt ′ , ST ) = L std (gt 1 , ST ). Now consider an edge (x, y) that is subdivided through the addition of a node u that is made the parent of the subtree t 1 . Then x, y, and u all map (under M) to different vertices in ST (gt 1 ). Also, a simple case analysis (using the standard formula) verifies that F (y, ST (gt ′ )) = F (y, ST (gt 1 )) + F (u, ST (gt 1 )) .
Since F (z, ST (gt ′ )) = F (z, ST (gt 1 )) for all other vertices z ∈ V (gt ′ ), this means that the total number of losses does not change.
Therefore, the result of adding each lower maximal missing clade to gt ′ does not imply any additional losses nor any additional duplications, and so also the total number of duplications and losses does not change. Let t * = t p be the tree obtained after adding in all the missing maximal clades, and return t * . The result then follows by induction on p.

Incompleteness due to gene birth and death
As we will see, while the MRCA mapping is still an optimal reconciliation when gene trees are incomplete due to gene birth and death (implied from [18,21]), the standard formula does not correctly compute the number of losses. We begin by summarizing some results that have already been established: Proof Chauve et al. [18] proved that the MRCA mapping minimizes the losses required to reconcile gt with ST(gt) for complete gene trees, but the proof also applies to incomplete gene trees, treating incompleteness as due to gene birth and death. Górecki and Tiuryn [21] showed that the MRCA mapping minimizes the number of duplications required to reconcile gt with ST(gt), treating incompleteness as due to gene birth and death. Therefore, the MRCA mapping is optimal for all three scores (number of duplications, number of losses, and number of duplications plus losses), when treating incompleteness as due to gene birth and death.
It is easy to see that the duplication nodes in gt are those nodes that have M(v) = M(l) or M(v) = M(r) (where l and r are the two children of v, and M is the MRCA mapping), and all other nodes are speciation nodes. Since the MRCA mapping M can be computed in O(n + n ′ ), where ST has n leaves and gt has n ′ leaves, it follows that all these can be computed in O(n + n ′ ) time.
However, the standard calculation for the number of losses can be incorrect when incompleteness is due to true biological loss! Consider the simple example gt = ((a, b), c) and ST = ((a, (b, d)), c). Under the standard formula, L std (gt, ST ) = 0, since ST (gt) = gt . Under the assumption that incompleteness is due to true biological loss, the genome for d does not have the gene. Because d is sister to b and all the other taxa have the gene, the gene must have been present in the parent of d, and lost on the branch leading to d. Therefore, the standard formula for the number of losses can be incorrect when gene trees are incomplete due to gene birth and death (i.e., true biological loss).

How to calculate losses
We now show how to calculate the number of losses for an incomplete gene tree gt and species tree ST, treating incomplete gene trees as due to gene birth and death. How this is defined will depend upon whether one assumes, a priori, that the gene is present in the genome of the common ancestor of the species in ST (i.e., at the root of ST). This needs to be taken into account as gene birth can happen only once whereas loss can happen repeatedly. Thus, this section shows how to calculate the following values: , the minimum number of losses, under the assumption the gene is present in the common ancestor of the species in ST (DL * bd (gt, ST ) is defined similarly for the total number of duplications and losses), and • L bd (gt, ST ) the minimum number of losses without assuming the gene is present in the common ancestor of the species in ST (DL bd (gt, ST ) is defined similarly for duplications and losses).
We now show how to compute the number of losses (i.e., L bd (gt, ST ) and L * bd (gt, ST )), using the fact that the MRCA mapping defines an optimal reconciliation.

Theorem 3 Let gt be a gene tree and ST a species tree such that
where ST has n leaves and gt has n ′ leaves.
Proof Note that we use a modification of the standard formula, F(u, ST), so that we do not replace ST by ST(gt) as was done in [18,19].
Derivation of L bd (gt, ST) Recall that L bd (gt, ST ) does not assume that the most recent common ancestor of the species in ST has the gene. Since gene birth can happen only once (although loss can happen repeatedly), we begin by determining the location of the gene birth. If M(r(gt)) = r(ST ), then the gene is born before r(ST), and is present at the root of ST. Otherwise, it is easy to see that the location of the gene birth that minimizes the number of losses is the edge above M(r(gt)). Now consider the modification of the standard formula (i.e., using ST instead of ST(gt)): It is easy to see that this correctly returns the number of inserted subtrees, and hence the number of losses.
Derivation of L * bd (gt, ST) By definition of L * bd (gt, ST ) , the gene is assumed to be present at the root of the species tree ST. If M(r(gt)) = r(ST ), then UMMC(gt, ST ) = ∅, and the result follows. However, if M(r(gt)) � = r(ST ), the gene must be present on the path between r(ST) and M(r(gt)). Since the gene is not present in any leaf that is not below M(r(gt)), to minimize losses, the gene must be lost on every edge off that path, since such edges lead to subtrees that do not have the gene present in any leaf. Note that if M(r(gt)) � = r(ST ), then the number of edges that lead off that path is |UMMC(gt, ST )| = d(M(r(gt)), r(ST )) + 1 . Since the gene must be lost on each of those edges, and the total number of losses is the sum of this value and the number of losses that occur within the subtree rooted at M(r(gt)), it follows that L * bd (gt, ST ) = L bd (gt, ST ) + |UMMC(gt, ST )|. Figure 1 illustrates an example distinguishing L bd (gt, ST ) and L * bd (gt, ST ). The running time follows easily from the fact that the MRCA mapping can be computed in linear time [22]. Now one of the most important questions in terms of estimating the optimal species tree is -given a set G of (possibly incomplete) gene trees, is the species tree that minimizes gt∈G L * bd (gt, ST ) or gt∈G L bd (gt, ST ) different than the one that minimizes gt∈G L std (gt, ST )? If the same species tree optimizes both ways of calculating losses, then defining loss differently is not that important in the context of phylogenomic analyses. But this is not necessarily true, as we will show in the following theorem. Proof We will present an input of 14 gene trees and a species tree ST 1 that optimizes the L std criterion and that provably is not optimal for the L bd criterion. Consider the two gene tree topologies tp 1 = ((a, b), c) and tp 2 = (b, (f , (e, (d, (c, a))))) as shown in Fig. 2a, b. Let G be a set of 14 gene trees, with eight gene trees having topology tp 1 and the remaining six gene trees having topology tp 2 . It is easy to verify that the species tree ST 1 with topology tp 2 minimizes gt∈G L std (gt, ST 1 ) . Here, gt∈G L std (gt, ST 1 ) = 8 * 3 + 6 * 0 = 24. Any other species tree will result into more than 24 losses, is not identical to tp 1 = ((a, b), c) requires 8 * 3 = 24 losses to reconcile the eight gene trees having topology tp 1 with T. Therefore, to achieve less than 24 losses, T (tp 1 ) should be identical to tp 1 . We now calculate the number of losses required to reconcile tp 2 with a species tree T such that T (tp 1 ) = ((a, b), c). Note that, tp 2 (tp 1 ) = ((a, c), b). Reconciling ((a, c), b) with ((a, b), c) requires three losses. Then taking {d, e, f } into consideration, it is quite easy to verify that it requires more than three losses to reconcile tp 2 with a species tree T such that T (tp 1 ) = ((a, b), c). Hence, there is no species tree T so that gt∈G L std (gt, T ) < 24. Therefore, ST 1 = tp 2 minimizes gt∈G L std (gt, ST 1 ). However, the species tree ST 2 = (( (((a, b), c), d), e), f ) minimizes gt∈G L bd (gt, ST 2 ). Here gt∈G L bd (gt, ST 2 ) = 8 * 3 + 6 * 6 = 60, which is less than gt∈G L bd (gt, ST 1 = tp 2 ) = 8 * 9 + 6 * 0 = 72. Therefore, ST std is not necessarily same as ST bd . Then the fact that ST std is not necessarily identical to ST * bd immediately follows.

Algorithms to find species trees
Here we address the problem of finding a species tree that has a minimum total number of duplications and losses, treating incompleteness as due to true biological loss. Prior results on GTP include a branch-andbound algorithm in [23] based on techniques from [18], a randomized hill-climbing heuristic presented in [4], a probabilistic and computationally expensive method for co-estimating gene and species trees [1], and dynamic programming based solutions by Hallett and Lagergren [15], Bayzid et al. [20] and Chang et al. [24]. However, none of these studies takes the reasons of incompleteness into account, and we have already shown that the standard calculation for losses can be incorrect when incompleteness is due to true biological loss.
In this section, we derive a different approach for the GTP problems, treating incomplete gene trees as due to true biological loss (i.e., minimizing L bd (gt, ST ) or L * bd (gt, ST )). The techniques we propose can be used to solve GTP exactly for small datasets, or approximately (though without any guaranteed error bounds) on larger datasets. The approach we take here is based on [20] (see also [15,25,26], which use very similar techniques). Bayzid et al. [20] provided a graph-theoretic formulation for MGDL std , whereby an optimal solution to MGDL std corresponded to finding a minimum weight maximum clique inside a graph called the "Compatibility Graph". The nodes of the compatibility graph correspond to "subtree-bipartitions", a concept Bayzid et al. [20] introduced and we will also use. Bayzid et al. [20] showed how to find a minimum weight max clique using a dynamic programming approach. We will use the same graph-theoretic formulation as in [20], but modify the weights appropriately, to show that the optimal solution to MGDL * bd still corresponds to a minimum weight max clique. The DP algorithm in [20] can then be used directly to find the optimal solution to MGDL * bd . To achieve this, we first derive an efficient formula for L bd (gt, ST ) (and L * bd (gt, ST ), similar to the one derived in [17] for L std (gt, ST ), but somewhat more involved.
We will let D gt,ST denote the set of duplication nodes in gt with respect to ST and S gt,ST denote the set of speciation nodes in gt with respect to ST. When gt and ST are known, we may write these as D and S. The calculation for the number of losses depends on how we interpret incompleteness in gene trees. Therefore, rather than having a single optimization problem like MGDL, we have variants of this problem depending on how we treat incompleteness. As shown in Theorem 1, the term MGDL in the literature refers to MGDL std , which (by Theorem 1) is identical to MGDL samp . Here, we consider the optimization problems MGDL * bd , where we treat incompleteness as due to gene birth and death. And therefore, we also consider MGDL bd , MGL * bd , and MGL bd .

Subtree-bipartitions
Let T be a rooted binary tree and u an internal node in T. The subtree-bipartition of u, denoted by SBP T (u), is the unordered pair (c T (l)|c T (r)), where l and r are the two children of u. Note that subtree-bipartitions are not defined for leaf nodes. The set of subtree-bipartitions of a tree T is denoted by SBP T = {SBP T (u) : u ∈ V int (T )} . Furthermore, any pair A and B of disjoint subsets of X also define a subtree-bipartition (though we may refer to these as candidate subtree-bipartitions to emphasize this). Subtree-bipartition domination Let BP i = (P i 1 |P i 2 ) and BP j = (P j 1 |P j 2 ) be two subtree-bipartitions. We say that BP i is dominated by BP j (and conversely that BP j dominates BP i ) if either of the following two conditions holds: (1) P i 1 ⊆ P j 1 and P i 2 ⊆ P j 2 , or (2) P i 1 ⊆ P j 2 and P i 2 ⊆ P j 1 . We say that subtree-bipartition (A|B) is dominated by a species tree T if one of T's subtree-bipartitions dominates (A|B). Bayzid et al. showed that an internal node u in a gene tree gt is a duplication node with respect to a species tree ST if SBP gt (u) is dominated by ST [20]. Finally, for a set G of gene trees on taxon set X and for any candidate subtree-bipartition (A|B), we let W dom (A|B) be the total number of subtree-bipartitions in G that are dominated by (A|B).
Subtree-bipartition containment and compatibility We say that BP i contains BP j if P j 1 ∪ P j 2 ⊆ P i 1 or P j 1 ∪ P j 2 ⊆ P i 2 , and that BP i and BP j are disjoint if [P i 1 ∪ P i 2 ] ∩ [P j 1 ∪ P j 2 ] = ∅. We say that two subtree bipartitions are compatible if they are disjoint or one contains the other.
The compatibility graph CG(G) Let G be a set of rooted binary gene trees on the set X of n taxa. The compatibility graph CG(G) has a vertex for a subtree-bipartition defined on X, and there is an edge between two vertices if and only if the associated subtree-bipartitions are compatible.

Deep coalescence and the MDC problem
Deep coalescence (also called incomplete lineage sorting, or ILS) refers to the failure of alleles to coalesce (looking backwards in time) into a common ancestral allele until deeper than the most recent speciation events [27]. One of the measures for incongruence between a gene tree and a species tree under ILS is XL(gt, ST), the number of extra lineages defined for the pair ST and gt [27].

XL(gt, e ′ ),
This problem is also NP-hard [17], and software for the problem exists in Phylonet [28] and iGTP [5], among others. We now describe theoretical material leading to the algorithmic approach in Phylonet [26].
Definition 5 (From [26]) For B ⊆ X and gene tree gt, we set k B (gt) to be the number of B-maximal clusters in gt, where a B-maximal cluster is a cluster Y ⊆ L(gt) such that Y ⊆ B but no other cluster of gt containing Y is a subset of B.

Definition 6
We define W xl (x, gt) for x either a subtree-bipartition or a subset of X, as follows. If x ⊆ X, For a set G of gene trees and ST a species tree, we set W 0 = gt∈G x∈X W xl ({x}, gt).
Yu et al. [26] showed that for any edge e in ST, where B is the cluster below e, then k B (gt) is the number of lineages going through edge e, and so k B (gt) − 1 is the number of extra lineages going through e. They defined weights on potential species tree clusters B by W mdc (B, gt) = 0 if B ∩ L(gt) = ∅ and otherwise W mdc (B, gt) = k B (gt) − 1 (i.e., W mdc is defined for clusters while W xl is defined for subtree-bipartitions), and extended this to a set G of gene trees by W ′ mdc (B) = gt∈G W mdc (B, gt), and then to a set C of clusters by W ′′ mdc (C) = B∈C W ′ mdc (B). From this, it follows easily that a set C of n − 1 compatible clusters minimizing W ′′ mdc (C) defines a rooted binary species tree with a minimum MDC score.

Theorem 7 [From [17]] Let gt be a rooted binary gene tree, ST a rooted binary species tree and D the set of duplication nodes in gt with respect to ST. Then
We now derive formulas for L bd (gt, ST ) and L * bd (gt, ST ) ; to obtain formulas for DL bd (gt, ST ) and DL * bd (gt, ST ), simply add |D gt,ST |. Recall that in the definition of F(u, T) given in Eq. 1, losses are associated with internal nodes, and the total number of losses is defined as the sum of losses associated to each internal node. However, the definition of the number of losses corresponding to a node can be rewritten in terms of edges, as we now show. Let D(s, s ′ ) be the number of edges in the path in ST between s and s ′ . Therefore, D(s, s ′ ) can be defined as follows. Bayzid and Warnow Algorithms Mol Biol (2018) 13:1 Then, for a vertex u in gt with children r and l, we can rewrite Eq. 1 as follows: It is easy to see that in all three branches of the equation above, the two terms of the sum correspond to the edges connecting u to its two children l and r. (The second term in the first branch and both terms in the third branch are 0, but we wrote them in terms of the function D(., .) for convenience.) Let p(x) be the parent of x in a tree T. Therefore, we can associate gene losses to edges e = (x, p(x)) instead of nodes, as follows: We use the subscript ST in edgeloss ST (e) to emphasize the fact that the distance is taken within the tree ST and not within ST(gt). Therefore, Proof We partition all the non-root nodes in gt into two sets: CD (children of duplications), consisting of those nodes  whose parents are duplication nodes, and CS (children of speciations), consisting of those nodes whose parents are speciation nodes. Note that every edge (x, p(x)) ∈ E(gt) can be associated with the set containing x. Therefore, Since each internal node has two children, clearly the number of vertices x for which p(x) is a speciation node is twice the number |S| of speciation nodes; therefore L bd (gt, ST ) =
Let L(gt, e) be the number of lineages that go through edge e ∈ E(ST ); thus, XL(gt, e) = L(gt, e) − 1, and so  M(p(x)), as x ranges over the internal vertices in gt. It is easy to see that the number of occurrences of an edge e ′ ∈ E(ST * (gt)) in these lists is L(gt, e ′ ) (the number of lineages through e ′ ). Also, the edges e ∈ E(ST ) − E(ST * (gt)) will not be present in these lists, since these are the edges incident on the missing clades in ST with respect to gt. Therefore, the sum of the lengths of these lists is equal to e∈E(gt) MD(e) and also equal to e∈ST * (gt) L(gt, e).

Assigning weights to subtree-bipartitions
To use the graph-theoretic formulation of MGDL * bd , we have to assign weights to each node in the compatibility graph, CG(G), where G is the input set of gene trees, so that a minimum weight clique of n − 1 vertices defines an optimal solution to MGDL * bd (G). We will define weights W xl (v), W dom (v), W EC (v), and W MMC (v) to each subtreebipartition (i.e., node in the compatibility graph), and set We then prove (see Theorem 11) that a set of n − 1 compatible subtree-bipartitions that has minimum total weight defines a species tree that optimizes MGDL * bd . Note that weights W xl (v) and W dom (v) have already been defined. Hence, all that remains is to define W EC (v) and W MMC (v), and then to prove Theorem 11.
Calculating W EC (v) and |E(ST * (gt))|: We now show how to define weight W EC (v, gt) for every vertex v in the compatibility graph CG(G) so that for all species trees ST, |E(ST * (gt))| is the sum of the vertex weights for the n − 1 clique C in CG(G) corresponding to ST. To count the number of edges in E(ST * (gt)), we need to exclude those edges from E(ST) that are incident on a clade that is missing in gt. For a vertex v associated with the subtreebipartition (p|q), we define W EC (v, gt) as follows (swapping p and q as needed): Then, |E(ST * (gt))| = u∈SBP ST W EC (u, gt). We set , then a set of subtree-bipartitions in an (n − 1)-clique of minimum weight in CG(G) defines a binary species tree ST that minimizes L * bd (G, ST ).
Proof We prove (a), since (b) follows directly from (a). Let C be a clique of size n − 1 in CG(G) and ST the associated species tree. Let SBP dom (gt, ST ) be the set of subtree-bipartitions in gt that are dominated by a subtree-bipartition in ST. Note that |SBP dom (gt, ST )| is the number of speciation nodes in gt with respect to ST [20]. Therefore, the total number of speciation nodes in G is Note that W 0 does not depend on the topology of the species tree. Hence, the (n − 1)-clique C with minimum weight defines a tree ST that minimizes DL * bd (G, ST , c d ). The proof for (b) follows trivially.

Dynamic programming algorithm
Let SBP be a set of subtree-bipartitions, with SBP equal to all possible subtree-bipartitions if an exact solution is desired, and otherwise a proper subset if a faster algorithm is desired or necessary. We present the DP algorithm for the MGDL * bd (G, c d ) problem. We compute score(A) in order, from the smallest cluster to the largest cluster X .
we set score(A) to −∞, signifying that A cannot be further resolved. At the end of the algorithm, if SBP includes at least one clique of size n − 1, we have computed score(X ) as well as sufficient information to construct the optimal set of compatible clusters and hence the optimal species tree (subject to the constraint that all the subtree bipartitions in the output tree are in SBP). If subtree bipartitions in SBP are not sufficient for building a fully resolved tree on X , then score(X ) will be −∞, and our algorithm returns FAIL.
The optimal number of duplications and losses is given by score(X ) + c d (N − k), by Theorem 11. If SBP contains all possible subtree-bipartitions, we have an exact but exponential time algorithm. However, if SBP contains only those subtree-bipartitions from the input gene trees, then the algorithm finds the optimal constrained species tree in time that is polynomial in the number of gene trees and taxa.

Theorem 12
Let G be a set of rooted binary gene trees, SBP a set of subtree-bipartitions which contains only the subtree-bipartitions, or a subset of the subtree-bipartitions from the input gene trees. The dynamic programming (DP) algorithm finds the species tree ST minimizing the weighted duploss score, treating incomplete gene trees as resulting from gene birth and death, subject to the constraint that SBP ST ⊆ SBP in O(n|SBP| 2 ) time. Therefore, if SBP is all possible subtree-bipartitions, we have an exact but exponential time algorithm. However, if SBP contains only those subtree-bipartitions from the input gene trees, then the DP algorithm finds the optimal constrained species tree in O(d 2 n 3 k 2 ) time, where n is the number of species, k is the number of gene trees, and d the maximum number of times that any taxon appears in any gene tree.
Proof The proof of correctness is given above. The running time analysis for an arbitrary SBP follows the same argument as given in [20], since W xl (v) and W dom (v) can be computed in O(1) time after the preprocessing (as described in [20]). Finally, suppose Q is the set of subtree bipartitions from the input gene trees, and we use Q as the constraint set. Note that |Q| is O(dkn) (every internal node in every gene tree corresponds to subtree bipartition, and there are at most a total of dkn internal nodes across all the gene trees). Hence, when the constraint set is just the set Q of subtree bipartitions from the input set of gene trees, then the algorithm runs in O(n|Q| 2 ) = O(d 2 k 2 n 3 ) time.

Extensions
It is trivial to extend the theory for MGDL * bd and MGL * bd to MGDL bd and MGL bd , as we now show. Recall that the only difference between L * bd (gt, ST ) and L bd (gt, ST ) is whether the gene is assumed to be present at the ancestral species: in L bd (gt, ST ) it is not assumed to be present there, but in L bd (gt, ST ) it is. Therefore, L bd (gt, ST ) = L * bd (gt, ST ) − |UMMC(gt, ST )| and that DL bd (gt, ST ) = DL * bd (gt, ST ) − |UMMC(gt, ST )| . Therefore, to extend the algorithmic approach to solve MGL bd and MGDL bd , we define W MGL bd (v, gt) = W MGL * bd (v, gt) − W MMC (v, gt) and W MGDL bd (v, gt) = W MGDL * bd (v, gt) − W MMC (v, gt), and then seek a minimum weight maximum clique in the compatibility graph with these modified weights.

Conclusion
The calculation of reconciliation costs between gene trees and species trees is a standard step in many bioinformatics analyses, including the estimation of species trees from a set of gene trees. This paper showed that different interpretations of incompleteness (i.e., species missing from genes) can impact the way that these reconciliation costs should be calculated, and need to be taken into account when using Gene Tree Parsimony to construct species trees from gene trees.
To address this issue, we presented a dynamic programming algorithm that provably finds an optimal species tree given a set of gene trees under the (weighted) GDL model within a constrained search space, treating incompleteness as due to true biological loss. This technique can be used on any input on which other gene tree parsimony is used. The use of dynamic programming to find provably optimal solutions within a constrained search space is also how ASTRAL (a coalescent-based species tree estimation method) [29][30][31] and FastRFS (a supertree method) [32] achieve good performance. For those methods, the constraints are based on bipartitions rather than subtree bipartitions, but the dynamic programming algorithm is nearly identical to the one we use here. As noted in [32], although setting the constraint set to just the bipartitions in the input source trees produced good results, expanding the set to include bipartitions from computed supertrees improved the topological accuracy of the resultant FastRFS supertree, without greatly increasing the running time. This suggests that expanding the subtree bipartition set to include estimated species trees based on GDL would be similarly beneficial for the dynamic programming method we present in this paper. In addition, changing the technique for defining subtree bipartitions is necessary when all the gene trees are incomplete, since in that case none of the gene trees can contribute any valid subtree bipartitions.
The results and the methods presented here are based on the assumption that the gene trees are discordant due to gene duplication and loss. However, these methods can be applied to both orthologous genes (in which case the gene trees could be single copy) and gene families that by definition will include both paralogs and orthologs. In cases where the genes are expected to be orthologs, all discordance between gene trees and species trees should be due to processes other than duplication and loss (e.g., incomplete lineage sorting), which could make approaches that attempt to minimize the total GDL cost potentially less accurate. Nevertheless, it is also possible that these GDL-based methods could be reasonably accurate under conditions where ILS rather than GDL is operating, and so these methods should be explored in that context.
Another natural source of discordance is gene tree estimation error, which is likely to occur with most biological datasets (see discussion in [33]). Therefore, the most accurate estimations of the number of duplications and losses (or weighted versions of these numbers) will only be obtained when the estimated gene trees and species trees are highly accurate, so that every attempt should be made to estimate these trees carefully. Gene trees, especially of multi-copy genes spanning large numbers of species, can be extremely large (i.e., greater than 100,000 leaves), and thus present enormous analytical and computational challenges (e.g., multiple sequence alignment and likelihood-based tree estimation are both difficult for datasets with more than about 1000 sequences, let alone 100,000) [34,35]. Since completely accurate gene trees are not likely to be reliably obtained, these reconciliation methods and associated species tree estimation methods should be modified to take gene tree uncertainty into account. Methods such as NOTUNG [36] and ProfileNJ [37] are examples of methods that do this, but more work is needed.