Gene tree correction for reconciliation and species tree inference
 Krister M Swenson^{1, 2}Email author,
Affiliated with
 Andrea Doroftei^{1} and
Affiliated with
 Nadia ElMabrouk^{1}Email author
Affiliated with
DOI: 10.1186/17487188731
© Swenson et al.; licensee BioMed Central Ltd. 2012
Received: 24 December 2011
Accepted: 26 July 2012
Published: 20 November 2012
Abstract
Background
Reconciliation is the commonly used method for inferring the evolutionary scenario for a gene family. It consists in “embedding” inferred gene trees into a known species tree, revealing the evolution of the gene family by duplications and losses. When a species tree is not known, a natural algorithmic problem is to infer a species tree from a set of gene trees, such that the corresponding reconciliation minimizes the number of duplications and/or losses. The main drawback of reconciliation is that the inferred evolutionary scenario is strongly dependent on the considered gene trees, as few misplaced leaves may lead to a completely different history, with significantly more duplications and losses.
Results
In this paper, we take advantage of certain gene trees’ properties in order to preprocess them for reconciliation or species tree inference. We flag certain duplication vertices of a gene tree, the “nonapparent duplication” (NAD) vertices, as resulting from the misplacement of leaves. In the case of species tree inference, we develop a polynomialtime heuristic for removing the minimum number of species leading to a set of gene trees that exhibit no NAD vertices with respect to at least one species tree. In the case of reconciliation, we consider the optimization problem of removing the minimum number of leaves or species leading to a tree without any NAD vertex. We develop a polynomialtime algorithm that is exact for two special classes of gene trees, and show a good performance on simulated data sets in the general case.
Keywords
Gene tree Species tree Reconciliation Error correction Maximum agreement subtree (MAST)Background
Almost all genomes which have been studied contain genes that are present in two or more copies. Duplicated genes account for about 15% of the protein coding genes in the human genome, for example [1]. In practise, homologous gene copies (e.g. copies in one genome or amongst different genomes that are descended from the same ancestral gene) are identified through sequence similarity; using a BLASTlike method, all gene copies with a similarity score above a certain threshold would be grouped into the same gene family. Using a classical phylogenetic method, a gene tree, representing the evolution of the gene family by local mutations, can then be constructed based on the similarity scores. However, macroevolutionary events (duplications, losses, horizontal gene transfer) affecting the number and distribution of genes among genomes [2], are not explicitly reflected by this gene tree. Having a clear picture of the speciation, duplication and loss mechanisms that have shaped a gene family is however crucial to the study of gene function. Indeed, following a duplication, the most common occurrence is for only one of the two gene copies to maintain the parental function, while the other becomes nonfunctional (pseudogenization) or acquires a new function (neofunctionalization) [3].
The most commonly used methods to infer evolutionary scenarios for gene families are based on the reconciliation approach that compares the species tree S (describing the relationships among taxa) to the gene tree T. Assuming no sequencing errors and a “correct” gene tree, the incongruence between the two trees can be seen as a footprint of the evolution of the gene family through processes other than speciation, such as duplication and loss. The concept of reconciling a gene tree to a species tree under the duplicationloss model was pioneered by Goodman [4] and then widely accepted, utilized and also generalized to models of other processes such as horizontal gene transfer [5–7]. Several definitions of reconciliation exist in the literature, one of them expressed in term of “tree extension” [8]. More precisely, a reconciliation R between T and S is an extension of T (obtained by grafting new subtrees onto existing edges of T) consistent with the species tree (i.e. reflecting the same phylogeny). A duplication and loss history for the gene family is then directly deduced from R. As many reconciliations exist, a natural approach is to select the one that optimizes a given criterion. Natural combinatorial criteria are the number of duplications (duplication cost), losses (loss cost) or both combined (mutation cost) [9, 10]. The so called Lowest Common Ancestor (LCA) mapping between a gene tree and a species tree, formulated in [11, 12] and widely used [2, 10, 12–16], defines a reconciliation that minimizes both the duplication and mutation costs. It has been shown in [8] that minimizing duplications follows from minimizing losses (i.e. a reconciliation minimizing losses also minimizes duplications, but the converse is false). When no preliminary knowledge on the species tree is given, a natural problem, known as the species tree inference problem, is to infer, from a set of gene trees, a species tree leading to a parsimonious evolution scenario, for a given cost. Similar to the case of a known species tree, methods have been developed for the duplication and mutation costs [9, 10, 17]. For both criteria, the problem of inferring an optimal species tree given a set of gene trees is NPhard [10].
The main criticism of reconciliation is that the inferred duplication and loss history for a gene family is strongly dependent on the gene tree considered for this family. Indeed, a few misplaced leaves in the gene tree can lead to a completely different history, possibly with significantly more duplications and losses [18]. Reconciliation can therefore inspire confidence only in the case of a wellsupported gene tree. Typically bootstrapping values are used as a measure of confidence in each edge of a phylogeny. How should the weak edges of a gene tree be handled? One reasonable answer is to transform the binary gene tree into an unresolved gene tree by removing each weak edge and collapsing its two incident vertices into one. Extensions of the duplicationloss model to nonbinary gene trees have been considered [19, 20]. Another strategy adopted in [9] is to explore the space of gene trees obtained from the original gene tree T by performing Nearest Neighbor Interchanges (NNIs) around weaklysupported edges. The problem is then to select, from this space, the tree giving rise to the minimum reconciliation cost.
In this paper, we explore a different strategy for “correcting” or preprocessing a gene tree T, prior to reconciliation or species tree inference. Criteria for identifying potentially misplaced leaves were given in [8]. The duplication vertices of T with respect to a species tree S can be subdivided into apparent and nonapparent duplication (NAD) vertices, where the latter class has been flagged as potentially resulting from the misplacement of leaves in the gene tree. The reason is that each one of the NAD vertices reflects a phylogenetic contradiction with the species tree that is not due to the presence of duplicated gene copies. In the case of an unknown species tree, we showed in [8] that deciding whether T can be explained using only apparent duplications (we say that T is an MDtree) can be done in polynomial time, as well as inferring an appropriate species tree. Here, we present algorithmic results for removing, for a given gene tree (or a forest of gene trees), the minimum number of leaves or leaflabels (species) leading to a tree without any NAD vertex, in both cases of a known or an unknown species tree. The minimum leaf removal problem in case of a known species tree has been recently proved to be APXhard [21].
In the next section, we begin by formally introducing our concepts. We then motivate and state our problems in Section “Motivations and problem statements”. Section “Minimum species removal inference and reconciliation” gives a greedy heuristic for the minimum species removal problem in the case of an unknown species tree, and shows that any algorithm for this case can be applied to the case where the species tree is known. Section “Algorithms for the minimum removal reconciliation problems” is dedicated to the algorithmic developments in the case of a known species tree. We first describe two special classes of gene trees which lead to an exact polynomialtime algorithm. We then present a heuristic algorithm for the general case. In Section “Empirical results”, we test the optimality of our algorithm for the minimum leafremoval problem in the case of a known species tree, and the ability of the presented approach to identify misplaced genes. This paper is an extended version of [22].
Definitions
Trees
In this paper, we only consider rooted trees. Let G={1,2,⋯,g} be a set of integers representing g different species (genomes). A species tree on G is a rooted binary tree with exactly g leaves, where each i ∈ Gis the label of a single leaf (Figure 1a). A gene tree on G is a rooted binary tree where each leaf is labelled by an integer from G, with possibly repeated leaves (Figure 1b). A gene tree represents a gene family, where each leaf labelled i represents a gene copy located on genome i. In the case of a species tree or a uniquely leaflabelled gene tree (i.e. no leaflabel occurring more than once) we will make no distinction between a leaf and its label.
Given a tree U, the size of U, denoted by U, is the number of leaves of U, and the genome set of U, denoted by , is the subset of G defined by the labels of the leaves of U. Given a vertex x of U, U _{ x } is the subtree of U rooted at x. The genome set of U _{ x } is denoted by (for example, in the tree of Figure 1a, ). If x is not a leaf, we denote by x _{ l } and x _{ r } the two children of x. Finally, if x is not the root, any vertex y≠xon a path from x to the root is an ancestor of x.
Given a tree U on a genome set G, a leaf removal consists of removing a given leaf from U, along with its parental node x, and if x is not the root joining the parent of x and the remaining child by a new edge. A tree obtained from U through a sequence of leaf removals is said to be included in U. The tree U restricted to a subset of G is the tree obtained from U through a sequence of leaf removals that removes all the leaves with labels in .
Finally, a subtree U _{ x }of U, for a given vertex x, is said to be a maximum subtree of U verifying a given property P if and only if U _{ x } verifies property P and, for any vertex y that is an ancestor of x, U _{ y }does not verify property P.
Reconciliation
Applying a classical phylogenetic method to the gene sequences of a given gene family leads to a gene tree T that is different from the species tree S, mainly due to the presence of multiple gene copies in T, and that may reflect a divergence history different from S. The reconciliation approach consists in “embedding” the gene tree into the species tree, revealing the evolution of the gene family by duplications and losses.

A subtree insertion in a tree T grafts a new subtree onto an existing edge of T. Formally, inserting a subtree onto an edge linking two nodes x and y (y is a child of x) consists in creating a new node z with parent x and two children being y and the root of the inserted subtree.

A tree is said to be an extension of T if it can be obtained from T by subtree insertions on the edges of T.

The gene tree T is said to be DSconsistent with S (DS standing for Duplication/Speciation) if T reflects a history with no loss, i.e. if for every vertex t of T such that , there exists a vertex s of S such that and one of the two following conditions holds:
 (D)
either (indicating a Duplication),
 (S)
or and (indicating a Speciation).
 (D)
Definition 1
A reconciliation between a gene tree T and a species tree S on G is an extension R(T,S) of T that is DSconsistent with S.
For example, the tree of Figure 1c is a reconciliation between the gene tree T of Figure 1b and the species tree of Figure 1a. Such a reconciliation between T and S implies an unambiguous evolution scenario for the gene family, where a vertex of R(T,S) that satisfies property (D) represents a duplication (duplication vertex), a vertex that satisfies property (S) represents a speciation (speciation vertex), and an inserted subtree represents a gene loss. The number of duplication vertices of R(T,S) is called the duplication cost of R(T,S).
The notion of reconciliation can naturally be extended to the case of a set, or forest, of gene trees : a reconciliation between and S is a set of reconciliations, respectively for T _{1},…,T _{ f }, such that each R _{ i }(T _{ i },S) is DSconsistent with S.
LCA Mapping
The LCA mapping between a gene tree T and a species tree S, denoted by M, maps every vertex t of T to the Lowest Common Ancestor (LCA) of in S. A vertex t of T is called a duplication vertex of T with respect to S if and only if M(t _{ ℓ })=M(t) and/or M(t _{ r })=M(t) (see Figure 1b). We denote by d(T,S)the number of duplication vertices of T with respect to S. Any vertex of T that is not a duplication vertex is a speciation vertex with respect to S.
The LCA mapping induces a reconciliation M(T S) between T and S, where an internal vertex t of T leads to a duplication vertex in M(T S) if and only if t is a duplication vertex of T with respect to S. In other words, the duplication cost of M(T S) is d(T S) (see for example [10, 13, 15] for more details on the construction of a reconciliation based on the LCA mapping). Moreover, M(T S) is a reconciliation that minimizes the duplication, loss, and mutation costs [8, 14].
Duplication vertices and MDtrees
Let T be a gene tree. As noticed in [8], any vertex t of T such that (i.e. the left and right subtrees rooted at t contain a gene copy in the same genome) will be a duplication vertex in any reconciliation between T and any species tree S, in particular in M(T S). Such a vertex is called an Apparent Duplication vertex (AD vertex for short) of T. In the tree of Figure 1b, the root is an AD vertex as its left and right subtree both contain a gene copy in genome 1. Following our notation in [8], a gene tree T is said to be a Minimum Duplication Tree (henceforth called an MDtree) if there exists a species tree S such that d(T S) is exactly the number of apparent duplications present in T. In which case, T is said to be MDconsistent with S.
However, this is not always true, in other words, a duplication vertex of T with respect to a species tree S is not necessarily an AD vertex. We call such a duplication vertex a NonApparent Duplication vertex, or simply a NAD vertex. For example, the tree of Figure 1b contains one NAD vertex, indicated by a square, and thus T is not MDconsistent with S.
Motivations and problem statements
Observations made in [8] tend to support this hypothesis. In particular, using simulated datasets based on the species tree of 12 Drosophila species given in [24] and a birthanddeath process, starting from a single ancestral gene, and with different gene gain/loss rates, it has been found that 95% of gene duplications lead to an AD vertex.
Following the later observations, we exploit the properties of NAD vertices for gene tree correction. For generality, we consider a forest of gene trees . If is not MDconsistent with a given species tree S (i.e. there is at least one tree in that is not MDconsistent with S) then an MDconsistent forest can always be obtained from by performing a certain number of leaf removals. Indeed, a gene tree with only two leaves is always MDconsistent with any species tree. Our first optimization problem is the following, where the size of is just the sum of sizes of all the trees of .
MINIMUM LEAF REMOVAL RECONCILIATION(MINLRR):Input: A genome set G, a forest of gene trees on G, and a species tree S for G; Output: A forest of gene trees of maximum size (i.e. obtained from by a minimum number of leaf removals) which is MDconsistent with S, where each is included in T _{ i }.
In the case of an unknown species tree, we have shown in [8] that deciding whether a forest of gene trees is an MDforest (i.e. a set of MDtrees) can be done in polynomial time and space, as well as computing a parsimonious species tree. For a forest which is not an MDforest, a natural generalization of the MINLRR problem is the following:
MINIMUM LEAF REMOVAL INFERENCE (MINLRI):Input: A genome set G and a forest of gene trees on G; Output: An MDforest of maximum size (i.e. obtained from by a minimum number of leaf removals), where each is included in T _{ i }.
A more conservative strategy that can be used to reduce the risk of inferring a wrong species tree, is to remove the minimum number of species from G such that the forest restricted to the new genome set is an MDforest. Removing the minimum number of species instead of leaves can also be considered in the case of reconciliation, a scenario that may be applicable when full confidence is not put in the species tree.
MINIMUM SPECIES REMOVAL RECONCILIATION (MINSRR):Input: A genome set G, a forest of gene trees on G and a species tree S for G; Output: A maximum subset of G such that forest restricted to (i.e. the set of trees T _{ i }restricted to ) is MDconsistent with the species tree S restricted to .
MINIMUM SPECIES REMOVAL INFERENCE(MINSRI):Input: A genome set G and a forest of gene trees on G; Output: A maximum subset of G such that the forest restricted to is an MDforest.
The latter two optimization problems (MINSRR and MINSRI)) are the subject of the next section. Section “Algorithms for the minimum removal reconciliation problems” focuses on the two optimization problems related to reconciliation (MINLRR and MINSRR).
Minimum species removal inference and reconciliation
By linking the species tree inference problem to a supertree problem we have been able to prove that deciding whether a gene tree T is an MDtree can be done in polynomialtime [8]. We used a constructive proof based on a mincut strategy, which has been largely considered in the context of supertrees [25–27]. In this section, we develop a greedy heuristic for MINSRI based on a minimum vertex cut strategy.
Let be a forest of gene trees on a genome set G. Define to be the set of highest (i.e. closest to the root) vertices of all T _{ i }s that are not ADvertices. is then the set of vertices of all T _{ i }s that are closest nonAD descendants of the vertices for . For a given level j, forest , and vertex , consider the bipartition . Then is the corresponding hypergraph [28] where V=G, and for .
In order for to be an MDforest, all the vertices of , for any j, should represent speciation vertices with respect to some species tree S (as otherwise they would represent additional nonapparent duplication vertices, preventing the forest from being an MDforest). In other words, the bipartitions B(x) for all x ∈ leve l _{0}(T) should reveal a first speciation event, which is possible if and only if the graph contains at least two connected components. Indeed, in this case for any species tree S with a root r splitting G into two disconnected subsets, all the vertices of would be speciation vertices. Conversely, if contains a single connected component, then for any species tree S, at least one node of leve l _{0}(T) would be a NAD node. The same reasoning applies to any and .
On the other hand, if is connected for some , there exists no species tree so that all represent speciation events. In this case, some number of species must be removed to make disconnected. This corresponds exactly to a vertex cut in . These observations leads to the following heuristic for the MINSRI problem.
 1.
; j=0; Compute ;
 2.
WHILE is not empty DO
 3.
Construct the hypergraph ;
 4.
IF is connected THEN
 5.
;
 6.
Restrict to ;
 7.
END IF
 8.
j=j + 1;
 9.
Compute ;
 10.
END WHILE
 11.
RETURN ( )
MINIMUMVERTEXCUT in a hypergraph can be computed using the minimum vertex cut algorithm for simple graphs: each hyperedge corresponds to a clique in the simple graph. It is easy to confirm that a set of vertices disconnects the hypergraph if and only if it disconnects the corresponding simple graph. Vertex cut on a simple graph can be implemented with 2n−2 calls to the standard s t vertex cut algorithm (based on minimum s t edge cut). By reusing computation, Hao and Orlin [29] showed how to do all 2n−2 calls to the s t cut algorithm in the same time it takes to do a single call. Thus, MINIMUMVERTEXCUT can be solved in time. Since we call MINIMUMVERTEXCUTO(V) times in the worst case, Algorithm Minimum Species Removal Inference runs in time.
In the next section we give algorithms for MINSRR and MINLRR.. We conclude this section by highlighting the relationship between MINSRR and MINSRI.
Remark 1
MINSRR reduces to MINSRI.
This is easy to see; take the species tree S given by the instance of MINSRR and add it to the forest for the MINSRI problem. The solution to MINSRI gives a species tree that must be a subtree of S. Thus, any algorithm for MINSRI can be used to solve MINSRR.
Algorithms for the minimum removal reconciliation problems
In this section, we assume that a species tree S is known for the genome set G. For simplicity, we present the algorithms in the case of a single gene tree T, although it is straightforward to generalize them to the case of a forest of gene trees.
Let T be a gene tree for a gene family on G. We suppose that T is not an MDtree consistent with S (i.e. there is at least one duplication vertex of T that is a NAD vertex). We begin by describing special classes of gene trees for which exact polynomialtime algorithms have been developed for the MINLRR and MINSRR problems.
Uniquely leaflabelled gene trees
When the considered gene family contains at most one gene per genome, the gene tree T is uniquely leaflabelled. In this case, minimizing the number of leaves, or equivalently species, that should be removed from T to obtain an MDtree consistent with S is equivalent to finding the maximum number of genomes that lead to the same phylogeny in T and S. In other words, it is immediate to see that the MINLRR problem reduces, in this case, to the MAST problem given below.
MAXIMUM AGREEMENT SUBTREE (MAST):Input: A uniquely leaflabelled gene tree T on G and a species tree S for G; Output: A tree T ^{ MAX }included in T such that it is MDconsistent with S and of maximum size.
A more general definition is given in the literature, where the MAST problem is defined on a set of uniquely leaflabelled trees as the largest tree included in each tree of the set. This definition is equivalent to ours in the case of a gene tree T and a species tree S.
The MAST problem arises naturally in biology and linguistics as a measure of consistency between two evolutionary trees over species or languages [30]. In the evolutionary study of genomes, different methods and different gene families are used to infer a phylogenetic tree for a set of species, usually yielding different trees. In such a context, one has to find a consensus of the various obtained trees. Considering the MAST problem, introduced by Finden and Gordon [31], is one way to obtain such a consensus. Amir et al.[32] showed that computing a MAST of three trees with unbounded degree is NPhard. However, in the case of two binary trees, the problem is polynomial. The first polynomialtime algorithm for this problem was given by Steel and Warnow [33]. It is a dynamic programming algorithm considering the solution for all pairs of subtrees of T and S; it has a running time of O(n ^{2}), where n is the number of leaves in the trees. Later, Cole et al.[30] developed an time algorithm, which, as far as we know, is the most efficient algorithm for solving the MAST problem on two binary trees. We use this result in the MINLRR version of our algorithms. In the case of k binary trees, the current fastest known algorithms run in O(k n ^{3}) time [34, 35]. We use this result in the MINSRR version of our algorithms.
No AD above NAD
In this section, we consider a tree T containing no AD vertex above a NAD vertex (Figure 3a). More precisely, T satisfies Constraint C below:
CONSTRAINT C: For each NAD vertex x of T, if y is an ancestor of x that is a duplication vertex, then y is a NAD vertex.
We show that the MINSRR problem reduces, in this case, to the MAST problem, while the MINLRR problem reduces to a “generalization” of the MAST problem to weighted trees, where a weighted tree is a uniquely leaflabelled tree with weighted leaves.
Definition 2
Let U be a tree on G. The weighted tree induced by (U,S) is the tree included in S obtained from S by removing all leaves that are not in , with a weight attributed to each leaf s, representing the number of occurrences of s in U (i.e. the number of leaves of U labelled s).
Let T _{1},T _{2},⋯T _{ m } be the maximum subtrees of T rooted at an AD vertex (i.e. subtrees of T rooted at the highest AD vertices). Then, the tree T ^{ I } obtained by replacing each T _{ i }, for 1 ≤ i ≤ m, by the weighted tree induced by (T _{ i },S), is a weighted uniquely leaflabelled tree. An example is given in Figure 3a,b and c. Let ρ _{ s } be the operation of removing the weighted leaf s from T ^{ I }. Then the corresponding removals in T consist of removing from T all leaves labelled s.
Finally, we formulate the generalization of the MAST problem to weighted trees as follows, where the value v(W) of a weighted tree W is the sum of its leaves’ weights.
WEIGHTED MAXIMUM AGREEMENT SUBTREE(WMAST):Input: A weighted tree W on G and a species tree S for G; Output: A weighted tree W ^{ MAX }included in W such that it is MDconsistent with S and of maximum value.
We are now ready for the main theorem.
Theorem 1
Let T be a gene tree satisfying CONSTRAINT C. Let W ^{ MAX }be a solution of the WMAST problem on T ^{ I }and S, and T ^{ MAX }be the subtree included in T obtained by removing from T all the leaves that are not leaves of W ^{ MAX }. Then T ^{ MAX }is a solution of the MINLRR problem on T and S.
In other words, solving the MINLRR problem on T is equivalent to solving the WMAST problem on T ^{ I }. We show in the proof of Theorem 2 that WMAST can be solved by the traditional MAST algorithms with no change in the asymptotic running time.
We now provide a proof of Theorem 1, subdivided into the two following lemmas.
Lemma 1
The tree T ^{ MAX }is MDconsistent with S.
Proof
 1.
x is a NAD vertex in T. Then the genome sets of and are disjoint. Moreover, the genome set of (resp. ) is a subset of the genome set of (resp. ). On the other hand, as x is not a duplication vertex in W ^{ MAX }, one of the three genes a, b and c should be absent in . And thus, {a,b,c} can not be a subset of the genome set of , a contradiction.
 2.
x is an AD vertex in T. Then the subtree T _{ x }of T rooted at x contains at least two leaves labelled with the same label d (different from a, b and c), one in and one in . Moreover the leaf labelled d in S should belong to the subtree of S rooted at s, and thus to the subtree S _{ i }rooted at the left or right child of s. Such subtree S _{ i }contains at least one leaf labelled a or b or c.
On the other hand, let y be the parent of x in T ^{ I }. As an optimal solution of the WMAST problem on T ^{ I }removes leaves from the subtree , such an operation should result in removing the duplication vertex y. In other words, x and y should map to the same vertex s in S. Moreover the result of the leaf removal from should result in a different LCA mapping for x and y. Indeed, removing leaves from the corresponding subtree in T ^{ I } does not contribute to eliminating any NAD from T ^{ I }. It follows that S should exhibit the phylogeny ((a,b,c),d), which is a contradiction with the result of the last paragraph. □
Lemma 2
Let be a tree included in T that is MDconsistent with S. Then .
Proof
We will show that, for any s ∈ G, if a leaf i labelled s is removed from T (i.e. i is not a leaf in ), then all leaves of T labelled s are removed from T.
Suppose this is not the case. Let y be the vertex of T representing the least common ancestor of all leaves labelled s in T. Then y is an AD node. As a leaf i labelled s is removed from T, such removal should contribute to resolving a NAD vertex x of T. From CONSTRAINT C, such a vertex should be outside the subtree of T rooted at y. Moreover, it should clearly be an ancestor of y (otherwise removing i will have no effect on x).
As x is a NAD vertex, it maps to the same vertex s of S as one of its children, say the left child. Then, there exist two leaves of labelled a and b, and one leaf of labelled c such that the triplet {a,b,c} exhibits a wrong phylogeny. Moreover, as removing leaf i labelled s contributes to solving x, we can assume that a=s. However, from our assumption, there remains a leaf labelled s in . Thus: either (1) at least one leaf labelled b and one leaf labelled c remains in , or (2) all leaves labelled b or all leaves labelled c are removed. In case (1), the wrong phylogeny exhibited by the triplet {a,b,c} is still present, preventing vertex x from being a nonduplication vertex. In case (2), as all copies of b (or equivalently c) are removed, there is no need to remove leaf i labelled s to correct the wrong phylogeny exhibited by the triple {a,b,c}.
Therefore, the weighted tree induced by is obtained from T ^{ I }through a sequence of leaf removals. Now, as W ^{ MAX }is the solution of the WMAST problem on T ^{ I }, then , and thus . □
Finally, the following corollary makes the link between the MINSRR problem and the MAST problem.
Corollary 1
Let T be a gene tree satisfying CONSTRAINT C. Let W ^{ MAX }be a solution to the MAST problem on T ^{ I }and S (ignoring weights), and T ^{ MAX }be the subtree of T induced by W ^{ MAX }. Then T ^{ MAX }is a solution to the MINSRR problem on T and S.
To apply the algorithm to MINSRR with the forest , all trees must simultaneously agree with S, so the O(k n ^{3})MAST algorithm [34, 35] must be used.
An Algorithm for the general case

Stop condition  Lines 2 to 4: If T is MDconsistent with S, then no leaf removal is performed, and the algorithm terminates.

Recurrence Loop  Lines 6 to 13: Resolve all maximum subtrees of T verifying CONSTRAINT C as described in Section “No AD above NAD”, that is:
 1.
Construct the weighted tree T ^{ I }(Lines 68);
 2.
For each root x of a maximum subtree of T ^{ I }satisfying CONSTRAINT C (Line 9), solve the WMAST problem on , which leads to the weighted tree (Line 10), compute the induced tree T _{ x }(Line 11) and store the number of performed leaf removals (Line 12).
 1.
Algorithm CorrectTree (T, S)
 1.
LeafRemoval=0;
 2.
IF T is a tree MDconsistent with S THEN
 3.
RETURN (LeafRemoval)
 4.
END IF
 5.
T ^{ I }=T;
 6.
FOR ALL x ∈ ADborder(T) DO
 7.
Replace by its induced weighted tree;
 8.
END FOR
 9.
FOR ALL x ∈ NADborder(T ^{ I }) DO
 10.
; Replace T _{ x }by the subtree induced by ; LeafRemoval = LeafRemoval + ;
 11.
END FOR
 12.
RETURN (LeafRemoval+CORRECTTREE(T, S))
If T is a uniquely leaflabelled tree then T ^{ I }=T, NADborder(T ^{ I }) is reduced to the root of the tree, and thus loop 9–13 is just executed once. Moreover, as T ^{ I } is unweighted (all labels are equal to 1), WMAST is reduced to MAST. The whole algorithm thus reduces to one resolution of the MAST problem.
If T satisfies CONSTRAINT C, then NADborder(T ^{ I }) is also reduced to the root of the whole tree, and thus loop 9–13 is just executed once. In this case, the methodology is the one following Theorem 1, and illustrated in Example 3.
In the general case, NADborder(T ^{ I }) is not restricted to a single vertex, and loop 9–13 can be executed many times. Moreover, at the end of loop 9–13, the resulting tree is not guaranteed to be MDconsistent with S, as NAD vertices higher than those in NADborder(T ^{ I }) may exist. Algorithm CorrectTree may therefore be applied many times.
Theorem 2
Algorithm Correcttree has timecomplexity , where n is the size of T.
Proof
Let n be the size of T. Loop 2–4 requires the LCA mapping between T and S, and the identification of AD and NAD vertices. As the LCA mapping can be computed in linear time [8, 36], testing whether a tree T is MDconsistent with S can be tested in time O(n). Clearly, Loop 6–8 can be executed in time O(n). As for Loop 9–13, it has the time complexity O(C) of WMAST. Therefore, the complexity for one execution of the recursive ALGORITHM CORRECTTREE is O(C). As in the worst case the algorithm can be executed Ω(n) times, the total worst case running time is O(nC).
Let us consider the complexity of WMAST. The O(n ^{2}) algorithm of Steel and Warnow [33] naturally generalizes to the case of weighted trees, and leads to the same complexity, O(n ^{2}). However, we show in the rest of this proof that an instance of WMAST can be transformed into an instance of MAST in lineartime, which allows us to consider C as being the best complexity found for MAST, namely the runningtime of the algorithm given in [30].
Let G be a genome set, W be a weighted tree on G and S be a species tree for G. Then consider the expanded genome set G _{ exp }obtained from G by replacing each genome g with weight c in W by a set of genomes {g _{1},⋯g _{ c }}, the expanded gene tree W _{ exp }obtained from W by replacing each leaf g with weight c>1 by an expanded leaf , i.e. a caterpillar tree of size c containing the leaves g _{1},⋯g _{ c }(i.e. the tree (g _{ c },(⋯g _{3},(g _{2},g _{1})⋯ )), and the expanded species tree S _{ exp }obtained from S by replacing each leaf g with weight c>1 in W by a caterpillar tree of size c containing the leaves g _{1},⋯g _{ c }. Then W _{ exp }and S _{ exp }are uniquely leaflabelled trees. It is easy to see that a solution of MAST will contain, for any g ∈ G, either c or 0 leaves labelled g (i.e. either 0 or all leaves labelled g removed from W _{ exp }). Therefore the compressed tree , obtained by recovering a single weighted leaf from each expanded leaf of W _{ exp }that was not removed by MAST, is a solution to WMAST. Further, since we add at most a constant number of genes per leaf, we will affect the running time of MAST by at most a constant factor. □
Empirical results
We test the optimality of Algorithm CorrectTree in the case of a gene tree satisfying Property ADaboveNAD (i.e. containing at least one AD vertex above a NAD vertex). Indeed, the algorithm is guaranteed to give the optimal solution otherwise (i.e. for trees satisfying the constraints of Section “Uniquely leaflabelled gene trees” or Section “No AD above NAD”). We compared the number N of leafremovals obtained by Algorithm CorrectTree with the number N _{ opt }obtained by the exact algorithm that tries all possible leafsubset removals. More precisely, if the minimum number of leafremovals output by Algorithm CorrectTree is r, we try all subsets of r−1,r−2,…,r−i leaf removals, and stop as soon as a tree that is MDconsistent with S is obtained. As the naive algorithm has clearly an exponentialtime complexity, tests are performed on trees of limited size.
Conclusion
Based on observations pointing to NAD vertices of a gene tree as indicating potentially misplaced genes, we developed a polynomialtime algorithm for inferring the minimum number of leafremovals required to transform a gene tree into an MDtree, i.e. a tree with no NAD vertices. The algorithm is exact in the case of a uniquely leaflabelled gene tree, or in the case of a gene tree that does not contain any AD vertex above a NAD vertex. In the general case, our algorithm exhibited nearoptimal results under our simulation parameters. Unfortunately, NAD vertices can only reveal a subset of misplaced genes, as a randomly placed gene does not necessarily lead to a NAD vertex. Our experiments show that, on average, we are able to infer 40% of misplaced genes. However, the additional damage caused by a misplaced leaf leading to a NAD is an excessive increase of the real mutationcost of the tree. Therefore, removing NADs can be seen as a preprocessing of the gene tree preceding a reconciliation approach, in order to obtain a better view of the duplicationloss history of the gene family.
Another use of our method would be to choose, among a set of equally supported gene trees output by a given phylogenetic method, the one that can be transformed to an MDconsistent tree by a minimum number of leaf removals.
A limitation of our approach is that a NAD resulting from a wrong bipartition {a,b;c} can be, a priori, solved by removing any gene from this bipartition. Our present approach is able to detect a number of misplaced genes but, in general, it is insufficient to detect precisely the genes that have been erroneously added in the tree. An extension would be to infer all optimal subsets of leaf removals, and to use bootstrapping values on the edges of the tree for a judicious choice of the genes to be removed.
Funding
Research supported by grants to N.E.M. from the Natural Sciences and Engineering Research Council of Canada, and “Fonds de Recherche Nature et Technologie” of Quebec.
Declarations
Authors’ Affiliations
References
 Li WH, Gu Z, Wang H, Nekrutenko A: Evolutionary analysis of the human genome. Nature 2001, 409:847–849.PubMedView Article
 Durand D, Haldórsson BV, Vernot B: A hybrid micromacroevolutionary approach to gene tree reconstruction. J Comput Biol 2006, 13:320–335.PubMedView Article
 Zhang J: Evolution by gene duplication: an update. TRENDS Ecol Evol 2003,18(6):292–298.View Article
 Goodman M, Czelusniak J, Moore GW, RomeroHerrera AE, Matsuda G: Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool 1979, 28:132–163.View Article
 Doyon JP, Scornavacca C, Gorbunov K, Szolloso G, Ranwez V, Berry V: An effi. algo. for gene/species trees parsim. reconc. with losses, dup. and transf. J Comp Biol 2010, 6398:93–108.
 Hallett M, Lagergren J, Tofigh A: Simultaneous identification of duplications and lateral transfers. In Proceedings of the Eight Annual International Conference on Computational Molecular Biology, RECOMB. Edited by: Bourne PE, Gusfield D. New York: ACM; 2004:347–356.View Article
 Tofigh A, Hallett M, Lagergren J: Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM Trans Comput Biol Bioinf 2011, 8:517–535.View Article
 Chauve C, ElMabrouk N: New perspectives on gene family evolution: losses in reconciliation and a link with supertrees. In Proceedings of the Thirteenth Annual International Conference on Computational Molecular Biology, RECOMB. Edited by: Batzoglou S S. Springer; 2009:46–58. volume 5541 of LNCS
 Chen K, Durand D, FarachColton M: Notung: Dating gene duplications using gene family trees. J Comput Biol 2000, 7:429–447.PubMedView Article
 Ma B, Li M, Zhang L: From gene trees to species trees. SIAM J Comput 2000, 30:729–752.View Article
 Guigó R, Muchnik I, Smith TF: Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol 1996, 6:189–213.PubMedView Article
 Page RDM, Charleston MA: Reconciled trees and incongruent gene and species trees. DIMACS Ser Discrete Mathematics and Theor Comput Sci 1997, 37:57–70.
 Bonizzoni P, Vedova Della G, Dondi R: Reconciling a gene tree to a species tree under the duplication cost model. Theor Comput Sci 2005, 347:36–53.View Article
 Gorecki P, Tiuryn J: DLStrees: a model of evolutionary scenarios. Theor Comput Sci 2006, 359:378–399.View Article
 Page RDM: Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst Biol 1994, 43:58–77.
 Page RDM: Genetree: comparing gene and species phylogenies using reconciled trees. Bioinformatics 1998, 14:819–820.PubMedView Article
 Hallett MT, Lagergren J: New algorithms for the duplicationloss model. In Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, RECOMB. Edited by: Shamir R, Miyano S, Istrail S, Pevzner P, Waterman MS. New York: ACM; 2000:138–146.View Article
 Hahn MW: Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution. Genome Biol 2007, 8:R141.PubMedView Article
 Chang WC, Eulenstein O: Reconciling gene trees with apparent polytomies. In Proceedings of the 12th Conference on Computing and Combinatorics (COCOON), volume 4112 of Lecture Notes in Computer Science. Edited by: Chen DZ, Leepages DT. Taipei, Taiwan; 2006:235–244.
 Lafond M, Swenson KM, ElMabrouk N: An optimal reconciliation algorithm for gene trees with polytomies. WABI, volume 7534 of LNBI/LNBI 2012, 106–122.
 Dondi R, ElMabrouk N: Minimum leaf removal for reconciliation: complexity and algorithms. Combinatorial Pattern Matching (CPM) 2012. accepted
 Doroftei A, ElMabrouk N: Removing noise from gene trees. WABI, volume 6833 of LNBI/LNBI 2011, 76–91.
 Chauve C, Doyon JP, ElMabrouk N: Gene family evolution by duplication, speciation and loss. J Comput Biol 2008, 15:1043–1062.PubMedView Article
 Hahn MW, Han MV, Han SG: Gene family evolution across 12 drosophilia genomes. PLoS Genet 2007, 3:e197.PubMedView Article
 Page R D M: Modified mincut supertrees. LNCS, volume 2452 of WABI 2002, 537–551.
 Semple C, Steel M: A supertree method for rooted trees. Discrete Appl Math 2000, 105:147–158.View Article
 Snir S, Rao S: Using max cut to enhance rooted trees consistency. IEEE/ACM Trans Comput Biol Bioinf 2006, 3:323–333.View Article
 Berge C: Hypergraphs:Combinatorics of Finite Sets, volume 45. Amsterdam, NorthHolland; 1989.
 Hao J, Orlin JB: A faster algorithm for finding the minimum cut in a graph. Proceedings of the Third Annual ACMSIAM Symposium on Discrete Algorithms 1992, 165–174.
 Cole R, FarachColton M, Hariharan R, Przytycka T, Thorup M: An o ( n \log n ) algorithm for the maximum agreement subtree problem for binary trees. SIAM J Comput 2000,30(5):1385–1404.View Article
 Finden CR, Gordon AD: Obtaining common pruned trees. J Classif 1995, 2:255–276.View Article
 Amir A, Keselman D: Maximum agreement subtree in a set of evolutionary trees: matrics and efficient algorithms. SIAM J Comput 1997, 26:1656–1669.View Article
 Steel M, Warnow T: Kaikoura tree theorems:computing the maximum agreement subtree. Inform Process Lett 1993, 48:77–82.View Article
 Bryant D: Building trees, hunting for trees and comparing trees: Theory and method in phylogenetic analysis. PhD dissertation, Department of Mathematics, University of Canterbury, UK; 1997
 Farach M, Przytycka TM, Thorup M: On the agreement of many trees. Inf Process Lett 1995,55(6):297–301.View Article
 Zhang LX: On MirkinMuchnikSmith conjecture for comparing molecular phylogenies. J Comput Biol 1997, 4:177–188.PubMedView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.