 Research
 Open Access
 Published:
Gene tree correction for reconciliation and species tree inference
Algorithms for Molecular Biology volume 7, Article number: 31 (2012)
Abstract
Background
Reconciliation is the commonly used method for inferring the evolutionary scenario for a gene family. It consists in “embedding” inferred gene trees into a known species tree, revealing the evolution of the gene family by duplications and losses. When a species tree is not known, a natural algorithmic problem is to infer a species tree from a set of gene trees, such that the corresponding reconciliation minimizes the number of duplications and/or losses. The main drawback of reconciliation is that the inferred evolutionary scenario is strongly dependent on the considered gene trees, as few misplaced leaves may lead to a completely different history, with significantly more duplications and losses.
Results
In this paper, we take advantage of certain gene trees’ properties in order to preprocess them for reconciliation or species tree inference. We flag certain duplication vertices of a gene tree, the “nonapparent duplication” (NAD) vertices, as resulting from the misplacement of leaves. In the case of species tree inference, we develop a polynomialtime heuristic for removing the minimum number of species leading to a set of gene trees that exhibit no NAD vertices with respect to at least one species tree. In the case of reconciliation, we consider the optimization problem of removing the minimum number of leaves or species leading to a tree without any NAD vertex. We develop a polynomialtime algorithm that is exact for two special classes of gene trees, and show a good performance on simulated data sets in the general case.
Background
Almost all genomes which have been studied contain genes that are present in two or more copies. Duplicated genes account for about 15% of the protein coding genes in the human genome, for example[1]. In practise, homologous gene copies (e.g. copies in one genome or amongst different genomes that are descended from the same ancestral gene) are identified through sequence similarity; using a BLASTlike method, all gene copies with a similarity score above a certain threshold would be grouped into the same gene family. Using a classical phylogenetic method, a gene tree, representing the evolution of the gene family by local mutations, can then be constructed based on the similarity scores. However, macroevolutionary events (duplications, losses, horizontal gene transfer) affecting the number and distribution of genes among genomes[2], are not explicitly reflected by this gene tree. Having a clear picture of the speciation, duplication and loss mechanisms that have shaped a gene family is however crucial to the study of gene function. Indeed, following a duplication, the most common occurrence is for only one of the two gene copies to maintain the parental function, while the other becomes nonfunctional (pseudogenization) or acquires a new function (neofunctionalization)[3].
The most commonly used methods to infer evolutionary scenarios for gene families are based on the reconciliation approach that compares the species tree S (describing the relationships among taxa) to the gene tree T. Assuming no sequencing errors and a “correct” gene tree, the incongruence between the two trees can be seen as a footprint of the evolution of the gene family through processes other than speciation, such as duplication and loss. The concept of reconciling a gene tree to a species tree under the duplicationloss model was pioneered by Goodman[4] and then widely accepted, utilized and also generalized to models of other processes such as horizontal gene transfer[5–7]. Several definitions of reconciliation exist in the literature, one of them expressed in term of “tree extension”[8]. More precisely, a reconciliation R between T and S is an extension of T (obtained by grafting new subtrees onto existing edges of T) consistent with the species tree (i.e. reflecting the same phylogeny). A duplication and loss history for the gene family is then directly deduced from R. As many reconciliations exist, a natural approach is to select the one that optimizes a given criterion. Natural combinatorial criteria are the number of duplications (duplication cost), losses (loss cost) or both combined (mutation cost)[9, 10]. The so called Lowest Common Ancestor (LCA) mapping between a gene tree and a species tree, formulated in[11, 12] and widely used[2, 10, 12–16], defines a reconciliation that minimizes both the duplication and mutation costs. It has been shown in[8] that minimizing duplications follows from minimizing losses (i.e. a reconciliation minimizing losses also minimizes duplications, but the converse is false). When no preliminary knowledge on the species tree is given, a natural problem, known as the species tree inference problem, is to infer, from a set of gene trees, a species tree leading to a parsimonious evolution scenario, for a given cost. Similar to the case of a known species tree, methods have been developed for the duplication and mutation costs[9, 10, 17]. For both criteria, the problem of inferring an optimal species tree given a set of gene trees is NPhard[10].
The main criticism of reconciliation is that the inferred duplication and loss history for a gene family is strongly dependent on the gene tree considered for this family. Indeed, a few misplaced leaves in the gene tree can lead to a completely different history, possibly with significantly more duplications and losses[18]. Reconciliation can therefore inspire confidence only in the case of a wellsupported gene tree. Typically bootstrapping values are used as a measure of confidence in each edge of a phylogeny. How should the weak edges of a gene tree be handled? One reasonable answer is to transform the binary gene tree into an unresolved gene tree by removing each weak edge and collapsing its two incident vertices into one. Extensions of the duplicationloss model to nonbinary gene trees have been considered[19, 20]. Another strategy adopted in[9] is to explore the space of gene trees obtained from the original gene tree T by performing Nearest Neighbor Interchanges (NNIs) around weaklysupported edges. The problem is then to select, from this space, the tree giving rise to the minimum reconciliation cost.
In this paper, we explore a different strategy for “correcting” or preprocessing a gene tree T, prior to reconciliation or species tree inference. Criteria for identifying potentially misplaced leaves were given in[8]. The duplication vertices of T with respect to a species tree S can be subdivided into apparent and nonapparent duplication (NAD) vertices, where the latter class has been flagged as potentially resulting from the misplacement of leaves in the gene tree. The reason is that each one of the NAD vertices reflects a phylogenetic contradiction with the species tree that is not due to the presence of duplicated gene copies. In the case of an unknown species tree, we showed in[8] that deciding whether T can be explained using only apparent duplications (we say that T is an MDtree) can be done in polynomial time, as well as inferring an appropriate species tree. Here, we present algorithmic results for removing, for a given gene tree (or a forest of gene trees), the minimum number of leaves or leaflabels (species) leading to a tree without any NAD vertex, in both cases of a known or an unknown species tree. The minimum leaf removal problem in case of a known species tree has been recently proved to be APXhard[21].
In the next section, we begin by formally introducing our concepts. We then motivate and state our problems in Section “Motivations and problem statements”. Section “Minimum species removal inference and reconciliation” gives a greedy heuristic for the minimum species removal problem in the case of an unknown species tree, and shows that any algorithm for this case can be applied to the case where the species tree is known. Section “Algorithms for the minimum removal reconciliation problems” is dedicated to the algorithmic developments in the case of a known species tree. We first describe two special classes of gene trees which lead to an exact polynomialtime algorithm. We then present a heuristic algorithm for the general case. In Section “Empirical results”, we test the optimality of our algorithm for the minimum leafremoval problem in the case of a known species tree, and the ability of the presented approach to identify misplaced genes. This paper is an extended version of[22].
Definitions
Trees
In this paper, we only consider rooted trees. Let G={1,2,⋯,g} be a set of integers representing g different species (genomes). A species tree on G is a rooted binary tree with exactly g leaves, where each i ∈ G is the label of a single leaf (Figure1a). A gene tree on G is a rooted binary tree where each leaf is labelled by an integer from G, with possibly repeated leaves (Figure1b). A gene tree represents a gene family, where each leaf labelled i represents a gene copy located on genome i. In the case of a species tree or a uniquely leaflabelled gene tree (i.e. no leaflabel occurring more than once) we will make no distinction between a leaf and its label.
Given a tree U, the size of U, denoted by U, is the number of leaves of U, and the genome set of U, denoted by$\mathcal{L}\left(U\right)$, is the subset of G defined by the labels of the leaves of U. Given a vertex x of U, U_{ x } is the subtree of U rooted at x. The genome set of U_{ x } is denoted by$\mathcal{L}\left(x\right)$ (for example, in the tree of Figure1a,$\mathcal{L}\left(B\right)=\{1,2\}$). If x is not a leaf, we denote by x_{ l } and x_{ r } the two children of x. Finally, if x is not the root, any vertex y≠x on a path from x to the root is an ancestor of x.
Given a tree U on a genome set G, a leaf removal consists of removing a given leaf from U, along with its parental node x, and if x is not the root joining the parent of x and the remaining child by a new edge. A tree${U}^{\prime}$ obtained from U through a sequence of leaf removals is said to be included in U. The tree U restricted to a subset${G}^{\prime}$ of G is the tree${U}^{\prime}$ obtained from U through a sequence of leaf removals that removes all the leaves with labels in$G\setminus {G}^{\prime}$.
Finally, a subtree U_{ x }of U, for a given vertex x, is said to be a maximum subtree of U verifying a given property P if and only if U_{ x } verifies property P and, for any vertex y that is an ancestor of x, U_{ y }does not verify property P.
Reconciliation
Applying a classical phylogenetic method to the gene sequences of a given gene family leads to a gene tree T that is different from the species tree S, mainly due to the presence of multiple gene copies in T, and that may reflect a divergence history different from S. The reconciliation approach consists in “embedding” the gene tree into the species tree, revealing the evolution of the gene family by duplications and losses.
There are several definitions of reconciliation between a gene tree and a species tree[2, 10–15]. Here we define reconciliation in terms of subtree insertions, following the notation used in[14, 23]. We begin with some definitions:
A subtree insertion in a tree T grafts a new subtree onto an existing edge of T. Formally, inserting a subtree onto an edge linking two nodes x and y (y is a child of x) consists in creating a new node z with parent x and two children being y and the root of the inserted subtree.
A tree${T}^{\prime}$ is said to be an extension of T if it can be obtained from T by subtree insertions on the edges of T.
The gene tree T is said to be DSconsistent with S (DS standing for Duplication/Speciation) if T reflects a history with no loss, i.e. if for every vertex t of T such that$\left\mathcal{L}\right(t\left)\right\ge 2$, there exists a vertex s of S such that$\mathcal{L}\left(t\right)=\mathcal{L}\left(s\right)$
and one of the two following conditions holds: either
$\mathcal{L}\left({t}_{r}\right)=\mathcal{L}\left({t}_{\ell}\right)$ (indicating a Duplication),
or$\mathcal{L}\left({t}_{r}\right)=\mathcal{L}\left({s}_{r}\right)$ and$\mathcal{L}\left({t}_{\ell}\right)=\mathcal{L}\left({s}_{\ell}\right)$ (indicating a Speciation).
Definition 1
A reconciliation between a gene tree T and a species tree S on G is an extension R(T,S) of T that is DSconsistent with S.
For example, the tree of Figure1c is a reconciliation between the gene tree T of Figure1b and the species tree of Figure1a. Such a reconciliation between T and S implies an unambiguous evolution scenario for the gene family, where a vertex of R(T,S) that satisfies property (D) represents a duplication (duplication vertex), a vertex that satisfies property (S) represents a speciation (speciation vertex), and an inserted subtree represents a gene loss. The number of duplication vertices of R(T,S) is called the duplication cost of R(T,S).
The notion of reconciliation can naturally be extended to the case of a set, or forest, of gene trees$\mathcal{F}=\{{T}_{1},\dots ,{T}_{f}\}$: a reconciliation between$\mathcal{F}$ and S is a set$\mathcal{R}(\mathcal{F},S)=\left\{{R}_{1}\right({T}_{1},S),\dots ,{R}_{f}({T}_{f},S\left)\right\}$ of reconciliations, respectively for T_{1},…,T_{ f }, such that each R_{ i }(T_{ i },S) is DSconsistent with S.
LCA Mapping
The LCA mapping between a gene tree T and a species tree S, denoted by M, maps every vertex t of T to the Lowest Common Ancestor (LCA) of$\mathcal{L}\left(t\right)$ in S. A vertex t of T is called a duplication vertex of T with respect to S if and only if M(t_{ ℓ })=M(t) and/or M(t_{ r })=M(t) (see Figure1b). We denote by d(T,S) the number of duplication vertices of T with respect to S. Any vertex of T that is not a duplication vertex is a speciation vertex with respect to S.
The LCA mapping induces a reconciliation M(T S) between T and S, where an internal vertex t of T leads to a duplication vertex in M(T S) if and only if t is a duplication vertex of T with respect to S. In other words, the duplication cost of M(T S) is d(T S) (see for example[10, 13, 15] for more details on the construction of a reconciliation based on the LCA mapping). Moreover, M(T S) is a reconciliation that minimizes the duplication, loss, and mutation costs[8, 14].
Duplication vertices and MDtrees
Let T be a gene tree. As noticed in[8], any vertex t of T such that$\mathcal{L}\left({t}_{\ell}\right)\cap \mathcal{L}\left({t}_{r}\right)\ne \varnothing $ (i.e. the left and right subtrees rooted at t contain a gene copy in the same genome) will be a duplication vertex in any reconciliation between T and any species tree S, in particular in M(T S). Such a vertex is called an Apparent Duplication vertex (AD vertex for short) of T. In the tree of Figure1b, the root is an AD vertex as its left and right subtree both contain a gene copy in genome 1. Following our notation in[8], a gene tree T is said to be a Minimum Duplication Tree (henceforth called an MDtree) if there exists a species tree S such that d(T S) is exactly the number of apparent duplications present in T. In which case, T is said to be MDconsistent with S.
However, this is not always true, in other words, a duplication vertex of T with respect to a species tree S is not necessarily an AD vertex. We call such a duplication vertex a NonApparent Duplication vertex, or simply a NAD vertex. For example, the tree of Figure1b contains one NAD vertex, indicated by a square, and thus T is not MDconsistent with S.
Motivations and problem statements
Nonapparent duplication vertices point to disagreement between a gene tree T and a species tree S that are not due to the presence of repeated leaf labels (i.e. multiple copies in the same genome). More precisely, we say that a vertex x of T splits three species {a b c} into {a b;c} if the genome set of one of its children contains a and b but not c, and the genome set of its other child contains c but neither a nor b. Then for any NAD x of T, there is a triplet of species {a b c} that are split differently by x and by the LCA mapping of x in S (see proof of Theorem 5 in[8]). We will say that such a triplet exhibits a wrong phylogeny. For example, in Figure1, {1,2,3} is split into {1,3;2} by the NAD vertex of T, and into {1,2;3} by the vertex A in S. It has therefore been suggested that NAD vertices may point at gene copies that are erroneously placed in the gene tree.
Observations made in[8] tend to support this hypothesis. In particular, using simulated datasets based on the species tree of 12 Drosophila species given in[24] and a birthanddeath process, starting from a single ancestral gene, and with different gene gain/loss rates, it has been found that 95% of gene duplications lead to an AD vertex.
Notice however that a misplaced gene in a gene tree T does not necessarily lead to a NAD vertex. In other words, NAD vertices can only point to a subset of misplaced leaves. However, in the context of reconciliation, the damage caused by a misplaced leaf leading to a NAD vertex is to significantly increases the real mutationcost of the tree, as shown in Figure2.
Following the later observations, we exploit the properties of NAD vertices for gene tree correction. For generality, we consider a forest of gene trees$\mathcal{F}=\{{T}_{1},\dots ,{T}_{f}\}$. If$\mathcal{F}$ is not MDconsistent with a given species tree S (i.e. there is at least one tree in$\mathcal{F}$ that is not MDconsistent with S) then an MDconsistent forest can always be obtained from$\mathcal{F}$ by performing a certain number of leaf removals. Indeed, a gene tree with only two leaves is always MDconsistent with any species tree. Our first optimization problem is the following, where the size of$\mathcal{F}$ is just the sum of sizes of all the trees of$\mathcal{F}$.
MINIMUM LEAF REMOVAL RECONCILIATION(MINLRR):Input: A genome set G, a forest of gene trees$\mathcal{F}=\{{T}_{1},\dots ,{T}_{f}\}$ on G, and a species tree S for G; Output: A forest of gene trees${\mathcal{F}}^{\mathrm{MAX}}=\{{T}_{1}^{\mathrm{MAX}},\dots ,$${T}_{f}^{\mathrm{MAX}}\}$ of maximum size (i.e. obtained from$\mathcal{F}$ by a minimum number of leaf removals) which is MDconsistent with S, where each${T}_{i}^{\mathrm{MAX}}$ is included in T_{ i }.
In the case of an unknown species tree, we have shown in[8] that deciding whether a forest of gene trees$\mathcal{F}$ is an MDforest (i.e. a set of MDtrees) can be done in polynomial time and space, as well as computing a parsimonious species tree. For a forest which is not an MDforest, a natural generalization of the MINLRR problem is the following:
MINIMUM LEAF REMOVAL INFERENCE (MINLRI):Input: A genome set G and a forest of gene trees$\mathcal{F}=\{{T}_{1},\dots ,{T}_{f}\}$ on G; Output: An MDforest${\mathcal{F}}^{\mathrm{MAX}}=\{{T}_{1}^{\mathrm{MAX}},\dots ,{T}_{f}^{\mathrm{MAX}}\}$ of maximum size (i.e. obtained from$\mathcal{F}$ by a minimum number of leaf removals), where each${T}_{i}^{\mathrm{MAX}}$ is included in T_{ i }.
A more conservative strategy that can be used to reduce the risk of inferring a wrong species tree, is to remove the minimum number of species from G such that the forest$\mathcal{F}$ restricted to the new genome set is an MDforest. Removing the minimum number of species instead of leaves can also be considered in the case of reconciliation, a scenario that may be applicable when full confidence is not put in the species tree.
MINIMUM SPECIES REMOVAL RECONCILIATION (MINSRR):Input: A genome set G, a forest of gene trees$\mathcal{F}=\{{T}_{1},\dots ,{T}_{f}\}$ on G and a species tree S for G; Output: A maximum subset${G}^{\prime}$ of G such that forest$\mathcal{F}$ restricted to${G}^{\prime}$ (i.e. the set of trees T_{ i }restricted to${G}^{\prime}$) is MDconsistent with the species tree S restricted to${G}^{\prime}$.
MINIMUM SPECIES REMOVAL INFERENCE(MINSRI):Input: A genome set G and a forest of gene trees$\mathcal{F}=\{{T}_{1},\dots ,{T}_{f}\}$ on G; Output: A maximum subset${G}^{\prime}$ of G such that the forest$\mathcal{F}$ restricted to${G}^{\prime}$ is an MDforest.
The latter two optimization problems (MINSRR and MINSRI)) are the subject of the next section. Section “Algorithms for the minimum removal reconciliation problems” focuses on the two optimization problems related to reconciliation (MINLRR and MINSRR).
Minimum species removal inference and reconciliation
By linking the species tree inference problem to a supertree problem we have been able to prove that deciding whether a gene tree T is an MDtree can be done in polynomialtime[8]. We used a constructive proof based on a mincut strategy, which has been largely considered in the context of supertrees[25–27]. In this section, we develop a greedy heuristic for MINSRI based on a minimum vertex cut strategy.
Let$\mathcal{F}=\{{T}_{1},{T}_{2},\dots ,{T}_{f}\}$ be a forest of gene trees on a genome set G. Define$\mathrm{leve}{l}_{0}\left(\mathcal{F}\right)$ to be the set of highest (i.e. closest to the root) vertices of all T_{ i }s that are not ADvertices.$\mathrm{leve}{l}_{j}\left(\mathcal{F}\right)$ is then the set of vertices of all T_{ i }s that are closest nonAD descendants of the vertices for$\mathrm{leve}{l}_{j1}\left(\mathcal{F}\right)$. For a given level j, forest$\mathcal{F}$, and vertex$x\in \mathrm{leve}{l}_{j}\left(\mathcal{F}\right)$, consider the bipartition$B\left(x\right)=\left(\mathcal{L}\right({x}_{l}),\mathcal{L}({x}_{r}\left)\right)$. Then${\mathcal{G}}_{j}=(V,E)$ is the corresponding hypergraph[28] where V=G, and$\mathcal{L}\left({x}_{l}\right),\mathcal{L}\left({x}_{r}\right)\in E$ for$x\in \mathrm{leve}{l}_{j}\left(\mathcal{F}\right)$.
In order for$\mathcal{F}$ to be an MDforest, all the vertices of$\mathrm{leve}{l}_{j}\left(\mathcal{F}\right)$, for any j, should represent speciation vertices with respect to some species tree S (as otherwise they would represent additional nonapparent duplication vertices, preventing the forest from being an MDforest). In other words, the bipartitions B(x) for all x ∈ leve l_{0}(T) should reveal a first speciation event, which is possible if and only if the graph${\mathcal{G}}_{0}$ contains at least two connected components. Indeed, in this case for any species tree S with a root r splitting G into two disconnected subsets, all the vertices of$\mathrm{leve}{l}_{0}\left(\mathcal{F}\right)$ would be speciation vertices. Conversely, if${\mathcal{G}}_{0}$ contains a single connected component, then for any species tree S, at least one node of leve l_{0}(T) would be a NAD node. The same reasoning applies to any$\mathrm{leve}{l}_{j}\left(\mathcal{F}\right)$ and${\mathcal{G}}_{j}$.
On the other hand, if${\mathcal{G}}_{j}$ is connected for some$\mathrm{leve}{l}_{j}\left(\mathcal{F}\right)$, there exists no species tree so that all$x\in \mathrm{leve}{l}_{j}\left(\mathcal{F}\right)$ represent speciation events. In this case, some number of species must be removed to make${\mathcal{G}}_{j}$ disconnected. This corresponds exactly to a vertex cut in${\mathcal{G}}_{j}$. These observations leads to the following heuristic for the MINSRI problem.
ALGORITHMMINIMUM SPECIESREMOVAL INFERENCE($\mathcal{F}$)

1.
${G}^{\prime}=G$; j=0; Compute $\mathrm{leve}{l}_{0}\left(\mathcal{F}\right)$;

2.
WHILE $\mathrm{leve}{l}_{j}\left(\mathcal{F}\right)$ is not empty DO

3.
Construct the hypergraph ${\mathcal{G}}_{j}$;

4.
IF ${\mathcal{G}}_{j}$ is connected THEN

5.
${G}^{\prime}={G}^{\prime}\setminus \mathrm{\text{MINIMUMVERTEXCUT}}\left({\mathcal{G}}_{j}\right)$;

6.
Restrict $\mathcal{F}$ to ${G}^{\prime}$;

7.
END IF

8.
j=j + 1;

9.
Compute $\mathrm{leve}{l}_{j}\left(\mathcal{F}\right)$;

10.
END WHILE

11.
RETURN ( ${G}^{\prime}$)
MINIMUMVERTEXCUT in a hypergraph can be computed using the minimum vertex cut algorithm for simple graphs: each hyperedge corresponds to a clique in the simple graph. It is easy to confirm that a set of vertices disconnects the hypergraph if and only if it disconnects the corresponding simple graph. Vertex cut on a simple graph can be implemented with 2n−2 calls to the standard s t vertex cut algorithm (based on minimum s t edge cut). By reusing computation, Hao and Orlin[29] showed how to do all 2n−2 calls to the s t cut algorithm in the same time it takes to do a single call. Thus, MINIMUMVERTEXCUT can be solved in$O\left(\rightV\left\rightElg(V{}^{2}/E\left\right)$ time. Since we call MINIMUMVERTEXCUTO(V) times in the worst case, Algorithm Minimum Species Removal Inference runs in$O\left(\rightV{}^{2}\leftE\rightlg\left(\rightV{}^{2}/\leftE\right)$ time.
In the next section we give algorithms for MINSRR and MINLRR.. We conclude this section by highlighting the relationship between MINSRR and MINSRI.
Remark 1
MINSRR reduces to MINSRI.
This is easy to see; take the species tree S given by the instance of MINSRR and add it to the forest$\mathcal{F}$ for the MINSRI problem. The solution to MINSRI gives a species tree that must be a subtree of S. Thus, any algorithm for MINSRI can be used to solve MINSRR.
Algorithms for the minimum removal reconciliation problems
In this section, we assume that a species tree S is known for the genome set G. For simplicity, we present the algorithms in the case of a single gene tree T, although it is straightforward to generalize them to the case of a forest of gene trees.
Let T be a gene tree for a gene family on G. We suppose that T is not an MDtree consistent with S (i.e. there is at least one duplication vertex of T that is a NAD vertex). We begin by describing special classes of gene trees for which exact polynomialtime algorithms have been developed for the MINLRR and MINSRR problems.
Uniquely leaflabelled gene trees
When the considered gene family contains at most one gene per genome, the gene tree T is uniquely leaflabelled. In this case, minimizing the number of leaves, or equivalently species, that should be removed from T to obtain an MDtree consistent with S is equivalent to finding the maximum number of genomes that lead to the same phylogeny in T and S. In other words, it is immediate to see that the MINLRR problem reduces, in this case, to the MAST problem given below.
MAXIMUM AGREEMENT SUBTREE (MAST):Input: A uniquely leaflabelled gene tree T on G and a species tree S for G; Output: A tree T^{MAX}included in T such that it is MDconsistent with S and of maximum size.
A more general definition is given in the literature, where the MAST problem is defined on a set of uniquely leaflabelled trees as the largest tree included in each tree of the set. This definition is equivalent to ours in the case of a gene tree T and a species tree S.
The MAST problem arises naturally in biology and linguistics as a measure of consistency between two evolutionary trees over species or languages[30]. In the evolutionary study of genomes, different methods and different gene families are used to infer a phylogenetic tree for a set of species, usually yielding different trees. In such a context, one has to find a consensus of the various obtained trees. Considering the MAST problem, introduced by Finden and Gordon[31], is one way to obtain such a consensus. Amir et al.[32] showed that computing a MAST of three trees with unbounded degree is NPhard. However, in the case of two binary trees, the problem is polynomial. The first polynomialtime algorithm for this problem was given by Steel and Warnow[33]. It is a dynamic programming algorithm considering the solution for all pairs of subtrees of T and S; it has a running time of O(n^{2}), where n is the number of leaves in the trees. Later, Cole et al.[30] developed an$O(nlgn)$ time algorithm, which, as far as we know, is the most efficient algorithm for solving the MAST problem on two binary trees. We use this result in the MINLRR version of our algorithms. In the case of k binary trees, the current fastest known algorithms run in O(k n^{3}) time[34, 35]. We use this result in the MINSRR version of our algorithms.
No AD above NAD
In this section, we consider a tree T containing no AD vertex above a NAD vertex (Figure3a). More precisely, T satisfies Constraint C below:
CONSTRAINT C: For each NAD vertex x of T, if y is an ancestor of x that is a duplication vertex, then y is a NAD vertex.
We show that the MINSRR problem reduces, in this case, to the MAST problem, while the MINLRR problem reduces to a “generalization” of the MAST problem to weighted trees, where a weighted tree is a uniquely leaflabelled tree with weighted leaves.
Definition 2
Let U be a tree on G. The weighted tree induced by (U,S) is the tree included in S obtained from S by removing all leaves that are not in$\mathcal{L}\left(U\right)$, with a weight attributed to each leaf s, representing the number of occurrences of s in U (i.e. the number of leaves of U labelled s).
Let T_{1},T_{2},⋯T_{ m } be the maximum subtrees of T rooted at an AD vertex (i.e. subtrees of T rooted at the highest AD vertices). Then, the tree T^{I} obtained by replacing each T_{ i }, for 1 ≤ i ≤ m, by the weighted tree induced by (T_{ i },S), is a weighted uniquely leaflabelled tree. An example is given in Figure3a,b and c. Let ρ_{ s } be the operation of removing the weighted leaf s from T^{I}. Then the corresponding removals in T consist of removing from T all leaves labelled s.
Finally, we formulate the generalization of the MAST problem to weighted trees as follows, where the value v(W) of a weighted tree W is the sum of its leaves’ weights.
WEIGHTED MAXIMUM AGREEMENT SUBTREE(WMAST):Input: A weighted tree W on G and a species tree S for G; Output: A weighted tree W^{MAX}included in W such that it is MDconsistent with S and of maximum value.
We are now ready for the main theorem.
Theorem 1
Let T be a gene tree satisfying CONSTRAINT C. Let W^{MAX}be a solution of the WMAST problem on T^{I}and S, and T^{MAX}be the subtree included in T obtained by removing from T all the leaves that are not leaves of W^{MAX}. Then T^{MAX}is a solution of the MINLRR problem on T and S.
In other words, solving the MINLRR problem on T is equivalent to solving the WMAST problem on T^{I}. We show in the proof of Theorem 2 that WMAST can be solved by the traditional MAST algorithms with no change in the asymptotic running time.
A complete example of the algorithmic methodology used for solving the MINLRR problem on T and S is given in Figure3. The algorithm will be detailed in the next section.
We now provide a proof of Theorem 1, subdivided into the two following lemmas.
Lemma 1
The tree T^{MAX}is MDconsistent with S.
Proof
We show, by contradiction, that T^{MAX}does not contain any NAD vertex. Suppose that T^{MAX}contains a NAD vertex x. Then x maps to the same vertex s of S than one of its children, lets say the left child. Then there exists two leaves of${T}_{{x}_{l}}^{\mathrm{MAX}}$, labelled a and b, and one leaf of${T}_{{x}_{r}}^{\mathrm{MAX}}$ labelled c such that the triplet {a,b,c} exhibits a wrong phylogeny. As a nonduplication vertex in T cannot become a duplication vertex after leaf removals, we have only two possibilities for x in T:

1.
x is a NAD vertex in T. Then the genome sets of ${T}_{{x}_{l}}$ and ${T}_{{x}_{r}}$ are disjoint. Moreover, the genome set of $\left(\right)close="">{W}_{{x}_{l}}^{\mathrm{MAX}}$ (resp. $\left(\right)close="">{W}_{{x}_{r}}^{\mathrm{MAX}}$) is a subset of the genome set of ${T}_{{x}_{l}}$ (resp. ${T}_{{x}_{r}}$). On the other hand, as x is not a duplication vertex in W ^{MAX}, one of the three genes a, b and c should be absent in ${W}_{x}^{\mathrm{MAX}}$. And thus, {a,b,c} can not be a subset of the genome set of ${T}_{x}^{\mathrm{MAX}}$, a contradiction.

2.
x is an AD vertex in T. Then the subtree T _{ x }of T rooted at x contains at least two leaves labelled with the same label d (different from a, b and c), one in ${T}_{{x}_{l}}$ and one in ${T}_{{x}_{r}}$. Moreover the leaf labelled d in S should belong to the subtree of S rooted at s, and thus to the subtree S _{ i }rooted at the left or right child of s. Such subtree S _{ i }contains at least one leaf labelled a or b or c.
On the other hand, let y be the parent of x in T^{I}. As an optimal solution of the WMAST problem on T^{I}removes leaves from the subtree$\left(\right)close="">{T}_{x}^{I}$, such an operation should result in removing the duplication vertex y. In other words, x and y should map to the same vertex s in S. Moreover the result of the leaf removal from$\left(\right)close="">{T}_{x}^{I}$ should result in a different LCA mapping for x and y. Indeed, removing leaves from the corresponding subtree in T^{I} does not contribute to eliminating any NAD from T^{I}. It follows that S should exhibit the phylogeny ((a,b,c),d), which is a contradiction with the result of the last paragraph. □
Lemma 2
Let${T}^{\prime}$ be a tree included in T that is MDconsistent with S. Then$\left{T}^{\prime}\right\le \left{T}^{\mathrm{MAX}}\right$.
Proof
We will show that, for any s ∈ G, if a leaf i labelled s is removed from T (i.e. i is not a leaf in${T}^{\prime}$), then all leaves of T labelled s are removed from T.
Suppose this is not the case. Let y be the vertex of T representing the least common ancestor of all leaves labelled s in T. Then y is an AD node. As a leaf i labelled s is removed from T, such removal should contribute to resolving a NAD vertex x of T. From CONSTRAINT C, such a vertex should be outside the subtree of T rooted at y. Moreover, it should clearly be an ancestor of y (otherwise removing i will have no effect on x).
As x is a NAD vertex, it maps to the same vertex s of S as one of its children, say the left child. Then, there exist two leaves of${T}_{{x}_{l}}$ labelled a and b, and one leaf of${T}_{{x}_{r}}$ labelled c such that the triplet {a,b,c} exhibits a wrong phylogeny. Moreover, as removing leaf i labelled s contributes to solving x, we can assume that a=s. However, from our assumption, there remains a leaf labelled s in${T}^{\prime}$. Thus: either (1) at least one leaf labelled b and one leaf labelled c remains in${T}^{\prime}$, or (2) all leaves labelled b or all leaves labelled c are removed. In case (1), the wrong phylogeny exhibited by the triplet {a,b,c} is still present, preventing vertex x from being a nonduplication vertex. In case (2), as all copies of b (or equivalently c) are removed, there is no need to remove leaf i labelled s to correct the wrong phylogeny exhibited by the triple {a,b,c}.
Therefore, the weighted tree${W}^{\prime}$ induced by${T}^{\prime}$ is obtained from T^{I}through a sequence of leaf removals. Now, as W^{MAX}is the solution of the WMAST problem on T^{I}, then$v\left({W}^{\mathrm{MAX}}\right)\ge v\left({W}^{\prime}\right)$, and thus$\left{T}^{\mathrm{MAX}}\right\ge \left{T}^{\prime}\right$. □
Finally, the following corollary makes the link between the MINSRR problem and the MAST problem.
Corollary 1
Let T be a gene tree satisfying CONSTRAINT C. Let W^{MAX}be a solution to the MAST problem on T^{I}and S (ignoring weights), and T^{MAX}be the subtree of T induced by W^{MAX}. Then T^{MAX}is a solution to the MINSRR problem on T and S.
To apply the algorithm to MINSRR with the forest$\mathcal{F}=\{{T}_{1},{T}_{2},\dots ,{T}_{k}\}$, all trees${T}_{i}^{I}$ must simultaneously agree with S, so the O(k n^{3})MAST algorithm[34, 35] must be used.
An Algorithm for the general case
In this section, we present a general algorithm, that is exact in the case of a uniquely leaflabelled gene tree (Section “Uniquely leaflabelled gene trees”) or a gene tree satisfying CONSTRAINT C (Section “No AD above NAD”), and a heuristic in the general case. We first introduce preliminary definitions. For a given tree U (weighted or not), consider the two following properties on U:
Property ONLYNAD: U has no AD vertices; Property ONLYAD: U is rooted at an AD vertex and contains no NAD vertex.
We define the NADborder of U as the set of roots of the maximum subtrees of U verifying Property ONLYNAD, and the ADborder of U as the set of roots of the maximum subtrees of U verifying Property ONLYAD.
ALGORITHM CORRECTTREE is a recursive algorithm that takes as input a gene tree T and a species tree S, and outputs a number of leaf removals transforming T into a tree that is MDconsistent with S. It proceeds as follows:
Stop condition  Lines 2 to 4: If T is MDconsistent with S, then no leaf removal is performed, and the algorithm terminates.
Recurrence Loop  Lines 6 to 13: Resolve all maximum subtrees of T verifying CONSTRAINT C as described in Section “No AD above NAD”, that is:

1.
Construct the weighted tree T ^{I}(Lines 68);

2.
For each root x of a maximum subtree ${T}_{x}^{I}$ of T ^{I}satisfying CONSTRAINT C (Line 9), solve the WMAST problem on ${T}_{x}^{I}$, which leads to the weighted tree ${W}_{x}^{\mathrm{MAX}}$ (Line 10), compute the induced tree T _{ x }(Line 11) and store the number of performed leaf removals (Line 12).
Algorithm CorrectTree (T, S)

1.
LeafRemoval=0;

2.
IF T is a tree MDconsistent with S THEN

3.
RETURN (LeafRemoval)

4.
END IF

5.
T ^{I}=T;

6.
FOR ALL x ∈ ADborder(T) DO

7.
Replace ${T}_{x}^{I}$ by its induced weighted tree;

8.
END FOR

9.
FOR ALL x ∈ NADborder(T ^{I}) DO

10.
$\left(\right)close="">{W}_{x}^{\mathrm{MAX}}=\mathrm{\text{WMAST}}\left({T}_{x}^{I}\right)$;
Replace T _{ x }by the subtree induced by $\left(\right)close="">{W}_{x}^{\mathrm{MAX}}$;
LeafRemoval = LeafRemoval + $\left(\right)close="">\left(v\right({T}_{x}^{I})v({W}_{x}^{\mathrm{MAX}}\left)\right)$;

11.
END FOR

12.
RETURN (LeafRemoval+CORRECTTREE(T, S))
If T is a uniquely leaflabelled tree then T^{I}=T, NADborder(T^{I}) is reduced to the root of the tree, and thus loop 9–13 is just executed once. Moreover, as T^{I} is unweighted (all labels are equal to 1), WMAST is reduced to MAST. The whole algorithm thus reduces to one resolution of the MAST problem.
If T satisfies CONSTRAINT C, then NADborder(T^{I}) is also reduced to the root of the whole tree, and thus loop 9–13 is just executed once. In this case, the methodology is the one following Theorem 1, and illustrated in Example 3.
In the general case, NADborder(T^{I}) is not restricted to a single vertex, and loop 9–13 can be executed many times. Moreover, at the end of loop 9–13, the resulting tree is not guaranteed to be MDconsistent with S, as NAD vertices higher than those in NADborder(T^{I}) may exist. Algorithm CorrectTree may therefore be applied many times.
Theorem 2
Algorithm Correcttree has timecomplexity$O({n}^{2}lgn)$, where n is the size of T.
Proof
Let n be the size of T. Loop 2–4 requires the LCA mapping between T and S, and the identification of AD and NAD vertices. As the LCA mapping can be computed in linear time[8, 36], testing whether a tree T is MDconsistent with S can be tested in time O(n). Clearly, Loop 6–8 can be executed in time O(n). As for Loop 9–13, it has the time complexity O(C) of WMAST. Therefore, the complexity for one execution of the recursive ALGORITHM CORRECTTREE is O(C). As in the worst case the algorithm can be executed Ω(n) times, the total worst case running time is O(nC).
Let us consider the complexity of WMAST. The O(n^{2}) algorithm of Steel and Warnow[33] naturally generalizes to the case of weighted trees, and leads to the same complexity, O(n^{2}). However, we show in the rest of this proof that an instance of WMAST can be transformed into an instance of MAST in lineartime, which allows us to consider C as being the best complexity found for MAST, namely the$O(nlog(n\left)\right)$ runningtime of the algorithm given in[30].
Let G be a genome set, W be a weighted tree on G and S be a species tree for G. Then consider the expanded genome set G_{ exp }obtained from G by replacing each genome g with weight c in W by a set of genomes {g_{1},⋯g_{ c }}, the expanded gene tree W_{ exp }obtained from W by replacing each leaf g with weight c>1 by an expanded leaf , i.e. a caterpillar tree of size c containing the leaves g_{1},⋯g_{ c }(i.e. the tree (g_{ c },(⋯g_{3},(g_{2},g_{1})⋯ )), and the expanded species tree S_{ exp }obtained from S by replacing each leaf g with weight c>1 in W by a caterpillar tree of size c containing the leaves g_{1},⋯g_{ c }. Then W_{ exp }and S_{ exp }are uniquely leaflabelled trees. It is easy to see that a solution${W}_{\mathrm{exp}}^{\mathrm{MAX}}$ of MAST will contain, for any g ∈ G, either c or 0 leaves labelled g (i.e. either 0 or all leaves labelled g removed from W_{ exp }). Therefore the compressed tree${W}_{\mathrm{exp}}^{\mathrm{MAX}}$, obtained by recovering a single weighted leaf from each expanded leaf of W_{ exp }that was not removed by MAST, is a solution to WMAST. Further, since we add at most a constant number of genes per leaf, we will affect the running time of MAST by at most a constant factor. □
Empirical results
We test the optimality of Algorithm CorrectTree in the case of a gene tree satisfying Property ADaboveNAD (i.e. containing at least one AD vertex above a NAD vertex). Indeed, the algorithm is guaranteed to give the optimal solution otherwise (i.e. for trees satisfying the constraints of Section “Uniquely leaflabelled gene trees” or Section “No AD above NAD”). We compared the number N of leafremovals obtained by Algorithm CorrectTree with the number N_{ opt }obtained by the exact algorithm that tries all possible leafsubset removals. More precisely, if the minimum number of leafremovals output by Algorithm CorrectTree is r, we try all subsets of r−1,r−2,…,r−i leaf removals, and stop as soon as a tree that is MDconsistent with S is obtained. As the naive algorithm has clearly an exponentialtime complexity, tests are performed on trees of limited size.
We considered a genome set of fixed size 5, and gene trees with 6 to 24 leaves. For each size s (from 6 to 24, with steps of 2), we generated 500 random gene trees of s leaves, and kept only those satisfying Property ADaboveNAD. The left diagram of Figure4 shows that Algorithm CorrectTree gives an exact solution for more than 65% of the trees (among all of those satisfying Property ADaboveNAD). Moreover, when N_{ opt } differs from N_{ opt }, in most cases the difference is 1. The right diagram of Figure4 is obtained by averaging, for each size s, the results obtained for all the gene trees of that size. We can see that the errorrate, computed as (N−N_{ opt }/N, is independent from the size of the tree, and did not exceed 0.15, based on our simulation parameters. After testing other dependency factors (nonshown results), it appears that the errorrate only depends on the number of times the loop 9–13 of Algorithm CorrectTree is executed, which is not directly related to the number of NADs or ADs in the tree.
Finally, we tested the ability of the approach to identify misplaced genes. To do so, we considered a genome set of fixed size 10, and gene trees of size s varying from 10 to 100 (with a step of 10). For a random species tree S and a random tree T of size s that is MDconsistent with S, we incorporate randomly NbAdded=s/10 leaves with randomly chosen labels. We then test how many “misplaced” leaves our method is able to detect. For each size s, results are averaged over 100 trees. Figure5 shows the detection percentage of Algorithm CorrectTree, which is computed as (N/NbAdded)×100. This detection percentage decreases with increasing size of the gene tree. This is mainly due to the fact that as an MDconsistent tree needs no leaf removal, its detection percentage is always 100%, and that the more leaves we add (1 for a gene tree of size 10, but 10 for a gene tree of size 100) the less chance we have to end up with an MDconsistent tree. Removing the cases of MDconsistent trees lead to a detection percentage around 40%.
Conclusion
Based on observations pointing to NAD vertices of a gene tree as indicating potentially misplaced genes, we developed a polynomialtime algorithm for inferring the minimum number of leafremovals required to transform a gene tree into an MDtree, i.e. a tree with no NAD vertices. The algorithm is exact in the case of a uniquely leaflabelled gene tree, or in the case of a gene tree that does not contain any AD vertex above a NAD vertex. In the general case, our algorithm exhibited nearoptimal results under our simulation parameters. Unfortunately, NAD vertices can only reveal a subset of misplaced genes, as a randomly placed gene does not necessarily lead to a NAD vertex. Our experiments show that, on average, we are able to infer 40% of misplaced genes. However, the additional damage caused by a misplaced leaf leading to a NAD is an excessive increase of the real mutationcost of the tree. Therefore, removing NADs can be seen as a preprocessing of the gene tree preceding a reconciliation approach, in order to obtain a better view of the duplicationloss history of the gene family.
Another use of our method would be to choose, among a set of equally supported gene trees output by a given phylogenetic method, the one that can be transformed to an MDconsistent tree by a minimum number of leaf removals.
A limitation of our approach is that a NAD resulting from a wrong bipartition {a,b;c} can be, a priori, solved by removing any gene from this bipartition. Our present approach is able to detect a number of misplaced genes but, in general, it is insufficient to detect precisely the genes that have been erroneously added in the tree. An extension would be to infer all optimal subsets of leaf removals, and to use bootstrapping values on the edges of the tree for a judicious choice of the genes to be removed.
Funding
Research supported by grants to N.E.M. from the Natural Sciences and Engineering Research Council of Canada, and “Fonds de Recherche Nature et Technologie” of Quebec.
References
 1.
Li WH, Gu Z, Wang H, Nekrutenko A: Evolutionary analysis of the human genome. Nature. 2001, 409: 847849. 10.1038/35057039
 2.
Durand D, Haldórsson BV, Vernot B: A hybrid micromacroevolutionary approach to gene tree reconstruction. J Comput Biol. 2006, 13: 320335. 10.1089/cmb.2006.13.320
 3.
Zhang J: Evolution by gene duplication: an update. TRENDS Ecol Evol. 2003, 18 (6): 292298. 10.1016/S01695347(03)000338
 4.
Goodman M, Czelusniak J, Moore GW, RomeroHerrera AE, Matsuda G: Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool. 1979, 28: 132163. 10.2307/2412519
 5.
Doyon JP, Scornavacca C, Gorbunov K, Szolloso G, Ranwez V, Berry V: An effi. algo. for gene/species trees parsim. reconc. with losses, dup. and transf. J Comp Biol. 2010, 6398: 93108.
 6.
Hallett M, Lagergren J, Tofigh A: Simultaneous identification of duplications and lateral transfers. Proceedings of the Eight Annual International Conference on Computational Molecular Biology, RECOMB. Edited by: Bourne PE, Gusfield D. New York: ACM, 2004, 347356.
 7.
Tofigh A, Hallett M, Lagergren J: Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM Trans Comput Biol Bioinf. 2011, 8: 517535.
 8.
Chauve C, ElMabrouk N: New perspectives on gene family evolution: losses in reconciliation and a link with supertrees. Proceedings of the Thirteenth Annual International Conference on Computational Molecular Biology, RECOMB. Edited by: Batzoglou S S. volume 5541 of LNCS, Springer, 2009, 4658.
 9.
Chen K, Durand D, FarachColton M: Notung: Dating gene duplications using gene family trees. J Comput Biol. 2000, 7: 429447. 10.1089/106652700750050871
 10.
Ma B, Li M, Zhang L: From gene trees to species trees. SIAM J Comput. 2000, 30: 729752. 10.1137/S0097539798343362
 11.
Guigó R, Muchnik I, Smith TF: Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol. 1996, 6: 189213. 10.1006/mpev.1996.0071
 12.
Page RDM, Charleston MA: Reconciled trees and incongruent gene and species trees. DIMACS Ser Discrete Mathematics and Theor Comput Sci. 1997, 37: 5770.
 13.
Bonizzoni P, Vedova Della G, Dondi R: Reconciling a gene tree to a species tree under the duplication cost model. Theor Comput Sci. 2005, 347: 3653. 10.1016/j.tcs.2005.05.016
 14.
Gorecki P, Tiuryn J: DLStrees: a model of evolutionary scenarios. Theor Comput Sci. 2006, 359: 378399. 10.1016/j.tcs.2006.05.019
 15.
Page RDM: Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst Biol. 1994, 43: 5877.
 16.
Page RDM: Genetree: comparing gene and species phylogenies using reconciled trees. Bioinformatics. 1998, 14: 819820. 10.1093/bioinformatics/14.9.819
 17.
Hallett MT, Lagergren J: New algorithms for the duplicationloss model. Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, RECOMB. Edited by: Shamir R, Miyano S, Istrail S, Pevzner P, Waterman MS. New York: ACM, 2000, 138146.
 18.
Hahn MW: Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution. Genome Biol. 2007, 8: R141. 10.1186/gb200787r141
 19.
Chang WC, Eulenstein O: Reconciling gene trees with apparent polytomies. Proceedings of the 12th Conference on Computing and Combinatorics (COCOON), volume 4112 of Lecture Notes in Computer Science. Edited by: Chen DZ, Leepages DT. Taipei, Taiwan, 2006, 235244.
 20.
Lafond M, Swenson KM, ElMabrouk N: An optimal reconciliation algorithm for gene trees with polytomies. WABI, volume 7534 of LNBI/LNBI. 2012, 106122.
 21.
Dondi R, ElMabrouk N: Minimum leaf removal for reconciliation: complexity and algorithms. Combinatorial Pattern Matching (CPM). accepted 2012.
 22.
Doroftei A, ElMabrouk N: Removing noise from gene trees. WABI, volume 6833 of LNBI/LNBI. 2011, 7691.
 23.
Chauve C, Doyon JP, ElMabrouk N: Gene family evolution by duplication, speciation and loss. J Comput Biol. 2008, 15: 10431062. 10.1089/cmb.2008.0054
 24.
Hahn MW, Han MV, Han SG: Gene family evolution across 12 drosophilia genomes. PLoS Genet. 2007, 3: e197. 10.1371/journal.pgen.0030197
 25.
Modified mincut supertrees. LNCS, volume 2452 of WABI. 2002, 537551.
 26.
Semple C, Steel M: A supertree method for rooted trees. Discrete Appl Math. 2000, 105: 147158. 10.1016/S0166218X(00)00202X
 27.
Snir S, Rao S: Using max cut to enhance rooted trees consistency. IEEE/ACM Trans Comput Biol Bioinf. 2006, 3: 323333.
 28.
Berge C: Hypergraphs:Combinatorics of Finite Sets, volume 45. Amsterdam, NorthHolland, 1989.
 29.
Hao J, Orlin JB: A faster algorithm for finding the minimum cut in a graph. Proceedings of the Third Annual ACMSIAM Symposium on Discrete Algorithms. 1992, 165174.
 30.
Cole R, FarachColton M, Hariharan R, Przytycka T, Thorup M: An o(n\log n) algorithm for the maximum agreement subtree problem for binary trees. SIAM J Comput. 2000, 30 (5): 13851404. 10.1137/S0097539796313477
 31.
Finden CR, Gordon AD: Obtaining common pruned trees. J Classif. 1995, 2: 255276.
 32.
Amir A, Keselman D: Maximum agreement subtree in a set of evolutionary trees: matrics and efficient algorithms. SIAM J Comput. 1997, 26: 16561669. 10.1137/S0097539794269461
 33.
Steel M, Warnow T: Kaikoura tree theorems:computing the maximum agreement subtree. Inform Process Lett. 1993, 48: 7782. 10.1016/00200190(93)901818
 34.
Bryant D: Building trees, hunting for trees and comparing trees: Theory and method in phylogenetic analysis. PhD dissertation, Department of Mathematics, University of Canterbury, UK; 1997.
 35.
Farach M, Przytycka TM, Thorup M: On the agreement of many trees. Inf Process Lett. 1995, 55 (6): 297301. 10.1016/00200190(95)00110X
 36.
Zhang LX: On MirkinMuchnikSmith conjecture for comparing molecular phylogenies. J Comput Biol. 1997, 4: 177188. 10.1089/cmb.1997.4.177
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Swenson, K.M., Doroftei, A. & ElMabrouk, N. Gene tree correction for reconciliation and species tree inference. Algorithms Mol Biol 7, 31 (2012) doi:10.1186/17487188731
Received
Accepted
Published
DOI
Keywords
 Gene tree
 Species tree
 Reconciliation
 Error correction
 Maximum agreement subtree (MAST)