Reconciliation is the commonly used method for inferring the evolutionary scenario for a gene family. It consists in “embedding” inferred gene trees into a known species tree, revealing the evolution of the gene family by duplications and losses. When a species tree is not known, a natural algorithmic problem is to infer a species tree from a set of gene trees, such that the corresponding reconciliation minimizes the number of duplications and/or losses. The main drawback of reconciliation is that the inferred evolutionary scenario is strongly dependent on the considered gene trees, as few misplaced leaves may lead to a completely different history, with significantly more duplications and losses.

Results

In this paper, we take advantage of certain gene trees’ properties in order to preprocess them for reconciliation or species tree inference. We flag certain duplication vertices of a gene tree, the “non-apparent duplication” (NAD) vertices, as resulting from the misplacement of leaves. In the case of species tree inference, we develop a polynomial-time heuristic for removing the minimum number of species leading to a set of gene trees that exhibit no NAD vertices with respect to at least one species tree. In the case of reconciliation, we consider the optimization problem of removing the minimum number of leaves or species leading to a tree without any NAD vertex. We develop a polynomial-time algorithm that is exact for two special classes of gene trees, and show a good performance on simulated data sets in the general case.

Almost all genomes which have been studied contain genes that are present in two or more copies. Duplicated genes account for about 15% of the protein coding genes in the human genome, for example[1]. In practise, homologous gene copies (e.g. copies in one genome or amongst different genomes that are descended from the same ancestral gene) are identified through sequence similarity; using a BLAST-like method, all gene copies with a similarity score above a certain threshold would be grouped into the same gene family. Using a classical phylogenetic method, a gene tree, representing the evolution of the gene family by local mutations, can then be constructed based on the similarity scores. However, macroevolutionary events (duplications, losses, horizontal gene transfer) affecting the number and distribution of genes among genomes[2], are not explicitly reflected by this gene tree. Having a clear picture of the speciation, duplication and loss mechanisms that have shaped a gene family is however crucial to the study of gene function. Indeed, following a duplication, the most common occurrence is for only one of the two gene copies to maintain the parental function, while the other becomes non-functional (pseudogenization) or acquires a new function (neofunctionalization)[3].

The most commonly used methods to infer evolutionary scenarios for gene families are based on the reconciliation approach that compares the species tree S (describing the relationships among taxa) to the gene tree T. Assuming no sequencing errors and a “correct” gene tree, the incongruence between the two trees can be seen as a footprint of the evolution of the gene family through processes other than speciation, such as duplication and loss. The concept of reconciling a gene tree to a species tree under the duplication-loss model was pioneered by Goodman[4] and then widely accepted, utilized and also generalized to models of other processes such as horizontal gene transfer[5–7]. Several definitions of reconciliation exist in the literature, one of them expressed in term of “tree extension”[8]. More precisely, a reconciliationR between T and S is an extension of T (obtained by grafting new subtrees onto existing edges of T) consistent with the species tree (i.e. reflecting the same phylogeny). A duplication and loss history for the gene family is then directly deduced from R. As many reconciliations exist, a natural approach is to select the one that optimizes a given criterion. Natural combinatorial criteria are the number of duplications (duplication cost), losses (loss cost) or both combined (mutation cost)[9, 10]. The so called Lowest Common Ancestor (LCA) mapping between a gene tree and a species tree, formulated in[11, 12] and widely used[2, 10, 12–16], defines a reconciliation that minimizes both the duplication and mutation costs. It has been shown in[8] that minimizing duplications follows from minimizing losses (i.e. a reconciliation minimizing losses also minimizes duplications, but the converse is false). When no preliminary knowledge on the species tree is given, a natural problem, known as the species tree inference problem, is to infer, from a set of gene trees, a species tree leading to a parsimonious evolution scenario, for a given cost. Similar to the case of a known species tree, methods have been developed for the duplication and mutation costs[9, 10, 17]. For both criteria, the problem of inferring an optimal species tree given a set of gene trees is NP-hard[10].

The main criticism of reconciliation is that the inferred duplication and loss history for a gene family is strongly dependent on the gene tree considered for this family. Indeed, a few misplaced leaves in the gene tree can lead to a completely different history, possibly with significantly more duplications and losses[18]. Reconciliation can therefore inspire confidence only in the case of a well-supported gene tree. Typically bootstrapping values are used as a measure of confidence in each edge of a phylogeny. How should the weak edges of a gene tree be handled? One reasonable answer is to transform the binary gene tree into an unresolved gene tree by removing each weak edge and collapsing its two incident vertices into one. Extensions of the duplication-loss model to non-binary gene trees have been considered[19, 20]. Another strategy adopted in[9] is to explore the space of gene trees obtained from the original gene tree T by performing Nearest Neighbor Interchanges (NNIs) around weakly-supported edges. The problem is then to select, from this space, the tree giving rise to the minimum reconciliation cost.

In this paper, we explore a different strategy for “correcting” or preprocessing a gene tree T, prior to reconciliation or species tree inference. Criteria for identifying potentially misplaced leaves were given in[8]. The duplication vertices of T with respect to a species tree S can be subdivided into apparent and non-apparent duplication (NAD) vertices, where the latter class has been flagged as potentially resulting from the misplacement of leaves in the gene tree. The reason is that each one of the NAD vertices reflects a phylogenetic contradiction with the species tree that is not due to the presence of duplicated gene copies. In the case of an unknown species tree, we showed in[8] that deciding whether T can be explained using only apparent duplications (we say that T is an MD-tree) can be done in polynomial time, as well as inferring an appropriate species tree. Here, we present algorithmic results for removing, for a given gene tree (or a forest of gene trees), the minimum number of leaves or leaf-labels (species) leading to a tree without any NAD vertex, in both cases of a known or an unknown species tree. The minimum leaf removal problem in case of a known species tree has been recently proved to be APX-hard[21].

In the next section, we begin by formally introducing our concepts. We then motivate and state our problems in Section “Motivations and problem statements”. Section “Minimum species removal inference and recon-ciliation” gives a greedy heuristic for the minimum species removal problem in the case of an unknown species tree, and shows that any algorithm for this case can be applied to the case where the species tree is known. Section “Algo-rithms for the minimum removal reconciliation problems” is dedicated to the algorithmic developments in the case of a known species tree. We first describe two special classes of gene trees which lead to an exact polynomial-time algorithm. We then present a heuristic algorithm for the general case. In Section “Empirical results”, we test the optimality of our algorithm for the minimum leaf-removal problem in the case of a known species tree, and the ability of the presented approach to identify misplaced genes. This paper is an extended version of[22].

Definitions

Trees

In this paper, we only consider rooted trees. Let G={1,2,⋯,g} be a set of integers representing g different species (genomes). A species tree on G is a rooted binary tree with exactly g leaves, where each i ∈ G is the label of a single leaf (Figure1a). A gene tree on G is a rooted binary tree where each leaf is labelled by an integer from G, with possibly repeated leaves (Figure1b). A gene tree represents a gene family, where each leaf labelled i represents a gene copy located on genome i. In the case of a species tree or a uniquely leaf-labelled gene tree (i.e. no leaf-label occurring more than once) we will make no distinction between a leaf and its label.

Given a tree U, the size ofU, denoted by |U|, is the number of leaves of U, and the genome set ofU, denoted by$\mathcal{L}\left(U\right)$, is the subset of G defined by the labels of the leaves of U. Given a vertex x of U, U_{
x
} is the subtree of U rooted at x. The genome set of U_{
x
} is denoted by$\mathcal{L}\left(x\right)$ (for example, in the tree of Figure1a,$\mathcal{L}\left(B\right)=\{1,2\}$). If x is not a leaf, we denote by x_{
l
} and x_{
r
} the two children of x. Finally, if x is not the root, any vertex y≠x on a path from x to the root is an ancestor of x.

Given a tree U on a genome set G, a leaf removal consists of removing a given leaf from U, along with its parental node x, and if x is not the root joining the parent of x and the remaining child by a new edge. A tree${U}^{\prime}$ obtained from U through a sequence of leaf removals is said to be included inU. The tree Urestricted to a subset${G}^{\prime}$ of G is the tree${U}^{\prime}$ obtained from U through a sequence of leaf removals that removes all the leaves with labels in$G\setminus {G}^{\prime}$.

Finally, a subtree U_{
x
}of U, for a given vertex x, is said to be a maximum subtree of U verifying a given property P if and only if U_{
x
} verifies property P and, for any vertex y that is an ancestor of x, U_{
y
}does not verify property P.

Reconciliation

Applying a classical phylogenetic method to the gene sequences of a given gene family leads to a gene tree T that is different from the species tree S, mainly due to the presence of multiple gene copies in T, and that may reflect a divergence history different from S. The reconciliation approach consists in “embedding” the gene tree into the species tree, revealing the evolution of the gene family by duplications and losses.

There are several definitions of reconciliation between a gene tree and a species tree[2, 10–15]. Here we define reconciliation in terms of subtree insertions, following the notation used in[14, 23]. We begin with some definitions:

A subtree insertion in a tree T grafts a new subtree onto an existing edge of T. Formally, inserting a subtree onto an edge linking two nodes x and y (y is a child of x) consists in creating a new node z with parent x and two children being y and the root of the inserted subtree.

A tree${T}^{\prime}$ is said to be an extension of T if it can be obtained from T by subtree insertions on the edges of T.

The gene tree T is said to be DS-consistent withS (DS standing for Duplication/Speciation) if T reflects a history with no loss, i.e. if for every vertex t of T such that$\left|\mathcal{L}\right(t\left)\right|\ge 2$, there exists a vertex s of S such that$\mathcal{L}\left(t\right)=\mathcal{L}\left(s\right)$

and one of the two following conditions holds: either

$\mathcal{L}\left({t}_{r}\right)=\mathcal{L}\left({t}_{\ell}\right)$ (indicating a Duplication),

or$\mathcal{L}\left({t}_{r}\right)=\mathcal{L}\left({s}_{r}\right)$ and$\mathcal{L}\left({t}_{\ell}\right)=\mathcal{L}\left({s}_{\ell}\right)$ (indicating a Speciation).

Definition 1

A reconciliation between a gene tree T and a species tree S on G is an extension R(T,S) of T that is DS-consistent with S.

For example, the tree of Figure1c is a reconciliation between the gene tree T of Figure1b and the species tree of Figure1a. Such a reconciliation between T and S implies an unambiguous evolution scenario for the gene family, where a vertex of R(T,S) that satisfies property (D) represents a duplication (duplication vertex), a vertex that satisfies property (S) represents a speciation (speciation vertex), and an inserted subtree represents a gene loss. The number of duplication vertices of R(T,S) is called the duplication cost of R(T,S).

The notion of reconciliation can naturally be extended to the case of a set, or forest, of gene trees$\mathcal{F}=\{{T}_{1},\dots ,{T}_{f}\}$: a reconciliation between$\mathcal{F}$ and S is a set$\mathcal{R}(\mathcal{F},S)=\left\{{R}_{1}\right({T}_{1},S),\dots ,{R}_{f}({T}_{f},S\left)\right\}$ of reconciliations, respectively for T_{1},…,T_{
f
}, such that each R_{
i
}(T_{
i
},S) is DS-consistent with S.

LCA Mapping

The LCA mapping between a gene tree T and a species tree S, denoted by M, maps every vertex t of T to the Lowest Common Ancestor (LCA) of$\mathcal{L}\left(t\right)$ in S. A vertex t of T is called a duplication vertex of T with respect to S if and only if M(t_{
ℓ
})=M(t) and/or M(t_{
r
})=M(t) (see Figure1b). We denote by d(T,S) the number of duplication vertices of T with respect to S. Any vertex of T that is not a duplication vertex is a speciation vertex with respect to S.

The LCA mapping induces a reconciliation M(TS) between T and S, where an internal vertex t of T leads to a duplication vertex in M(TS) if and only if t is a duplication vertex of T with respect to S. In other words, the duplication cost of M(TS) is d(TS) (see for example[10, 13, 15] for more details on the construction of a reconciliation based on the LCA mapping). Moreover, M(TS) is a reconciliation that minimizes the duplication, loss, and mutation costs[8, 14].

Duplication vertices and MD-trees

Let T be a gene tree. As noticed in[8], any vertex t of T such that$\mathcal{L}\left({t}_{\ell}\right)\cap \mathcal{L}\left({t}_{r}\right)\ne \varnothing $ (i.e. the left and right subtrees rooted at t contain a gene copy in the same genome) will be a duplication vertex in any reconciliation between T and any species tree S, in particular in M(TS). Such a vertex is called an Apparent Duplication vertex (AD vertex for short) of T. In the tree of Figure1b, the root is an AD vertex as its left and right subtree both contain a gene copy in genome 1. Following our notation in[8], a gene tree T is said to be a Minimum Duplication Tree (henceforth called an MD-tree) if there exists a species tree S such that d(TS) is exactly the number of apparent duplications present in T. In which case, T is said to be MD-consistent with S.

However, this is not always true, in other words, a duplication vertex of T with respect to a species tree S is not necessarily an AD vertex. We call such a duplication vertex a Non-Apparent Duplication vertex, or simply a NAD vertex. For example, the tree of Figure1b contains one NAD vertex, indicated by a square, and thus T is not MD-consistent with S.

Motivations and problem statements

Non-apparent duplication vertices point to disagreement between a gene tree T and a species tree S that are not due to the presence of repeated leaf labels (i.e. multiple copies in the same genome). More precisely, we say that a vertex x of Tsplits three species {abc} into {ab;c} if the genome set of one of its children contains a and b but not c, and the genome set of its other child contains c but neither a nor b. Then for any NAD x of T, there is a triplet of species {abc} that are split differently by x and by the LCA mapping of x in S (see proof of Theorem 5 in[8]). We will say that such a triplet exhibits a wrong phylogeny. For example, in Figure1, {1,2,3} is split into {1,3;2} by the NAD vertex of T, and into {1,2;3} by the vertex A in S. It has therefore been suggested that NAD vertices may point at gene copies that are erroneously placed in the gene tree.

Observations made in[8] tend to support this hypothesis. In particular, using simulated data-sets based on the species tree of 12 Drosophila species given in[24] and a birth-and-death process, starting from a single ancestral gene, and with different gene gain/loss rates, it has been found that 95% of gene duplications lead to an AD vertex.

Notice however that a misplaced gene in a gene tree T does not necessarily lead to a NAD vertex. In other words, NAD vertices can only point to a subset of misplaced leaves. However, in the context of reconciliation, the damage caused by a misplaced leaf leading to a NAD vertex is to significantly increases the real mutation-cost of the tree, as shown in Figure2.

Following the later observations, we exploit the properties of NAD vertices for gene tree correction. For generality, we consider a forest of gene trees$\mathcal{F}=\{{T}_{1},\dots ,{T}_{f}\}$. If$\mathcal{F}$ is not MD-consistent with a given species tree S (i.e. there is at least one tree in$\mathcal{F}$ that is not MD-consistent with S) then an MD-consistent forest can always be obtained from$\mathcal{F}$ by performing a certain number of leaf removals. Indeed, a gene tree with only two leaves is always MD-consistent with any species tree. Our first optimization problem is the following, where the size of$\mathcal{F}$ is just the sum of sizes of all the trees of$\mathcal{F}$.

MINIMUM LEAF REMOVAL RECONCILIATION(MINLRR):Input: A genome set G, a forest of gene trees$\mathcal{F}=\{{T}_{1},\dots ,{T}_{f}\}$ on G, and a species tree S for G; Output: A forest of gene trees${\mathcal{F}}^{\mathrm{MAX}}=\{{T}_{1}^{\mathrm{MAX}},\dots ,$${T}_{f}^{\mathrm{MAX}}\}$ of maximum size (i.e. obtained from$\mathcal{F}$ by a minimum number of leaf removals) which is MD-consistent with S, where each${T}_{i}^{\mathrm{MAX}}$ is included in T_{
i
}.

In the case of an unknown species tree, we have shown in[8] that deciding whether a forest of gene trees$\mathcal{F}$ is an MD-forest (i.e. a set of MD-trees) can be done in polynomial time and space, as well as computing a parsimonious species tree. For a forest which is not an MD-forest, a natural generalization of the MINLRR problem is the following:

MINIMUM LEAF REMOVAL INFERENCE (MINLRI):Input: A genome set G and a forest of gene trees$\mathcal{F}=\{{T}_{1},\dots ,{T}_{f}\}$ on G; Output: An MD-forest${\mathcal{F}}^{\mathrm{MAX}}=\{{T}_{1}^{\mathrm{MAX}},\dots ,{T}_{f}^{\mathrm{MAX}}\}$ of maximum size (i.e. obtained from$\mathcal{F}$ by a minimum number of leaf removals), where each${T}_{i}^{\mathrm{MAX}}$ is included in T_{
i
}.

A more conservative strategy that can be used to reduce the risk of inferring a wrong species tree, is to remove the minimum number of species from G such that the forest$\mathcal{F}$ restricted to the new genome set is an MD-forest. Removing the minimum number of species instead of leaves can also be considered in the case of reconciliation, a scenario that may be applicable when full confidence is not put in the species tree.

MINIMUM SPECIES REMOVAL RECONCILIATION (MINSRR):Input: A genome set G, a forest of gene trees$\mathcal{F}=\{{T}_{1},\dots ,{T}_{f}\}$ on G and a species tree S for G; Output: A maximum subset${G}^{\prime}$ of G such that forest$\mathcal{F}$ restricted to${G}^{\prime}$ (i.e. the set of trees T_{
i
}restricted to${G}^{\prime}$) is MD-consistent with the species tree S restricted to${G}^{\prime}$.

MINIMUM SPECIES REMOVAL INFERENCE(MINSRI):Input: A genome set G and a forest of gene trees$\mathcal{F}=\{{T}_{1},\dots ,{T}_{f}\}$ on G; Output: A maximum subset${G}^{\prime}$ of G such that the forest$\mathcal{F}$ restricted to${G}^{\prime}$ is an MD-forest.

The latter two optimization problems (MINSRR and MINSRI)) are the subject of the next section. Section “Algorithms for the minimum removal reconcilia-tion problems” focuses on the two optimization problems related to reconciliation (MINLRR and MINSRR).

Minimum species removal inference and reconciliation

By linking the species tree inference problem to a supertree problem we have been able to prove that deciding whether a gene tree T is an MD-tree can be done in polynomial-time[8]. We used a constructive proof based on a min-cut strategy, which has been largely considered in the context of supertrees[25–27]. In this section, we develop a greedy heuristic for MINSRI based on a minimum vertex cut strategy.

Let$\mathcal{F}=\{{T}_{1},{T}_{2},\dots ,{T}_{f}\}$ be a forest of gene trees on a genome set G. Define$\mathrm{leve}{l}_{0}\left(\mathcal{F}\right)$ to be the set of highest (i.e. closest to the root) vertices of all T_{
i
}s that are not AD-vertices.$\mathrm{leve}{l}_{j}\left(\mathcal{F}\right)$ is then the set of vertices of all T_{
i
}s that are closest non-AD descendants of the vertices for$\mathrm{leve}{l}_{j-1}\left(\mathcal{F}\right)$. For a given level j, forest$\mathcal{F}$, and vertex$x\in \mathrm{leve}{l}_{j}\left(\mathcal{F}\right)$, consider the bipartition$B\left(x\right)=\left(\mathcal{L}\right({x}_{l}),\mathcal{L}({x}_{r}\left)\right)$. Then${\mathcal{G}}_{j}=(V,E)$ is the corresponding hypergraph[28] where V=G, and$\mathcal{L}\left({x}_{l}\right),\mathcal{L}\left({x}_{r}\right)\in E$ for$x\in \mathrm{leve}{l}_{j}\left(\mathcal{F}\right)$.

In order for$\mathcal{F}$ to be an MD-forest, all the vertices of$\mathrm{leve}{l}_{j}\left(\mathcal{F}\right)$, for any j, should represent speciation vertices with respect to some species tree S (as otherwise they would represent additional non-apparent duplication vertices, preventing the forest from being an MD-forest). In other words, the bipartitions B(x) for all x ∈ level_{0}(T) should reveal a first speciation event, which is possible if and only if the graph${\mathcal{G}}_{0}$ contains at least two connected components. Indeed, in this case for any species tree S with a root r splitting G into two disconnected subsets, all the vertices of$\mathrm{leve}{l}_{0}\left(\mathcal{F}\right)$ would be speciation vertices. Conversely, if${\mathcal{G}}_{0}$ contains a single connected component, then for any species tree S, at least one node of level_{0}(T) would be a NAD node. The same reasoning applies to any$\mathrm{leve}{l}_{j}\left(\mathcal{F}\right)$ and${\mathcal{G}}_{j}$.

On the other hand, if${\mathcal{G}}_{j}$ is connected for some$\mathrm{leve}{l}_{j}\left(\mathcal{F}\right)$, there exists no species tree so that all$x\in \mathrm{leve}{l}_{j}\left(\mathcal{F}\right)$ represent speciation events. In this case, some number of species must be removed to make${\mathcal{G}}_{j}$ disconnected. This corresponds exactly to a vertex cut in${\mathcal{G}}_{j}$. These observations leads to the following heuristic for the MINSRI problem.

MINIMUM-VERTEX-CUT in a hypergraph can be computed using the minimum vertex cut algorithm for simple graphs: each hyperedge corresponds to a clique in the simple graph. It is easy to confirm that a set of vertices disconnects the hypergraph if and only if it disconnects the corresponding simple graph. Vertex cut on a simple graph can be implemented with 2n−2 calls to the standard st vertex cut algorithm (based on minimum st edge cut). By reusing computation, Hao and Orlin[29] showed how to do all 2n−2 calls to the st cut algorithm in the same time it takes to do a single call. Thus, MINIMUM-VERTEX-CUT can be solved in$O\left(\right|V\left|\right|E|lg(|V{|}^{2}/|E\left|\right)$ time. Since we call MINIMUM-VERTEX-CUTO(|V|) times in the worst case, Algorithm Minimum Species Removal Inference runs in$O\left(\right|V{|}^{2}\left|E\right|lg\left(\right|V{|}^{2}/\left|E\right|)$ time.

In the next section we give algorithms for MINSRR and MINLRR.. We conclude this section by highlighting the relationship between MINSRR and MINSRI.

Remark 1

MINSRR reduces to MINSRI.

This is easy to see; take the species tree S given by the instance of MINSRR and add it to the forest$\mathcal{F}$ for the MINSRI problem. The solution to MINSRI gives a species tree that must be a subtree of S. Thus, any algorithm for MINSRI can be used to solve MINSRR.

Algorithms for the minimum removal reconciliation problems

In this section, we assume that a species tree S is known for the genome set G. For simplicity, we present the algorithms in the case of a single gene tree T, although it is straightforward to generalize them to the case of a forest of gene trees.

Let T be a gene tree for a gene family on G. We suppose that T is not an MD-tree consistent with S (i.e. there is at least one duplication vertex of T that is a NAD vertex). We begin by describing special classes of gene trees for which exact polynomial-time algorithms have been developed for the MINLRR and MINSRR problems.

Uniquely leaf-labelled gene trees

When the considered gene family contains at most one gene per genome, the gene tree T is uniquely leaf-labelled. In this case, minimizing the number of leaves, or equivalently species, that should be removed from T to obtain an MD-tree consistent with S is equivalent to finding the maximum number of genomes that lead to the same phylogeny in T and S. In other words, it is immediate to see that the MINLRR problem reduces, in this case, to the MAST problem given below.

MAXIMUM AGREEMENT SUBTREE (MAST):Input: A uniquely leaf-labelled gene tree T on G and a species tree S for G; Output: A tree T^{
MAX
}included in T such that it is MD-consistent with S and of maximum size.

A more general definition is given in the literature, where the MAST problem is defined on a set of uniquely leaf-labelled trees as the largest tree included in each tree of the set. This definition is equivalent to ours in the case of a gene tree T and a species tree S.

The MAST problem arises naturally in biology and linguistics as a measure of consistency between two evolutionary trees over species or languages[30]. In the evolutionary study of genomes, different methods and different gene families are used to infer a phylogenetic tree for a set of species, usually yielding different trees. In such a context, one has to find a consensus of the various obtained trees. Considering the MAST problem, introduced by Finden and Gordon[31], is one way to obtain such a consensus. Amir et al.[32] showed that computing a MAST of three trees with unbounded degree is NP-hard. However, in the case of two binary trees, the problem is polynomial. The first polynomial-time algorithm for this problem was given by Steel and Warnow[33]. It is a dynamic programming algorithm considering the solution for all pairs of subtrees of T and S; it has a running time of O(n^{2}), where n is the number of leaves in the trees. Later, Cole et al.[30] developed an$O(nlgn)$ time algorithm, which, as far as we know, is the most efficient algorithm for solving the MAST problem on two binary trees. We use this result in the MINLRR version of our algorithms. In the case of k binary trees, the current fastest known algorithms run in O(kn^{3}) time[34, 35]. We use this result in the MINSRR version of our algorithms.

No AD above NAD

In this section, we consider a tree T containing no AD vertex above a NAD vertex (Figure3a). More precisely, T satisfies Constraint C below:

CONSTRAINT C: For each NAD vertex x of T, if y is an ancestor of x that is a duplication vertex, then y is a NAD vertex.

We show that the MINSRR problem reduces, in this case, to the MAST problem, while the MINLRR problem reduces to a “generalization” of the MAST problem to weighted trees, where a weighted tree is a uniquely leaf-labelled tree with weighted leaves.

Definition 2

Let U be a tree on G. The weighted tree induced by (U,S) is the tree included in S obtained from S by removing all leaves that are not in$\mathcal{L}\left(U\right)$, with a weight attributed to each leaf s, representing the number of occurrences of s in U (i.e. the number of leaves of U labelled s).

Let T_{1},T_{2},⋯T_{
m
} be the maximum subtrees of T rooted at an AD vertex (i.e. subtrees of T rooted at the highest AD vertices). Then, the tree T^{
I
} obtained by replacing each T_{
i
}, for 1 ≤ i ≤ m, by the weighted tree induced by (T_{
i
},S), is a weighted uniquely leaf-labelled tree. An example is given in Figure3a,b and c. Let ρ_{
s
} be the operation of removing the weighted leaf s from T^{
I
}. Then the corresponding removals in T consist of removing from T all leaves labelled s.

Finally, we formulate the generalization of the MAST problem to weighted trees as follows, where the value v(W) of a weighted tree W is the sum of its leaves’ weights.

WEIGHTED MAXIMUM AGREEMENT SUBTREE(WMAST):Input: A weighted tree W on G and a species tree S for G; Output: A weighted tree W^{
MAX
}included in W such that it is MD-consistent with S and of maximum value.

We are now ready for the main theorem.

Theorem 1

Let T be a gene tree satisfying CONSTRAINT C. Let W^{
MAX
}be a solution of the WMAST problem on T^{
I
}and S, and T^{
MAX
}be the subtree included in T obtained by removing from T all the leaves that are not leaves of W^{
MAX
}. Then T^{
MAX
}is a solution of the MINLRR problem on T and S.

In other words, solving the MINLRR problem on T is equivalent to solving the WMAST problem on T^{
I
}. We show in the proof of Theorem 2 that WMAST can be solved by the traditional MAST algorithms with no change in the asymptotic running time.

A complete example of the algorithmic methodology used for solving the MINLRR problem on T and S is given in Figure3. The algorithm will be detailed in the next section.

We now provide a proof of Theorem 1, subdivided into the two following lemmas.

Lemma 1

The tree T^{
MAX
}is MD-consistent with S.

Proof

We show, by contradiction, that T^{
MAX
}does not contain any NAD vertex. Suppose that T^{
MAX
}contains a NAD vertex x. Then x maps to the same vertex s of S than one of its children, lets say the left child. Then there exists two leaves of${T}_{{x}_{l}}^{\mathrm{MAX}}$, labelled a and b, and one leaf of${T}_{{x}_{r}}^{\mathrm{MAX}}$ labelled c such that the triplet {a,b,c} exhibits a wrong phylogeny. As a non-duplication vertex in T cannot become a duplication vertex after leaf removals, we have only two possibilities for x in T:

1.

x is a NAD vertex in T. Then the genome sets of ${T}_{{x}_{l}}$ and ${T}_{{x}_{r}}$ are disjoint. Moreover, the genome set of ${W}_{{x}_{l}}^{\mathrm{MAX}}$ (resp. ${W}_{{x}_{r}}^{\mathrm{MAX}}$) is a subset of the genome set of ${T}_{{x}_{l}}$ (resp. ${T}_{{x}_{r}}$). On the other hand, as x is not a duplication vertex in W^{
MAX
}, one of the three genes a, b and c should be absent in ${W}_{x}^{\mathrm{MAX}}$. And thus, {a,b,c} can not be a subset of the genome set of ${T}_{x}^{\mathrm{MAX}}$, a contradiction.

2.

x is an AD vertex in T. Then the subtree T_{
x
}of T rooted at x contains at least two leaves labelled with the same label d (different from a, b and c), one in ${T}_{{x}_{l}}$ and one in ${T}_{{x}_{r}}$. Moreover the leaf labelled d in S should belong to the subtree of S rooted at s, and thus to the subtree S_{
i
}rooted at the left or right child of s. Such subtree S_{
i
}contains at least one leaf labelled a or b or c.

On the other hand, let y be the parent of x in T^{
I
}. As an optimal solution of the WMAST problem on T^{
I
}removes leaves from the subtree${T}_{x}^{I}$, such an operation should result in removing the duplication vertex y. In other words, x and y should map to the same vertex s in S. Moreover the result of the leaf removal from${T}_{x}^{I}$ should result in a different LCA mapping for x and y. Indeed, removing leaves from the corresponding subtree in T^{
I
} does not contribute to eliminating any NAD from T^{
I
}. It follows that S should exhibit the phylogeny ((a,b,c),d), which is a contradiction with the result of the last paragraph. □

Lemma 2

Let${T}^{\prime}$ be a tree included in T that is MD-consistent with S. Then$\left|{T}^{\prime}\right|\le \left|{T}^{\mathrm{MAX}}\right|$.

Proof

We will show that, for any s ∈ G, if a leaf i labelled s is removed from T (i.e.i is not a leaf in${T}^{\prime}$), then all leaves of T labelled s are removed from T.

Suppose this is not the case. Let y be the vertex of T representing the least common ancestor of all leaves labelled s in T. Then y is an AD node. As a leaf i labelled s is removed from T, such removal should contribute to resolving a NAD vertex x of T. From CONSTRAINT C, such a vertex should be outside the subtree of T rooted at y. Moreover, it should clearly be an ancestor of y (otherwise removing i will have no effect on x).

As x is a NAD vertex, it maps to the same vertex s of S as one of its children, say the left child. Then, there exist two leaves of${T}_{{x}_{l}}$ labelled a and b, and one leaf of${T}_{{x}_{r}}$ labelled c such that the triplet {a,b,c} exhibits a wrong phylogeny. Moreover, as removing leaf i labelled s contributes to solving x, we can assume that a=s. However, from our assumption, there remains a leaf labelled s in${T}^{\prime}$. Thus: either (1) at least one leaf labelled b and one leaf labelled c remains in${T}^{\prime}$, or (2) all leaves labelled b or all leaves labelled c are removed. In case (1), the wrong phylogeny exhibited by the triplet {a,b,c} is still present, preventing vertex x from being a non-duplication vertex. In case (2), as all copies of b (or equivalently c) are removed, there is no need to remove leaf i labelled s to correct the wrong phylogeny exhibited by the triple {a,b,c}.

Therefore, the weighted tree${W}^{\prime}$ induced by${T}^{\prime}$ is obtained from T^{
I
}through a sequence of leaf removals. Now, as W^{
MAX
}is the solution of the WMAST problem on T^{
I
}, then$v\left({W}^{\mathrm{MAX}}\right)\ge v\left({W}^{\prime}\right)$, and thus$\left|{T}^{\mathrm{MAX}}\right|\ge \left|{T}^{\prime}\right|$. □

Finally, the following corollary makes the link between the MINSRR problem and the MAST problem.

Corollary 1

Let T be a gene tree satisfying CONSTRAINT C. Let W^{
MAX
}be a solution to the MAST problem on T^{
I
}and S (ignoring weights), and T^{
MAX
}be the subtree of T induced by W^{
MAX
}. Then T^{
MAX
}is a solution to the MINSRR problem on T and S.

To apply the algorithm to MINSRR with the forest$\mathcal{F}=\{{T}_{1},{T}_{2},\dots ,{T}_{k}\}$, all trees${T}_{i}^{I}$ must simultaneously agree with S, so the O(kn^{3})MAST algorithm[34, 35] must be used.

An Algorithm for the general case

In this section, we present a general algorithm, that is exact in the case of a uniquely leaf-labelled gene tree (Section “Uniquely leaf-labelled gene trees”) or a gene tree satisfying CONSTRAINT C (Section “No AD above NAD”), and a heuristic in the general case. We first introduce preliminary definitions. For a given tree U (weighted or not), consider the two following properties on U:

Property ONLY-NAD:U has no AD vertices; Property ONLY-AD:U is rooted at an AD vertex and contains no NAD vertex.

We define the NAD-border of U as the set of roots of the maximum subtrees of U verifying Property ONLY-NAD, and the AD-border of U as the set of roots of the maximum subtrees of U verifying Property ONLY-AD.

ALGORITHM CORRECT-TREE is a recursive algorithm that takes as input a gene tree T and a species tree S, and outputs a number of leaf removals transforming T into a tree that is MD-consistent with S. It proceeds as follows:

Stop condition - Lines 2 to 4: If T is MD-consistent with S, then no leaf removal is performed, and the algorithm terminates.

Recurrence Loop - Lines 6 to 13: Resolve all maximum subtrees of T verifying CONSTRAINT C as described in Section “No AD above NAD”, that is:

1.

Construct the weighted tree T^{
I
}(Lines 6-8);

2.

For each root x of a maximum subtree ${T}_{x}^{I}$ of T^{
I
}satisfying CONSTRAINT C (Line 9), solve the WMAST problem on ${T}_{x}^{I}$, which leads to the weighted tree ${W}_{x}^{\mathrm{MAX}}$ (Line 10), compute the induced tree T_{
x
}(Line 11) and store the number of performed leaf removals (Line 12).

Algorithm Correct-Tree (T, S)

1.

LeafRemoval=0;

2.

IF T is a tree MD-consistent with S THEN

3.

RETURN (LeafRemoval)

4.

END IF

5.

T^{
I
}=T;

6.

FOR ALL x ∈ AD-border(T) DO

7.

Replace ${T}_{x}^{I}$ by its induced weighted tree;

If T is a uniquely leaf-labelled tree then T^{
I
}=T, NAD-border(T^{
I
}) is reduced to the root of the tree, and thus loop 9–13 is just executed once. Moreover, as T^{
I
} is unweighted (all labels are equal to 1), WMAST is reduced to MAST. The whole algorithm thus reduces to one resolution of the MAST problem.

If T satisfies CONSTRAINT C, then NAD-border(T^{
I
}) is also reduced to the root of the whole tree, and thus loop 9–13 is just executed once. In this case, the methodology is the one following Theorem 1, and illustrated in Example 3.

In the general case, NAD-border(T^{
I
}) is not restricted to a single vertex, and loop 9–13 can be executed many times. Moreover, at the end of loop 9–13, the resulting tree is not guaranteed to be MD-consistent with S, as NAD vertices higher than those in NAD-border(T^{
I
}) may exist. Algorithm Correct-Tree may therefore be applied many times.

Theorem 2

Algorithm Correct-tree has time-complexity$O({n}^{2}lgn)$, where n is the size of T.

Proof

Let n be the size of T. Loop 2–4 requires the LCA mapping between T and S, and the identification of AD and NAD vertices. As the LCA mapping can be computed in linear time[8, 36], testing whether a tree T is MD-consistent with S can be tested in time O(n). Clearly, Loop 6–8 can be executed in time O(n). As for Loop 9–13, it has the time complexity O(C) of WMAST. Therefore, the complexity for one execution of the recursive ALGORITHM CORRECT-TREE is O(C). As in the worst case the algorithm can be executed Ω(n) times, the total worst case running time is O(nC).

Let us consider the complexity of WMAST. The O(n^{2}) algorithm of Steel and Warnow[33] naturally generalizes to the case of weighted trees, and leads to the same complexity, O(n^{2}). However, we show in the rest of this proof that an instance of WMAST can be transformed into an instance of MAST in linear-time, which allows us to consider C as being the best complexity found for MAST, namely the$O(nlog(n\left)\right)$ running-time of the algorithm given in[30].

Let G be a genome set, W be a weighted tree on G and S be a species tree for G. Then consider the expanded genome setG_{
exp
}obtained from G by replacing each genome g with weight c in W by a set of genomes {g_{1},⋯g_{
c
}}, the expanded gene treeW_{
exp
}obtained from W by replacing each leaf g with weight c>1 by an expanded leaf , i.e. a caterpillar tree of size c containing the leaves g_{1},⋯g_{
c
}(i.e. the tree (g_{
c
},(⋯g_{3},(g_{2},g_{1})⋯ )), and the expanded species treeS_{
exp
}obtained from S by replacing each leaf g with weight c>1 in W by a caterpillar tree of size c containing the leaves g_{1},⋯g_{
c
}. Then W_{
exp
}and S_{
exp
}are uniquely leaf-labelled trees. It is easy to see that a solution${W}_{\mathrm{exp}}^{\mathrm{MAX}}$ of MAST will contain, for any g ∈ G, either c or 0 leaves labelled g (i.e. either 0 or all leaves labelled g removed from W_{
exp
}). Therefore the compressed tree${W}_{\mathrm{exp}}^{\mathrm{MAX}}$, obtained by recovering a single weighted leaf from each expanded leaf of W_{
exp
}that was not removed by MAST, is a solution to WMAST. Further, since we add at most a constant number of genes per leaf, we will affect the running time of MAST by at most a constant factor. □

Empirical results

We test the optimality of Algorithm Correct-Tree in the case of a gene tree satisfying Property AD-above-NAD (i.e. containing at least one AD vertex above a NAD vertex). Indeed, the algorithm is guaranteed to give the optimal solution otherwise (i.e. for trees satisfying the constraints of Section “Uniquely leaf-labelled gene trees” or Section “No AD above NAD”). We compared the number N of leaf-removals obtained by Algorithm Correct-Tree with the number N_{
opt
}obtained by the exact algorithm that tries all possible leaf-subset removals. More precisely, if the minimum number of leaf-removals output by Algorithm Correct-Tree is r, we try all subsets of r−1,r−2,…,r−i leaf removals, and stop as soon as a tree that is MD-consistent with S is obtained. As the naive algorithm has clearly an exponential-time complexity, tests are performed on trees of limited size.

We considered a genome set of fixed size 5, and gene trees with 6 to 24 leaves. For each size s (from 6 to 24, with steps of 2), we generated 500 random gene trees of s leaves, and kept only those satisfying Property AD-above-NAD. The left diagram of Figure4 shows that Algorithm Correct-Tree gives an exact solution for more than 65% of the trees (among all of those satisfying Property AD-above-NAD). Moreover, when N_{
opt
} differs from N_{
opt
}, in most cases the difference is 1. The right diagram of Figure4 is obtained by averaging, for each size s, the results obtained for all the gene trees of that size. We can see that the error-rate, computed as (N−N_{
opt
}/N, is independent from the size of the tree, and did not exceed 0.15, based on our simulation parameters. After testing other dependency factors (non-shown results), it appears that the error-rate only depends on the number of times the loop 9–13 of Algorithm Correct-Tree is executed, which is not directly related to the number of NADs or ADs in the tree.

Finally, we tested the ability of the approach to identify misplaced genes. To do so, we considered a genome set of fixed size 10, and gene trees of size s varying from 10 to 100 (with a step of 10). For a random species tree S and a random tree T of size s that is MD-consistent with S, we incorporate randomly NbAdded=s/10 leaves with randomly chosen labels. We then test how many “misplaced” leaves our method is able to detect. For each size s, results are averaged over 100 trees. Figure5 shows the detection percentage of Algorithm Correct-Tree, which is computed as (N/NbAdded)×100. This detection percentage decreases with increasing size of the gene tree. This is mainly due to the fact that as an MD-consistent tree needs no leaf removal, its detection percentage is always 100%, and that the more leaves we add (1 for a gene tree of size 10, but 10 for a gene tree of size 100) the less chance we have to end up with an MD-consistent tree. Removing the cases of MD-consistent trees lead to a detection percentage around 40%.

Conclusion

Based on observations pointing to NAD vertices of a gene tree as indicating potentially misplaced genes, we developed a polynomial-time algorithm for inferring the minimum number of leaf-removals required to transform a gene tree into an MD-tree, i.e. a tree with no NAD vertices. The algorithm is exact in the case of a uniquely leaf-labelled gene tree, or in the case of a gene tree that does not contain any AD vertex above a NAD vertex. In the general case, our algorithm exhibited near-optimal results under our simulation parameters. Unfortunately, NAD vertices can only reveal a subset of misplaced genes, as a randomly placed gene does not necessarily lead to a NAD vertex. Our experiments show that, on average, we are able to infer 40% of misplaced genes. However, the additional damage caused by a misplaced leaf leading to a NAD is an excessive increase of the real mutation-cost of the tree. Therefore, removing NADs can be seen as a preprocessing of the gene tree preceding a reconciliation approach, in order to obtain a better view of the duplication-loss history of the gene family.

Another use of our method would be to choose, among a set of equally supported gene trees output by a given phylogenetic method, the one that can be transformed to an MD-consistent tree by a minimum number of leaf removals.

A limitation of our approach is that a NAD resulting from a wrong bipartition {a,b;c} can be, a priori, solved by removing any gene from this bipartition. Our present approach is able to detect a number of misplaced genes but, in general, it is insufficient to detect precisely the genes that have been erroneously added in the tree. An extension would be to infer all optimal subsets of leaf removals, and to use bootstrapping values on the edges of the tree for a judicious choice of the genes to be removed.

Funding

Research supported by grants to N.E.M. from the Natural Sciences and Engineering Research Council of Canada, and “Fonds de Recherche Nature et Technologie” of Quebec.

Declarations

Authors’ Affiliations

(1)

Département d’Informatique et de Recherche Opérationnelle, Université de Montréal

(2)

Departement of Computer Science, McGill University

References

Li WH, Gu Z, Wang H, Nekrutenko A: Evolutionary analysis of the human genome. Nature. 2001, 409: 847-849. 10.1038/35057039PubMedView Article

Durand D, Haldórsson BV, Vernot B: A hybrid micro-macro-evolutionary approach to gene tree reconstruction. J Comput Biol. 2006, 13: 320-335. 10.1089/cmb.2006.13.320PubMedView Article

Zhang J: Evolution by gene duplication: an update. TRENDS Ecol Evol. 2003, 18 (6): 292-298. 10.1016/S0169-5347(03)00033-8View Article

Goodman M, Czelusniak J, Moore GW, Romero-Herrera AE, Matsuda G: Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool. 1979, 28: 132-163. 10.2307/2412519View Article

Doyon J-P, Scornavacca C, Gorbunov K, Szolloso G, Ranwez V, Berry V: An effi. algo. for gene/species trees parsim. reconc. with losses, dup. and transf. J Comp Biol. 2010, 6398: 93-108.

Hallett M, Lagergren J, Tofigh A: Simultaneous identification of duplications and lateral transfers. Proceedings of the Eight Annual International Conference on Computational Molecular Biology, RECOMB. Edited by: Bourne PE, Gusfield D. New York: ACM, 2004, 347-356.View Article

Tofigh A, Hallett M, Lagergren J: Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM Trans Comput Biol Bioinf. 2011, 8: 517-535.View Article

Chauve C, El-Mabrouk N: New perspectives on gene family evolution: losses in reconciliation and a link with supertrees. Proceedings of the Thirteenth Annual International Conference on Computational Molecular Biology, RECOMB. Edited by: Batzoglou S S. volume 5541 of LNCS, Springer, 2009, 46-58.

Chen K, Durand D, Farach-Colton M: Notung: Dating gene duplications using gene family trees. J Comput Biol. 2000, 7: 429-447. 10.1089/106652700750050871PubMedView Article

Ma B, Li M, Zhang L: From gene trees to species trees. SIAM J Comput. 2000, 30: 729-752. 10.1137/S0097539798343362View Article

Guigó R, Muchnik I, Smith TF: Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol. 1996, 6: 189-213. 10.1006/mpev.1996.0071PubMedView Article

Page RDM, Charleston MA: Reconciled trees and incongruent gene and species trees. DIMACS Ser Discrete Mathematics and Theor Comput Sci. 1997, 37: 57-70.

Bonizzoni P, Vedova Della G, Dondi R: Reconciling a gene tree to a species tree under the duplication cost model. Theor Comput Sci. 2005, 347: 36-53. 10.1016/j.tcs.2005.05.016View Article

Gorecki P, Tiuryn J: DLS-trees: a model of evolutionary scenarios. Theor Comput Sci. 2006, 359: 378-399. 10.1016/j.tcs.2006.05.019View Article

Page RDM: Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst Biol. 1994, 43: 58-77.

Page RDM: Genetree: comparing gene and species phylogenies using reconciled trees. Bioinformatics. 1998, 14: 819-820. 10.1093/bioinformatics/14.9.819PubMedView Article

Hallett MT, Lagergren J: New algorithms for the duplication-loss model. Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, RECOMB. Edited by: Shamir R, Miyano S, Istrail S, Pevzner P, Waterman MS. New York: ACM, 2000, 138-146.View Article

Hahn MW: Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution. Genome Biol. 2007, 8: R141. 10.1186/gb-2007-8-7-r141PubMedPubMed CentralView Article

Chang WC, Eulenstein O: Reconciling gene trees with apparent polytomies. Proceedings of the 12th Conference on Computing and Combinatorics (COCOON), volume 4112 of Lecture Notes in Computer Science. Edited by: Chen DZ, Leepages DT. Taipei, Taiwan, 2006, 235-244.

Lafond M, Swenson KM, El-Mabrouk N: An optimal reconciliation algorithm for gene trees with polytomies. WABI, volume 7534 of LNBI/LNBI. 2012, 106-122.

Dondi R, El-Mabrouk N: Minimum leaf removal for reconciliation: complexity and algorithms. Combinatorial Pattern Matching (CPM). accepted 2012.

Doroftei A, El-Mabrouk N: Removing noise from gene trees. WABI, volume 6833 of LNBI/LNBI. 2011, 76-91.

Chauve C, Doyon J-P, El-Mabrouk N: Gene family evolution by duplication, speciation and loss. J Comput Biol. 2008, 15: 1043-1062. 10.1089/cmb.2008.0054PubMedView Article

Hahn MW, Han MV, Han S-G: Gene family evolution across 12 drosophilia genomes. PLoS Genet. 2007, 3: e197. 10.1371/journal.pgen.0030197PubMedPubMed CentralView Article

Modified mincut supertrees. LNCS, volume 2452 of WABI. 2002, 537-551.

Semple C, Steel M: A supertree method for rooted trees. Discrete Appl Math. 2000, 105: 147-158. 10.1016/S0166-218X(00)00202-XView Article

Snir S, Rao S: Using max cut to enhance rooted trees consistency. IEEE/ACM Trans Comput Biol Bioinf. 2006, 3: 323-333.View Article

Hao J, Orlin JB: A faster algorithm for finding the minimum cut in a graph. Proceedings of the Third Annual ACM-SIAM Symposium on Discrete Algorithms. 1992, 165-174.

Cole R, Farach-Colton M, Hariharan R, Przytycka T, Thorup M: An o(n\log n) algorithm for the maximum agreement subtree problem for binary trees. SIAM J Comput. 2000, 30 (5): 1385-1404. 10.1137/S0097539796313477View Article

Finden CR, Gordon AD: Obtaining common pruned trees. J Classif. 1995, 2: 255-276.View Article

Amir A, Keselman D: Maximum agreement subtree in a set of evolutionary trees: matrics and efficient algorithms. SIAM J Comput. 1997, 26: 1656-1669. 10.1137/S0097539794269461View Article

Steel M, Warnow T: Kaikoura tree theorems:computing the maximum agreement subtree. Inform Process Lett. 1993, 48: 77-82. 10.1016/0020-0190(93)90181-8View Article

Bryant D: Building trees, hunting for trees and comparing trees: Theory and method in phylogenetic analysis. PhD dissertation, Department of Mathematics, University of Canterbury, UK; 1997.

Farach M, Przytycka TM, Thorup M: On the agreement of many trees. Inf Process Lett. 1995, 55 (6): 297-301. 10.1016/0020-0190(95)00110-XView Article

Zhang LX: On Mirkin-Muchnik-Smith conjecture for comparing molecular phylogenies. J Comput Biol. 1997, 4: 177-188. 10.1089/cmb.1997.4.177PubMedView Article

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.