Reconciliation and local gene tree rearrangement can be of mutual profit
 Thi Hau Nguyen^{1, 2, 3},
 Vincent Ranwez^{2, 3},
 Stéphanie Pointet^{1, 3},
 AnneMuriel Arigon Chifolleau^{1, 3},
 JeanPhilippe Doyon^{1, 3} and
 Vincent Berry^{1, 3}Email author
DOI: 10.1186/17487188812
© Nguyen et al.; licensee BioMed Central Ltd. 2013
Received: 17 December 2012
Accepted: 5 February 2013
Published: 8 April 2013
Abstract
Background
Reconciliation methods compare gene trees and species trees to recover evolutionary events such as duplications, transfers and losses explaining the history and composition of genomes. It is wellknown that gene trees inferred from molecular sequences can be partly erroneous due to incorrect sequence alignments as well as phylogenetic reconstruction artifacts such as long branch attraction. In practice, this leads reconciliation methods to overestimate the number of evolutionary events. Several methods have been proposed to circumvent this problem, by collapsing the unsupported edges and then resolving the obtained multifurcating nodes, or by directly rearranging the binary gene trees. Yet these methods have been defined for models of evolution accounting only for duplications and losses, i.e. can not be applied to handle prokaryotic gene families.
Results
We propose a reconciliation method accounting for gene duplications, losses and horizontal transfers, that specifically takes into account the uncertainties in gene trees by rearranging their weakly supported edges. Rearrangements are performed on edges having a low confidence value, and are accepted whenever they improve the reconciliation cost. We prove useful properties on the dynamic programming matrix used to compute reconciliations, which allows to speedup the tree space exploration when rearrangements are generated by Nearest Neighbor Interchanges (NNI) edit operations. Experiments on synthetic data show that gene trees modified by such NNI rearrangements are closer to the correct simulated trees and lead to better event predictions on average. Experiments on real data demonstrate that the proposed method leads to a decrease in the reconciliation cost and the number of inferred events. Finally on a dataset of 30 k gene families, this reconciliation method shows a ranking of prokaryotic phyla by transfer rates identical to that proposed by a different approach dedicated to transfer detection [BMCBIOINF 11:324, 2010, PNAS 109(13):4962–4967, 2012].
Conclusions
Prokaryotic gene trees can now be reconciled with their species phylogeny while accounting for the uncertainty of the gene tree. More accurate and more precise reconciliations are obtained with respect to previous parsimony algorithms not accounting for such uncertainties [LNCS 6398:93–108, 2010, BIOINF 28(12): i283–i291, 2012].
A software implementing the method is freely available at http://www.atgcmontpellier.fr/Mowgli/.
Keywords
Evolution Reconciliation Gene Tree Correction Method Software Duplication Transfer Loss Nearest Neighbor InterchangeBackground
A phylogenetic tree or phylogeny is a tree depicting evolutionary relationships among biological entities that are believed to have a common ancestor. A gene family is a group of genes descending from a common ancestor, that retains similar sequences and often similar functions [1]. A species tree depicts the evolutionary history of a group of species, whereas a gene tree depicts the evolutionary history of a gene family. Gene trees often differ from the species tree due to familyspecific evolutionary events such as gene duplications, gene losses and horizontal gene transfers. By comparing a gene tree with the species tree, reconciliation methods try to recover those major evolutionary events. Reconciliation is indeed the process of constructing a mapping between a gene tree and a species tree to explain their differences and similitudes with evolutionary events such as speciation ( ), duplication ( ), loss ( ), and horizontal gene transfer ( ) events. Reconciliations are most often inferred on the basis of a parsimony criterion: a cost is given to each event type, the total cost of a reconciliation is the sum of the costs of the individual events it uses, and a reconciliation of minimum total cost is sought for. This computational problem is often called Most Parsimonious Reconciliation, or MPR in short, and many works have been devoted to it recently [2–8].
The first proposed models focused on parsimonious reconciliations involving only duplications and losses (the DL model) [9–11] or only horizontal transfers and losses [12]. Probabilistic methods have also been developed for the DL model, such as that of Arvestad et al. [13] (see Doyon et al. [14] for a review). Most recent works using a parsimony approach have been devoted to models incorporating duplications, losses and transfers all together (the DTL model) [2, 4, 5, 8], which is necessary to handle prokaryotes. When accounting for transfer events, the history proposed by a reconciliation is consistent if, for any transfer, the donor and receiver species coexist. Ensuring such a time consistency is difficult and leads to an NPhard problem in the general case [7, 15] which cannot be solved by just examining couples of species tree edges. However, in the case divergence dates are available for nodes of the species tree, the problem becomes amenable [2, 16]. The difficulty to handle transfers has led to a split within proposed DTL methods, namely those that ensure timeconsistency [2, 16] and those that do not [3, 4, 7]. The fastest parsimony algorithms for the later category runs in O(m n logn) where m and n are the sizes of the gene and species trees respectively [3], while the fastest timeconsistent algorithm runs in O(m n^{2}) [2]. Probabilistic methods also have been extended recently to the DTL model. Inspired by the work of Tofigh [17], Szőllösi et al. recently proposed a timeconsistent procedure to estimate the species tree by reconciliations from a set of gene trees [18].
A major problem, when applying reconciliation methods, is that parts of the gene trees can be incorrect. This leads reconciliation methods to overestimate ( ), ( ), ( ) and ( ) events [19, 20]. Errors within a binary gene tree can be due to sequence alignment problems, phylogenetic reconstruction artifacts (e.g. long branch attraction) or a lack of phylogenetic signal (especially for genes encoded by short sequences). Such phenomena are wellknown in phylogenetics and several support measures, such as bootstrap values or bayesian posterior probabilities, have been proposed to detect unreliable edges in a gene tree. Up to now, very few works have tackled the reconciliation problem in the presence of unsupported edges, and most of them consider only the DL model [19, 21–26]. Durand et al. proposed an exponential exact algorithm to find the best rearrangement of a gene tree while preserving its strongly supported edges [19]. Another approach is to collapse unsupported edges, thereby creating nodes with more than two children (i.e., polytomies), and then to rely on a generalization of the least common ancestor mapping (LCA) to avoid the need for examining all possible binary rearrangements of the polytomies [21–23, 26]. In this way, Chang et al. and Lafond et al. proposed polynomial time algorithms to solve the MPR problem for a binary species tree and a nonbinary gene tree [22, 26]. When both the species tree and the gene tree are nonbinary, Berglund et al. proved that finding a refinement of the gene tree using less than a given number of duplications is an NPcomplete problem [21]. They also proposed a heuristic approach to refine the gene tree by first minimizing duplications and then losses. Zheng et al. showed that minimizing together duplication and loss costs is NPhard for reconciling a nonbinary species tree with a binary gene tree [25]. For this specific case, Vernot et al. proposed a fixed parameter tractable (FPT) algorithm whose complexity is exponential only in the maximum degree of nodes [23]. More recently, Stolzer et al. extended this FPT algorithm by allowing transfers [27].
Overall, several works relied on tree edit operations to deal with uncertainties in the gene trees. Durand et al. used Nearest Neighbor Interchange (NNI) edit operations to rearrange the local topology of the gene trees in the regions of low supports [19]. Górecki and Eulenstein proposed an efficient algorithm to do a similar task and at the same time root the gene trees, while restraining their search to trees that are at most k NNI moves away from the original gene trees [28]. Chaudhary et al. investigated Subtree Prune and Regraft (SPR) and Tree Bisection and Reconstruction (TBR) edit operations to search for the gene tree rearrangement that minimizes the number of duplications, regardless of losses [24].
It seems hard to have an exact polynomial time algorithm for the MPR problem under the DTL model even when the polytomies are present only in the gene tree or in the species tree. Following the works cited above to deal with uncertainties in the gene trees, we propose a heuristic method relying on NNI edit operations to search for a gene tree rearrangement that preserves strongly supported edges and minimizes the cost of reconciliation to a fixed binary species tree, but in the context of the more complex DTL model. The resulting dynamic program, called MowgliNNI, is a generalization of Mowgli[2], a program initially developed for fixed binary gene trees.
Experiments on simulated data show that MowgliNNI provides a gene tree that is closer to the true evolutionary history of the gene family, and leads to more accurate , and predictions. Experiments on real data show a significant decrease in the number of predicted events and an increased precision, that is a decrease in the number of equally most parsimonious reconciliations. We conducted a large scale experiment where 30 k prokaryotic gene families covering several phyla were reconciled using MowgliNNI. These phyla were then ordered according to their inferred transfer rate. We obtained the same phyla ordering as the one obtained using Prunier, a method dedicated to transfer prediction [29, 30], and our reconciliation based approach has the advantage of providing extra information: explicit donor and receiver branches for transfers, prediction and localization of duplications and losses.
Methods
Basic notations
Trees considered in this paper are rooted and labeled at their leaves only, each leaf being labeled with the name of a studied species. Given a tree T, its node set, edge set, leaf node set and root are resp. denoted V(T), E(T), L(T) and r(T). The label of a leaf u of T is denoted by $\mathcal{\mathcal{L}}\left(u\right)$ and the set of labels of leaves of T is denoted by $\mathcal{\mathcal{L}}\left(T\right)$. When a node u has two children, they are denoted u_{1} and u_{2}.
Given two nodes u and v of T, we write u ≤ _{ T }v (resp. u < _{ T }v) if and only if v is on the unique path from u to r(T) (resp. and u ≠ v); if neither u < _{ T }v nor v < _{ T }u then u and v are said to be incomparable. As we consider rooted trees T only, we adopt the convention that an edge denoted (v,u) means that u < _{ T }v. For a node u of T, T_{ u } denotes the subtree of T rooted at u, p(u) the parent node of u, while (u_{ p },u) is the parent edge of u. A tree T^{′} is a refinement of a tree T if T can be obtained from T^{′} by collapsing some edges in T^{′}, i.e. by merging the two extremitites of these edges [31].
A species tree is a rooted binary tree depicting the evolutionary relationships of ancestral (internal nodes) species leading to a set of extant (leaf) species. A species tree S is considered here to be dated, that is associated to a time function ${\theta}_{S}:V\left(S\right)\to {\mathbb{R}}^{+}$ such that if y < _{ S }x then θ (y) < θ (x). Such times are usually estimated on the basis of molecular sequences [32] and fossil records. Note that to ensure the time consistency of inferred transfers, absolute dates are not required, the important information being the ordering of the nodes of S induced by the dating.
A gene tree is a rooted binary tree depicting the evolutionary history of a gene family, that lead to a set of homologous sequences observed in current organisms. Each leaf of the gene tree has a unique label, corresponding to specific extant sequences of the gene. Indeed, several leaves of a gene tree can be associated to a same species due to duplication and transfert events. We denote by s(u) the species associated to leaf u ∈ V(G).
A gene tree G with supports is a gene tree whose internal edges each have a support value. Let w k_{ t }(G) ⊆ E(G) be the set of edges having a support value weaker than threshold t and let s t r_{ t }(G) be E(G)  w k_{ t }(G), that is the edges having a support equal or stronger than t.
Reconciliation model
Reconciling a gene tree G with a species tree S means building a mapping α that associates each gene g ∈ V(G) to a sequence of nodes in V(S), namely the species in which the sequence g evolved. This evolution is submitted to different kinds of biological events such as speciation, duplication and transfer. The following definition presents a discrete models of this evolution.
Definition 1
(Reconciliation model). Consider a gene tree G, a species tree S with a time function θ_{ S }, and its subdivision S^{′} with a time function θ_{S′}.
Let α be a function that maps each node u of G onto an ordered sequence of nodes of S^{′}, denoted α(u). For u ∈ V(G), let ℓ denote the length of α(u) and let α_{ i }(u) be its i^{th} element (where 1 ≤ i ≤ ℓ). α is said to be a reconciliation between G and S^{′} if and only if exactly one of the following atomic events occurs for each couple of nodes u of G and α_{ i }(u) of S^{′} (where α_{ i }(u) is denoted x):

x is the last vertex α_{ ℓ }(u) and exactly one of the cases below is true.
 1.
u∈L(G), x ∈ L(S ^{′}) and $\mathcal{\mathcal{L}}\left(x\right)=s\left(u\right)$. ( event)
 2.
x is not artificial and {α _{1}(u _{1}),α _{1}(u _{2})} = {x _{1},x _{2}}. ( event)
 3.
α _{1}(u _{1})=α _{1}(u _{2}) = x. ( event)
 4.
α _{1}(u _{1}) = x, and α _{1}(u _{2}) = x ^{′} is such that x ^{′} ≠ x and ${\theta}_{{S}^{\prime}}\left({x}^{\prime}\right)={\theta}_{{S}^{\prime}}\left(x\right)$. ( event)

otherwise, one of the cases below is true.
 5.
x is an artificial vertex and α _{i+1}(u) is its only
 6.
x is not artificial and α _{i+1}(u) ∈ {x _{1},x _{2}}. ($\mathbb{S}\mathbb{L}$ event)
 7.
α _{i+1}(u) = x ^{′} is such that x ^{′} ≠ x and ${\theta}_{{S}^{\prime}}\left({x}^{\prime}\right)={\theta}_{{S}^{\prime}}\left(x\right)$. ($\mathbb{T}\mathbb{L}$ event)
Note that among these events, $\mathbb{T}\mathbb{L}$ and $\mathbb{S}\mathbb{L}$ are in fact a combination of two independent biological events. However, the fact that a loss is always taken into account jointly with another event allows to obtain a recursive algorithm and is done without loss of generality, i.e. does not reduce the power of the model [2].
Given a gene tree G and species tree S, there is an infinite number of possible reconciliations. Discrete evolutionary models compare them by counting the number of events they respectively induce. As different types of event can have different expectancies (e.g. are thought to be more frequent than and ), reconciliation models allow for a specific cost to be given to each kind of event. The cost of a reconciliation is then the sum of the costs of the individual events it induces. In that setting, the parsimony approach is then to prefer a reconciliation of lower cost. This is formalized in the following definition.
Definition 2.
where δ, τ, and λ respectively denote the cost of , , and events, while d, t, and l denote the number of the corresponding events in α. Moreover, a $\mathbb{T}\mathbb{L}$ event is atomic and costs (τ + λ), while a $\mathbb{S}\mathbb{L}$ event just costs λ. Indeed, speciation events are most of the time considered as having a null cost, but the model easily accommodates for nonnull costs if necessary.
over all reconciliations α between G and S^{′}.
Note that several distinct alternative reconciliations can have an optimal reconciliation cost.
Lemma 1 (Consecutive $\mathbb{T}\mathbb{L}$ events)
Consider a gene tree G, the subdivision S^{′} of a species tree, and a reconciliation α of optimal cost C(G,S^{′}) = c(α). For any node u of G, if α_{ i }(u) corresponds to a $\mathbb{T}\mathbb{L}$ event, then α_{i + 1}(u) does not.
This results from the observation that two $\mathbb{T}\mathbb{L}$ in a row can be replaced by single $\mathbb{T}\mathbb{L}$, leading to a reconciliation of lesser cost.
Finding a most parsimonious reconciliation
To find one of the most parsimonious reconciliations between a gene G and a species tree S we will rely on the dynamic programming algorithm of Doyon et al. [2] that computes the optimal reconciliation cost, C(G,S^{′}) on G and the subdivision S^{′} of S. This algorithm successively examines the nodes u of G and their possible mapping on nodes x of S^{′} (or equivalently on edges ending at such nodes). A node u of G can be mapped on such a vertex x according to different scenarios, each postulating a different event at node u among those of Definition 1. The optimal cost for mapping u at x is defined according to the scenario of minimal cost. For running time optimization reasons, the scenario involving a $\mathbb{T}\mathbb{L}$ event, whose cost is denoted ${c}_{\mathbb{T}\mathbb{L}}(u,x)$, is computed after the other possible scenarios, ${c}_{\overline{\mathbb{T}\mathbb{L}}}(u,x)$ denoting the minimum cost that can be achieved among the latter. This decomposition is possible since a $\mathbb{T}\mathbb{L}$ event is always followed by a , , , , , or $\mathbb{S}\mathbb{L}$ event (see Lemma 1). As a result, the best receiver for a $\mathbb{T}\mathbb{L}$ event of node u with donor branch x can be computed from the costs ${c}_{\overline{\mathbb{T}\mathbb{L}}}(u,y)$ over all vertices y other than x such that ${\theta}_{{S}^{\prime}}\left(y\right)={\theta}_{{S}^{\prime}}\left(x\right)$. The cost ${c}_{\overline{\mathbb{T}\mathbb{L}}}(u,y)$ are themselves computed from ${c}_{\mathbb{T}\mathbb{L}}({u}_{i},x)$ values but for children u_{ i } of u (see below). These intricate notions are formally detailed in Definition 3
Definition 3 (Reconciliation cost matrix).
Consider a gene tree G and the subdivision S^{′} of a species tree S with a time function θ_{S′}. Let $c\phantom{\rule{1.5em}{0ex}}:V\left(G\right)\times V\left({S}^{\prime}\right)\to \mathbb{R}$ denote the cost matrix recursively defined as follows for a node u of G and a vertex x of S^{′}: ${c}_{\overline{\mathbb{T}\mathbb{L}}}(u,x)=\text{min}\left\{{c}_{\mathbb{E}}\right(u,x)\phantom{\rule{0.9em}{0ex}}:\mathbb{E}\in \{\mathbb{C},\mathbb{S},\mathbb{D},\mathbb{T},\varnothing ,\mathbb{S}\mathbb{L}\left\}\right\}$ and $c(u,x)=\text{min}\left\{{c}_{\mathbb{T}\mathbb{L}}\right(u,x),{c}_{\overline{\mathbb{T}\mathbb{L}}}(u,x\left)\right\}$, where the costs ${c}_{\mathbb{E}}(u,x)$ for all events x $\mathbb{E}\in \{\mathbb{C},\mathbb{S},\mathbb{D},\mathbb{T},\varnothing ,\mathbb{S}\mathbb{L},\mathbb{T}\mathbb{L}\}$ are defined below

${c}_{\mathbb{C}}(u,x)=0$, if u ∈ L(G), x ∈ L(S^{′}) and $\mathcal{\mathcal{L}}\left(x\right)=s\left(u\right)$.

${c}_{\mathbb{S}}(u,x)=\text{min}\left\{c\right({u}_{1},{x}_{1})+c({u}_{2},{x}_{2}),$

c(u_{1},x_{2}) + c(u_{2},x_{1})}, if u ∉ L(G) and $x\notin L\left({S}^{\prime}\right)$.

${c}_{\mathbb{D}}(u,x)=c({u}_{1},x)+c({u}_{2},x)+\delta $, if u ∉ L(G).

${c}_{\mathbb{T}}(u,x)=\text{min}\left\{c\right({u}_{1},x)+c({u}_{2},z),c({u}_{1},y)+c({u}_{2},x\left)\right\}$

+ τ, with u ∉ L(G) and z (resp. y) denoting a vertex that minimizes c(u_{2},z) (resp. c(u_{1},y)) over all vertices ${x}^{\prime}\in V\left({S}^{\prime}\right)\setminus \left\{x\right\}$ such that ${\theta}_{{S}^{\prime}}\left({x}^{\prime}\right)={\theta}_{{S}^{\prime}}\left(x\right)$.

${c}_{\varnothing}(u,x)=c(u,{x}_{1})$, if x has a single child.

${c}_{\mathbb{S}\mathbb{L}}(u,x)=\text{min}\left\{c\right(u,{x}_{1}),c(u,{x}_{2}\left)\right\}+\lambda $, if x has two children.

${c}_{\mathbb{T}\mathbb{L}}(u,x)={c}_{\overline{\mathbb{T}\mathbb{L}}}(u,y)+\tau +\lambda $, where y denotes a vertex that minimizes ${c}_{\overline{\mathbb{T}\mathbb{L}}}(u,y)$ over all vertices x^{′} ∈ V(S^{′}) ∖ {x} such that ${\theta}_{{S}^{\prime}}\left({x}^{\prime}\right)={\theta}_{{S}^{\prime}}\left(x\right)$.
If the above constraints for an event $\mathbb{E}\in \{\mathbb{C},\mathbb{S},\mathbb{D},\mathbb{T},\varnothing ,\mathbb{S}\mathbb{L},\mathbb{T}\mathbb{L}\}$ on node u and vertex x are not respected, the corresponding cost ${c}_{\mathbb{E}}(u,x)$ is set to ∞.
The value c(u,x) is the optimal cost when mapping gene node u to node x in S^{′}. The optimal cost for reconciling G with S^{′}, denoted C(G,S^{′}), is then $\mathit{\text{mi}}{n}_{x\in V\left({S}^{\prime}\right)}\left(c\right(r\left(G\right),x)$.
The algorithm of Doyon et al. [2], called Mowgli, fills the dynamic programming cost matrix $V\left({S}^{\prime}\right)\times V\left(G\right)\to {\mathbb{R}}^{+}$ by two embedded loops: one loop visits all species nodes of S^{′} in time order (e.g. according to the ${\theta}_{{S}^{\prime}}$ partial order, while the other loop visits nodes of the gene tree G in postorder. Due to an optimization in precomputing the best receiver edge for transfer events of nodes u at a given time, this algorithm has O(S^{2}.G) time complexity.
The problem considered in this paper is the following:
MOST PARSIMONIOUS RECONCILIATION GENE TREE (MPRGT)
INPUT:

A dated species tree S with a time function θ_{ S }

a gene tree G with supports on its edges and whose leaves are associated to leaves of S

costs δ, τ, resp. λ for , , resp. and

a threshold t.
OUTPUT: a gene tree G^{′} such that both $\mathcal{\mathcal{L}}\left(G\right)=\mathcal{\mathcal{L}}\left({G}^{\ast}\right)$ and $\mathit{\text{st}}{r}_{t}\left(G\right)\subseteq E\left({G}^{\ast}\right)$, and such that C(G^{∗},S^{′}) is minimum among all such trees.
Algorithm
We describe here a heuristic for the MPRGT problem that relies on a hillclimbing strategy to seek a (rooted) gene tree G of minimum reconciliation cost (see Definition 3) using NNI edit operations [33].
Consider now the time complexity of MowgliNNI. Identifying the weak edges is done in O(G) and generating the two alternative gene trees for a NNI operation is done in constant time. Hence, the complexity bottleneck of MowgliNNI is the number of times (denoted N) the Θ(S^{2} · G)Mowgli algorithm is called. Overall, the time complexity of MowgliNNI is Θ(S^{2} · G · N). The next section describes how we can avoid recomputing large parts of the cost matrix, and hence greatly reduce the running time of MowgliNNI.
Combinatorial optimization
We now present results that take advantage of the way the dynamic programming matrix is computed (Definition 3) to avoid recomputing from scratch the cost matrix associated to a gene tree G^{′} obtained by an NNI edit operation from a gene tree G. Consider the gene tree G of Figure 4, the NNI operation applied on edge (w,v) that swaps the two subtrees G_{ b } and G_{ c }, and the resulting gene tree denoted G^{′}. We can observe that despite the global architecture of G and G^{′} differs, the local architectures of subtrees ${G}_{b},{G}_{c},{G}_{d},{G}_{{a}_{0}}$, $\dots {G}_{{a}_{k}}$ remain unchanged. Hence, any cost that differs between the matrices C(G,S^{′}) and C(G^{′},S^{′}) (see Definition 3) is located in a column (i.e. node of the gene tree) associated to an ancestor of v (including v itself). For each of those nodes, there are two cases: (i) the node belongs to the NNI edge and its two children have subtree that have been modified (e.g. nodes w and v); (ii) the node is a strict ancestor of the NNI edge (w,v) and has exactly one child with a subtree that has been modified (e.g. g_{ k },…,g_{0}).
Lemma 2 below indicates which columns of the cost matrix don’t need to be recomputed.
Lemma 2.
Consider a gene tree G, the subdivision S^{′} of a species tree S, an edge (w,v) of G, and the gene tree G^{′} obtained from G by an NNI operation on (w,v). For each node z of G that is not ancestor of v in G and for each vertex x of S^{′}, then c(z,x) = c^{′}(z,x) holds.
This observation results from the fact that the dynamic algorithm of Mowgli computes the value of a cell (z,x) in the cost matrix using cells storing values either for the same node z or for its children (see formulas of Definition 3). Hence the value of a cell (z,x) directly or indirectly depends only on values for cells corresponding to z and its descendants. Going from gene tree G to G^{′} by an NNI operation, precisely changes the descendant relationships of v and its ancestors, i.e. all other nodes z have the same descendants in both G and G^{′} (see Figure 4), hence c(z,x) = c^{′}(z,x) holds for all these nodes.
Unfortunately, there is no extension of Lemma 2 to ensure that when an edge has already been unsuccessfully tried for an NNI, it is useless to reconsider it later, even if it is a descendant in G of the edge leading to the last successful NNI.
Theorem 1.
Consider a gene tree G, the subdivision S^{′} of a species tree S, an edge (w,v) of G, a gene tree G^{′} obtained by an NNI operation on (w,v), and any strict ancestor u of w in G where the unique child of u that is an ancestor of w is u_{1} w.l.o.g. (i.e. w ≤ u_{1} in both G and G’). If c(u_{1},x) ≤ c^{′}(u_{1},x) holds for all x ∈ V(S^{′}), then c(u,x) ≤ c^{′}(u,x) holds for all x ∈ V(S^{′}), and as a corollary C(G,S^{′}) ≤ C(G^{′},S^{′}).
The proof of Theorem 1 is described in Appendix. This theorem leads to the optimized algorithm of MowgliNNI, formally stated in Algorithm 1 as an integrated procedure run after Mowgli. The later computes a dynamic programming matrix c :V(G) → V(S^{′}) that MowgliNNI then partly recomputes given a rearrangement performed on the gene tree G. For each rearrangement, the matrix recomputed by MowgliNNI, denoted c^{′} :V(G^{′}) → V(S^{′}), is obtained in worst case time O(S^{′} · h(G)), where h(G) is the height of G (i.e. the number of its ancestors)
Algorithm 1 M o w g l i N N I ( G , c ): seeking a gene tree G ^{ ′ } of minimum reconciliation cost, starting from a gene tree G and the precomputed matrix reconciliation cost $c:V\left(G\right)\times V\left({S}^{\prime}\right)\to \mathbb{R}$ , where S ^{ ′ } is the subdivided species tree.
Theorem 2.
MowgliNNI has worst case running time O(S^{2} · G + S^{2} · h(G)·N)
Indeed the steps of Algorithm 1 can be described as follows: initializing the reconciliation matrix for the initial gene tree is done in O(S^{2} · G) time; then updating the matrix for each of the N NNIs now only costs O(S^{′} · h(G)) = O(S^{2}·h(G)).
In MowgliNNI’s naïve implementation each rearrangement requires to recompute the cost associated to each and every node of the gene tree. In contrast, in the optimized version, an NNI around edge (w,v) is examined after updating only those costs associated to ancestral nodes of w. This has no impact on the worst case complexity (when the gene tree is a caterpillar h(G) is in O(G)) but significantly reduces the running times in practice since in most cases the number of nodes in G is much larger than their average height. For some random tree models the average height of a node in an nleaf tree is indeed proportional to l o g(n) [34].
Results and discussion
Experiments on simulated datasets
Simulated gene trees and evolutionary histories
A phylogeny of 37 proteobacteria was used as a reference species tree (denoted S) [8]. Along this tree, we simulated the evolutionary history (denoted R_{ T r u e }) of 1000 gene families (G_{ T r u e }), each containing from 10 to 100 genes, according to a birth and death process [35]. Birth events can be one of three kinds of evolutionary events, i.e. speciation, duplication, and horizontal gene transfer. During the simulation process along the species tree, a speciation occurs every time a gene lineage reaches an internal node of the species tree, leading to a split in two gene lineages. A birth event happening strictly between two nodes of the species tree can only correspond to a gene duplication or a horizontal gene transfer event. A birth is decided to be duplication or a transfer according to the input rates of these events.
The death of a gene lineage corresponds to a loss event, which happens according to an input loss rate. The species tree was scaled to the height of 500 million years (Mya). The speciation rate is determined by the topology and the height of the species tree. Each of the 1000 gene families was generated with different event rates, the loss rate being randomly chosen in the range [0.0010–0.0018] events/gene per million year. The ratio between the sum of duplication and transfer rates and the loss rate was randomly chosen in the range [0.5  1.1], the duplication rate being [70%  100%] of the mentionned sum. This birth and death process first output a complete gene tree G^{ o }, then the “true” gene tree G_{ T r u e } was obtained from G^{ o } by pruning extinct subtrees. The “true” evolutionary events to be recovered by the reconciliation programs are those appearing in G_{ T r u e }. We denote R_{ T r u e } the history composed by these events. We only considered gene families containing at most ten duplication and transfer events in their true evolution. In particular for the transfer events, this constraint allowed us to limit the number of cases where the true evolution contains a sequence of consecutive transfers where nontransferred genes are lost (i.e. a sequence of$\mathbb{T}\mathbb{L}$ events). Such a piece of history can hardly be recovered by reconciliation methods as it left no trace at all in the gene tree.
where${\mathbb{E}}_{{R}_{\mathit{\text{True}}}}$ stands for the number of events of the corresponding kind in R_{ True }.
Measuring the accuracy
First, we estimated the improvement in the accuracy of the gene tree’s topology, as measured by the RobinsonFoulds (RF) distance [39] between the true gene tree (G_{ True }) and the inferred gene tree (G_{ ML }). As a second measurement of the accuracy of inferred reconciliations we compared the positions of , and events predicted by MowgliNNI and Mowgli with those present in the true history. This is achieved by studying the proportion of true positive (TP), false positive (FP) and false negative (FN) separately for duplications, transfers and losses [2]. True negatives (TN) were not studied as their number is considerably large (if even finite) and hard to determine. An event of R_{ True } is declared as correctly predicted when it concerns the right part of the gene tree (node or edge) placed in the correct branch or node of the species tree (see [2] for more details). Incidentally, both the receiver and the donor edge of the species tree have to be correctly indicated for a predicted transfer event to be declared as correct.
MowgliNNI provides more accurate inferences
We explored the ability of MowgliNNI to improve the set of G_{ ML } trees using six different bootstrap values as threshold for defining weak edges, i.e. 20, 40, 60, 80, 90, and 95. The G_{ ML } trees were inferred from relatively long sequences, they thus contained a large proportion of high bootstrap values, e.g. more than 63% edges had a bootstrap value ≥ 80. Though this left only a moderate number of edges in each gene tree to be considered by MowgliNNI for rearrangement, the method was still able to improve their quality (see below).
MowgliNNI progressively reduced the number of predicted duplications, transfers and losses as the threshold increased. At threshold 0 (where MowgliNNI = Mowgli, 5510 duplications, 2494 transfers and 12190 losses were predicted on the whole dataset; going to threshold 80, these numbers dropped to 4602 duplications, 1676 transfers and 8133 losses, i.e. values that are much closer to the 4535 duplications and 8260 losses contained in the true reconciliations.
Influence of the sequence length parameter
Quality of the gene trees ( G _{ NNI } ) and reconciliations ( R _{ NNI } ) inferred by MowgliNNI depending on the length of the sequences used to obtain G _{ ML } trees and on the threshold indicating weak edges
Short sequences  Long sequences  

Threshold  20  80  95  20  80  95 
Number of gene families containing weak edges  163  323  327  118  328  332 
%cases s.t. C o s t ( S , G _{ N N I } ) < C o s t ( S , G _{ M L } )  80  92  91  75  83  84 
%cases s.t. R F ( G _{ T r u e } , G _{ N N I } ) < R F ( G _{ T r u e } , G _{ M L } )  43  74  73  29  67  67 
%cases s.t. R F ( G _{ T r u e } , G _{ N N I } ) = R F ( G _{ T r u e } , G _{ M L } )  53  17  18  67  26  24 
%cases s.t. R F ( G _{ T r u e } , G _{ N N I } ) > R F ( G _{ T r u e } , G _{ M L } )  4  9  9  4  7  9 
%cases s.t. E D ( R _{ T r u e } , R _{ N N I } ) < E D ( R _{ T r u e } , R _{ M L } )  66  82  83  51  76  76 
%cases s.t. E D ( R _{ T r u e } , R _{ N N I } ) = E D ( R _{ T r u e } , R _{ M L } )  24  12  12  33  20  19 
%cases s.t. E D ( R _{ T r u e } , R _{ N N I } ) > E D ( R _{ T r u e } , R _{ M L } )  10  6  5  16  4  5 
Quality of the gene trees ( G _{ NNI } ) and reconciliations ( R _{ NNI } ) inferred by MowgliNNI on very short sequences
Threshold  20  80  95 

Number of gene families containing weak edges  794  1000  1000 
%cases s.t. C o s t ( S , G _{ N N I } ) < C o s t ( S , G _{ M L } )  89  97  97 
%cases s.t. R F ( G _{ T r u e } , G _{ N N I } ) < R F ( G _{ T r u e } , G _{ M L } )  58  77  75 
%cases s.t. R F ( G _{ T r u e } , G _{ N N I } ) = R F ( G _{ T r u e } , G _{ M L } )  39  16  16 
%cases s.t. R F ( G _{ T r u e } , G _{ N N I } ) > R F ( G _{ T r u e } , G _{ M L } )  3  7  9 
%cases s.t. E D ( R _{ T r u e } , R _{ N N I } ) < E D ( R _{ T r u e } , R _{ M L } )  78  91  91 
%cases s.t. E D ( R _{ T r u e } , R _{ N N I } ) = E D ( R _{ T r u e } , R _{ M L } )  15  5  5 
%cases s.t. E D ( R _{ T r u e } , R _{ N N I } ) > E D ( R _{ T r u e } , R _{ M L } )  7  4  4 
Robustness of reconciliations to imprecision in the event costs
The robustness of MowgliNNI to changes in event costs with respect to the initial ones computed by Formula ( 1) (Column 1)
Event cost variation  RF  FP Dup  FP Tran  FP Loss  FN Dup  FN Tran  FN Loss 

0%  12.8  14.1  69.9  14.9  12.8  42.5  16.2 
10%  12.9  14.3  72.1  16.4  13.0  45.0  17.6 
20%  12.9  14.4  71.1  16.8  13.2  44.2  17.7 
50%  12.9  14.6  74.1  17.8  15.0  45.6  18.9 
Room for future improvement
To measure how much of the achievable improvement over the G_{ M L } trees was realized by MowgliNNI, we studied the distribution of reconciliation costs of all possible gene trees for several cases involving a computationally manageable number of species. The shape of the distribution together with the relative position of the costs obtained for G_{ True }, G_{ ML } and G_{ NNI } within those distributions gives information on how much improvement could be achieved in the future by more sophisticated methods (e.g., relying on SPR moves). We report here on two cases representative of our observations: two true gene trees${G}_{\mathit{\text{True}}}^{A},{G}_{\mathit{\text{True}}}^{B}$ of 8 taxa were generated from the species tree S of 37 proteobacteria according to the protocol described in Figure 6. Their two associated histories A and B were used as starting points to obtain both sequence alignments and reconciliations costs (according to Equation 1). This time, 50 sequence alignments were generated from each of the two gene trees. A maximum likelihood tree was obtained from each of the 100 alignments, with bootstrap supports associated to its edges. These trees were then submitted for improvement to MowgliNNI, applying a threshold 50 to specify weak edges, and relying on event costs corresponding to histories A and B respectively. All reconciliations were performed with respect to the species tree S.
History A involved 2 duplications, no transfer and 7 losses and was correctly recovered by the parsimonious reconciliation of Mowgli from${G}_{\mathit{\text{True}}}^{A}$. However, History B (involving 2 duplications, 1 transfer and 5 losses) was incorrectly recovered from${G}_{\mathit{\text{True}}}^{B}$, the achieved$C({G}_{\mathit{\text{True}}}^{B},S)=3.98$ cost being less than the 5.75 cost for the real history. Though the real cost is in the left part of the distribution, it is not the minimum point of the distribution, showing that parsimony can sometimes be misleading when followed to its extreme.
Nevertheless, in both cases, the true gene tree is among the ones having the minimum reconciliation costs: it is precisely the one leading to the minimum cost for history A and among the nine best trees for history B. On these examples (and other cases not shown), parsimony can be considered as a very good guide towards the correct gene tree, even if the reconciliation from this correct tree can underestimate the number of real events (as discussed above).
For both histories A and B, MowgliNNI proposed a gene tree G_{ NNI } whose reconciliation cost was on average closer from that of the true gene tree – and from the real cost – than the cost obtained from the maximum likelihood tree.
Conclusion on simulated datasets
In summary, MowgliNNI successfully uses the reconciliation cost as additional information to resolve the uncertain parts of gene trees inferred from sequences only. Though the gene tree resolutions are partly guided by reconciliations with the species tree, they are not attracted away from the true gene trees, but are closer to them than the initial gene trees. As a result, MowgliNNI infers gene events more accurately, which is of prior importance to distinguish orthologs from paralogs and xenologs [14].
Experiments on real data
As species tree S, we chose a phylogeny covering 336 genomes of Bacteria and Archaea recently inferred by Abby et al. [30].
Then, a dataset of 29,709 homologous gene families spanning these taxa was collected from the HOGENOM database (release 04) [40]. Each such family contains from 3 to 312 taxa. The gene tree of each family from this dataset was reconciled with the species tree by Mowgli and MowgliNNI using costs τ = 3, δ = 3.5, resp. λ = 1 for transfers, duplications, resp. losses. These costs were estimated on the basis of several bacteria phyla by a maximum likelihood method [18, 41]. A threshold of 50% for branch support values was used to indicate to MowgliNNI the weak edges in the gene trees.
A decrease in the number of inferred events and reconciliation costs
MowgliNNI allowed to change the gene tree, hence to lower the reconciliation cost, in 24% of the ≈30,000 families. This gain is nonnegligible and has a real importance as changing the gene tree topology has an important impact on the inferred events (as already shown on simulated datasets and discussed below). In turn, these inferred events may serve to predict the function of new sequences on the basis of their orthology relationships with annotated sequences, orthology following from the chosen reconciliation. Among previous reconciliation studies that allowed to modify the gene trees, BerglundSonnhammer et al. report that 10% of their families were improved [21] when allowing rearrangements on weak edges under the DL model, while Chaudhary et al. improved all their gene trees in a pure D model when rearranging gene trees with Subtree Prune and Regraft (SPR) operations [24]. Note that the heterogeneity of models and datasets used in these studies limit the comparison of their results, but we cite them for completeness.
For gene families with a lower reconciliation cost (24% of all families), we counted the number of events of each kind ($\mathbb{D},\mathbb{T},\mathbb{L}$) inferred by Mowgli and MowgliNNI. As a rule, MowgliNNI led to a decrease in the number of events in inferred evolutionary histories. In particular, the number of transfers is reduced in 88.3% of these gene families, the number of losses being reduced in 59.9%, while the number of duplications is almost the same (decrease in 5.2% of the families). These results obtained in the DTL model echo those of Durand et al. reporting that in the DL model gene tree rearrangements substantially reduce the number of events needed to explain the data [19]. The differences in reductions we observed among the kind of events can be explained by the costs – estimated from [18, 41] – that we used for the events (τ = 3,δ = 3.5,λ = 1). Given those costs, it is usually more parsimonious to explain the conflicts between a gene and the species tree by a combination of and rather than a combination of , and . Thus, when MowgliNNI infers a gene tree closer to the species tree, it mostly removes the need for artificial transfers (and losses to a lesser extent), while not altering that much the number of duplications.
A decrease in the number of equally most parsimonious reconciliations
In addition to reductions in number of events and hence reconciliation cost, the modified gene trees proposed by MowgliNNI usually reduced the number of alternative MPRs, i.e. equally most parsimonious histories. On a random sample of two dozens modified gene trees, the number of MPRs is reduced in 63% of the cases (by a factor of 18 in the best case), and increased in 21% (by a factor 3 at worst). This echoes similar observations done by other authors.
The improvement in running time due to the optimized version of MowgliNNI
Figure 13 shows that MowgliNNI is 20 (resp. 50 and 80) times faster than MowgliNNI^{ n }, when facing 1–20 (resp. 20–40 and 40–60) weak edges. This shows that the combinatorial optimization proposed in the Methods section is crucial in practice.
Now, when compared to Mowgli, the rearrangements tried by MowgliNNI^{ n } on weak edges to obtain a better gene tree are done at the price of a relatively small computation time overcost. We also indicate the regression line of MowgliNNI running times with respect to those of Mowgli, plotted against the number of weak edges. Its slope is only 0.01186, meaning that MowgliNNI (the optimized version) is able to take into account the gene tree uncertainties with just a slight increase in the running time.
Transfers in prokaryotic phyla
On our whole dataset of 29,709 homologous gene families, we particularly studied transfers in 5 bacterial and 1 archaeal phyla: Proteobacteria (169 genomes), Actinobacteria (31 genomes), Cyanobacteria (14 genomes), Chlamydiae (7 genomes), Spirochaetes (7 genomes) and Crenarchaeota (10 genomes). We compared our results obtained with Mowgli and MowgliNNI to those of Abby et al. [30] obtained with the Prunier method [29] that infers transfers in monocopy gene families on another basis than reconciliation.
In order to compare our results to the Abby et al. study, we extracted particular families from HOGENOM v4. For each of the 6 phyla of interest, we collected the list of families having at most one copy of the gene for the genomes of this phylum and separated them into two groups: families having one copy of the gene for each genome of the phylum, socalled “universal families”, and families having a copy of the gene for at least 7 genomes of the phylum, socalled “nonuniversal families”.
For each phylum, we computed the number of intraphylum transfers inferred by reconciliations of Mowgli and MowgliNNI for families of the two groups (universal and nonuniversal). As the number of families we found in several groups among the various phyla varied slightly from those reported by Abby et al. [30] we summarized the findings of both studies in terms of transfer rates, expressed in number of transfers per million year and per family.
Finally, as expected, MowgliNNI reduced the number of inferred transfers compared to Mowgli, leading to transfer rates closer to that inferred by Prunier.
Conclusion
We introduce the MowgliNNI heuristic method relying on NNI rearrangements of the uncertain parts of the gene trees to solve a parsimony optimization problem for reconciliations accounting for duplications ( ), losses ( ) and transfers ( ). We show experimental evidence that reconciliations computed under the parsimony criterion can efficiently correct erroneous parts of gene trees inferred from sequence data. On simulated data, MowgliNNI often proposes a new gene tree topology that is closer to the correct one and that also leads to better , and predictions. Moreover, the number of events and the number of most parsimonious reconciliations predicted by MowgliNNI are significantly lower than those obtained without questioning the gene tree topology. This is confirmed on real data. A critical point for parsimony methods is the choice of respective costs for the considered evolutionary events. We show here that MowgliNNI’s performance is only slightly altered when changing the costs given to the individual events ( , and ), that is, the method is robust to cost misspecification.
Appendix
Proof of Theorem 1
Theorem 1.
Consider a gene tree G, the subdivision S^{′} of a species tree S, an edge (w, v) of G, a gene tree G^{′} obtained by an NNI operation on (w,v), and any strict ancestor u of w in G where the unique child of u that is an ancestor of w is u_{1} w.l.o.g. (i.e. w ≤ u_{1} in both G and G’). If c(u_{1} ,x) ≤ ;c^{′} (u_{1},x) holds for all x ∈ V(S^{′}), then c (u,x) ≤ c^{′} (u,x) holds for all x ∈ V(S^{′}), and as a corollary C(G,S^{′}) ≤ C(G^{′},S^{′}).
Proof.
First remark that an NNI operation performed around the edge (w,v) of G to obtain a modified tree G^{′} does not alter the order of the nodes above v, which are then considered below indifferently of the tree they belong.
The proof is done with a recurrence over increasing time t ∈ {0,1,…,h(r(S^{′}))} for the subset of nodes V_{ t }(S^{′}) ⊂ V(S^{′}). Recall that, in S^{′} the height of a node u (denoted h(u)) is a valid time function (see Figure 1) and that u_{1} is the child of u that is an ancestor of w (whereas u_{2} is incomparable with w). □
Base case
For time t = 0, the possible events for the internal node u and any leaf x ∈ V_{0}(S^{′}) are , , and$\mathbb{T}\mathbb{L}$ (see the reconciliation model of Definition 1).
For each event$\mathbb{E}\in \{\mathbb{D},\mathbb{T}\}$,${c}_{\mathbb{E}}(u,x)$ (resp.${c}_{\mathbb{E}}^{\prime}(u,x)$) depends on the costs c(u_{ i },y) (resp. c^{′}(u_{ i },y)) over all children u_{ i } ∈ {u_{1},u_{2}} and vertices y ∈ V_{ t }(S^{′}) (see Definition 3). Since u_{2} (resp. u_{1}) is incomparable to (resp. an ancestor of) w, Lemma 2 implies that c(u_{2},y) = c^{′} (u_{2},y) and the assumption states that c(u_{1},y) ≤ c^{′}(u_{1},y).
That all costs used in the equation of${c}_{\mathbb{E}}^{\prime}(u,x)$ are not lower than the corresponding costs in that of${c}_{\mathbb{E}}(u,x)$ leads to properties listed in the following remark.
Remark 1.
 1.
For all events$\mathbb{E}\in \{\mathbb{D},\mathbb{T}\}$ and leaves x∈V _{0}(S ^{′}),${c}_{\mathbb{E}}(u,x)\le {c}_{\mathbb{E}}^{\prime}(u,x)$ holds.
 2.
For all leaves leaves x ∈ V _{0}(S ^{′}),$\text{min}\left\{{c}_{\mathbb{E}}\right(u,x):\mathbb{E}\in \{\mathbb{D},\mathbb{T}\left\}\right\}\le \text{min}\left\{\underset{\mathbb{E}}{\overset{\prime}{c}}\right(u,x):\mathbb{E}\in \{\mathbb{D},\mathbb{T}\left\}\right\}$.
 3.
$\underset{x\in {V}_{0}\left({S}^{\prime}\right)}{\text{min}}{c}_{\overline{\mathbb{T}\mathbb{L}}}(u,x)\le \underset{{x}^{\prime}\in {V}_{0}\left({S}^{\prime}\right)}{\text{min}}{c}_{\overline{\mathbb{T}\mathbb{L}}}^{\prime}(u,{x}^{\prime})$, since / / /$\mathbb{S}\mathbb{L}$ are impossible events at height 0.
Hence, we have the following result:
Remark 2.
For all internal nodes u ∈ V(G) ∖ L(G) and leaves x ∈ V_{0}(S^{′}),${c}_{\mathbb{T}\mathbb{L}}(u,x)\le {c}_{\mathbb{T}\mathbb{L}}^{\prime}(u,x)$ holds.
Therefore, c(u,x) ≤ c^{′}(u,x) holds for each leaf x ∈ V_{0}(S^{′}).
Inductive step
For a height 0 ≤ t < h(S), we now suppose that the expected property c(u,y) ≤ c^{′}(u,y) holds for all vertices y ∈ V_{ t }(S^{′}) and prove that it still holds for any vertex x∈V_{t+1}(S). , , , ,$\mathbb{S}\mathbb{L}$, and$\mathbb{T}\mathbb{L}$ are the possible events for node u and vertex x. Following exactly the same arguments as in the base case, Remark 1 ( and ) and Remark 2 ($\mathbb{T}\mathbb{L}$) still hold for the current time (t+1).
The dependencies of the corresponding cost for , , and$\mathbb{S}\mathbb{L}$ events are as follows:${c}_{\mathbb{S}}(u,x)$ depends on the costs c(u_{ i },x_{ i }) for u_{ i } ∈ {u_{1},u_{2}} and x_{ i } ∈ {x_{1},x_{2}}, with x_{ i } ∈ V_{ t }(S^{′});${c}_{\varnothing}(u,x)$ on c(u,x_{1}), with x_{1} ∈ V_{ t }(S^{′}); and${c}_{\mathbb{S}\mathbb{L}}(u,x)$ on c(u,x_{ i }) for x_{ i } ∈ {x_{1},x_{2}}, with x_{ i } ∈ V_{ t }(S^{′}). The same dependencies apply for${c}_{\mathbb{S}}^{\prime}(u,x)$,${c}_{\varnothing}^{\prime}(u,x)$, and${c}_{\mathbb{S}\mathbb{L}}^{\prime}(u,x)$. Recall that u_{2} (resp. u_{1}) is incomparable to (resp. an ancestor of) w and that Lemma 2 (resp. the assumption) implies that c(u_{2},x_{ i }) = c^{′}(u_{2},x_{ i }) (resp. c(u_{1},x_{ i }) ≤ c^{′}(u_{1},x_{ i })) for each x_{ i } ∈ {x_{1},x_{2}}. Moreover, the inductive hypothesis states that c(u,x_{ i }) ≤ c^{′}(u,x_{ i }) holds for each child x_{ i } of x since x_{ i } ∈ V_{ t }(S^{′}). For each event$\mathbb{E}\in \{\mathbb{S},\varnothing ,\mathbb{S}\mathbb{L}\}$, that all costs used in the equation of${c}_{\mathbb{E}}^{\prime}(u,x)$ are not lower than the corresponding costs used in the equation of${c}_{\mathbb{E}}(u,x)$ leads to the following result:
Remark 1.
For all events$\mathbb{E}\in \{\mathbb{S},\varnothing ,\mathbb{S}\mathbb{L}\}$, internal nodes u ∈ V(G)∖L(G) and internal vertices x ∈ V_{t+1}(S^{′}),${c}_{\mathbb{E}}(u,x)\le {c}_{\mathbb{E}}^{\prime}(u,x)$ holds.
Therefore, c(u,x) ≤ c^{′}(u,x) holds for each vertex x ∈ V_{t+1}(S^{′}), and thus for all vertices of S^{′}.
As a corollary, the same inequality holds between the root nodes r of G and G^{′}, since w ≤ r. Then C(G,S^{′}) ≤ C(G^{′},S^{′}).
Declarations
Acknowledgements
We thank Gergely J. Szöllősi for providing event costs of the real dataset. This work was funded by the french Agence Nationale de la Recherche Investissements d’avenir / Bioinformatique (ANR10BINF01–02, Ancestrome), Programme 6ème Extinction (ANR09PEXT000 PhyloSpace) and by the Institut de Biologie Computationnelle.
Authors’ Affiliations
References
 Dayhoff MO: The origin and evolution of protein superfamilies. Fed Proc. 1976, 35 (10): 21322138.PubMed
 Doyon JP, Scornavacca C, Gorbunov KY, Szöllösi G, Ranwez V, Berry V: An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers. RECOMBCG 2010, LNCS. 2010, 6398: 93108.
 Bansal MS, Alm EJ, Kellis M: Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer, and loss. In Bioinformatics. 2012, 28 (12): i283i291. 10.1093/bioinformatics/bts225.View Article
 Hallett M, Lagergren J, Tofigh A: Simultaneous identification of duplications and lateral transfers. RECOMB ’04. Edited by: Bourne PE, Gusfield D. New York: ACM 2004, 347356.View Article
 Górecki P: Reconciliation problems for duplication, loss and horizontal gene transfer. RECOMB. Edited by: Bourne PE, Gusfield D. New York, NY, USA: ACM 2004, 316325.View Article
 Conow C, Fielder D, Ovadia Y, LibeskindHadas R: Jane: a new tool for the cophylogeny reconstruction problem. Algorithms Mol Biol. 2010, 5: 16.PubMed CentralView ArticlePubMed
 Tofigh A, Hallett M, Lagergren J: Simultaneous identification of duplications and lateral gene transfers. IEEE/ACMTCBB. 2011, 8 (2): 517535.
 David LA, Alm EJ: Rapid evolutionary innovation during an archaean genetic expansion. Nature. 2011, 469 (7328): 9396.View ArticlePubMed
 Goodman M, Czelusniak J, Moore GW, Romero Herrera A, Matsuda G: Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool. 1979, 28: 132163. 10.2307/2412519.View Article
 Page RD: Extracting species trees from complex gene trees: reconciled trees and vertebrate phylogeny. Mol Phylogenet Evol. 2000, 14: 89106.View ArticlePubMed
 Ma B, Li M, Zhang L: From gene trees to species trees. SIComput, AMJ. 2001, 30 (3): 729752.
 Nakhleh L, Warnow T, Linder CR: Reconstructing reticulate evolution in species: theory and practice. Proceedings of the Eighth Annual International Conference on Resaerch in Computational Molecular Biology. 2004, 337346. RECOMB ’04. New York: ACM,
 Arvestad L, Lagergren J, Sennblad B: The gene evolution model and computing its associated probabilities. J ACM. 2009, 56 (2): 144.View Article
 Doyon JP, Ranwez V, Daubin V, Berry V: Models, algorithms and programs for phylogeny reconciliation. Brief Bioinformatics. 2011, 12 (5): 392400.View ArticlePubMed
 Ovadia Y, Fielder D, Conow C, LibeskindHadas R: The cophylogeny reconstruction problem is NPcomplete. Comp J Biol. 2011, 18 (1): 5965. 10.1089/cmb.2009.0240.View Article
 LibeskindHadas R, Charleston MA: On the computational complexity of the reticulate cophylogeny reconstruction problem. JCB. 2009, 16 (1): 105117.
 Tofigh A: Using Trees to Capture Reticulate Evolution, Lateral Gene Transfers and Cancer Progression. PhD thesis, Royal, KTH, . 2009, Sweden: Institute of Technology,
 Szőllösi GJ, : Modeling gene family evolution and reconciling phylogenetic discord. Methods Mol Biol. 2012, 856: 2951.View ArticlePubMed
 Durand D, Halldorsson BV, : A hybrid micromacroevolutionary approach to gene tree reconstruction. Comput J Biol. 2006, 13 (2): 320335. 10.1089/cmb.2006.13.320.View Article
 Hahn MW: Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution. Genome Biol. 2007, 8 (7): R141.PubMed CentralView ArticlePubMed
 BerglundSonnhammer AC, Steffansson P, Betts MJ, Liberles DA: Optimal gene trees from sequences and species trees using a soft interpretation of parsimony. Mol J Evol. 2006, 63 (2): 240250. 10.1007/s0023900500961.View Article
 Chang W, Eulenstein O: Reconciling gene tree with apparent polytomies. COCOON, LNCS. 2006, 4112: 235244.
 Vernot B, Stolzer M, Goldman A, : Reconciliation with nonbinary species trees. Comput J Biol. 2008, 15: 9811006. 10.1089/cmb.2008.0092.View Article
 Chaudhary R, Burleigh JG, Eulenstein O: Algorithms for rapid error correction for the gene duplication problem. Proceedings of the 7th International Conference on Bioinformatics Research and Applications. 2011, 227239. ISBRA’11. Berlin, Heidelberg: SpringerVerlag,View Article
 Zheng Y, Wu T, Zhang L: Reconciliation of gene and species trees With Polytomies. ArXiv. 2012, 1201.3995v2[qbio.PE],
 Lafond M, Krister Swenson M, ElMabrouk N: An optimal reconciliation algorithm for gene trees with polytomies. WABI 2012, LNBI 7534. Edited by: Tang J, Raphael B, Raphael B , Tang J . 2012, 106122. Berlin Heidelberg: SpringerVerlag,
 Stolzer M, Lai H, Xu M, Sathaye D, Vernot B, Durand D: Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees. Bioinformatics. 2012, 28: 409415. 10.1093/bioinformatics/bts386.View Article
 Górecki P, Eulenstein O: Algorithms: simultaneous errorcorrection and rooting for gene tree reconciliation and the gene duplication problem. Bioinformatics, BMC, . 2012, 13 (Suppl 10): S1410.1186/1471210513S10S14.View Article
 Abby S, Tannier E, Gouy M, Daubin V: Detecting lateral gene transfers by statistical reconciliation of phylogenetic forests. Bioinformatics, BMC. 2010, 11: 32410.1186/1471210511324.View Article
 Abby S, Tannier E, Gouy M, Daubin V: Lateral gene transfer as a support for the tree of life. PNAS. 2012, 109 (13): 49624967.PubMed CentralView ArticlePubMed
 Semple C, Steel MA: Phylogenetics, volume 24 of Oxford Lecture Series in Mathematics and its Applications. 2003, New York, USA: Oxford University Press,
 Sanderson MJ: inferring absolute rates of evolution and divergence times in the absence of a molecular clock. Bioinformatics. 2003, 19: 301302.View ArticlePubMed
 Felsenstein J: Inferring Phylogenies. 2004, Sunderland: Sinauer Associates,
 Knuth DE: The Art of Computer Programming. 1998, Redwood City: AddisonWesley Longman Publishing Co., Inc.,
 Kendall DG: On the generalized birthanddeath process. Ann Math Stat. 1948, 19: 115. 10.1214/aoms/1177730285.View Article
 Galtier N: A model of horizontal gene transfer and the bacterial phylogeny problem. Syst Biol. 2007, 56: 633642.View ArticlePubMed
 Rambaut A, Grass NC: Seqgen: an application for the monte carlo simulation of dna sequence evolution along phylogenetic trees. Bioinformatics. 1997, 13 (3): 235238. 10.1093/bioinformatics/13.3.235.View Article
 Stamatakis A: Raxmlvihpc: maximum likelihoodbased phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006, 22 (21): 26882690.View ArticlePubMed
 Robinson DF, Foulds LR: Comparison of phylogenetic trees. Math Biosci. 1981, 53: 131147. 10.1016/00255564(81)900432.View Article
 Penel S, Arigon AM, Dufayard JF, Sertier AS, Daubin V, Duret L, Gouy M, Perriere G: Databases of homologous gene families for comparative genomics. Bioinformatics, BMC. 2009, 6 (Suppl10): S3.View Article
 Szőllösi GJ, Boussau B, Tannier E, Daubin V: Phylogenetic modeling of lateral gene transfer reconstructs the pattern and relative timing of speciations. PNAS. 2012, 109 (43): 1751317518.PubMed CentralView ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.