 Research
 Open Access
 Published:
Inferring species trees from incongruent multicopy gene trees using the RobinsonFoulds distance
Algorithms for Molecular Biology volume 8, Article number: 28 (2013)
Abstract
Background
Constructing species trees from multicopy gene trees remains a challenging problem in phylogenetics. One difficulty is that the underlying genes can be incongruent due to evolutionary processes such as gene duplication and loss, deep coalescence, or lateral gene transfer. Gene tree estimation errors may further exacerbate the difficulties of species tree estimation.
Results
We present a new approach for inferring species trees from incongruent multicopy gene trees that is based on a generalization of the RobinsonFoulds (RF) distance measure to multilabeled trees (multrees). We prove that it is NPhard to compute the RF distance between two multrees; however, it is easy to calculate this distance between a multree and a singlylabeled species tree. Motivated by this, we formulate the RF problem for multrees (MulRF) as follows: Given a collection of multicopy gene trees, find a singlylabeled species tree that minimizes the total RF distance from the input multrees. We develop and implement a fast SPRbased heuristic algorithm for the NPhard MulRF problem.
We compare the performance of the MulRF method (available at http://genome.cs.iastate.edu/CBL/MulRF/) with several gene tree parsimony approaches using gene tree simulations that incorporate gene tree error, gene duplications and losses, and/or lateral transfer. The MulRF method produces more accurate species trees than gene tree parsimony approaches. We also demonstrate that the MulRF method infers in minutes a credible plant species tree from a collection of nearly 2,000 gene trees.
Conclusions
Our new phylogenetic inference method, based on a generalized RF distance, makes it possible to quickly estimate species trees from large genomic data sets. Since the MulRF method, unlike gene tree parsimony, is based on a generic tree distance measure, it is appealing for analyses of genomic data sets, in which many processes such as deep coalescence, recombination, gene duplication and losses as well as phylogenetic error may contribute to gene tree discord. In experiments, the MulRF method estimated species trees accurately and quickly, demonstrating MulRF as an efficient alternative approach for phylogenetic inference from largescale genomic data sets.
Background
With the proliferation of next generation sequencing technologies, there is great interest in using large genomic data sets for phylogenetic inference. One challenge for such phylogenomic analyses is that the genes sampled from the same set of species often produce conflicting trees [1]. Some of the incongruence among trees may be due to errors in the phylogenetic analyses. Alternately, the discordance may reflect biological processes such as recombination, gene duplication, gene loss, deep coalescence, or lateral gene transfer (LGT) [1–6]. Thus, in order to construct phylogenetic hypotheses from genomic data, it is necessary to address the incongruence among gene trees. Furthermore, any method for such phylogenetic analyses also must be computationally tractable for extremely large genomic data sets.
Constructing species phylogenies from a collection of gene trees requires summarizing and reconciling the phylogenetic information contained in the genes. The majority of such species tree reconstruction methods reconcile the gene tree and species tree topologies using an optimality criterion based on a specific evolutionary process, such as gene duplication and loss or deep coalescence. In this paper we consider the problem of constructing species tree from gene trees using a tree distance measure that is not based on a specific biological process. We evaluate how our method performs in gene tree simulation experiments and with a large genomic data set from plants.
Existing methods for inferring species trees from collections of gene trees can be divided into two broad categories: nonparametric methods based on gene tree parsimony (GTP), and parametric methods that use likelihood (e.g., [7, 8]) or Bayesian frameworks (e.g., [9–11]). GTP methods take a collection of discordant gene trees and try to find the species tree that implies the fewest evolutionary events. GeneTree [12], DupTree [13], and DupLoss [14] seek to minimize the number of duplications or duplications and losses. GeneTree [12], Mesquite [1], PhyloNet [15], and the method of [14] minimize deep coalescence events. The Subtree Prune and Regraft (SPR) supertree method [16] is based on minimizing the number of LGT events, and thus, it also can be considered a GTP method. Some of these methods have fast and effective heuristics, enabling the analysis of very large data sets. However, errors in the gene trees can mislead GTP analyses [17–19]. Furthermore, in some cases GTP methods may be statistically inconsistent, even when the gene tree topologies are correct [20]. Parametric methods exist based on either coalescence [7, 10] or gene duplication and loss models [8]. Although such approaches have a strong statistical foundation, they can be extremely computationally expensive.
While the existing methods differ widely in their details, with the exception of [9], they are based on assumptions about the specific biological cause of discordance among gene trees. For example, GTP methods based on a duplication and loss cost implicitly assume that the differences between a gene tree and the species tree are caused by either gene duplications or losses. This does not necessarily mean that these methods will fail when their assumptions are incorrect, but it suggests that it is important to explore a range of different objectives for reconciling gene trees.
We present a new approach for constructing a species tree from discordant multicopy gene trees based on a generic, nonbiological distance measure. Our distance measure generalizes the RobinsonFoulds (RF) distance measure to multilabeled trees (multrees) or trees where multiple leaves can have the same label. Our method takes as input a collection of multicopy gene trees (multrees) and finds a species tree at minimum RF distance to the input gene trees. Our contributions are as follows:

We study the problem of computing the RF distance between two multrees, and show that it is NPcomplete (Theorem 1).

We formulate an RF problem for multrees (MulRF) that takes a collection of multicopy gene trees as input and constructs a binary species tree that is at minimum RF distance from each input gene tree (Section The MulRF Problem). A key component of this approach is a simple and efficient technique to compute the RF distance between an input multicopy gene tree and a singlylabeled species tree. (Note the contrast with the previouslymentioned NPcompleteness result.)

MulRF is an NPhard problem, so heuristics are required to estimate solutions for large data sets. We provide a fast Θ(n m k)time algorithm for the MulRF problem, where n is the total number of distinct leaf labels in the input collection of gene trees, m is the sum of n and the number of gene sequences in a input gene tree (assuming for convenience that all the input gene trees are built on approximately the same number of gene sequences), and k is the number of input gene trees (Section Solving the MulRF problem).

We implemented the MulRF heuristic and examined its performance on simulated gene tree data sets that incorporate gene tree error, gene duplication and loss, and/or lateral gene transfer and a data set of nearly 2000 plant gene trees (Section Experimental evaluation).
We note that there has been much recent work on multrees ranging from constructing strict and majority rule consensus multrees to deriving diameter bounds for various metrics on multrees (see, [21–26]). Further, various problems related to RF distance have received attention. The RF distance has been extended to increase its robustness without sacrificing polynomial time computability [27, 28]. These methods appear to work well when both input trees are singlylabeled, but there are no direct extensions of them for multrees. Alternatively, RF distance has been used in the supertree method for singlylabeled input trees [29, 30] and the maximumlikelihood supertree approach of [31]. Here, we use RF distance for building species trees from multrees, which allows us to incorporate a wealth of genomic data from multicopy genes into phylogenetic inference.
Our heuristic algorithm for MulRF problem shares several core concepts with unrooted RF supertree (URF) algorithm of [30], but there are theoretical and practical differences. In particular, our local search heuristic of MulRF is based on the SPR operation, unlike the pEdge Contract and Refine operation (pECR) used for URF [30]. Typically, the SPR operation is more effective in exploring the space of trees compared to pECR operation; this also enables the MulRF heuristic to run as a standalone application on the given gene trees, independent from the rooted RF heuristic of [29]. In contrast, the pECRbased URF algorithm of [30] uses the output of the rooted method of [29] as a starting tree.
We performed gene tree simulation experiments to evaluate the accuracy of our method by comparing it against the model species tree used to simulate the data. We compared the species trees constructed by MulRF and GTP methods that consider only duplication [13], duplication and loss [14], and only LGT [16] with the true species trees. Our simulated data sets were too large to analyze with parametric methods, so we were unable to compare MulRF with these approaches. For example, when we ran Phyldog [8] on a single 50taxon, 400 gene data set using 4 cores, it did not converge on a species tree in 110 hours. In contrast, MulRF gave an answer within a few seconds.
In all experiments, MulRF produced trees that are more similar to the true species trees than those obtained by the three GTP methods. This suggests that MulRF may be more robust than GTP methods to complex processes of gene evolution, including LGT, and in the presence of gene tree error. Furthermore, our algorithm runs quickly on moderatesize data sets, finishing in under two minutes on data sets containing 300 gene trees evolved over 100 taxon species trees. This suggests it is scalable for largescale phylogenomic analyses. Finally, we examined the performance of the MulRF method on an unpublished plant gene tree data set with nearly 2000 gene trees from 22 species. The resulting species tree from the MulRF method was largely consistent with published plant phylogenies.
Preliminaries
Let X be a finite set of labels. A phylogenetic multree on X (or multree, for short) is a pair $\mathcal{T}=(T,\phi )$ consisting of an unrooted tree T, whose leaf set is denoted by $\mathcal{\mathcal{L}}\left(T\right)$, and where every internal vertex has degree at least three, along with a surjective map $\phi :\mathcal{\mathcal{L}}\left(T\right)\to X$. The tree T is called the underlying tree of and φ is called the labeling map of . We say that is a singlylabeled tree if φ is a bijection between $\mathcal{\mathcal{L}}\left(T\right)$ and X (i.e., φ^{1}(x)=1 for all x∈X). Singlylabeled trees are also referred to as phylogenetic Xtrees ([32]; page 17).
A multree $\mathcal{T}=(T,\phi )$ is binary if every internal vertex of T has degree 3. A vertex of T is said to be unresolved if its degree is greater than three. We use V(T) and E(T) to denote the set of vertices and the set of edges of T. The set of all internal vertices of T is $I\left(T\right):=V\left(T\right)\setminus \mathcal{\mathcal{L}}\left(T\right)$. The size of , denoted by $\left\mathcal{T}\right$, is the number of elements in $\mathcal{\mathcal{L}}\left(T\right)$.
Let $\mathcal{T}=(T,\phi )$ be a multree on X and U⊆X. Let T[ U] denotes the minimum subtree of T induced by the elements of $\{v\in \mathcal{\mathcal{L}}(T):\phi (v)\in U\}$. The restriction of T to , denoted T_{U} is the tree obtained from T[ U] by suppressing all vertices of degree two. The restriction of φ to U, denoted φ_{U} is the surjective mapping ${\phi}_{U}:\mathcal{\mathcal{L}}\left({T}_{U}\right)\to U$, where for each $v\in \mathcal{\mathcal{L}}\left({T}_{U}\right),{\phi}_{U}\left(v\right)=\phi \left(v\right)$. The restriction of to U, denoted by ${\mathcal{T}}_{U}$, is the multree on U given by ${\mathcal{T}}_{U}=({T}_{U},{\phi}_{U})$.
Two multrees ${\mathcal{T}}_{1}=({T}_{1}=({V}_{1},{E}_{1}),{\phi}_{1})$ and ${\mathcal{T}}_{2}=({T}_{2}=({V}_{2},{E}_{2}),{\phi}_{2})$ on X are isomorphic if there exists a bijection τ:V_{1}→V_{2}, which induces a bijection between E_{1} and E_{2}, subject to the condition that φ_{1}(u)=φ_{2}(τ(u)) for all $u\in \mathcal{\mathcal{L}}\left({T}_{1}\right)$.
We define two basic operations on a multree (T,φ). The contraction of an internal edge of T collapses that edge and identifies its two endpoints, yielding a new tree T^{′} and a corresponding multree (T^{′},φ). (Note that, since T^{′} and T have the same leaf sets, φ is also defined on T^{′}.) Let v be an unresolved vertex of T. A refinement of v is obtained by partitioning the set of neighbors of v into two sets N_{1} and N_{2}, such that N_{1},N_{2}>1, replacing v by two vertices v_{1} and v_{2} connected by an edge, and making the vertices of N_{1} neighbors of v_{1} and those in N_{2} neighbors of v_{2}. This yields a new tree T^{′}, with the same leaf set as T, and a corresponding multree (T^{′},φ). Contraction and refinement can be viewed as inverses of each other (Figure 1).
Let ${\mathcal{T}}_{1}=({T}_{1},{\phi}_{1})$ and ${\mathcal{T}}_{2}=({T}_{2},{\phi}_{2})$ be multrees on X_{1} and X_{2}, respectively, such that X_{1}∩X_{2}≠∅. We say that ${\mathcal{T}}_{1}=({T}_{1},{\phi}_{1})$ and ${\mathcal{T}}_{2}=({T}_{2},{\phi}_{2})$ have matching label multiplicities if $\left{\phi}_{1}^{1}\right(x\left)\right=\left{\phi}_{2}^{1}\right(x\left)\right$ for all x∈X_{1}∩X_{2}. The RobinsonFoulds (RF) distance between two multrees ${\mathcal{T}}_{1}$ and ${\mathcal{T}}_{2}$ with identical label sets and matching label multiplicities, denoted by $\mathit{\text{RF}}({\mathcal{T}}_{1},{\mathcal{T}}_{2})$, is defined as the minimum number of contractions and refinements necessary to transform ${\mathcal{T}}_{1}$ into another multree isomorphic to ${\mathcal{T}}_{2}$[33, 34]. (Note that [33] originally defined their distance measure for singlylabeled trees. Later on, [34] showed that the definition extends naturally to multrees.) We extend the RF distance to multrees ${\mathcal{T}}_{1}$, on X_{1}, and ${\mathcal{T}}_{2}$, on X_{2}, with X_{1}⊆X_{2} and matching label multiplicities, as $\mathit{\text{RF}}({\mathcal{T}}_{1},{\mathcal{T}}_{2}):=\mathit{\text{RF}}({\mathcal{T}}_{1},{{\mathcal{T}}_{2}}_{{X}_{1}})$.
Let $\mathcal{T}=(T,\phi )$ be a multree on X. Let M be a multiset on X such that the multiplicity in M of each element x∈X is φ^{1}(x). A split AB of is a bipartition of M, i.e., the sum of multiplicities of each element x∈X in A and B is equal to the multiplicity of x in M. Multisets A and B are the parts of split AB. (Note that if is singlylabeled, then M, A, and B are sets.) The set of all splits induced by the internal edges of a multree is denoted by $\Sigma \left(\mathcal{T}\right)$.
As Figure 2 illustrates, two multrees ${\mathcal{T}}_{1}$ and ${\mathcal{T}}_{2}$ such that $\Sigma \left({\mathcal{T}}_{1}\right)=\Sigma \left({\mathcal{T}}_{2}\right)$ may not be isomorphic. (See also ([34], Figure five) for a larger example.) On the other hand, by the Splits Equivalence Theorem ([32]; p. 44), if ${\mathcal{T}}_{1}$ and ${\mathcal{T}}_{2}$ are singlylabeled trees, then $\Sigma \left({\mathcal{T}}_{1}\right)=\Sigma \left({\mathcal{T}}_{2}\right)$ implies that ${\mathcal{T}}_{1}$ and ${\mathcal{T}}_{2}$ are isomorphic. Further, in this case, [33].
Since multrees do not satisfy the Splits Equivalence Theorem, the RF distance between two of them cannot be computed by splits using expression (1). Nevertheless, as we show in Section The MulRF Problem, the formula will be useful for computing the RF distance between input gene tree and a species tree.
Ganapathy et al. [34] gave a worstcase exponential time algorithm for computing the RF distance between two multrees. The next result suggests that a polynomial time algorithm is unlikely.
Theorem 1
Computing the RF distance between two multrees is NPcomplete.
Proof
See the Additional file 1. □
The MulRF Problem
A profile$\mathcal{P}=({\mathcal{T}}_{1},{\mathcal{T}}_{2},\dots ,{\mathcal{T}}_{k})$ is a tuple of multrees, also called input multrees, representing multicopy gene trees, where, for each $i\in \{1,\dots ,k\},{\mathcal{T}}_{i}$ has label set X_{ i }. A species tree for is a singlylabeled phylogenetic tree on Y, where $Y=\bigcup _{i=1}^{k}{X}_{i}$.
A species tree for and a tree in will not, in general, have matching label multiplicities, since is singlylabeled, while need not be. In order to define $\mathit{\text{RF}}(\mathcal{T},\mathcal{S})$, we will extend the species tree to add the missing duplicate leaf labels, thereby converting it into a multree. We explain this formally next.
Let $\mathcal{T}=(T,\phi )$ be an input multree on X and $\mathcal{S}=(S,\varphi )$ be a species tree on Y; thus, X⊆Y. The extension of relative to is a multree$\mathcal{S}\text{*}=(S\text{*},\varphi \text{*})$ on Y, constructed from by doing the following for each vertex $s\in \mathcal{\mathcal{L}}\left(S\right)$ such that φ^{1}(ϕ(s))>1. Let k:=φ^{1}(ϕ(s)). Replace s by an internal vertex connecting to k leaves {l_{1},…,l_{ k }} labeled with ϕ(s); i.e., ∀i(1≤i≤k),ϕ*(l_{ i }) = ϕ(s). See Figure 3. We now define $\mathit{\text{RF}}(\mathcal{T},\mathcal{S})$ to be $\mathit{\text{RF}}(\mathcal{T},\mathcal{S}\text{*})$, where $\mathcal{S}\text{*}$ is the extension of relative to . We define the RF distance from a profileto a species tree for as $\mathit{\text{RF}}(\mathcal{P},\mathcal{S}):=\sum _{\mathcal{T}\in \mathcal{P}}\mathit{\text{RF}}(\mathcal{T},\mathcal{S})$.
Let $\mathcal{\mathcal{B}}\left(\mathcal{P}\right)$ be the set of all binary species trees for .
Problem 1
(RF for MULTrees (MulRF))Instance: A profile $\mathcal{P}=({\mathcal{T}}_{1},{\mathcal{T}}_{2},\dots ,{\mathcal{T}}_{k})$ of multrees.Find: A species tree ${\mathcal{S}}^{\star}$ for such that $\mathit{\text{RF}}(\mathcal{P},{\mathcal{S}}^{\star})=\underset{\mathcal{S}\in \mathcal{\mathcal{B}}\left(\mathcal{P}\right)}{min}\mathit{\text{RF}}(\mathcal{P},\mathcal{S})$.
Observe that the solution to the MulRF problem may not be unique. Further, the MulRF problem is NPhard even when all the input multrees are singly labeled and their leaf label sets are identical [35]. Nevertheless, the “small” version of the problem —computing the RF distance between a profile of multrees and a species tree— is easy to solve. For each input multree , we (i) construct the extension $\mathcal{S}\text{*}$ of the species tree relative to ; (ii) differentiate duplicate leaf labels in both $\mathcal{S}\text{*}$ and ; and (iii) apply the splitbased formula (1) to compute the RF distance between the resulting singlylabeled phylogenetic trees. Next, we explain this process formally.
A full differentiation of a multree $\mathcal{T}=(T,\phi )$ on X is a singlylabeled tree T=(T,φ^{′}) on X^{′}[34]. Note that both and T have identical underlying trees, but the labeling map is surjective in the former, and bijective in the latter. Thus, X and X^{′} may be different sets and X≤X^{′}. Intuitively, a full differentiation of a multree differentiates the leaves that have identical leaf labels.
Let $\mathcal{T}=(T,\phi )$ and $\mathcal{S}=(S,\varphi )$ be two multrees, on X and Y, respectively, such that and have matching label multiplicities. Two full differentiations T=(T,φ^{′}) and S=(S,ϕ^{′}) of and , respectively, are consistent if for each a∈X∩Y,φ^{′}(φ^{1}(a))=ϕ^{′}(ϕ^{1}(a)), i.e, both T and S have same set of new leaf labels for each common leaf label in and . For instance, a consistent full differentiation can be obtained by relabeling each of the k copies of each leaf label a by a_{1},a_{2},…,a_{ k } in both the multrees.
Theorem 2 ([34])
Let $\mathcal{T}=(T,\phi )$ and $\mathcal{S}=(S,\varphi )$ be multrees with matching label multiplicities. Then, $\mathit{\text{RF}}(\mathcal{T},\mathcal{S})=min\left\{\mathit{\text{RF}}\right(\mathbf{T},\mathbf{S}):\mathbf{T}$ and S are mutually consistent full differentiations of and , respectively }.
We can prove the following result.
Theorem 3
Let be a multree in a profile and let be a species tree for . Let $\mathcal{S}\text{*}$ be the extension of relative to . Then, for each pair of consistent full differentiations (T_{1},S_{1}) and (T_{2},S_{2}) of and $\mathcal{S}\text{*}$ we have RF (T_{1},S_{1})= RF (T_{2},S_{2}).
Proof
Let $\mathcal{T}=(T,\phi )$ be the input multree on X. We prove the theorem by showing that for each a∈X, where φ^{1}(a)=k, all k! ways of uniquely relabeling corresponding k leaves in both and $\mathcal{S}\text{*}$ result into the same number of matched and unmatched splits in the corresponding mutually consistent full differentiations. The set of splits in can be divided into two categories: □

Category 1: Splits that have all the leaves labeled with a in one part. Such a split will always have a match, irrespective of the labeling.

Category 2: The remaining splits. Such splits are not present in $\mathcal{S}\text{*}$, therefore, they will never have a match, irrespective of the labeling.
Thus, we can compute the RF distance between an input phylogenetic multree and a species tree by computing the RF distance between any consistent full differentiations of the two trees. Since these full differentiations are singlylabeled trees, the RF distance between them can be found using Equation (1).
Solving the MulRF problem
Our local search heuristic for the MulRF problem starts with an initial (singlylabeled) species tree and explores the space of possible species trees in search of a locally optimum species tree, a species tree for whose score is minimum within its “neighborhood”. The neighborhood is defined in terms of the Subtree Prune and Regraft (SPR) operation [36]. An SPR operation on an unrooted, binary, singlylabeled phylogenetic tree $\mathcal{T}=(T,\phi )$ on X cuts any edge e∈E(T), thereby pruning a subtree t, and then regrafts t by the same cut edge to a new vertex obtained by subdividing a preexisting edge in Tt (Figure 4). The resulting phylogenetic tree is said to be obtained fromby a single SPR operation. The set of all phylogenetic trees obtained by the application of a single SPR operation on is called the SPR neighborhood of , and is denoted by $\mathit{\text{SP}}{R}_{\mathcal{T}}$.
Problem 2
(SPR Search) Instance: A profile $\mathcal{P}=({\mathcal{T}}_{1},{\mathcal{T}}_{2},\dots ,{\mathcal{T}}_{k})$ of multrees and a binary species tree for . Find: A species tree ${\mathcal{S}}^{\star}$ for such that ${\mathcal{S}}^{\star}\in \mathit{\text{SP}}{R}_{\mathcal{S}}$ and $\mathit{\text{RF}}(\mathcal{P},{\mathcal{S}}^{\star})=\underset{{\mathcal{S}}^{\prime}\in \mathit{\text{SP}}{R}_{\mathcal{S}}}{min}\mathit{\text{RF}}(\mathcal{P},{\mathcal{S}}^{\prime})$.
The SPR Search based MulRF algorithm runs in two phases. In phase I, the algorithm quickly builds a likely suboptimal initial species tree using a greedy leaf adding procedure. This procedure first builds a phylogenetic tree on three randomly selected labels, and then it adds the remaining labels one at a time in a randomized order. In phase II, the algorithm performs a series of SPR Search iterations, each of which starts with an initial species tree and the input multrees. The output species tree of one SPR Search iteration serves as the initial species tree for the next iteration. When the resulting species tree of an SPR Search iteration is same as its initial species tree (i.e., there is no improvement in the score), the MulRF algorithm stops and returns the initial species tree of that iteration as the output.
Let the size of the input species tree for the SPR Search problem be n, i.e. $n:=\left\mathcal{S}\right$. For each $\mathcal{T}\in \mathcal{P}$, let $m:=\left\mathcal{T}\right+\left\mathcal{S}\right$. (For convenience, we assume that all the input gene trees have approximately the same size.) Let k be the number of input gene trees in . In Section Solving the SPR search problem, we present an algorithm for the SPR search problem that runs in time Θ(n m k). (More precisely, if the size of the i th gene tree in the input profile be t_{ i } then the complexity of our algorithm is $\Theta \left(\sum _{i=1}^{k}n\right(n+{t}_{i}\left)\right)$. We made the assumption about the size of the input gene trees to simplify this complexity.) The algorithm relies on results from [30], which characterize the RF distance between unrooted phylogenetic trees in terms of least common ancestors in rooted versions of those phylogenies. These properties enable us to update the RF distance quickly after an SPR operation has been applied to one of the trees. For completeness, we briefly review these results in the next subsection. For a full discussion with proofs, see [30].
RobinsonFoulds distance and least common ancestors
In this subsection, we deal exclusively with singlylabeled trees, which we refer to simply as phylogenetic trees.
A phylogenetic tree $\mathbb{T}=(T,\phi )$ is rooted if the underlying tree T is rooted; this means that T has exactly one distinguished vertex r t(T), called the root. A rooted phylogenetic tree is binary if the root has degree two and every other internal vertex has degree three.
Let $\mathbb{T}=(T,\phi )$ be a rooted phylogenetic tree on X. A vertex v of T is internal if $v\in V\left(T\right)\setminus \left(\mathcal{\mathcal{L}}\right(T)\cup \mathit{\text{rt}}(T\left)\right)$. The set of all internal vertices of T is denoted by I(T). We define ≼_{ T } to be the partial order on V(T) where x≼_{ T }y if y is a vertex on the path from r t(T) to x. If {x,y}∈E(T) and x≼_{ T }y, then y is the parent of x and x is a child of y. The least common ancestor (LCA) of a nonempty subset L⊆V(T), denoted by LCA_{ T }(L), is the unique smallest upper bound of L under ≼_{ T }.
For a rooted phylogenetic tree $\mathbb{T}=(T,\phi )$ on X, let T_{ v } denotes the subtree of T rooted at vertex v∈V(T). For each node $v\in I\left(T\right),{C}_{\mathbb{T}}\left(v\right)$ is defined to be the set of leaf labels $\left\{\phi \right(u)\in X:u\in \mathcal{\mathcal{L}}({T}_{v}\left)\right\}$. Set ${C}_{\mathbb{T}}\left(v\right)$ is called a cluster. Let $\mathcal{\mathscr{H}}\left(\mathbb{T}\right)$ denote the set of all clusters of .
The RF distance between rooted phylogenetic trees $\mathbb{T}=(T,\phi ),\mathbb{S}=(S,\varphi )$ on X, Y, respectively, such that X=Y, is defined as [33]
When X⊂Y, we extend the RF distance in the same way as for unrooted trees. That is, $\mathit{\text{RF}}(\mathbb{T},\mathbb{S}):=\mathit{\text{RF}}(\mathbb{T},{\mathbb{S}}_{X})$, where ${\mathbb{S}}_{X}:=({S}_{X},{\varphi}_{X})$ is the rooted phylogenetic tree; here, S_{X} is obtained from S[ X] by suppressing all nonroot degreetwo vertices, ϕ_{X} is the bijective mapping ${\varphi}_{X}:\mathcal{\mathcal{L}}\left({S}_{X}\right)\to X$, where for each $v\in \mathcal{\mathcal{L}}\left({S}_{X}\right),{\varphi}_{X}\left(v\right)=\varphi \left(v\right)$.
Let T and S be two unrooted phylogenetic trees on X and Y, respectively, such that X⊆Y. Let and be the rooted phylogenetic trees that result from rooting the underlying trees of T and S at the branches incident on some arbitrarilychosen but fixed leaf label r∈X (Figure 5). (The leaf label sets of and are X and Y, respectively.)
Lemma 1 ([30])
Let T and S be two unrooted phylogenetic trees on the same leaf label set, then $\mathit{\text{RF}}(\mathbf{T},\mathbf{S})=\mathit{\text{RF}}(\mathbb{T},\mathbb{S}).$
We now show how to compute the RF distance between $\mathbb{T}=(T,\phi )$ on X and $\mathbb{S}=(S,\varphi )$ on Y, when X⊆Y, without explicitly building ${\mathbb{S}}_{X}$. We need two concepts. Let v∈I(S). The restriction of ${C}_{\mathbb{S}}\left(v\right)$ to X is ${\u0108}_{\mathbb{T}}\left(v\right):=\{w\in Y:{\varphi}^{1}(w)\in \mathcal{\mathcal{L}}({S}_{v})\text{and}w\in X\}.$
The vertex function${f}_{\mathbb{S}}$ assigns each u∈I(T) the value ${f}_{\mathbb{S}}\left(u\right)=\leftU\right$, where $U:=\{v\in I(S):{C}_{\mathbb{T}}(u)={\u0108}_{\mathbb{T}}(v\left)\right\}$. Observe that if X=Y, then for all $u\in I\left(T\right),{f}_{\mathbb{S}}\left(u\right)\le 1$.
Lemma 2 ([30])
$\mathit{\text{RF}}(\mathbb{T},\mathbb{S})=\left\mathcal{\mathcal{L}}\right(T\left)\right\leftI\right(T\left)\right+2\left{\mathcal{F}}_{\mathbb{S}}\right2$, where ${\mathcal{F}}_{\mathbb{S}}:=\{u\in I(T):{f}_{\mathbb{S}}(u)=0\}$.
We now describe a O(n)time algorithm to compute the initial vertex function for relative to , along with the RF distance between these two trees. The algorithm relies on LCAs. For and , the LCA mapping${\mathcal{\mathcal{M}}}_{\mathbb{S},\mathbb{T}}:V\left(S\right)\to V\left(T\right)\cup \left\{\xi \right\}$ is defined as
See Figure 6.
Lemma 3 ([30])
For all $u\in I\left(T\right),{f}_{\mathbb{S}}\left(u\right)=\leftB\right$, where $B:=\{v\in I(S):{\mathcal{\mathcal{M}}}_{\mathbb{S},\mathbb{T}}(v)=u$ and $\left{C}_{\mathbb{T}}\right(u\left)\right=\left{\u0108}_{\mathbb{T}}\right(v\left)\right\}$.
The LCA computation of is lineartime in the size of , and the LCA mapping from to can be done in O(n) time [37] in bottomup manner. Further, from Lemmas 2 and 3 we can compute the RF distance between and in O(m) time as well.
Solving the SPR search problem
Let $\mathcal{T}=(T,\phi )$ be a multree (on X) in and $\mathcal{S}=(S,\varphi )$ be the input species tree (on Y). We now show how to compute the RF distance from to each tree in the ${\text{SPR}}_{\mathcal{S}}$ neighborhood in time that is linear in the size of the neighborhood. By Theorem 3, computing the RF distance between and each ${\mathcal{S}}^{\prime}\in \mathit{\text{SP}}{R}_{\mathcal{S}}$ reduces to computing the RF distance between T and each S^{′}, where T and S^{′} are the mutually consistent full differentiations of and the extension of relative to .
Suppose an SPR operation on cuts the edge e={x,y}∈S, and that $\stackrel{\u0307}{x},\stackrel{\u0307}{y}$ are the subtrees of Se containing x, y, respectively. Suppose subtree $\stackrel{\u0307}{y}$ is pruned and regrafted by the same cut edge to a new vertex obtained by subdividing an edge in $\stackrel{\u0307}{x}$. The degreetwo vertex x is suppressed and the new vertex is denoted by x. Observe that there are O(n) possible edges in $\stackrel{\u0307}{x}$ to regraft $\stackrel{\u0307}{y}$. We perform regrafts in an order that leads to a constant time RF distance computation for each successive regraft.
We begin by regrafting $\stackrel{\u0307}{y}$ at an edge incident to a leaf in $\stackrel{\u0307}{x}$. Let $\overline{\mathcal{S}}$ be the phylogenetic tree obtained from performing the pruneandregraft. Let T (on X^{′}) and $\overline{\mathbf{S}}$ (on Y^{′}) be the mutually consistent full differentiations of and the extension of $\overline{\mathcal{S}}$. We compute the RF distance between T and $\overline{\mathbf{S}}$ using the algorithm described in Section RobinsonFoulds distance and least common ancestors. This method works by computing the RF distance between the rooted phylogenetic trees and $\overline{\mathbb{S}}$ obtained by rooting T and $\overline{\mathbf{S}}$ at any leaf label in X^{′}∩Y^{′}. (Note that if $X\cap \varphi \left(\mathcal{\mathcal{L}}\right(\stackrel{\u0307}{x}\left)\right)=\varnothing $ or $X\cap \varphi \left(\mathcal{\mathcal{L}}\right(\stackrel{\u0307}{y}\left)\right)=\varnothing $, then ’s distance from $\overline{\mathcal{S}}$ is the same as its distance from .) The algorithm also computes the LCAs for and the LCA mapping from $\overline{\mathbb{S}}$ to .
We perform the remaining regrafts of $\stackrel{\u0307}{y}$ on edges in $\stackrel{\u0307}{x}$ by iterating through the vertices of $\stackrel{\u0307}{x}$, starting from a leaf and exploring as far as possible along each branch before backtracking. The k^{th} regraft is performed on the edge between the k^{th} and k+1^{st} vertices in this iteration. Let us denote this ordering of edges by ℵ. See Figure 7. Observe that each two distinct consecutive edges in ℵ are adjacent. We will show that, after the initial RF distance computation for $\overline{\mathcal{S}}$, we can compute in constant time the RF distance for the result of regrafting on each successive (adjacent) edges in ℵ.
Beginning with $\overline{\mathcal{S}}$, each ${\mathcal{S}}^{\prime}\in \mathit{\text{SP}}{R}_{\mathcal{S}}$ helps in computing the RF distance of the next tree in the above regraft order. Assume that ${\mathcal{S}}^{\prime}\in \mathit{\text{SP}}{R}_{\mathcal{S}}$ results from regrafting $\stackrel{\u0307}{y}$ at edge {a,b} in $\stackrel{\u0307}{x}$, such that x subdivides the edge {a,b} and neighbors to vertex y in $\stackrel{\u0307}{y}$, as shown in Figure 7. Let the rooted phylogenetic tree obtained after extending and differentiating ${\mathcal{S}}^{\prime}$ be denoted by ${\mathbb{S}}^{\prime}$. The LCA mapping and RF distance have been computed for ${\mathbb{S}}^{\prime}$. Let ${\mathcal{S}}^{\mathrm{\prime \prime}}\in \mathit{\text{SP}}{R}_{\mathcal{S}}$ denote the tree obtained by regrafting $\stackrel{\u0307}{y}$ on edge {b,c} in $\stackrel{\u0307}{x}$ and the rooted counterpart of ${\mathcal{S}}^{\mathrm{\prime \prime}}$ is ${\mathbb{S}}^{\mathrm{\prime \prime}}$.
Next, we find the vertices of ${\mathbb{S}}^{\mathrm{\prime \prime}}$ whose LCA mappings have changed as a result of the SPR operation. Let T, S^{′} and S^{′′} be the underlying trees of $\mathbb{T},{\mathbb{S}}^{\prime}$ and ${\mathbb{S}}^{\mathrm{\prime \prime}}$, respectively. Based on the topology of S^{′}, there are three cases:

1.
x is parent of b and b is parent of c. For all $t\in I\left({S}^{\mathrm{\prime \prime}}\right)\setminus \{x,b\},{\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\mathrm{\prime \prime}},\mathbb{T}}\left(t\right)$ = ${\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\prime},\mathbb{T}}\left(t\right)$. Further, ${\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\mathrm{\prime \prime}},\mathbb{T}}\left(b\right):={\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\prime},\mathbb{T}}\left(x\right)$, and ${\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\mathrm{\prime \prime}},\mathbb{T}}\left(x\right):=\text{LCA}\left({\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\prime},\mathbb{T}}\right(c),{\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\prime},\mathbb{T}}(y\left)\right)$.

2.
b is parent of c and x. For all $t\in I\left({S}^{\mathrm{\prime \prime}}\right)\setminus \left\{x\right\},{\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\mathrm{\prime \prime}},\mathbb{T}}\left(t\right)$ = ${\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\prime},\mathbb{T}}\left(t\right)$. Further, ${\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\mathrm{\prime \prime}},\mathbb{T}}\left(x\right):=\text{LCA}\left({\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\prime},\mathbb{T}}\right(c),{\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\prime},\mathbb{T}}(y\left)\right)$.

3.
b is parent of x and c is parent of b. For all $t\in I\left({S}^{\mathrm{\prime \prime}}\right)\setminus \{b,x\},{\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\mathrm{\prime \prime}},\mathbb{T}}\left(t\right)$ = ${\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\prime},\mathbb{T}}\left(t\right)$. Moreover, ${\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\mathrm{\prime \prime}},\mathbb{T}}\left(x\right):={\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\prime},\mathbb{T}}\left(b\right)$, and ${\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\mathrm{\prime \prime}},\mathbb{T}}\left(b\right):=\text{LCA}\left({\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\prime},\mathbb{T}}\right(d),{\mathcal{\mathcal{M}}}_{{\mathbb{S}}^{\prime},\mathbb{T}}(a\left)\right)$.
Since we can check in constant time which one of the above three cases holds, the LCA mappings can be updated in constant time too. Let H be a set $\{u\in I(T):{f}_{{\mathbb{S}}^{\mathrm{\prime \prime}}}(u)\ne {f}_{{\mathbb{S}}^{\prime}}(u\left)\right\}$. Observe that H has at most four vertices, and thus it be computed in constant time. Let G denote the set $\{w\in H:{f}_{{\mathbb{S}}^{\prime}}(w)=0,\text{but}{f}_{{\mathbb{S}}^{\mathrm{\prime \prime}}}(w)\ge 1\}$, and L denote the set $\{w\in H:{f}_{{\mathbb{S}}^{\prime}}(w)\ge 1,\text{but}{f}_{{\mathbb{S}}^{\mathrm{\prime \prime}}}(w)=0\}$.
Lemma 4
$\mathit{\text{RF}}({\mathbb{S}}^{\mathrm{\prime \prime}},\mathbb{T})=\mathit{\text{RF}}({\mathbb{S}}^{\prime},\mathbb{T})2\leftG\right+2\leftL\right$.
Proof
□
Thus, after the initial regraft of $\stackrel{\u0307}{y}$ at a leaf in $\stackrel{\u0307}{x}$, we can compute in constant time the RFdistance between and the species tree that results from each subsequent regraft.
Next, we present the results on complexity of our algorithm. Recall that n is the size of the species tree and m is the sum of n and the size of an input gene tree, where all the input gene trees are considered to have approximately the same size.
Lemma 5
Let {x,y} be an edge of S and let $\stackrel{\u0307}{x}$ and $\stackrel{\u0307}{y}$ be the subtrees of S containing x and y, respectively, that result from deleting {x,y}. The RF distance for the set of trees obtained by regrafting $\stackrel{\u0307}{x}$ (resp. $\stackrel{\u0307}{y}$) on each edge in $\stackrel{\u0307}{y}$ (resp. $\stackrel{\u0307}{x}$) can be computed in Θ(m) time.
Proof
The RF distance computation for $\overline{\mathcal{S}}$, obtained by pruning $\stackrel{\u0307}{y}$ and regrafting at a leaf in $\stackrel{\u0307}{x}$, can be done in Θ(m) time. After $\overline{\mathcal{S}}$, the RF distance for each phylogenetic tree ${\mathcal{S}}^{\prime}$ obtained by regrafting $\stackrel{\u0307}{y}$ on each edge in $\stackrel{\u0307}{x}$, can be computed in constant time by performing regrafts in the order of ℵ. There are Θ(n) edges in ℵ, thus the RF computation for all the phylogenetic trees can be done in Θ(m) time. The same argument applies for pruning $\stackrel{\u0307}{x}$ and regrafting on the edges in $\stackrel{\u0307}{y}$. □
Theorem 4
The SPR Search problem can be solved in Θ(n m k) time.
Proof
The underlying tree S of has Θ(n) internal edges. For each edge {x,y} in S, let $\stackrel{\u0307}{x}$ and $\stackrel{\u0307}{y}$ be the subtrees of S defined in the statement of Lemma 5. The RF distance for all the phylogenetic trees obtained by regrafting $\stackrel{\u0307}{x}$ (or $\stackrel{\u0307}{y}$) on each edge in $\stackrel{\u0307}{y}$ (or $\stackrel{\u0307}{x}$) can be computed in Θ(m) time from Lemma 5. Thus the RF distance for k input multrees can be checked in Θ(m k) time. The total execution time for Θ(n) internal edges must be Θ(n m k). □
Experimental evaluation
In order to evaluate the performance of the MulRF method, we implemented the heuristic algorithm of Section Solving the MulRF problem using C/C++. The MulRF software as well as simulated data sets (explained next) are freely available for download at http://genome.cs.iastate.edu/CBL/MulRF/.
Simulated data set
Methods
Methods.
We used simulation experiments to evaluate the performance of MulRF and compare it to GTP methods. Since the MulRF method is designed for use with multicopy gene trees, we focus on simulating genes that could have a history of duplication and/or lateral transfer. We first generated model species trees using the uniform speciation (Yule) module in the program Mesquite [38]. Two sets of model trees were generated: i) 50taxon trees of height 220 thousand years (tyrs), ii) 100taxon trees of height 440 tyrs. Note that the dates are relative; they do not have to represent thousands of years.
Next, we evolved 150 and 300 gene trees for each 50 and 100taxon model species tree, respectively. For each gene tree a single gene birth node is chosen from the species tree nodes. Among all the simulated gene trees for a species tree, four gene trees have the gene birth node that is same as the root of the model tree. This represents the sampling that would result from an experiment looking at genes from a few distantly related species. The rest had a gene birth node, which is selected at random using the model species tree topology and branch lengths. Starting from the children of the root, a Poisson process is tested along the parent edge of each node. If the birth occurs, the corresponding node becomes the birth node for that gene tree. This represents the sampling that would result from a study of closely related species.
We simulated the evolution of the gene trees within the model species tree using our C++ implementation of the duplicationloss model of [39]. We applied LGT events on the evolved gene trees, using the standard subtree transfer model of LGT. One LGT event causes the subtree rooted at a vertex c to be pruned and regrafted at an edge (a,b), where a and b together are not in the path from the root (of the tree) to c. We used gene duplication and loss (D/L) rate of 0.002 events/gene per tyrs and LGT rate of 0 to 2 events per gene tree. Note that the gene tree simulations without LGT follow a molecular clock model (equal rates of molecular evolution along all branches of the gene tree), but the simulations with LGT violate the molecular clock.
We generated gene trees based on four evolutionary scenarios: i) no duplications, losses, or LGT (called none), ii) D/L rate 0.002 and no LGT (called dl), iii) no duplication or loss, and LGT rate 2 (called lgt), and iv) D/L rate 0.002 and LGT rate 2 (called both). The parameter values (evolutionary scenario and model tree size) for each simulation are called the model condition; 20 model species trees were generated for each model condition. We deleted 0 to 25% of leaves (selected at random) from each gene tree to represent missing data or unsampled, which is common in almost all phylogenomic studies. For each gene tree, we used SeqGen [40] to simulate a DNA sequence alignment of length 500 based on the GTR+Gamma+I model. The parameters of the model were chosen with equal probability from the parameter sets estimated in [41] on three biological data sets, following [42]. We estimated maximum likelihood trees from each simulated sequence alignment using RAxML [43], performing searches from 5 different starting trees and saving the best tree. Since the true root of a gene tree with possible duplication and loss often is unknown, we rooted each estimated gene tree at the midpoint of the longest leaftoleaf path using Retree [44] before the species tree construction.
Species tree estimation.
We estimated species trees with GTP minimizing only the number of duplications (Onlydup) [13], GTP minimizing duplications and losses (Duploss) [14], GTP minimizing LGT events (SPR supertree or SPRS for short) [16], and the MulRF heuristic. Both Onlydup and Duploss were executed with their default settings, including a fast leafadding heuristic for initial species tree construction. SPRS was run with 25 iterations of the global rearrangement search option. For 50taxon data sets, it calculated the exact rSPR distance if it was 15 or less, and otherwise it estimated the rSPR distance using the 3approximation. For the 100taxon data sets, we used the 3approximation of the rSPR distance. SPRS does not allow multrees as input. Therefore we only ran it on none and lgt data sets. Experiments were performed on the University of Florida High Performance Computing (HPC) cluster. We performed the experiments on the HPC cluster in order to simultaneously run the many simulations and phylogenetic analyses. However, all of the analyses (including SPRS, GTP, and MulRF) are sequential and easily run on a desktop machine. The running times are given in Table 1. The HPC cluster has cores of 2.3, 2.6, 2.9, or 2.66GHz on Opteron or Intel processors with 2 to 4GB RAM.
Results.
We report the average topological error (ATE) for each model condition. This is the average of the normalized RF distance (dividing the RF distance by number of internal edges in both trees) between each of the 20 model species trees and their estimated species trees. An ATE of 0 indicates that two trees are identical, and an ATE of 100 indicates that two trees share no common splits.
For each set of 50 and 100taxon model trees, the MulRF species trees are more accurate (lower ATE rate) than those produced by the other three methods. The ATE rate of MulRF is 16.75% to 39.91% lower than the method of lowest ATE rate among other three methods (Figure 8).
In order to examine how Onlydup, Duploss, and SPRS methods perform when gene tree simulations only include events that these methods assume to be the source of discordance, we studied the performance of Onlydup and Duploss on evolutionary scenario dl and SPRS on lgt. In both evolutionary scenarios we found that the ATE rate of MulRF was lowest for both 50 and 100taxon data sets (Figure 8). Surprisingly for lgt, while the ATE rate of SPRS was lower than Onlydup and Duploss in 50taxon, the ATE rate of the former was much higher than that of the latter two in 100taxon data sets (Figure 8).
We also examined the accuracy of species tree estimates by Onlydup, Duploss, and SPRS when gene tree simulations include events that these methods do not assume to be the source of discordance (e.g., dl and both for SPRS, lgt and both for Onlydup and Duploss). While SPRS could not be tested on dl and both because they included multrees, Onlydup and Duploss had high ATE rate for lgt and both (Figure 8). In general, Onlydup’s estimate had much higher ATE rate compared to Duploss in the presence of LGT events; the ATE rate of MulRF was lowest among all the methods (Figure 8).
Biological data set
We also tested the performance of the MulRF method on a gene tree data set from 22 plant species. These species were chosen because they are phylogenetically diverse, and they all have fully sequenced and annotated genome sequences. This makes it possible to obtain a large number of gene trees with potentially no missing sequences. Furthermore, there is much support for the relationships among most of these species (e.g., [45]), and therefore, it provides an empirical system on which we can evaluate the performance of MulRF. We obtained nucleotide alignments from gene families that had been generated from genome sequences with OrthMCL [46] and aligned with MAFFT [47]. We selected the gene family alignments that contained sequences from at least 20 of the 22 species and had a maximum of 50 gene sequences. This was a total of 1910 gene alignments. We estimated maximum likelihood trees for all of the genes using GTRCAT model in RAxML [43]. The unrooted gene trees were used as input for the MulRF heuristic. The Onlydup and Duploss methods require rooted input trees. Thus, we rooted all of the gene trees on the longest branch using Newick utilities [48]. This is similar to midpoint rooting, and in our experience it often provides a good starting point for input gene trees in GTP analyses. The rooted gene trees were used as input for GTP analysis using Onlydup [13] and Duploss [14]. Onlydup and Duploss were executed with the default SPR search, including a fast leafadding heuristic for initial species tree construction, and searching for an optimal root by rerooting the gene trees after each SPR search (e.g., [13]; [17]). We could not run SPRS on this data set because it contains multrees.
The MulRF heuristic completed in 4 minutes and 4 seconds on a Mac laptop with a 2.26 GHz Intel processor and 4GB RAM. The resulting species tree is largely consistent with the most recent phylogenetic analyses (Figure 9; e.g., [45]). Amborella is sister to the other angiosperms and monocots and eudicots form clades. Within the eudicots there is a coreeudicot clade, and within the coreeudicots the rosid clade is sister to the asterid clade. The malvids are sister to the fabids within the rosids. Interestingly, Populus groups with the malvids, consistent with recent analyses of nuclear and mitochondrial, but not chloroplast, data (e.g., [49]; [17]). There are two minor differences from the generally accepted reltionships: Phoenix should be sister to Musa + Poaceae rather than sister to Musa, and Aquilegia should be sister to the other eudicots rather than Nelumbo. The ATE for the MulRF tree is 0.11. Thus, it appears that MulRF can quickly estimate a nearly accurate species trees from largescale plant genomic data sets. The Onlydup tree heuristic completed in 7 seconds, and if we unroot the result, it is identical to the MulRF tree. The Duploss tree, which completed in 7 seconds, had a less accepted topology, placing Amborella sister to the monocots instead of sister to other angiosperms and Vitis sister to the asterids rather than with the rosids. The ATE for the Duploss tree is 0.21.
Conclusion
We presented the new MulRF method for inferring species tree from incongruent gene trees that is based on a generalized form of the RF distance. Unlike most previous phylogenetic methods using gene trees, our approach is based on a generic tree distance measure that is not linked to any specific biological processes. As a result, it is intuitively appealing for analyses of genomic data sets, in which many processes such as deep coalescence, recombination, gene duplications and losses, and LGT, as well as phylogenetic error likely contribute to gene tree discord. In simulation experiments, the MulRF method estimated species trees more accurately than several GTP methods, and it appears to be relatively robust to the effects of phylogenetic error, gene duplication and loss, and LGT. In addition, the MulRF method is fast, estimating 100taxon species trees from hundreds of gene trees in under two minutes and a plant data set with 22 taxa and nearly 2000 gene trees in just over 4 minutes.
Our simulation experiments greatly simplify the true processes of genomic evolution. We focused only on processes that reflect the objectives of the GTP methods, and we emphasized on duplication and loss, because that especially relevant to the evolution of multicopy gene trees. Still, even in these conditions in which we might expect GTP to perform well, we find that MulRF obtains more accurate results than GTP in most instances. This does not mean that MulRF will always outperform GTP, but we suggest that MulRF can quickly provide an interesting alternate perspective on species tree inference. More tests are needed to characterize the performance of MulRF methods under different evolutionary scenarios.
Another future direction will be to incorporate estimates of gene tree uncertainty into the supertree analysis by weighing the splits differently when computing the RF distance. Also, the effectiveness of the MulRF method in inferring species trees from multicopy gene trees suggests that other tree distance measures can be used in the same context. A natural candidate for study is the quartet distance. Future work should also evaluate the suitability of different distance metrics in estimating species trees under different error models and evolutionary scenarios.
References
 1.
Maddison WP: Gene trees in species trees. Syst Biol. 1997, 46: 523536. 10.1093/sysbio/46.3.523.
 2.
Avise J, Shapira J, Daniel S, Aquadro C, Lansman R: Mitochondrial DNA differentiation during the speciation process in Peromyscus. Mol Biol Evol. 1983, 1: 3856.
 3.
Doyle J: Gene trees and species trees: molecular systematics as onecharacter taxonomy. Syst Bot. 1993, 17: 144163.
 4.
Goodman M, Czelusniak J, Moore GW, RomeroHerrera AE, Matsuda G: Fitting the gene lineage into its species lineage. A parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool. 1979, 28: 132163. 10.2307/2412519.
 5.
Maddison W: Molecular approaches and the growth of phylogenetic biolog. Molecular Zoology: Advances, Strategies and Protocols. Edited by: Ferraris JD, Palumbi SR. 1996, 4763. New York: WileyLiss
 6.
Pamilo P, Nei M: Relationships between gene trees and species trees. Mol Biol Evol. 1988, 5: 568583.
 7.
Kubatko LS, Carstens BC, Knowles LL: STEM: species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics. 2009, 25 (7): 971973.
 8.
Boussau B, Szöllősi GJ, Duret L, Gouy M, Tannier E, Daubin V: Genomescale coestimation of species and gene trees. Genome Res. 2012, 23: 323330.
 9.
Ané C, Larget B, Baum DA, Smith SD, Rokas A: Bayesian estimation of concordance among gene trees. Mol Biol Evol. 2007, 24 (7): 1575
 10.
Liu L, Pearl DK: Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol. 2007, 56 (3): 504514.
 11.
Drummond AJ, Rambaut A: BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol. 2007, 7: 214.
 12.
GeneTree: comparing gene and species phylogenies using reconciled trees. Bioinformatics. 1998, 14 (9): 819820.
 13.
Wehe A, Bansal MS, Burleigh JG, Eulenstein O: DupTree: a program for largescale phylogenetic analyses using gene tree parsimony. Bioinformatics. 2008, 24: 13
 14.
Bansal MS, Burleigh JG, Eulenstein O: Efficient genomescale phylogenetic analysis under the duplicationloss and deep coalescence cost models. BMC Bioinformatics. 2010, 11 (Suppl 1): S42
 15.
Yu Y, Warnow T, Nakhleh L: Algorithms for MDCbased multilocus phylogeny inference. RECOMB. 2011, 531545. Heidelberg: SpringerVerlag Berlin
 16.
Whidden C, Zeh N, Beiko R: SPRSupertrees. Version 1.1.0. 2012, [http://kiwi.cs.dal.ca/Software/SPRSupertrees], []
 17.
Burleigh JG, Bansal MS, Eulenstein O, Hartmann S, Wehe A, Vision TJ: Genomescale phylogenetics: inferring the plant tree of life from 18, 896 discordant gene trees. Syst Biol. 2011, 60 (2): 117125.
 18.
Huang H, Knowles LL: What is the danger of the anomaly zone for empirical phylogenetics?. Syst Biol. 2009, 58: 527536.
 19.
Sanderson MJ, McMahon MM: Inferring angiosperm phylogeny from EST data with widespread gene duplication. BMC Evol Biol. 2007, 7 (suppl 1): S310.1186/147121487S1S3. [http://www.biomedcentral.com/14712148/7/S1/S3].
 20.
Than C, Rosenberg N: Consistency properties of species tree inference by minimizing deep coalescences. J Comput Biol. 2011, 18: 115.
 21.
Cui Y, Jansson J, Sung WK: Algorithms for building consensus MULtrees. International Symposium on Algorithms and Computation (ISAAC’2011), LNCS 7074. 2011, 744753. Heidelberg: SpringerVerlag Berlin
 22.
Cui Y, Jansson J, Sung W: Polynomialtime algorithms for building a consensus MULtree. J Comput Biol. 2012, 19: 10731088.
 23.
Huber KT, Lott M, Moulton V, Spillner A: The complexity of deriving multilabeled trees from bipartitions. J Comput Biol. 2008, 15: 639651.
 24.
Huber K, Moulton V, Spillner A: Computing a consensus of multilabeled trees. Proceedings of the 14th Workshop on Algorithm Engineering and Experiments (ALENEX 2012). 2012, 8492.
 25.
Huber KT, Spillner A, Suchecki R, Moulton V: Metrics on multilabeled trees: interrelationships and diameter bounds. IEEE/ACM Trans Comput Biol Bioinform. 2011, 8: 10291040.
 26.
Guillemot S, Jansson J, Sung WK: Computing a smallest multilabeled phylogenetic tree from rooted triplets. IEEE/ACM Trans Comput Biol Bioinform. 2011, 8: 11411147.
 27.
Bogdanowicz D, Giaro K: Matching split distance for unrooted binary phylogenetic trees. IEEE/ACM Trans Comput Biol Bioinform. 2012, 9: 150160.
 28.
Lin Y, Rajan V, Moret BM: A metric for phylogenetic trees based on matching. IEEE/ACM Trans Comput Biol Bioinform. 2012, 9: 10141022.
 29.
Bansal MS, Burleigh JG, Eulenstein O, FernándezBaca D: RobinsonFoulds supertrees. Algorithms Mol Biol. 2010, 5: 18.
 30.
Chaudhary R, Burleigh JG, FernándezBaca D: Fast local search for unrooted RobinsonFoulds supertrees. IEEE/ACM Trans Comput Biol Bioinform. 2012, 9: 10041013.
 31.
Steel M, Rodrigo A: Maximum likelihood supertrees. Syst Biol. 2008, 57: 2.
 32.
Semple C, Steel M: Phylogenetics. 2003, New York: Oxford University Press Inc
 33.
Robinson DF, Foulds LR: Comparison of phylogenetic trees. Math Biosci. 1981, 53: 131147. 10.1016/00255564(81)900432.
 34.
Ganapathy G, Goodson B, Jansen R, Le H, Ramachandran V, Warnow T: Pattern identification in biogeography. IEEE/ACM Trans Comput Biol Bioinform. 2006, 3: 334346.
 35.
McMorris FR, Steel MA: The complexity of the median procedure for binary trees. Proceedings of the International Federation of Classification Societies. 1994, Heidelberg: SpringerVerlag Berlin
 36.
Allen BL, Steel M: Subtree transfer operations and their induced metrics on evolutionary trees. Ann Combinatorics. 2001, 5: 115. 10.1007/s0002600180068.
 37.
Bender MA, FarachColton M: The LCA Problem Revisited. LATIN, Volume 1776 of Lecture Notes in Computer Science. Edited by: Gonnet GH, Panario D, Viola A. 2000, 8894. Heidelberg: SpringerVerlag Berlin
 38.
Maddison WP, Maddison D: Mesquite: a modular system for evolutionary analysis. Version 2.6. 2009, [http://mesquiteproject.org], []
 39.
Arvestad L, Berglund AC, Lagergren J, Sennblad B: Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinformatics. 2003, 19 (Suppl 1): i7i15.
 40.
Rambaut A, Grassly NC: SeqGen: An application for the MonteCarlo simulation of DNA sequence evolution along phylogenetic trees. Copmput Appl Biosci. 1997, 13: 235238.
 41.
Ganapathy G: Algorithms and heuristics for combinatorial optimization in phylogeny. PhD thesis, University of Texas at Austin 2006
 42.
Swenson MS, Barbançon F, Warnow T, Linder CR: A simulation study comparing supertree and combined analysis methods using SMIDGen. Algorithms Mol Biol. 2010, 5: 8
 43.
Stamatakis A: RAxMLVIHPC: Maximum likelihood based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006, 22: 26882690.
 44.
Felsenstein J: Retree software. 1993, [http://evolution.genetics.washington.edu/phylip/doc/retree.html].
 45.
Soltis DE, Smith SA, Cellinese N, Wurdack KJ, Tank DC, Brockington SF, RefulioRodriguez NF, Walker JB, Moore MJ, Carlsward BS, Bell CD, Latvis M, Crawley S, Black C, Diouf D, Xi Z, Rushworth CA, Gitzendanner MA, Sytsma KJ, Qiu YL, Hilu KW, Davis CC, Sanderson MJ, Beaman RS, Olmstead RG, Judd WS, Donoghue MJ, Soltis PS: Angiosperm phylogeny: 17 genes, 640 taxa. Am J Bot. 2011, 98: 704730.
 46.
Chen F, Mackey AJ, Stoeckert CJ Jr, Roos DS: OrthoMCLDB: querying a comprehensive multispecies collection of ortholog groups. Nucleic Acids Res. 2006, 34: D363D368.
 47.
Katoh K, ichi Kuma, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005, 33: 511518.
 48.
Junier T, Zdobnov EM: The Newick utilities: highthroughput phylogenetic tree processing in the Unix shell. Bioinformatics. 2010, 26: 16691670.
 49.
Qiu YL, Li L, Wang B, Xue JY, Hendry TA, Li RQ, Brown JW, Liu Y, Hudson GT, Chen ZD: Angiosperm phylogeny inferred from sequences of four mitochondrial genes. J Syst Evol. 2010, 48: 391425. 10.1111/j.17596831.2010.00097.
Acknowledgements
This work was supported in part by NSF grants DEB0829674 (to D.F.B), CCF1048217 (to J.G.B), and DEB1208428 (to J.G.B.). We thank the reviewers for carefully reading the entire manuscript and offering many useful suggestions.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
RC designed and implemented the algorithm, developed the NPcompleteness proof, performed simulation experiments, and wrote major parts of the paper. JGB conducted experiments on biological data set and contributed to the writing of the paper. DFB supervised the project and contributed to the writing of the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Chaudhary, R., Burleigh, J.G. & FernándezBaca, D. Inferring species trees from incongruent multicopy gene trees using the RobinsonFoulds distance. Algorithms Mol Biol 8, 28 (2013) doi:10.1186/17487188828
Received
Accepted
Published
DOI
Keywords
 Gene Tree
 Lateral Gene Transfer
 Lateral Gene Transfer Event
 Rooted Phylogenetic Tree
 Deep Coalescence