Inferring species trees from incongruent multi-copy gene trees using the Robinson-Foulds distance

Background Constructing species trees from multi-copy gene trees remains a challenging problem in phylogenetics. One difficulty is that the underlying genes can be incongruent due to evolutionary processes such as gene duplication and loss, deep coalescence, or lateral gene transfer. Gene tree estimation errors may further exacerbate the difficulties of species tree estimation. Results We present a new approach for inferring species trees from incongruent multi-copy gene trees that is based on a generalization of the Robinson-Foulds (RF) distance measure to multi-labeled trees (mul-trees). We prove that it is NP-hard to compute the RF distance between two mul-trees; however, it is easy to calculate this distance between a mul-tree and a singly-labeled species tree. Motivated by this, we formulate the RF problem for mul-trees (MulRF) as follows: Given a collection of multi-copy gene trees, find a singly-labeled species tree that minimizes the total RF distance from the input mul-trees. We develop and implement a fast SPR-based heuristic algorithm for the NP-hard MulRF problem. We compare the performance of the MulRF method (available at http://genome.cs.iastate.edu/CBL/MulRF/) with several gene tree parsimony approaches using gene tree simulations that incorporate gene tree error, gene duplications and losses, and/or lateral transfer. The MulRF method produces more accurate species trees than gene tree parsimony approaches. We also demonstrate that the MulRF method infers in minutes a credible plant species tree from a collection of nearly 2,000 gene trees. Conclusions Our new phylogenetic inference method, based on a generalized RF distance, makes it possible to quickly estimate species trees from large genomic data sets. Since the MulRF method, unlike gene tree parsimony, is based on a generic tree distance measure, it is appealing for analyses of genomic data sets, in which many processes such as deep coalescence, recombination, gene duplication and losses as well as phylogenetic error may contribute to gene tree discord. In experiments, the MulRF method estimated species trees accurately and quickly, demonstrating MulRF as an efficient alternative approach for phylogenetic inference from large-scale genomic data sets.

http://www.almob.org/content/8/1/28 the gene tree and species tree topologies using an optimality criterion based on a specific evolutionary process, such as gene duplication and loss or deep coalescence. In this paper we consider the problem of constructing species tree from gene trees using a tree distance measure that is not based on a specific biological process. We evaluate how our method performs in gene tree simulation experiments and with a large genomic data set from plants.
Existing methods for inferring species trees from collections of gene trees can be divided into two broad categories: non-parametric methods based on gene tree parsimony (GTP), and parametric methods that use likelihood (e.g., [7,8]) or Bayesian frameworks (e.g., [9][10][11]). GTP methods take a collection of discordant gene trees and try to find the species tree that implies the fewest evolutionary events. GeneTree [12], DupTree [13], and DupLoss [14] seek to minimize the number of duplications or duplications and losses. GeneTree [12], Mesquite [1], PhyloNet [15], and the method of [14] minimize deep coalescence events. The Subtree Prune and Regraft (SPR) supertree method [16] is based on minimizing the number of LGT events, and thus, it also can be considered a GTP method. Some of these methods have fast and effective heuristics, enabling the analysis of very large data sets. However, errors in the gene trees can mislead GTP analyses [17][18][19]. Furthermore, in some cases GTP methods may be statistically inconsistent, even when the gene tree topologies are correct [20]. Parametric methods exist based on either coalescence [7,10] or gene duplication and loss models [8]. Although such approaches have a strong statistical foundation, they can be extremely computationally expensive.
While the existing methods differ widely in their details, with the exception of [9], they are based on assumptions about the specific biological cause of discordance among gene trees. For example, GTP methods based on a duplication and loss cost implicitly assume that the differences between a gene tree and the species tree are caused by either gene duplications or losses. This does not necessarily mean that these methods will fail when their assumptions are incorrect, but it suggests that it is important to explore a range of different objectives for reconciling gene trees.
We present a new approach for constructing a species tree from discordant multi-copy gene trees based on a generic, non-biological distance measure. Our distance measure generalizes the Robinson-Foulds (RF) distance measure to multi-labeled trees (mul-trees) or trees where multiple leaves can have the same label. Our method takes as input a collection of multi-copy gene trees (mul-trees) and finds a species tree at minimum RF distance to the input gene trees. Our contributions are as follows: • We study the problem of computing the RF distance between two mul-trees, and show that it is NP-complete (Theorem 1). • We formulate an RF problem for mul-trees (MulRF) that takes a collection of multi-copy gene trees as input and constructs a binary species tree that is at minimum RF distance from each input gene tree (Section The MulRF Problem). A key component of this approach is a simple and efficient technique to compute the RF distance between an input multi-copy gene tree and a singly-labeled species tree. (Note the contrast with the previously-mentioned NP-completeness result.) • MulRF is an NP-hard problem, so heuristics are required to estimate solutions for large data sets. We provide a fast (nmk)-time algorithm for the MulRF problem, where n is the total number of distinct leaf labels in the input collection of gene trees, m is the sum of n and the number of gene sequences in a input gene tree (assuming for convenience that all the input gene trees are built on approximately the same number of gene sequences), and k is the number of input gene trees (Section Solving the MulRF problem). • We implemented the MulRF heuristic and examined its performance on simulated gene tree data sets that incorporate gene tree error, gene duplication and loss, and/or lateral gene transfer and a data set of nearly 2000 plant gene trees (Section Experimental evaluation).
We note that there has been much recent work on mul-trees ranging from constructing strict and majority rule consensus mul-trees to deriving diameter bounds for various metrics on mul-trees (see, [21][22][23][24][25][26]). Further, various problems related to RF distance have received attention. The RF distance has been extended to increase its robustness without sacrificing polynomial time computability [27,28]. These methods appear to work well when both input trees are singly-labeled, but there are no direct extensions of them for mul-trees. Alternatively, RF distance has been used in the supertree method for singly-labeled input trees [29,30] and the maximum-likelihood supertree approach of [31]. Here, we use RF distance for building species trees from mul-trees, which allows us to incorporate a wealth of genomic data from multi-copy genes into phylogenetic inference.
Our heuristic algorithm for MulRF problem shares several core concepts with unrooted RF supertree (URF) algorithm of [30], but there are theoretical and practical differences. In particular, our local search heuristic of MulRF is based on the SPR operation, unlike the p-Edge Contract and Refine operation (p-ECR) used for URF [30]. http://www.almob.org/content/8/1/28 Typically, the SPR operation is more effective in exploring the space of trees compared to p-ECR operation; this also enables the MulRF heuristic to run as a standalone application on the given gene trees, independent from the rooted RF heuristic of [29]. In contrast, the p-ECRbased URF algorithm of [30] uses the output of the rooted method of [29] as a starting tree.
We performed gene tree simulation experiments to evaluate the accuracy of our method by comparing it against the model species tree used to simulate the data. We compared the species trees constructed by MulRF and GTP methods that consider only duplication [13], duplication and loss [14], and only LGT [16] with the true species trees. Our simulated data sets were too large to analyze with parametric methods, so we were unable to compare MulRF with these approaches. For example, when we ran Phyldog [8] on a single 50-taxon, 400 gene data set using 4 cores, it did not converge on a species tree in 110 hours. In contrast, MulRF gave an answer within a few seconds.
In all experiments, MulRF produced trees that are more similar to the true species trees than those obtained by the three GTP methods. This suggests that MulRF may be more robust than GTP methods to complex processes of gene evolution, including LGT, and in the presence of gene tree error. Furthermore, our algorithm runs quickly on moderate-size data sets, finishing in under two minutes on data sets containing 300 gene trees evolved over 100 taxon species trees. This suggests it is scalable for large-scale phylogenomic analyses. Finally, we examined the performance of the MulRF method on an unpublished plant gene tree data set with nearly 2000 gene trees from 22 species. The resulting species tree from the MulRF method was largely consistent with published plant phylogenies.

Preliminaries
Let X be a finite set of labels. A phylogenetic mul-tree on X (or mul-tree, for short) is a pair T = (T, ϕ) consisting of an unrooted tree T, whose leaf set is denoted by L(T), and where every internal vertex has degree at least three, along with a surjective map ϕ : L(T) → X. The tree T is called the underlying tree of T and ϕ is called the labeling map of T . We say that T is a singly-labeled tree if ϕ is a bijection between L(T) and X (i.e., |ϕ −1 (x)| = 1 for all x ∈ X). Singly-labeled trees are also referred to as phylogenetic X-trees ( [32]; page 17).
A mul-tree T = (T, ϕ) is binary if every internal vertex of T has degree 3. A vertex of T is said to be unresolved if its degree is greater than three. We use V (T) and E(T) to denote the set of vertices and the set of edges of T. The set of all internal vertices of T is I(T) := V (T)\L(T). The size of T , denoted by |T |, is the number of elements in L(T).
Let T = (T, ϕ) be a mul-tree on X and U ⊆ X. Let T[U] denotes the minimum subtree of T induced by the elements of {v ∈ L(T) : ϕ(v) ∈ U}. The restriction of T to U, denoted T |U is the tree obtained from T[U] by suppressing all vertices of degree two. The restriction of ϕ to U, denoted ϕ |U is the surjective mapping ϕ |U : The restriction of T to U, denoted by T |U , is the mul-tree on U given by T |U = (T |U , ϕ |U ).
Two mul-trees which induces a bijection between E 1 and E 2 , subject to the condition that ϕ 1 (u) = ϕ 2 (τ (u)) for all u ∈ L(T 1 ).
We define two basic operations on a mul-tree (T, ϕ). The contraction of an internal edge of T collapses that edge and identifies its two endpoints, yielding a new tree T and a corresponding mul-tree (T , ϕ). (Note that, since T and T have the same leaf sets, ϕ is also defined on T .) Let v be an unresolved vertex of T. A refinement of v is obtained by partitioning the set of neighbors of v into two sets N 1 and N 2 , such that |N 1 |, |N 2 | > 1, replacing v by two vertices v 1 and v 2 connected by an edge, and making the vertices of N 1 neighbors of v 1 and those in N 2 neighbors of v 2 . This yields a new tree T , with the same leaf set as T, and a corresponding mul-tree (T , ϕ). Contraction and refinement can be viewed as inverses of each other ( Figure 1).
Let T 1 = (T 1 , ϕ 1 ) and T 2 = (T 2 , ϕ 2 ) be mul-trees on X 1 and X 2 , respectively, such that X 1 ∩ X 2 = ∅. We say that T 1 = (T 1 , ϕ 1 ) and Robinson-Foulds (RF) distance between two mul-trees T 1 and T 2 with identical label sets and matching label multiplicities, denoted by RF(T 1 , T 2 ), is defined as the minimum number of contractions and refinements Figure 1 Contraction and refinement. Contracting edge {u,v} in the mul-tree on the left produces the mul-tree on the right. Conversely, refinement of vertex u in the mul-tree on the right produces the mul-tree on the left. http://www.almob.org/content/8/1/28 necessary to transform T 1 into another mul-tree isomorphic to T 2 [33,34]. (Note that [33] originally defined their distance measure for singly-labeled trees. Later on, [34] showed that the definition extends naturally to mul-trees.) We extend the RF distance to mul-trees T 1 , on X 1 , and T 2 , on X 2 , with X 1 ⊆ X 2 and matching label multiplicities, as , if T 1 and T 2 are singly-labeled trees, then (T 1 ) = (T 2 ) implies that T 1 and T 2 are isomorphic. Further, in this case, [33].
Since mul-trees do not satisfy the Splits Equivalence Theorem, the RF distance between two of them cannot be computed by splits using expression (1). Nevertheless, as we show in Section The MulRF Problem, the formula will be useful for computing the RF distance between input gene tree and a species tree.
Ganapathy et al. [34] gave a worst-case exponential time algorithm for computing the RF distance between two mul-trees. The next result suggests that a polynomial time algorithm is unlikely.

Theorem 1. Computing the RF distance between two multrees is NP-complete.
Proof. See the Additional file 1.

The MulRF Problem
A species tree S for P and a tree T in P will not, in general, have matching label multiplicities, since S is singlylabeled, while T need not be. In order to define RF(T , S), we will extend the species tree to add the missing duplicate leaf labels, thereby converting it into a mul-tree. We explain this formally next.
Let T = (T, ϕ) be an input mul-tree on X and S = (S, φ) be a species tree on Y ; thus, X⊆Y . The extension of S relative to T is a mul-tree S* = (S*, φ*) on Y, constructed from S by doing the following for each vertex Replace s by an internal vertex connecting to k leaves Figure 3. We now define RF(T , S) to be RF(T , S*), where S* is the extension of S relative to T . We define the RF distance from a profile P to a species tree S for P as RF(P, S) := T ∈P RF(T , S).
Let B(P) be the set of all binary species trees for P.

Problem 1. (RF for MUL-Trees (MulRF))
Observe that the solution to the MulRF problem may not be unique. Further, the MulRF problem is NP-hard even when all the input mul-trees are singly labeled and their leaf label sets are identical [35]. Nevertheless, the "small" version of the problem -computing the RF distance between a profile of mul-trees and a species treeis easy to solve. For each input mul-tree T , we (i) construct the extension S* of the species tree relative to T ; (ii) differentiate duplicate leaf labels in both S* and T ; and (iii) apply the split-based formula (1) to compute the RF distance between the resulting singly-labeled phylogenetic trees. Next, we explain this process formally.
A full differentiation of a mul-tree T = (T, ϕ) on X is a singly-labeled tree T = (T, ϕ ) on X [34]. Note that both T and T have identical underlying trees, but the labeling map is surjective in the former, and bijective in the latter. Thus, X and X may be different sets and |X| ≤ |X |. Intuitively, a full differentiation of a mul-tree differentiates the leaves that have identical leaf labels.
Let T = (T, ϕ) and S = (S, φ) be two mul-trees, on X and Y, respectively, such that T and S have matching label multiplicities. Two full differentiations T = (T, ϕ ) and We can prove the following result.
Theorem 3. Let T be a mul-tree in a profile P and let S be a species tree for P. Let S* be the extension of S relative to T . Then, for each pair of consistent full differentiations Proof. Let T = (T, ϕ) be the input mul-tree on X. We prove the theorem by showing that for each a ∈ X, where |ϕ −1 (a)| = k, all k! ways of uniquely relabeling corresponding k leaves in both T and S* result into the same number of matched and unmatched splits in the corresponding mutually consistent full differentiations. The set of splits in T can be divided into two categories: • Category 1: Splits that have all the leaves labeled with a in one part. Such a split will always have a match, irrespective of the labeling. • Category 2: The remaining splits. Such splits are not present in S*, therefore, they will never have a match, irrespective of the labeling.
Thus, we can compute the RF distance between an input phylogenetic mul-tree and a species tree by computing the RF distance between any consistent full differentiations of the two trees. Since these full differentiations are singlylabeled trees, the RF distance between them can be found using Equation (1).

Solving the MulRF problem
Our local search heuristic for the MulRF problem starts with an initial (singly-labeled) species tree and explores the space of possible species trees in search of a locally optimum species tree, a species tree for P whose score is minimum within its "neighborhood". The neighborhood is defined in terms of the Subtree Prune and Regraft (SPR) operation [36]. An SPR operation on an unrooted, binary, singly-labeled phylogenetic tree T = (T, ϕ) on X cuts any edge e ∈ E(T), thereby pruning a subtree t, and then regrafts t by the same cut edge to a new vertex obtained by subdividing a pre-existing edge in T − t (Figure 4). The resulting phylogenetic tree is said to be obtained from T by a single SPR operation. The set of all phylogenetic trees obtained by the application of a single SPR operation on T is called the SPR neighborhood of T , and is denoted by SPR T .

Problem 2. (SPR Search)
Instance: A profile P = (T 1 , T 2 , . . . , T k ) of mul-trees and a binary species tree S for P. Find: A species tree S for P such that S ∈ SPR S and RF(P, S ) = min S ∈SPR S RF(P, S ).
The SPR Search based MulRF algorithm runs in two phases. In phase I, the algorithm quickly builds a likely suboptimal initial species tree using a greedy leaf adding procedure. This procedure first builds a phylogenetic tree on three randomly selected labels, and then it adds the remaining labels one at a time in a randomized order. In phase II, the algorithm performs a series of SPR Search iterations, each of which starts with an initial species tree and the input mul-trees. The output species tree of one SPR Search iteration serves as the initial species tree for the next iteration. When the resulting species tree of an SPR Search iteration is same as its initial species tree (i.e., Figure 4 Species tree extension. From left to right, input mul-tree T , the species tree S, and the mul-tree S*. S* is the extension of S relative to T . http://www.almob.org/content/8/1/28 there is no improvement in the score), the MulRF algorithm stops and returns the initial species tree of that iteration as the output.
Let the size of the input species tree S for the SPR Search problem be n, i.e. n := |S|. For each T ∈ P, let m := |T | + |S|. (For convenience, we assume that all the input gene trees have approximately the same size.) Let k be the number of input gene trees in P. In Section Solving the SPR search problem, we present an algorithm for the SPR search problem that runs in time (nmk). (More precisely, if the size of the ith gene tree in the input profile be t i then the complexity of our algorithm is ( We made the assumption about the size of the input gene trees to simplify this complexity.) The algorithm relies on results from [30], which characterize the RF distance between unrooted phylogenetic trees in terms of least common ancestors in rooted versions of those phylogenies. These properties enable us to update the RF distance quickly after an SPR operation has been applied to one of the trees. For completeness, we briefly review these results in the next subsection. For a full discussion with proofs, see [30].

Robinson-Foulds distance and least common ancestors
In this subsection, we deal exclusively with singly-labeled trees, which we refer to simply as phylogenetic trees.
A phylogenetic tree T = (T, ϕ) is rooted if the underlying tree T is rooted; this means that T has exactly one distinguished vertex rt(T), called the root. A rooted phylogenetic tree is binary if the root has degree two and every other internal vertex has degree three.
Let T = (T, ϕ) be a rooted phylogenetic tree on X. A ver- The RF distance between rooted phylogenetic trees T = (T, ϕ), S = (S, φ) on X, Y, respectively, such that X = Y , is defined as [33]

RF(T, S) := |(H(T)\H(S)) ∪ (H(S)\H(T))|.
When X ⊂ Y , we extend the RF distance in the same way as for unrooted trees. That is, RF(T, S) := RF(T, S |X ), where S |X := (S |X , φ |X ) is the rooted phylogenetic tree; here, S |X is obtained from S[X] by suppressing all nonroot degree-two vertices, φ |X is the bijective mapping φ |X : Let T and S be two unrooted phylogenetic trees on X and Y, respectively, such that X ⊆ Y . Let T and S be the rooted phylogenetic trees that result from rooting the underlying trees of T and S at the branches incident on some arbitrarily-chosen but fixed leaf label r ∈ X ( Figure 5). (The leaf label sets of T and S are X and Y, respectively.)

Lemma 1 ([30]). Let T and S be two unrooted phylogenetic trees on the same leaf label set, then RF(T, S) = RF(T, S).
We now show how to compute the RF distance between T = (T, ϕ) on X and S = (S, φ) on Y, when X ⊆ Y , without explicitly building S |X . We need two concepts.
The vertex function f S assigns each u ∈ I(T) the value The LCA computation of T is linear-time in the size of T, and the LCA mapping from S to T can be done in O(n) time [37] in bottom-up manner. Further, from Lemmas 2 and 3 we can compute the RF distance between S and T in O(m) time as well.

Solving the SPR search problem
Let T = (T, ϕ) be a mul-tree (on X) in P and S = (S, φ) be the input species tree (on Y ). We now show how to compute the RF distance from T to each tree in the SPR S neighborhood in time that is linear in the size of the neighborhood. By Theorem 3, computing the RF distance between T and each S ∈ SPR S reduces to computing the RF distance between T and each S , where T and S are the mutually consistent full differentiations of T and the extension of S relative to T .
Suppose an SPR operation on S cuts the edge e = {x, y} ∈ S, and thatẋ,ẏ are the subtrees of S − e containing x, y, respectively. Suppose subtreeẏ is pruned and regrafted by the same cut edge to a new vertex obtained by subdividing an edge inẋ. The degree-two vertex x is suppressed and the new vertex is denoted by x. Observe that there are O(n) possible edges inẋ to regraftẏ. We perform regrafts in an order that leads to a constant time RF distance computation for each successive regraft.
We begin by regraftingẏ at an edge incident to a leaf iṅ x. Let S be the phylogenetic tree obtained from performing the prune-and-regraft. Let T (on X ) and S (on Y ) be the mutually consistent full differentiations of T and the extension of S. We compute the RF distance between T and S using the algorithm described in Section Robinson-Foulds distance and least common ancestors. This method works by computing the RF distance between the rooted phylogenetic trees T and S obtained by rooting T and S at any leaf label in X ∩ Y . (Note that if X ∩ φ(L(ẋ)) = ∅ or X ∩ φ(L(ẏ)) = ∅, then T 's distance from S is the same as its distance from S.) The algorithm also computes the LCAs for T and the LCA mapping from S to T.
We perform the remaining regrafts ofẏ on edges iṅ x by iterating through the vertices ofẋ, starting from a leaf and exploring as far as possible along each branch before backtracking. The k th regraft is performed on the edge between the k th and k + 1 st vertices in this iteration. Let us denote this ordering of edges by ℵ. See Figure 7. Observe that each two distinct consecutive edges in ℵ are adjacent. We will show that, after the initial RF distance computation for S, we can compute in constant time the RF distance for the result of regrafting on each successive (adjacent) edges in ℵ.
Beginning with S, each S ∈ SPR S helps in computing the RF distance of the next tree in the above regraft order. Assume that S ∈ SPR S results from regraftingẏ at edge {a, b} inẋ, such that x subdivides the edge {a, b} and neighbors to vertex y inẏ, as shown in Figure 7. Let the rooted phylogenetic tree obtained after extending and differentiating S be denoted by S . The LCA mapping and RF distance have been computed for S . Let S ∈ SPR S denote the tree obtained by regraftingẏ on edge {b, c} inẋ and the rooted counterpart of S is S .
Next, we find the vertices of S whose LCA mappings have changed as a result of the SPR operation. Let T, S and S be the underlying trees of T, S and S , respectively. Based on the topology of S , there are three cases: Thus, after the initial regraft ofẏ at a leaf inẋ, we can compute in constant time the RF-distance between T and the species tree that results from each subsequent regraft.
Next, we present the results on complexity of our algorithm. Recall that n is the size of the species tree and m is the sum of n and the size of an input gene tree, where all the input gene trees are considered to have approximately the same size.

Lemma 5.
Let {x, y} be an edge of S and letẋ andẏ be the subtrees of S containing x and y, respectively, that result from deleting {x, y}. The RF distance for the set of trees obtained by regraftingẋ (resp.ẏ) on each edge inẏ (resp.ẋ) can be computed in (m) time.
Proof. The RF distance computation for S, obtained by pruningẏ and regrafting at a leaf inẋ, can be done in (m) time. After S, the RF distance for each phylogenetic tree S obtained by regraftingẏ on each edge inẋ, can be computed in constant time by performing regrafts in the order of ℵ. There are (n) edges in ℵ, thus the RF computation for all the phylogenetic trees can be done in (m) time. The same argument applies for pruningẋ and regrafting on the edges inẏ.

Theorem 4. The SPR Search problem can be solved in
(nmk) time.
Proof. The underlying tree S of S has (n) internal edges. For each edge {x, y} in S, letẋ andẏ be the subtrees of S defined in the statement of Lemma 5. The RF distance for all the phylogenetic trees obtained by regraftingẋ (oṙ y) on each edge inẏ (orẋ) can be computed in (m) time from Lemma 5. Thus the RF distance for k input mul-trees can be checked in (mk) time. The total execution time for (n) internal edges must be (nmk).

Experimental evaluation
In order to evaluate the performance of the MulRF method, we implemented the heuristic algorithm of Section Solving the MulRF problem using C/C++. The MulRF software as well as simulated data sets (explained next) are freely available for download at http://genome. cs.iastate.edu/CBL/MulRF/.

Simulated data set
Methods. We used simulation experiments to evaluate the performance of MulRF and compare it to GTP methods. Since the MulRF method is designed for use with multi-copy gene trees, we focus on simulating genes that could have a history of duplication and/or lateral transfer. We first generated model species trees using the uniform speciation (Yule) module in the program Mesquite [38]. Two sets of model trees were generated: i) 50-taxon trees of height 220 thousand years (tyrs), ii) 100-taxon trees of height 440 tyrs. Note that the dates are relative; they do not have to represent thousands of years.
Next, we evolved 150 and 300 gene trees for each 50and 100-taxon model species tree, respectively. For each gene tree a single gene birth node is chosen from the species tree nodes. Among all the simulated gene trees for a species tree, four gene trees have the gene birth node that is same as the root of the model tree. This represents http://www.almob.org/content/8/1/28 the sampling that would result from an experiment looking at genes from a few distantly related species. The rest had a gene birth node, which is selected at random using the model species tree topology and branch lengths. Starting from the children of the root, a Poisson process is tested along the parent edge of each node. If the birth occurs, the corresponding node becomes the birth node for that gene tree. This represents the sampling that would result from a study of closely related species.
We simulated the evolution of the gene trees within the model species tree using our C++ implementation of the duplication-loss model of [39]. We applied LGT events on the evolved gene trees, using the standard subtree transfer model of LGT. One LGT event causes the subtree rooted at a vertex c to be pruned and regrafted at an edge (a, b), where a and b together are not in the path from the root (of the tree) to c. We used gene duplication and loss (D/L) rate of 0.002 events/gene per tyrs and LGT rate of 0 to 2 events per gene tree. Note that the gene tree simulations without LGT follow a molecular clock model (equal rates of molecular evolution along all branches of the gene tree), but the simulations with LGT violate the molecular clock.
We generated gene trees based on four evolutionary scenarios: i) no duplications, losses, or LGT (called none), ii) D/L rate 0.002 and no LGT (called dl), iii) no duplication or loss, and LGT rate 2 (called lgt), and iv) D/L rate 0.002 and LGT rate 2 (called both). The parameter values (evolutionary scenario and model tree size) for each simulation are called the model condition; 20 model species trees were generated for each model condition. We deleted 0 to 25% of leaves (selected at random) from each gene tree to represent missing data or unsampled, which is common in almost all phylogenomic studies. For each gene tree, we used Seq-Gen [40] to simulate a DNA sequence alignment of length 500 based on the GTR+Gamma+I model. The parameters of the model were chosen with equal probability from the parameter sets estimated in [41] on three biological data sets, following [42]. We estimated maximum likelihood trees from each simulated sequence alignment using RAxML [43], performing searches from 5 different starting trees and saving the best tree. Since the true root of a gene tree with possible duplication and loss often is unknown, we rooted each estimated gene tree at the midpoint of the longest leaf-to-leaf path using Retree [44] before the species tree construction.

Species tree estimation.
We estimated species trees with GTP minimizing only the number of duplications (Onlydup) [13], GTP minimizing duplications and losses (Duploss) [14], GTP minimizing LGT events (SPR supertree or SPRS for short) [16], and the MulRF heuristic. Both Only-dup and Dup-loss were executed with their default settings, including a fast leaf-adding heuristic for initial species tree construction. SPRS was run with 25 iterations of the global rearrangement search option. For 50-taxon data sets, it calculated the exact rSPR distance if it was 15 or less, and otherwise it estimated the rSPR distance using the 3-approximation. For the 100-taxon data sets, we used the 3-approximation of the rSPR distance. SPRS does not allow mul-trees as input. Therefore we only ran it on none and lgt data sets. Experiments were performed on the University of Florida High Performance Computing (HPC) cluster. We performed the experiments on the HPC cluster in order to simultaneously run the many simulations and phylogenetic analyses. However, all of the analyses (including SPRS, GTP, and MulRF) are sequential and easily run on a desktop machine. The running times are given in Table 1. The HPC cluster has cores of 2.3, 2.6, 2.9, or 2.66GHz on Opteron or Intel processors with 2 to 4GB RAM.

Results.
We report the average topological error (ATE) for each model condition. This is the average of the normalized RF distance (dividing the RF distance by number of internal edges in both trees) between each of the 20 model species trees and their estimated species trees. An ATE of 0 indicates that two trees are identical, and an ATE of 100 indicates that two trees share no common splits.
For each set of 50-and 100-taxon model trees, the MulRF species trees are more accurate (lower ATE rate)  (Figure 8).
In order to examine how Only-dup, Dup-loss, and SPRS methods perform when gene tree simulations only include events that these methods assume to be the source of discordance, we studied the performance of Only-dup and Dup-loss on evolutionary scenario dl and SPRS on lgt. In both evolutionary scenarios we found that the ATE rate of MulRF was lowest for both 50-and 100-taxon data sets ( Figure 8). Surprisingly for lgt, while the ATE rate of SPRS was lower than Only-dup and Dup-loss in 50-taxon, the ATE rate of the former was much higher than that of the latter two in 100-taxon data sets (Figure 8).
We also examined the accuracy of species tree estimates by Only-dup, Dup-loss, and SPRS when gene tree simulations include events that these methods do not assume to be the source of discordance (e.g., dl and both for SPRS, lgt and both for Only-dup and Duploss). While SPRS could not be tested on dl and both because they included mul-trees, Only-dup and Dup-loss had high ATE rate for lgt and both (Figure 8). In general, Only-dup's estimate had much higher ATE rate compared to Dup-loss in the presence of LGT events; the ATE rate of MulRF was lowest among all the methods ( Figure 8).

Biological data set
We also tested the performance of the MulRF method on a gene tree data set from 22 plant species. These species were chosen because they are phylogenetically diverse, and they all have fully sequenced and annotated genome sequences. This makes it possible to obtain a large number of gene trees with potentially no missing sequences. Furthermore, there is much support for the relationships among most of these species (e.g., [45]), and therefore, it provides an empirical system on which we can evaluate the performance of MulRF. We obtained nucleotide alignments from gene families that had been generated from genome sequences with OrthMCL [46] and aligned with MAFFT [47]. We selected the gene family alignments that contained sequences from at least 20 of the 22 species and had a maximum of 50 gene sequences. This was a total of 1910 gene alignments. We estimated maximum likelihood trees for all of the genes using GTRCAT model in RAxML [43]. The unrooted gene trees were used as input for the MulRF heuristic. The Only-dup and Dup-loss methods require rooted input trees. Thus, we rooted all of the gene trees on the longest branch using Newick utilities [48]. This is similar to mid-point rooting, and in our experience it often provides a good starting point for input gene trees in GTP analyses. The rooted gene trees were used as input for GTP analysis using Only-dup [13] and Duploss [14]. Only-dup and Dup-loss were executed with the default SPR search, including a fast leaf-adding heuristic for initial species tree construction, and searching for an optimal root by re-rooting the gene trees after each SPR search (e.g., [13]; [17]). We could not run SPRS on this data set because it contains mul-trees.
The MulRF heuristic completed in 4 minutes and 4 seconds on a Mac laptop with a 2.26 GHz Intel processor and 4GB RAM. The resulting species tree is largely consistent with the most recent phylogenetic analyses (Figure 9; e.g., [45]). Amborella is sister to the other angiosperms and monocots and eudicots form clades. Within the eudicots there is a core-eudicot clade, and within the core-eudicots the rosid clade is sister to the asterid clade. The malvids are sister to the fabids within the rosids. Interestingly, Populus groups with the malvids, consistent with recent analyses of nuclear and mitochondrial, but not chloroplast, data (e.g., [49]; [17]). There are two minor differences from the generally accepted reltionships: Phoenix should be sister to Musa + Poaceae rather than sister to Musa, and Aquilegia should be sister to the other eudicots rather than Nelumbo. The ATE for the MulRF tree is 0.11. Thus, it appears that MulRF can quickly estimate a nearly accurate species trees from large-scale plant genomic data sets. The Only-dup tree heuristic completed in 7 seconds, and if we unroot the result, it is identical to the MulRF tree. The Dup-loss tree, which completed in 7 seconds, had a less accepted topology, placing Amborella sister to the monocots instead of sister to other angiosperms and Vitis sister to the asterids rather than with the rosids. The ATE for the Dup-loss tree is 0.21.

Conclusion
We presented the new MulRF method for inferring species tree from incongruent gene trees that is based on a generalized form of the RF distance. Unlike most previous phylogenetic methods using gene trees, our approach is based on a generic tree distance measure that is not linked to any specific biological processes. As a result, it is intuitively appealing for analyses of genomic data sets, in which many processes such as deep coalescence, recombination, gene duplications and losses, and LGT, as well as phylogenetic error likely contribute to gene tree discord.
In simulation experiments, the MulRF method estimated species trees more accurately than several GTP methods, and it appears to be relatively robust to the effects of phylogenetic error, gene duplication and loss, and LGT. In addition, the MulRF method is fast, estimating 100-taxon species trees from hundreds of gene trees in under two minutes and a plant data set with 22 taxa and nearly 2000 gene trees in just over 4 minutes.
Our simulation experiments greatly simplify the true processes of genomic evolution. We focused only on processes that reflect the objectives of the GTP methods, and we emphasized on duplication and loss, because that especially relevant to the evolution of multi-copy gene trees. Still, even in these conditions in which we might expect GTP to perform well, we find that MulRF obtains more accurate results than GTP in most instances. This does not mean that MulRF will always outperform GTP, but we suggest that MulRF can quickly provide an interesting alternate perspective on species tree inference. More tests are needed to characterize the performance of MulRF methods under different evolutionary scenarios.
Another future direction will be to incorporate estimates of gene tree uncertainty into the supertree analysis by weighing the splits differently when computing the RF distance. Also, the effectiveness of the MulRF method in inferring species trees from multi-copy gene trees suggests that other tree distance measures can be used in the same context. A natural candidate for study is the quartet distance. Future work should also evaluate the suitability of different distance metrics in estimating species trees under different error models and evolutionary scenarios.