Accuracy of phylogeny reconstruction methods combining overlapping gene data sets
© Kupczok et al; licensee BioMed Central Ltd. 2010
Received: 9 March 2010
Accepted: 6 December 2010
Published: 6 December 2010
Skip to main content
© Kupczok et al; licensee BioMed Central Ltd. 2010
Received: 9 March 2010
Accepted: 6 December 2010
Published: 6 December 2010
The availability of many gene alignments with overlapping taxon sets raises the question of which strategy is the best to infer species phylogenies from multiple gene information. Methods and programs abound that use the gene alignment in different ways to reconstruct the species tree. In particular, different methods combine the original data at different points along the way from the underlying sequences to the final tree. Accordingly, they are classified into superalignment, supertree and medium-level approaches. Here, we present a simulation study to compare different methods from each of these three approaches.
We observe that superalignment methods usually outperform the other approaches over a wide range of parameters including sparse data and gene-specific evolutionary parameters. In the presence of high incongruency among gene trees, however, other combination methods show better performance than the superalignment approach. Surprisingly, some supertree and medium-level methods exhibit, on average, worse results than a single gene phylogeny with complete taxon information.
For some methods, using the reconstructed gene tree as an estimation of the species tree is superior to the combination of incomplete information. Superalignment usually performs best since it is less susceptible to stochastic error. Supertree methods can outperform superalignment in the presence of gene-tree conflict.
The phylogenetic information inherent in sequence data from different genes can be combined to yield a species phylogeny rather than gene trees. The gene data for these phylogenies are mainly collected following two strategies: (a) using only genes that provide full information, i.e., cover all taxa of interest (e.g. ) or (b) using all available genes that are present in some taxa and fulfill special overlap conditions (e.g. [2–4]). The latter approach is able to use many more genes and taxa, since it allows for missing data. It can also be applied for phylogeny reconstruction from expressed sequence tags (ESTs, e.g. ). Before the gene alignments are obtained, two important steps can influence the phylogeny result: First, orthologs must be assigned correctly (see e.g. [6, 7] for method comparisons). Second, these orthologs need to be aligned with sufficient accuracy (see e.g.  for a review and  for an example of the impact of alignment accuracy on phylogeny reconstruction).
After reliable alignments are obtained, different methods are available to combine the original data at different points along the way from the underlying sequences to the final tree [4, 10]: First, superalignment methods combine the data at an early level by directly concatenating the gene alignments without any intermediate computations (early-level combination; also called "supermatrix", "concatenation" or "total evidence" [11, 12]). Superalignment methods have been used to infer phylogenies for eukaryotes , Metazoa and green plants , legumes  or species from all three domains of life .
Second, medium level combination methods first compute intermediate results from the gene alignments, e.g. pairwise distances [14, 15] or quartets , and subsequently reconstruct a phylogeny by combining this information.
Third, supertree methods combine the data at the late level of gene trees (late-level combination; e.g. ). Gordon  first suggested supertree methods to combine overlapping trees. The so-called source trees are first computed for each gene, or are obtained from the literature, and are subsequently combined into a supertree. The prevalent method for reconstructing supertrees is matrix representation with parsimony (MRP) [18, 19], especially when only published trees but not the original data are available or when data of different kind are combined. MRP has been applied to many different kinds of species data, for instance to Mammalia  or Bacteria .
Each of these approaches has general advantages and disadvantages. The superalignment method uses all character information but assumes the same underlying topology and often the same parameters for all genes. Supertree approaches account for differing topologies and parameters between genes. On the other hand, they are more susceptible to stochastic errors since estimating substitution parameters and a topology for each gene independently may lead to overfitting. Furthermore, they try to minimize the amount of missing data when constructing the gene trees. Medium-level approaches also allow for gene-specific parameters, but they use quartet likelihoods or distances, not gene trees, when building the final tree. In the consensus setting, i.e., where all data sets contain the same taxa, the differences between concatenated alignments and tree combination have been extensively discussed (e.g. [22–27]).
Practical investigations using real data sets or simulated data are of interest to compare different methods. Various authors used real data sets to compare superalignment and supertree approaches [7, 28–31]. Those real data sets have the advantage of a realistic setting, however, the true tree is usually unknown. Then it is only possible to use well-established clades for assessing the performance (e.g. ) or to compare methods to one another (e.g. ). In simulations, on the other hand, the results can be compared to a model tree. Then the performance of the methods can be measured at an absolute scale. Several studies investigating supertree methods using simulations were carried out [32–35]. They employed the following general scheme: (1) Generation of a model tree assuming a Yule process, (2) generation of alignments along that tree, (3) random deletion of a proportion of taxa, (4) reconstruction of gene trees by maximum parsimony, (5) construction of the supertree from the inferred gene trees, and (6) comparison of the supertree to the model tree. Bininda-Emonds and Sanderson  compared superalignment and MRP for different degrees of divergence and observed that, with increasing divergence, the distance of the MRP trees to the superalignment tree increased. Levasseur and Lapointe  compared average consensus, superalignment with distances and MRP for gene trees with complete taxon sets. They found average consensus to perform nearly as well as superalignment, whereas MRP was substantially worse since it ignores gene tree branch lengths.
Simulations can also be used to evaluate the impact of undesired properties for a particular supertree method. For instance, one of these properties is the emergence of "novel clades", i.e., clades contradicted by all gene trees. Bininda-Emonds  found such clades to be very rare. However, note that due to missing taxa and multifurcating trees, it is not straightforward to measure supporting and conflicting relationships between a supertree and the gene trees (an alternative definition is presented in ).
Each of the above simulation studies focused on a special subset of methods for supertree construction. A general performance assessment, however, has not yet been carried out, and the strengths and weaknesses of the different methods are unknown. Here, we present an extensive simulation study about combining gene alignments. Thus, we take the orthology relationships and the alignment as correctly given. We compare different data combination methods, including supertree, superalignment and medium-level methods, to assess their accuracy in biologically reasonable situations. This leads to suggestions of applicable methods in the case of overlapping data sets. Moreover, we discuss the issue of complete versus incomplete data.
Overview of reconstruction methods and corresponding abbreviations
Matrix representation with parsimony and Baum/Ragan coding
Matrix representation with parsimony and Purvis coding
Matrix representation with irreversible parsimony and Baum/Ragan coding
Matrix representation with flipping and Baum/Ragan coding
Matrix representation with flipping and Purvis coding
Matrix representation with compatibility and Baum/Ragan coding
Modified minimal cut
Quartet inference and local inconsistency
Super quartet puzzling
Super distance matrix
A superalignment is generated from single gene alignments by concatenating the different alignments and adding gaps where no sequence information is present for a specific taxon. The superalignment method (SA) refers to reconstructing the superalignment tree. Here, we use maximum likelihood (ML) or maximum parsimony (MP), depending on the size of the data set. ML phylogenies are computed with IQPNNI version 3.1 , assuming the substitution model HKY for DNA sequences  and JTT for protein sequences . In both cases, site heterogeneity is modeled with four -distributed rate categories. MP phylogenies are computed with PAUP* 4.0b10  and the following parameters: heuristic search with TBR branch swapping, random addition of sequences, and a maximum of 10,000 trees in memory.
For complete data, where each gene is present in each taxon, we also apply the majority-rule consensus as implementated in consense.
Three methods based on matrix representation (MR) coding schemes are available: MR with parsimony (MRP), MR with flipping (MRF), and MR with compatibility (MRC). All three aim to optimize an objective function. If more than one optimal tree is found, we take the strict consensus tree as the reconstructed tree.
Example of coding a gene tree as a matrix representation
MRP trees are reconstructed by searching the most parsimonious tree for the matrix representation [18, 19, 42]. We apply two kinds of parsimony: (1) reversible Fitch parsimony , which assumes the character changes to be undirected, and (2) irreversible Camin-Sokal parsimony, which only allows changes from 0 to 1 and thus uses the root information in the trees . The most parsimonious tree with the respective criterion is determined by PAUP* 4.0b10 (heuristic search with TBR branch swapping and random addition of sequences, and a maximum of 10,000 trees in memory). Overall, we consider three MRP variants: MRP_BR (reversible parsimony and BR coding), MRP_I (irreversible parsimony and BR coding) and MRP_PU (reversible parsimony and PU coding).
The objective function of MRF is to minimize the number of binary flips (changes from '0' to '1' and vice versa) necessary to convert the original MR into an MR compatible with a tree [47, 48]. Here, we apply MRF to both coding schemes, BR and PU. So far, MRF has only been applied to matrices with Baum/Ragan-coding. Since MRF, like MRP, is an NP-complete problem, we use the heuristic implemented in HeuristicMRF2 (http://genome.cs.iastate.edu/CBL/).
The objective of MRC is to maximize the number of columns in the MR congruent with a tree [50, 51]. We use Clann version 3.0.2 as a heuristic to find the MRC tree for a BR coded matrix representation (the sfit criterion with default parameters ).
The "Build" algorithm  is only able to construct a supertree for a set of compatible and rooted gene trees. In case of compatible gene trees, each gene tree is a subtree of the supertree. "Build" and its variants are graph-based rooted triplet methods, thus, rooted trees are required. To combine incompatible gene trees, different cut methods have been developed.
MinCut (minimal cut) is an extension of the "Build" algorithm . In case of a conflict, MinCut introduces an edge in the supertree that conflicts with the fewest possible number of triplets.
ModMinCut (modified MinCut) improves MinCut by not only considering the contradicting triplets for an edge but, additionally, by trying to keep subtrees that are uncontradicted by the gene trees . Both MinCut and ModMinCut are polynomial-time algorithms implemented in supertree by Rod Page. We use a precompiled version of this program taken from Rainbow 1.2 beta .
MaxCut considers two types of triplet topologies: bad ones which occur in a gene tree, and good ones for which another possible topology occurs in a gene tree. In case of a conflict, the ratio of these counts is maximized, which is an NP-hard problem. Snir and Rao  suggested a heuristic based on semidefinite programming. We compute the MaxCut tree from a set of triplets with a program provided by Sagi Snir. To apply it, we first extract triples from the gene trees using a program provided by Gregory Ewing.
Quartet-based methods QILI (Quartet Inference and Local Inconsistency)  is based on quartet topologies extracted from unrooted gene trees. First, a set of weighted quartets is computed, where the weights for each quartet are smaller if they occur in more trees. Missing quartets are inferred by a rectifying process using quintet information. From this collection of quartets, a tree is estimated by minimizing the weighted sum of the quartets represented in a tree using Willson's local inconsistency method . QILI is available in the QuartetSuite 1.0 package.
combines the sequence data based on the quartet likelihoods . For each gene, TREE-PUZZLE  computes all quartet tree likelihoods. These likelihoods are combined for every possible quartet topology across all genes containing the respective quartet. The likelihoods are used to combine the data into so-called superquartets, the building blocks for SuperQuartetPuzzling (SuperQP). SuperQP is related to the QP algorithm , but it takes also missing data into account, using an overlap-graph guided insertion scheme and a voting procedure that is aware of missing quartets. We compute the SuperQP tree with an upcoming version of the TREE-PUZZLE package.
The medium-level information for distance-based methods are pairwise distance matrices computed separately for each gene. Here, we estimate pairwise ML distances with IQPNNI. The distances are combined into one distance matrix for all taxa, which is subsequently fitted to a tree with the least-squares method of Fitch-Margoliash . We use the fitch implementation in the PHYLIP package with the Subreplicates option, thus allowing for missing data by considering only available entries. Two distance-based medium-level methods, differing only in the combination of the matrices, have been devised so far:
Super Distance Matrix (SDM)  inserts two types of parameters: (1) weighting factors for each distance matrix, which correspond to a branch lengths scaling for each gene tree, and (2) additive constants for each taxon in each matrix, which correspond to an elongation of terminal branches. Utilizing several contraints, the variance of the scaled and shifted gene distance matrices to the combined distance matrix is minimized. Both methods are impemented in the SDM program .
Parameters varied in the simulations
True gene trees
subtrees of species tree
R α :
rate of evolution assigned randomly from a Γ-distribution with parameter α (i.e., mean 1 and variance 1/α )
substitution parameters and branch lengths gene-specific
T θ :
trees random by coalescent process with parameter θ
equal to true gene trees
normal sequence length and ML estimation
For most simulations, the superalignment ML tree for the real data is taken to be the true species tree. Estimated nucleotide and amino acid frequencies as well as the parameter of the -distribution are used as parameters for Monte-Carlo simulations with seq-gen. Unless stated otherwise, protein data are generated with JTT and nucleotide data with an HKY model with the transition/transversion ratio taken from the original ML estimation. Sequences are simulated with the same lengths distribution as in the original data. If simulations were performed taking missing taxa into account, those taxa were deleted from the genes which were also absent in the original data.
There is also the possibility to use the gene trees from the original data as the true gene trees (true gene trees gene-specific, G in Table 3). In this case there is no true species tree known.
For each simulated data set, at most fifteen different methods are applied to reconstruct a tree (Table 1). Note that not all methods are applicable for all settings. Consensus is only applicable for complete data and the medium- and low-level methods are only applicable if sequence information is present.
If applicable, we measured the accuracy of the methods by the normalized Robinson-Foulds distance (RF) of the inferred species tree to the true species tree. The Robinson-Foulds distance  is the number of splits that are present in one tree but not in the other one, and vice versa. Since unrooted n-taxa trees have a maximum of n - 3 inner branches, the maximal Robinson-Foulds distance is 2(n - 3). In the following, RF denotes the normalized Robinson-Foulds distance, where the distances are divided by 2(n - 3). This yields a value between 0% and 100%, which can be interpreted as the percentage of false or missing splits in the inferred tree compared to the true tree.
Each simulation setting is abbreviated by four letters corresponding to values for each of the four categories of simulation parameters (Table 3).
Next, we use the same 500 simulated alignments as before, but delete those sequences from the simulated gene alignments which are not present in the original alignment (cf. Figure 2a). The resulting distributions of the RF distances (Figure 4b) show that all methods are strongly affected by missing data. With a mean RF distance of about 6.2%, SA is again the most accurate method. Among the remaining methods, MRP_BR (10.8%) and SuperQP (11%) show the smallest mean RF distances. The cut methods, QILI, and average consensus show mean RF distances larger than the baseline distance of 16.5%. Thus, these methods perform on average worse on incomplete data sets than the ML reconstruction using only one gene present in all taxa. These methods seem to be unable to efficiently utilize the additional information provided by extra, but incomplete, gene data.
In general, the results of the large data set are similar to those for the small data set: In both settings, the methods that improve the baseline distance are the same, superalignment outperforms the other methods, the MR methods are the best supertree methods, and SuperQP is the best medium-level method. Thus, we expect the results also to be similar when introducing deviating settings. In the following, we only present the results for the small data set.
We also test whether the methods are able to combine highly informative, but incomplete, data sets. Thus, we minimize the effect of erroneous gene tree reconstruction by generating gene sequences ten times longer than the original gene sequences while taxa occurrences are the same as in Figure 2a. The accuracy of inferred species trees and gene trees is substantially improved for all methods (data not shown). High mean RF distances for QILI (30.3%) and AvCon (8.1%), however, show that these methods fail to reconstruct reasonable trees from highly informative data sets with missing data. The mean RF distances for MinCut, SuperQP and SDM are between 1% and 2% and all remaining methods show an average RF distance of ≤1%.
We extended the simulation with missing data (S, m, E, n) by bootstrapping the superalignment and the gene trees. In this case, reconstructed gene trees were the majority-rule consensus of trees reconstructed from bootstrapped alignments. Since branches with low support are discarded from each gene tree, the accuracy of supertree methods is expected to improve. Note that this bootstrap procedure does not a ect the medium-level methods. Here, we measured the accuracy of reconstruction for 200 of the alignments that were the basis of the simulations summarized in Figure 4b (S, m, E, n). The bootstrapped gene trees lead to an improvement of the accuracy of all supertree methods when compared to the results without bootstrapping (data not shown). The mean RF distance is now 5.6% for superalignment, between 9 and 10.3% for all MR methods, and between 12 and 22% for the cut methods.
Now we introduce a more complicated setting where the evolutionary rates vary between genes. The true gene trees were generated from the species tree by stretching or shrinking all branch lengths with a Γ-distributed random factor drawn independently for each gene in each simulation. In two different settings, the shape parameter for the Γ-distribution was α = 3 and α = 1:67, respectively. As in the previous simulations, the substitution parameters for the sequence simulation were equal for each gene. The gene trees and the SA tree were also obtained by bootstrapping. For each choice of α, we computed 100 simulated alignments. For neither setting do the results differ substantially from the previous simulation with bootstrapping (data not shown).
Here, as in the previous setting, the true gene trees differ from the species tree by their branch lengths. However, this time the branch lengths were fitted from the original data to obtain the true gene trees. For each alignment, the species tree was pruned to the respective taxon set. Afterwards, GTR parameters and branch lengths were fitted to the pruned tree using the original alignment. If a branch length got down to 10-6, the lower bound in IQPNNI, the respective branch length was set to 1/l, where l is the length of the corresponding alignment. The trees constructed this way were used as the true gene trees for the simulations. The sequence simulations used the estimated GTR parameters for each gene.
The resulting distances clearly show that SA is the most consistent method, since it has the smallest average distance to the SA tree from the original data (7.8%). It is followed by MRP_BR with a mean RF distance of 13.2%.
First, we investigate the performance of the supertree methods in the presence of incongruent gene trees without any reconstruction error. In Figure 8a, we see that the matrix representation methods can estimate the species tree quite accurately in the presence of complete data; MRP_PU and MRF_PU give the best results with a mean reconstruction error of 4.6% and 4.7%, respectively. The matrix representation methods, headed by MRF_PU (12.5%), are also the best methods when data are missing (Figure 8b).
This might be due to the small number of genes (10) and the different sequence lengths (Figure 2a). More than 50% of all positions in the superalignment stem from only three genes. The corresponding three inferred gene tree topologies also show the smallest average RF-distances to the superalignment tree (numbers not shown). Thus these three genes mainly drive the superalignment reconstruction. If their gene trees are distant from the true species tree, the superalignment result will also deviate.
We also tested the methods on incongruent gene trees together with missing data. That is, the same alignments were used but the information was pruned according to Figure 2a. Several methods show a lower mean accuracy than the phylogeny of a full gene, namely MinCut, ModMinCut, QILI and AvCon (Figure 9b). MRP_BR (20.4%), MRP_PU (21.6%), MRP_I (21.1%), MRF_BR (21.7%) and MRF_PU (22.2%) still outperform superalignment (22.3%) on average, but the difference is marginal.
Paired Wilcoxon signed-rank test
<2.2 × 10 -16
<2.2 × 10 -16
2.2 × 10 -9
2.2 × 10 -16
1.8 × 10-5
[-0.5, 1:9 × 10-5]
2.2 × 10 -6
2.478 × 10-5
3.2 × 10-5
1.8 × 10 -3
[-1.5, 3:2 × 10-5]
-6 × 10-5
1:6 × 10 -14
1.1 × 10 -8
6:1 × 10 -16
7.4 × 10 -15
We presented a detailed simulation study to assess the accuracy of superalignment, supertree and medium-level methods for reconstructing phylogenetic trees from multiple data sets. Although supertrees are often used to combine data of different kinds, our simulations only refer to sequence-based approaches. Morphological characters are not included due to the lack of reasonable probabilistic models to simulate their evolution. This study is first in comparing a broad range of methods for combining incomplete data sets. Furthermore, the true gene trees were generated from the true species tree in different ways (see also Table 3): (a) all gene trees were identical to the species tree, (b) the branch lengths but not the topology were gene-specific, (c) the gene trees from the original data were used as true gene trees and (d) the gene trees showed different topologies modeling incomplete lineage sorting. All conclusions are based on the specific implementations used for these methods as described in the methods section.
Gene features like sequence lengths and taxon overlap influence the accuracy of the methods presented. Instead of covering many different parameter combinations, we used the parameters of two very different natural data sets for the simulations. They cover 10 genes of 25 taxa and and 254 genes of 69 taxa, respectively. Note that supertree methods can be applied to substantially larger data sets (e.g. ). We expect that the accuracy of the methods will be influenced by the amount and distribution of missing data as well as the taxon overlap between alignments. Additionally, the incongruency among the true gene trees and alignment lengths influence the relative performance of the methods. Adding more genes may increase the number of incongruent trees , while adding more taxa typically increases the amount of missing data. Thus, accuracy will generally decrease.
The first main result is that one of the matrix representation methods, which are the most abundant supertree reconstruction methods used in the literature, usually shows the second-best result after superalignment. Especially the MRP and MRF methods with Baum/Ragan-coding result in very accurate trees. Since these methods are based on splits, bootstrap-based weighting can be easily incorporated, which is expected to further increase the accuracy of the reconstructed trees [28, 32]. Among the medium-level methods, SuperQP yields better results than the distance-based approaches, especially when data are missing. The accuracy of SuperQP is often consecutive to or among the accuracies of the MR-based supertree methods.
Second, in the case of complete gene trees, the majority-rule consensus method is also applicable. In all simulation settings with complete gene trees, some supertree methods perform better on average than the consensus method. In these cases, supertree branches that are supported by less than half of the gene trees are correctly resolved, while remaining unresolved in the consensus tree. This shows that, although supertree methods have been criticized for not being majority-rule methods , the resolution of additional branches can be favorable. However, as for majority-rule consensus trees, it is desirable to also label the supertree branches with the support in the gene trees.
Third, we introduced the baseline distance as a measure to judge the benefit of the combination methods. The baseline distance for one setting is defined as the mean RF distance between the true species tree and the reconstructed gene trees using complete alignments. We observe that, for most of the simulation parameters studied here, QILI, average consensus, MinCut and modified MinCut show larger mean RF distances than the single gene trees. QILI has already been observed to be slightly worse than MRP . Average consensus is clearly outperformed by SDM when data are missing. We applied both methods as medium-level methods by taking pairwise distances directly from the alignment distances, not from the reconstructed gene trees. While average consensus was suggested as a late-level method , SDM has been explicitly designed as a medium level method . Thus, average consensus may not be able to resolve the conflicts in the non-tree-like distances. The behavior of MinCut can be partly explained by the fact that it resembles Adams consensus . This means that uncertain taxa will be placed at the root of subtrees, which can disturb quite a few splits, leading to high RF distances. The cut methods presented here implement a heuristic based on the rooted triplets in the gene trees. Recently, Lin et al.  suggested another approach which maximizes the common rooted triplets in the supertree and the gene trees. They show that their method outperforms modified MinCut and MaxCut on example data sets.
Finally, we observe that superalignment methods usually show the highest accuracy on average. This applies to incomplete data as well as gene-specific substitution parameters. Superalignment also results in the most consistent phylogenetic estimation when each method is not compared to a model tree but to the original result obtained with that very method (Figure 7). However, in the presence of high incongruency among true gene trees, that means, if reconstruction error is not the main cause that gene trees differ from the species tree, the implicit weighting by sequence length can have a negative effect on the performance of superalignment leading to outperformance by the supertree method MRP BR. This bias might be avoided by introducing a normalization, but then, the opposite and still unwanted bias could emerge. Furthermore, it has been discussed (e.g. ) that SA should be preferred over supertree methods since the former does not imply character weighting. Furthermore, Edwards argued recently  that in the presence of gene tree conflict caused by coalescence effects as many genes as possible should be used and they should be weighted equally. This is consistent with our observation, that supertree methods outperform superalignment in the presence of strong coalescence effects. There are also species tree reconstruction methods that use a coalescent model to account for the differences between true gene trees (e.g. ). Kubatko et al.  have shown that concatenation of gene alignments may be inappropriate when the gene tree histories differ considerably. The coalescent model can be applied for closely related species (e.g. grasshoppers ), but severe problems caused by incomplete lineage sorting seem not to play a role among taxa of deep phylogenetic trees (e.g. for Metazoa ). Since these methods typically require complete data, we did not include them in our comparison. We rather concentrated on methods which were explicitly designed for missing data and that resolve conflict of unknown source.
Our results are in general concordance with previously published comparisons. Dutilh et al.  used real data sets and also found superalignment to perform best. Eulenstein et al.  used simulated data and find MRP and MRF to perform similar and better than MinCut and ModMinCut. Swenson et al.  compare superalignment and weighted and unweighted MRP using biologically motivated simulations and also find the highest accuracy for superalignment. We apply, however, a broader range of methods than previous studies.
All conclusions presented here are based on the accuracy measured by the mean RF distance. This does not imply that the methods presented as better on average always show superior results and could, thus, be used as a gold standard. Rather, we highly recommend to use several of the superior methods (considering also various levels of data combination) and to compare their results. By comparing the accuracies of the reconstructed supertrees with the accuracies of the ML gene trees, we showed the baseline distance to be a reasonable criterion for excluding unsuitable methods. If the baseline distance cannot be improved by a data combination method, it is preferable to use only genes for ML reconstruction that are present in all taxa and to possibly sequence the missing genes in some taxa. For a real data supertree analysis, not the baseline distance but only the distances between the reconstructed gene trees are available to assess which method may be appropriate. Thus, the homogeneity of the gene trees can be an indicator whether variation is present in the gene trees. Assessing the homogeneity of overlapping gene trees is a complex task of itself [23, 77–79] and is not covered in our study.
The source of variation, i.e., why the reconstructed gene trees differ from the species tree, should also be taken into account, since it has an influence on the relative performance of the methods. If a tree-like evolutionary history is assumed and true gene tree incongruence is unlikely or rare, superalignment results in the most accurate trees. This also holds in the presence of gene-specific substitution parameters and branch lengths, as has been observed before . But if the difference of the true gene trees to the species trees is the main source of variation, supertree methods are favorable. Applying a superalignment method to data with different underlying topologies or highly varying parameters has also been shown to be problematic (e.g. [80, 81]).
In the case of known gene tree variation, methods that model the assumed causes can also be applied (e.g. [67, 82] for incomplete lineage sorting). When exploring gene tree effects, like horizontal gene transfer or incomplete lineage sorting, gene trees have to be reconstructed and compared to a species tree. If the intention of an analysis is species tree reconstruction, however, external information may be considered: External information, like the rates of horizontal gene transfer, gene duplication or incomplete lineage sorting helps to judge whether complex evolutionary models are necessary to reconstruct the species tree. If these complex scenarios are not assumed to play a major role, application of superalignment minimizes the stochastic error. On the other hand, if gene-tree conflict is present but the underlying biological model is unknown, supertree or medium-level methods may be more reasonable. They account for gene tree variation but make no assumptions on the underlying evolutionary model causing the variation.
Our study provides comparative data for methods combining the data at different levels. This broad collection of methods hence provides valuable help to choose a promising set of approaches to reconstruct species trees from sets of orthologous genes.
A Software to simulate data as in this study is available from our webpage http://www.cibiv.at/software/supi. The software published under GPL is written in Python and Java and is platform independent.
The authors like to thank Gregory Ewing for assistance in the simulation of incomplete lineage sorting and Michael Kopp and Sascha Strauss for helpful comments on the manuscript. Financial support from the Wiener Wissenschafts-, Forschungs- and Technologiefonds (WWTF) is greatly appreciated. A.v.H. acknowledges support from the German Research Foundation (DFG, SPP-1174).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.