Basics
Supertree datasets
Because of the taxon sampling strategies used by biologists, source trees tend to be focused either on intensively sampled, smaller subgroups, like big cats, or on larger, sparsely sampled groups, like all vertebrates. We refer to the first type as a cladebased source tree, and the second type as a scaffold. Supertree profiles include scaffolds to ensure sufficient overlap with the cladebased trees.
Matrix representation with parsimony
MRP encodes source trees as a matrix of partial binary characters: all entries in the matrix are 0, 1, or ?, with each column in the matrix defined by a single edge in a source tree. The matrix is then analyzed using a heuristic for the NPhard maximum parsimony problem [22].
Quartets MaxCut (QMC)
QMC is a quartet amalgamation method, operating in polynomial time and providing no guarantees with respect to its optimization problem, MQC. The source trees are encoded by sets of quartet trees, and QMC is applied to the union of these sets.
Quartet encodings of source trees
The work presented here explored several techniques for representing source trees by sets of quartet trees. Two of these techniques use random sampling strategies [14], which are based upon computation of the topological distance between leaves in the source tree. The topological diameter of a quartet tree q with respect to a source tree t (denoted diam_{
t
} (q)) is the maximum of its leaftoleaf topological distances within t. The quartet encoding strategies used in [14] also included calculation of the TopologicallyShort Quartet (TSQ) trees, defined as follows. For each edge in a source tree, pick the topologically nearest leaves in each of the subtrees around the edge. If two or more leaves within a subtree have the same topological distance to the edge, pick all such leaves. The set of quartet trees formed by picking one such leaf from each subtree forms the TSQs around that edge. The union of all these is the set of TSQ trees.
We tested three strategies for encoding a source tree t by a set of quartet trees:
All quartets: include all induced fourtaxon trees.
Geo+TSQ: include a quartet q with probability d^{

}^{3}where d = diam_{
t
} (q), and add the TSQ trees (this method was studied in [14]).
Exp+TSQ: compute the topological distance between every pair of leaves, include a quartet with probability 1.5^{d}where d = diam_{
t
} (q), and add the TSQ trees (this method was also studied in [14]).
Performance study design
Our simulation study used datasets that have properties typical of biological supertree datasets, and that were used in a previous study [23] to compare supertree methods to combined analysis using maximum likelihood. These datasets had 100, 500 and 1000 taxa, and came in two types: (1) mixed source trees, consisting of one scaffold dataset (produced by a random selection of taxa from the entire dataset) and many cladebased datasets (focused dense taxon sampling within a rooted subtree), and (2) allscaffold source trees, in which all source tree datasets were obtained by sampling randomly within the full dataset. Here we describe the simulation methodology in brief, for details see [23].
Step 1: Generate model trees
We generated trees with 100, 500 and 1000 leaves (taxa) under a pure birth process, deviating these from ultrametricity (the molecular clock hypothesis). We generated 30 datasets for each 100 and 500taxon model condition, and 10 datasets for each 1000taxon model condition.
Step 2: Evolve gene sequences down the model tree
We first determined the subtree within the model tree for which each gene would be present, using a gene "birthdeath" process (gene gain and loss); this produced missing data patterns that reflect biological processes. Each gene was then evolved down its subtree under a General Time Reversible process with rates for sites drawn from a Gamma plus Invariable distribution (GTR+Gamma+I) [24]), using a variety of GTR matrices estimated for different biological datasets (see Appendix [Additional file 1]).
Step 3: Dataset production
We selected (1) datasets of genes to estimate trees on specific clades (rooted subtrees) within the tree and (2) datasets of genes to form the scaffold tree. We selected three genes for each clade dataset, and four genes for each scaffold dataset. Each model condition is indicated by the number of taxa in the model tree and by the density of the scaffold dataset, which is the percentage of the entire taxon set in the scaffold dataset, with scaffold densities ranging from 20% to 100%. We generated two types of source tree dataset profiles: those containing only scaffolds, and those containing one scaffold and several cladebased datasets (as described earlier).
Step 4: Estimation of source trees
We used RAxML [25], one of the most accurate ML phylogeny estimation methods.
Step 5: Estimation of the supertrees
We used MRP, using a very effective heuristic search technique called the Ratchet[26] (see Appendix [Additional file 1] for commands used). This returns a set of trees, each of which has the best (found) score; we then compute the greedy consensus (gMRP) tree for this set. The greedy consensus is a refinement of the majority consensus, and thus contains all the bipartitions present in more than half the input trees; it is a common consensus method, and in our experiments produces the most accurate supertrees when applied to results produced by the Ratchet. We also computed supertrees based upon three ways of encoding the source trees as sets of quartet trees and then applying QMC, as described above. Finally, we computed supertrees using five other methods: QImp, RFS, MinFlip, SFIT, and PhySIC (See Appendix [Additional file 1] for details on software and commands used).
Because MinFlip, RFS, and PhySIC require that the source trees be rooted, we rooted each source tree at the midpoint of the longest leaftoleaf path (a standard method for rooting trees when there is no outgroup available) before passing the source trees to these three methods.
Step 6: Performance evaluation
Topological error for each estimated supertree was measured as follows. We represented each tree T on leaf set S by the set ∑(T) of bipartitions induced on the leaf set, one bipartition for each internal edge in the tree. If T is an estimated supertree and T_{0} is the true (model) tree, then the false positive rate is
, and the false negative rate is
.
We also computed the total topological distance of each supertree to its source trees. To do this, we restricted the supertree to the subset of taxa for each source tree, and then computed the topological distances between the two trees. We computed the following three distance measures for each supertree T to its source tree profile
.
SumFN, defined as follows: Sum
, where FN(T, t) is the number of edges in t that do not appear in T, and
, where m_{
t
} is the number of internal edges in t.
SumFP and SumRF, defined similarly, with FP(T, t) and RF(T, t) replacing FN(T. t), respectively. FP denotes the false positive distance and RF denotes the RobinsonFoulds ("bipartition") distance. The false positive distance between a supertree T and a source tree t in the profile
is the number of edges in T that do not appear in t. The RobinsonFoulds error rate is the average of the FP and FN error rates.
Each distance measure was normalized by the number of edges (bipartitions) in the relevant tree (the model tree for false negatives, and the estimated tree for false positives), to produce error rates between 0 and 1. Note that if the supertree and all source trees are binary, then RF(T, t) = 2FN(T, t) = 2FP(T, t), and after normalization all three distances are equal. When the estimated trees are not binary, the RF distance is biased in favor of unresolved trees [27]. Our source trees were generally fully binary or nearly fully binary. With the exception of PhySIC, the supertree methods we studied produced either fully resolved, or almost fully resolved supertrees. PhySIC is highly conservative and therefore tended to produce highly unresolved trees. Consequently, PhySIC tended to have very low false positive rates at the expense of having very high false negative rates. In our results, we, therefore, show false negative error rates, since except for PhySIC, the relative performance of the different supertree methods does not depend upon the error metric used. This allows us to provide a more nuanced evaluation than would be possible with RF. We calculated average error rates and standard error for each model condition. However, because QMC failed to return trees on some inputs, we restricted our results to datasets for which all the reported methods returned trees. This reduced the number of replicates for some model conditions. We also recorded the running time and space usage of each method on each dataset. Because the analyses were run under Condor (a distributed software environment [28]), running times are approximate (particularly for the larger datasets) and are larger than if they had been run on a dedicated processor.