A sub-cubic time algorithm for computing the quartet distance between two general trees
© Nielsen et al; licensee BioMed Central Ltd. 2011
Received: 12 April 2011
Accepted: 3 June 2011
Published: 3 June 2011
When inferring phylogenetic trees different algorithms may give different trees. To study such effects a measure for the distance between two trees is useful. Quartet distance is one such measure, and is the number of quartet topologies that differ between two trees.
We have derived a new algorithm for computing the quartet distance between a pair of general trees, i.e. trees where inner nodes can have any degree ≥ 3. The time and space complexity of our algorithm is sub-cubic in the number of leaves and does not depend on the degree of the inner nodes. This makes it the fastest algorithm so far for computing the quartet distance between general trees independent of the degree of the inner nodes.
We have implemented our algorithm and two of the best competitors. Our new algorithm is significantly faster than the competition and seems to run in close to quadratic time in practice.
The evolutionary relationship between a set of species is conveniently described as a tree, where the leaves represent the species and the inner nodes speciation events. Using different inference methods to infer such trees from biological data, or using different biological data from the same set of species, often yield slightly different trees. To study such differences in a systematic manner, one must be able to quantify differences between evolutionary trees using well-defined and efficient methods. One approach for this is to define a distance measure between trees and compare two trees by computing this distance. Several distance measures have been proposed, e.g. the symmetric difference , the nearest-neighbour interchange , the subtree transfer distance , the Robinson and Foulds distance , and the quartet distance . Each distance measure has different properties and reflects different properties of the tree relationship.
Most previous work has focused on comparing binary trees and therefore avoided star quartets. Steel and Penny in  developed an algorithm for computing the quartet distance in time O(n3). Bryant et al. in  improved this result with an algorithm that computes the quartet distance in time O(n2). Brodal et al., in , presented the currently best known algorithm that algorithm the computes the quartet distance in time O(n log n).
Recently, we have developed algorithms for computing the quartet distance between two trees of arbitrary degrees, i.e. trees that can contain star quartets. In  we developed two algorithms: the first algorithm runs in time O(n3) and space O(n2)--and is thus independent of the degree of the inner nodes--the second in time O(n2d2) and space O(n2), where d is the maximal degree of inner nodes in the trees--and thus depends on the degree of the nodes. The O(n2d2) was later improved to O(n2d) , and by taking an approach similar to the Brodal et al. O(n log n) we developed a sub-quadratic algorithm in terms of n but at a significant cost in terms of d: O(d9n log n) .
In this paper we develop an O(n2+α) algorithm, where and O(n ω ) is the time it takes to multiply two n × n matrices. Using the Coppersmith-Winograd  algorithm, where ω = 2.376, this yields a running time of O(n2.688). The running time is thus independent of the degrees of the inner nodes of the input trees, and this is the first sub-cubic time algorithm with this property. Furthermore we have implemented the algorithm, along with two of the previous methods, and show experimentally that our new algorithm performs well in practice.
Methods: A sub-cubic time and space algorithm
Since, B = shared B (T, T ) and B' = shared B (T', T'), an algorithm for computing shared B (T, T') and diff B (T, T') gives an algorithm for computing the quartet distance between T and T'.
Before counting shared and different butterflies, we calculate a number of values in two preprocessing steps. First, we calculate a matrix that for each pairs of subtrees F ∈ T and G ∈ T' stores the number of leaves in both trees, |F ⋂ G|. This can be achieved in time and space O(n2) .
Inspired by the sums (S.3) - (S.6) in Additional file 1 we calculate a matrix I', vectors of its row and column sums, the total sum of its entries, and some further values
Calculating the values in Eq. (15) and (16) takes time if padding the matrices to become square and with ω = 2.376 if using the Coppersmith-Winograd algorithm  for matrix multiplication, or time if using naive matrix multiplication. Similarly, calculating the values in Eq. (17) and (18) takes time or . Computing either and , or and , thus takes time .
Counting shared butterfly topologies
which can be computed in time O(1), if the referenced matrices have been precomputed. Thus we can compute all shared directed butterflies in total time O(n2). Dividing by two, we get the number of shared undirected butterflies.
Counting different butterfly topologies
depending on whether we have precomputed and , or and . We can thus compute Eq. (22) in time O(1) for each pair of inner edges e ∈ T and e' ∈ T' giving a total time of O(n2) to compute different directed, and thus different undirected, butterfly topologies in the two trees.
To get the actual number of different butterflies we have to divide by four.
The running time of the algorithm is dominated by the time it takes to compute either and , or and , for each pair of nodes v ∈ T and v' ∈ T'. Let O(n ω ) be the time it takes to multiply two n × n matrices. In section 4 of Additional file 1 we show that the running of our algorithm is O(n2+α), where . Using the Coppersmith-Winograd algorithm  for matrix multiplication, where ω = 2.376, this yields a running time of O(n2.688).
We have derived, implemented and tested a new algorithm for computing the quartet distance. In theory our algorithm has execution time O(nα+2), where . With current knowledge of matrix multiplication this is O(n2.688). If an algorithm for matrix multiplication in time O(n2) is found this would make our algorithm run in time O(n2.5). Experiments on our implementation shows it to be fast in practice, and that it can have a running time significantly better than the theoretical upper bound, depending on the topology of the trees being compared.
The software is available from http://www.birc.au.dk/software/qdist. It has been tested on Ubuntu Linux and Mac OS X.
We are grateful to Chris Christiansen, Martin Randers and Martin S. Stissing for many fruitful discussions about the quartet distance.
- Robinson DF, Foulds LR: Comparison of weighted labelled trees. Combinatorial mathematics, VI (Proc. 6th Austral. Conf), Lecture Notes in Mathematics, Springer. 1979, 119-126.Google Scholar
- Waterman MS, Smith TF: On the similarity of dendrograms. Journal of Theoretical Biology. 1978, 73: 789-800. 10.1016/0022-5193(78)90137-6PubMedView ArticleGoogle Scholar
- Allen BL, Steel M: Subtree transfer operations and their induced metrics on evolutionary trees. Annals of Combinatorics. 2001, 5: 1-13. 10.1007/s00026-001-8006-8View ArticleGoogle Scholar
- Robinson DF, Foulds LR: Comparison of phylogenetic trees. Mathematical Biosciences. 1981, 53: 131-147. 10.1016/0025-5564(81)90043-2View ArticleGoogle Scholar
- Estabrook G, McMorris F, Meacham C: Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Syst Zool. 1985, 34: 193-200. 10.2307/2413326View ArticleGoogle Scholar
- Steel M, Penny D: Distribution of tree comparison metrics-some new results. Syst Biol. 1993, 42 (2): 126-141.Google Scholar
- Bryant D, Tsang J, Kearney PE, Li M: Computing the quartet distance between evolutionary trees. Proceedings of the 11th Annual Symposium on Discrete Algorithms (SODA). 2000, 285-286.Google Scholar
- Brodal GS, Fagerberg R, Pedersen CNS: Computing the Quartet Distance Between Evolutionary Trees in Time O(n log n). Algorithmica. 2003, 38: 377-395.View ArticleGoogle Scholar
- Christiansen C, Mailund T, Pedersen CNS, Randers M: Algorithms for Computing the Quartet Distance between Trees of Arbitrary Degree. Proc. of Workshop on Algorithms in Bioinformatics (WABI), Volume 3692 of Lecture Notes in Bioinformatics (LNBI), Springer-Verlag. 2005, 77-88.Google Scholar
- Christiansen C, Mailund T, Pedersen CNS, Randers M, Stissing MS: Fast calculation of the quartet distance between trees of arbitrary degrees. Algorithms for Molecular Biology. 2006, 1:Google Scholar
- Stissing M, Pedersen CNS, Mailund T, Brodal GS, Fagerberg R: Computing the quartet distance between evolutionary trees of bounded degree. Proceedings of the 5th Asia-Pacific Bioinfomatics Conference 2007, Volume 5 of Series on Advances in Bioinformatics and Computational Biology. Edited by: Sankoff D, Wang L, Chin F. 2007, 101-110.Google Scholar
- Coppersmith D, Winograd S: Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation. 1990, 9: 251-281. 10.1016/S0747-7171(08)80013-2View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.