A sub-cubic time algorithm for computing the quartet distance between two general trees

Background When inferring phylogenetic trees different algorithms may give different trees. To study such effects a measure for the distance between two trees is useful. Quartet distance is one such measure, and is the number of quartet topologies that differ between two trees. Results We have derived a new algorithm for computing the quartet distance between a pair of general trees, i.e. trees where inner nodes can have any degree ≥ 3. The time and space complexity of our algorithm is sub-cubic in the number of leaves and does not depend on the degree of the inner nodes. This makes it the fastest algorithm so far for computing the quartet distance between general trees independent of the degree of the inner nodes. Conclusions We have implemented our algorithm and two of the best competitors. Our new algorithm is significantly faster than the competition and seems to run in close to quadratic time in practice.


Background
The evolutionary relationship between a set of species is conveniently described as a tree, where the leaves represent the species and the inner nodes speciation events. Using different inference methods to infer such trees from biological data, or using different biological data from the same set of species, often yield slightly different trees. To study such differences in a systematic manner, one must be able to quantify differences between evolutionary trees using well-defined and efficient methods. One approach for this is to define a distance measure between trees and compare two trees by computing this distance. Several distance measures have been proposed, e.g. the symmetric difference [1], the nearest-neighbour interchange [2], the subtree transfer distance [3], the Robinson and Foulds distance [4], and the quartet distance [5]. Each distance measure has different properties and reflects different properties of the tree relationship.
For an evolutionary tree, the quartet topology of four species is determined by the minimal topological subtree containing the four species. The four possible quartet topologies of four species are shown in Figure 1. Given two evolutionary trees on the same set of n species, the quartet distance between them is the number of sets of four species for which the quartet topologies differ in the two trees.
Most previous work has focused on comparing binary trees and therefore avoided star quartets. Steel and Penny in [6] developed an algorithm for computing the quartet distance in time O(n 3 ). Bryant et al. in [7] improved this result with an algorithm that computes the quartet distance in time O(n 2 ). Brodal et al.,in [8], presented the currently best known algorithm that algorithm the computes the quartet distance in time O(n log n).
Recently, we have developed algorithms for computing the quartet distance between two trees of arbitrary degrees, i.e. trees that can contain star quartets. In [9] we developed two algorithms: the first algorithm runs in time O(n 3 ) and space O(n 2 )-and is thus independent of the degree of the inner nodes-the second in time O (n 2 d 2 ) and space O(n 2 ), where d is the maximal degree of inner nodes in the trees-and thus depends on the degree of the nodes. The O(n 2 d 2 ) was later improved to O(n 2 d) [10], and by taking an approach similar to the Brodal et al. [8] O(n log n) we developed a sub-quadratic algorithm in terms of n but at a significant cost in terms of d: O(d 9 n log n) [11].

Methods: A sub-cubic time and space algorithm
The quartet distance between two trees is the number of quartets where the quartet topology differs between the two trees, i.e. the number of quartets where one tree has the star topology and the other a butterfly topology, plus the number of quartets where the trees have a different butterfly topology. As observed in [9], the former-where one tree has the star topology and the other a butterfly topology-can be expressed in terms of the total number of butterflies in the two trees, the number of shared butterflies and the number of different butterflies: For trees T and T', the number of different topologies due to one being a star and the other a quartet, diff S (T, T'), is given by where B is the number of butterflies in T, B' the number of butterflies in T', shared B (T, T') the number of quartets with the same butterfly topology in T and T' and diff B (T') the number of quartets with different butterfly topologies in T and T'. Thus the quartet distance between T and T' is given by the expression Since, B = shared B (T, T ) and B' = shared B (T', T'), an algorithm for computing shared B (T, T') and diff B (T, T') gives an algorithm for computing the quartet distance between T and T'.
Our approach to counting the shared and different quartets is based on directed quartets and claims [8,9]. An (undirected) butterfly quartet topology, ab|cd induces two directed quartet topologies ab cd and ab cd, by the orientation of the middle edge of the topology, as shown in Figure 2. There are twice as many directed butterflies as undirected. If e = (s e , t e ) is a directed edge from s e to t e we call s e the source of e, and t e the target. To each directed quartet, ab cd, we can uniquely associate the directed edge, e so that a and b are leaves in the subtree rooted at s e , and c and d are leaves in different subtrees rooted at t e , see Figure 3. We call such a tree substructure, consisting of a directed edge e with a subtree, A behind e and two distinct subtrees, C and D, in front of e a claim, written A e →(C, D) . We say that the edge e claims the directed quartet ab cd, and we also say that an edge e claims an undirected quartet ab|cd if it claims one of its directed quartets. Each (undirected) butterfly quartet defines exactly two directed butterfly quartets, and each directed quartet is claimed by exactly one directed edge; considering each claim and implicitly each directed butterfly claimed by the claim, we can examine each directed butterfly in a tree, or each undirected butterfly twice.
The crux of the algorithm is to consider each pair of claims, one from each tree, and for each such pair count the number of shared and different directed butterflies  claimed in the two trees. This way each shared butterfly is counted twice, and each different butterfly is counted four times, as shown in Figure 4. Dividing the counts by two and four, respectively, gives us shared B (T, T') and diff B (T, T').

Preprocessing
Before counting shared and different butterflies, we calculate a number of values in two preprocessing steps. First, we calculate a matrix that for each pairs of subtrees F T and G T' stores the number of leaves in both trees, |F ⋂ G|. This can be achieved in time and space O(n 2 ) [7].
Next, for each pair of inner nodes, v T, v' T' with sub-trees F i , i = 1,..., d v and G j , j = 1, ..., d v' , respectively, we calculate a matrix, I, such that I[I, j] = |F i ⋂ G j |, and we calculate vectors of its row and column sums, and the total sum of its entries: (3) Inspired by the sums (S.3) -(S.6) in Additional file 1 we calculate a matrix I', vectors of its row and column sums, the total sum of its entries, and some further values   Figure 4 Counting directed claims. A shared butterfly induces two butterflies in each tree, which will give four pairs of claims, however the butterflies will only be identical in two of these pairs, thus a shared butterfly will be counted twice. A different butterfly also induces four pairs of claims, but since we are counting different butterflies all four will be counted. The way we count shared butterflies prevents the two different butterflies induced by the shared (undirected) butterfly from being counted.
Calculating the values in Eq.
Finally, we need to calculate the following values:

Counting shared butterfly topologies
or the sum of 1 2 for all distinct entries in I but fixed (i, j), see Figure 6(a). We divide by two since we count each quartet twice, due to symmetry between the (k, l) and (m, n) pairs. Notice, however, that the inner sum is simply the total sum of entries in I, M, except for the rows i and k and columns j and l, see Figure 6 which can be computed in time O(1), if the referenced matrices have been precomputed. Thus we can compute all shared directed butterflies in total time O(n 2 ). Dividing by two, we get the number of shared undirected butterflies.

Counting different butterfly topologies
Counting the number of different butterflies in the two trees is done similar to counting the number of shared butterflies. As before, we consider a pair of inner edges, e T and e' T'. The quartets claimed by both e and e', but with different butterfly topology, are on the form a F i ⋂ G j , b F i ⋂ G l , c F k ⋂ G j and d F m ⋂ G n for some claims F i e →(F k , F m ) and G j e →(G l , G n ) . The number of butterflies claimed by both e and e' but with different topology is therefore given by or the sum of I[I, j] · I[I, l] · I · [k, j] I[m, n] for all distinct entries in I but fixed (I, j), see Figure 7. In this case there is no need to divide by any normalizing constant, since there are no symmetries between k and m or between l and n.
As before, the inner sum can be expressed as in Eq. (20), and using the precomputed values we can, as shown in section 3 of Additional file 1 rewrite the expression in Eq. (22) as

I[i, j] (M − R[i] − C[j] + I[i, j])(R[i] − I[i, j])(C[j] − I[i, j])+ (R[i] − I[i, j])(I[i, j](R[i] − I[i, j]) − C [j])+ (C[j] − I[i, j])(I[i, j](C[j] − I[i, j]) − R [i])+
depending on whether we have precomputed I 1 and I 1 , or I 2 and I 2 . We can thus compute Eq. (22) in time O(1) for each pair of inner edges e T and e' T' giving a total time of O(n 2 ) to compute different directed, and thus different undirected, butterfly topologies in the two trees.
To get the actual number of different butterflies we have to divide by four.

Time analysis
The running time of the algorithm is dominated by the

Results
We have implemented our new algorithm and, for comparison, the O(n 3 ) and O(n 4 ) algorithms [9] for general trees. We chose those algorithm instead of those from [10,11], because the running time of those algorithms are dependent on the degree of the nodes, while a major feature of our new algorithm is that it has a good asymptotical running time independent of the degree of the nodes. For matrix multiplication we link to a BLAS library, and expect that to choose the most efficient algorithm for matrix multiplication. In our experiments the vecLib library from Mac OS X is used. We have run benchmarks with trees with ten leaves up to trees with almost 15, 000 leaves. For each size, trees were generated in four different ways: general trees, binary trees, star trees and trees with one node of degree n 2 surrounded by degree 3 nodes. The code that generated the trees is available in Additional file 2. For each of the ten possible combinations of topologies, one pair of trees were randomly generated, and the time used for the computation of the quartet distance was measured and plotted. Our experiments were run on a Mac-Pro with two Intel quad-core Xeon processors running at 2.26 GHz and with 8 GB RAM. As seen in Figure 8 the implementation of our new algorithm is significantly faster than the implementations of the competing algorithms, on trees with many leaves. In the worst cases our algorithm approaches O (n 3 ) which is expected if the BLAS implementation uses the O(n 3 ) matrix multiplication algorithm. Indeed Figure 9 shows that the slowest of our runs are on two star-shaped trees, where we need to multiply two n × n matrices and where the time-complexity of the matrix multiplication algorithm is most important. However, in most cases our algorithm seems to be close to quadratic execution time, even though it apparently uses an asymptotically slow matrix multiplication algorithm.

Conclusion
We     Figure 9 Comparison of tree topologies. This plot shows the two best and the two worst pairs of tree topologies, for our new algorithm only. In a log-log plot x b becomes a straight line with the slope determined by b. The lines in the plot are not regression lines, but are inserted to help the reader judge the time complexity of our implementation.