Fast calculation of the quartet distance between trees of arbitrary degrees

  • Chris Christiansen1,

    Affiliated with

    • Thomas Mailund2Email author,

      Affiliated with

      • Christian NS Pedersen1, 3Email author,

        Affiliated with

        • Martin Randers1 and

          Affiliated with

          • Martin Stig Stissing1

            Affiliated with

            Algorithms for Molecular Biology20061:16

            DOI: 10.1186/1748-7188-1-16

            Received: 18 May 2006

            Accepted: 25 September 2006

            Published: 25 September 2006

            Abstract

            Background

            A number of algorithms have been developed for calculating the quartet distance between two evolutionary trees on the same set of species. The quartet distance is the number of quartets – sub-trees induced by four leaves – that differs between the trees. Mostly, these algorithms are restricted to work on binary trees, but recently we have developed algorithms that work on trees of arbitrary degree.

            Results

            We present a fast algorithm for computing the quartet distance between trees of arbitrary degree. Given input trees T and T', the algorithm runs in time O(n + |V|·|V'| min{id, id'}) and space O(n + |V|·|V'|), where n is the number of leaves in the two trees, V and V are the non-leaf nodes in T and T', respectively, and id and id' are the maximal number of non-leaf nodes adjacent to a non-leaf node in T and T', respectively. The fastest algorithms previously published for arbitrary degree trees run in O(n3) (independent of the degree of the tree) and O(|V|·|V'id·id'), respectively. We experimentally compare the algorithm with existing algorithms for computing the quartet distance for general trees.

            Conclusion

            We present a new algorithm for computing the quartet distance between two trees of arbitrary degree. The new algorithm improves the asymptotic running time for computing the quartet distance, compared to previous methods, and experimental results indicate that the new method also performs significantly better in practice.

            Background

            The evolutionary relationship for a set of species is conveniently described by a tree in which the leaves correspond to the species, and the internal nodes correspond to speciation events. The true evolutionary tree for a set of species is rarely known, so inferring it from obtainable information is of great interest. Many different methods have been developed for this, see e.g. [1] for an overview.

            Different methods often yield different inferred trees for the same set of species, and even the same method can give rise to different evolutionary trees for the same set of species when applied to different information about the species. To study such differences in a systematic manner, one must be able to quantify differences between evolutionary trees using well-defined and efficient methods.

            One approach for comparing evolutionary trees is to define a distance measure between trees and compare two trees by computing this distance. Several distance measures have been proposed, e.g. the symmetric difference metric [2], the nearest-neighbour interchange metric [3], the subtree transfer distance [4], the Robinson and Foulds distance [5], and the quartet distance [6]. Each distance measure has different properties and reflects different aspects of biology.

            This paper is concerned with calculating the quartet distance. A quartet is a set of four species, the quartet topology induced by an evolutionary tree is determined by the minimal topological subtree containing the four species. The four possible quartet topologies of four species are shown in Fig. 1. Given two evolutionary trees on the same set of n species, the quartet distance between them is the number of sets of four species for which the quartet topologies differ in the two trees.
            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_Fig1_HTML.jpg
            Figure 1

            The four possible quartet topologies. The four possible quartet topologies of species a, b, c, d. Topologies (a): ab|cd, (b): ac|bd, and (c): ad|bc are butterfly quartets, while topology (d): http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq1_HTML.gif , is a star quartet.

            Steel and Penny [7] pointed at Doucettes unpublished work [8] which presented an algorithm for computing the quartet distance in time O(n3), where n is the number of species. Bryant et al. in [9] presented an improved algorithm which computes the quartet distance in time O(n2) for binary trees. Brodal et al. in [10] showed how to compute the quartet distance in time O(n log n) considering binary trees. For arbitrary degree trees, the quartet distance can be calculated in time O(n3) or O(n2d2), where d is the maximum degree of any node in any of the two trees, as shown by Christiansen et al. [11].

            Results and discussion

            In [11], we presented an algorithm for computing the quartet distance between trees of arbitrary degree. It runs in time O(n2d2) and space O(n2), where n is the number of leaves in each tree and d is the maximal degree found in either of the trees. In this paper, we present an improved algorithm running in time O(n + |V||V'| min{id, id'}) and space O(n + |V||V'|), where |V| and |V'| are the number of internal (non-leaf) nodes in the two input trees, and id and id' are the maximal degree of an internal node, when disregarding edges to leaves, in the two trees.

            Time analysis for different types of trees

            The terms |V|, id, |V'| and id' are all clearly O(n), but on the other hand neither |V| and id nor |V'| and id' are independent. Intuitively, if there are a lot of internal nodes in a tree, they will not have a very large internal degree. We address in this section, how this dependency will affect the running time for different types on trees.

            The worst theoretical running time of the algorithm for calculating the quartet distance presented above is O(n3). Consider a tree with an internal node of degree http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq2_HTML.gif , connected to http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq2_HTML.gif internal nodes of degree three each connected to two leaves, see Fig. 2. Such a tree has n leaves, O(n) internal nodes and a maximal internal degree that is O(n). If the algorithm is run on two such trees, the running time will be O(n3). In d-ary trees (trees where all internal nodes have degree d) |V| = O ( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq3_HTML.gif ), the time complexity of calculating the quartet distance will be O ( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq4_HTML.gif ).
            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_Fig2_HTML.jpg
            Figure 2

            A worst case input tree for the algorithm. A tree with an internal node of degree http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq2_HTML.gif , connected to http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq2_HTML.gif internal nodes of degree three each connected to two leaves. This tree has both a maximal degree of O(n) and at the same time O(n) inner nodes.

            The two cases above are somewhat extreme. The first case has a very large gap between the maximal and minimal degree of internal nodes, while the second has little or no gap. The theoretical performance of the algorithm on the two types of trees reflects this difference. Let d min = min{min v d v , min v' d v' }, be the minimal degree of any internal node in either tree, then each tree has O( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq5_HTML.gif ) internal nodes and the time complexity is O ( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq6_HTML.gif min{id, id'}). If min{id, id'} is O( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq7_HTML.gif ) the time usage of calculating the quartet distance will be O (n2). In the following section we will do practical verification of the theoretical results in this section.

            Experimental running times

            The graphs in Fig. 3 show the running time for comparing worst case trees (see Fig. 2), (d-ary trees and random trees. There are six types of (d-ary trees; binary, 6-ary, 15-ary, and 30-ary and two types of random trees; r8s-based (see [12]) and trees with random topologies. The trees generated by r8s are binary, but by contracting edges, we can get trees of arbitrary degree (contracting an edge e connecting nodes u and v means removing u and e and attaching the rest of u's edges to v). Each edge is contracted with a probability that is inversely proportional with its length, i.e. a short edge has a higher probability of being contracted than a long edge. The trees with random topology are generated by adding leaves one by one, starting with a tree of size 2. A leaf can be added by attaching it to a random inner node or by spliting a random edge with a new node, to which the leaf is attached.
            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_Fig3_HTML.jpg
            Figure 3

            Experimental running times. The running time of the algorithm for worst case trees, d-ary trees and random trees. The lines plots the polynomials c. n i , where c is a fitted constant and i ∈ [1, 4]. The two bottommost plots are in log-scale on both the x- and y-axis.

            The running time for worst case input trees (as described in the previous section) is O(n3), because such trees have O(n) internal nodes and min{id, id'} is O(n). This is supported by the first graph in Fig. 3, which shows that the plot of the polynomial n3 (representing the best sum-of-squares fit of the polynomial c·n3 to the data-points) is closest to the plot of the running times with regard to slope.

            The running time on the algorithm on d-ary trees is O( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq4_HTML.gif ). The plots of the running times in the second graph are parallel, and one of them is plotted directly on top of a plot of the polynomial n2 (here c·n2 is fitted to the data-points for each d separately; the different colors match the d colors). This supports that they all have a running time of O(n2) for fixed d's. The graph also shows that higher degrees give lower running times, which is also expected. The reason why the algorithm is more than twice as fast on 6-ary trees than it is on binary trees, is that the number of internal nodes in 6-ary trees is less than in binary trees, and even though |V| is O(n) in both cases, that difference has an impact on the running time. The last graph shows the running time of the algorithm on trees created as either random trees (each topology is equally likely) or trees simulated using r8s (with edge contraction as described above). We have no theoretical running time for this data, but the graphs show that the running time is O(n2). Even though the plotted data is only a small random sample, this indicates that many pairs of trees actually have the property that min{id, id'} is O( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq7_HTML.gif ). Therefore it is not unreasonable to expect that our algorithm runs in time O(n2) on trees used in practice. All experiments were performed on a standard PC (Pentium 4, 3 GHz, 1 Gb Ram) running Linux Fedora Core 3.

            Comparison with existing algorithms

            In Fig. 4 we compare the running time of the new algorithm with the O(n2d2) and O(n3) time algorithms from [11] on random and r8s simulated trees. In Fig. 5 we compare the running time of the new algorithm with the other two algorithms on Buneman and refined Buneman trees built for a range of Pfam [13] derived distance matrices using the tool in [14]. Buneman and refined Buneman trees are not binary unless this is well supported by the input distance matrix, and thus represent the kind of trees which can only be compared by methods which allow for trees of arbitrary degrees. In both experiments, the O(n3) time algorithm is slowest by a large margin for all plotted sizes of n. The new algorithm is consistently faster than the O(n2d2) time algorithm for the r8s (with edge contraction) simulated trees and for the Buneman and refined Buneman trees. For random trees the previous O(n2d2) time algorithm is slightly faster in practice. This difference is most likely caused by the additional overhead of precomputing the sums used by the new O(n2d) time algorithm compared to the previous O(n2d2) time algorithm in order to improve the asymptotic worst case running time (see method section). For trees of low degree, the overhead might dominate the factor d by which the worst case running time of the new algorithm is improved. The observed running times on random trees thus indicate that over selection of random trees consists of trees of low degree, whereas the r8s simulated, Buneman, and refined Buneman trees are trees with a few nodes of high degree which more than compensate for the additional overhead of dealing with nodes of low degree. In conclusion, we find that the experimental comparison of the new algorithm with the previously developed algorithms indicate that the new algorithm not only improves on the theoretical asymptotic running time, but also improves the running time in practice if the input trees contain a few nodes of high degree.
            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_Fig4_HTML.jpg
            Figure 4

            Comparisons with earlier algorithms on random and r8s trees. The running time for the new algorithm compared to the existing O(n2d2) and O(n3) time algorithms for random and r8s trees. The lines are fitted polynomial c. n2, for the case of the new algorithm (denoted n2d in the legend) and the O(n2d2) algorithm, and the polynomial c. n3 for the O(n3) algorithm. The plot is in log-scale on both the x- and y-axis.

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_Fig5_HTML.jpg
            Figure 5

            Comparisons with earlier algorithms on Buneman and refined Buneman trees. The running time for the new O(n2d) time algorithm compared to the existing O(n2d2) and O(n3) time algorithms on the Buneman and the refined Buneman trees for range of Pfam based distance matrices. The plot is in log-scale on both the x- and y-axis.

            Conclusion

            We have constructed an algorithm for finding the quartet distance between two trees of arbitrary degree. It runs in time O(n + |V||V'| min{id, id'}) and uses space O(n + |V||V'|), where n is the number of leaves in the trees, |V| and |V'| are the number of internal nodes in the trees and id and id' are the maximal internal degree of internal nodes in input tree T and T' respectively. Internal degree of an internal node is the number of internal nodes connected to it, so neighbouring leaves do not add to this value. The values |V|, |V'|, id and id' are not independent, therefore we have investigated how the structure of the trees affect the running time of the algorithm. We show that the time used to count the butterfly quartets – topologies where one pair of the four leaves is separated from the other pair by an edge – is reduced to O(n2) when min{id, id'} = O( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq7_HTML.gif ), where d min is the minimal degree of all internal nodes in the trees. If the input trees are d-ary, that is all internal nodes have degree d, the running time is O( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq4_HTML.gif ), excluding the time to find intersections. These theoretical running times have been validated by running a series of tests using a Java implementation of the algorithm, available at [15]. We also done a series of tests on random trees, trees generated by the program r8s, Buneman trees, and refined Buneman trees. Running the algorithm on these trees gives an impression on how it performs on trees used in practice. On both types of trees the running time appears to be O(n2). It is however still an open problem to develop an algorithm running in time O(n2)for all types of trees.

            Methods

            Consider two input trees, and assume that a quartet has butterfly topology in both trees, i.e. that one pair of the four leaves is separated from the other pair by an edge in the tree in both trees. We say that the butterfly quartet is shared, if it has the same butterfly topology in both trees. Otherwise, we say that the butterfly quartet is nonshared. We let shared(T, T') denote the number of butterflies shared between tree T and tree T', i.e. the number of quartets that are butterflies with the same topology in tree T and tree T', and let nonshared (T, T') denote the number of quartets that are butterflies in both T and T' but with different topology. By our definition of shared, the number of butterfly quartets in a single tree can be stated as the number of butterfly quartets shared between the tree and itself, i.e. shared(T, T) or shared(T', T') for the number of butterfly quartets in T and T' respectively. (This notation also emphasizes that computing the number of butterfly quartets in a single tree by our algorithm is performed as a comparison of the tree against itself.)

            In [11] we argue that the quartet distance between T and T', qdist(T, T'), can be found by focusing only on the computation of the number of shared and nonshared butterfly quartets between two trees, i.e. it is unnecessary to consider non-butterfly quartets explicitely. More specifically, we show that:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq8_HTML.gif

            The proof of this formula is a follows. Let Q denote the number of quartets which have butterfly topology in T and non-butterfly topology in T'. Symmetrically, let Q' denote the number of quartets which have butterfly topology in T' and non-butterfly topology in T. A butterfly quartet in T is either a butterfly quartet in T' or a non-butterfly quartet in T'. The number of butterfly quartets in T, shared(T, T), can thus be expressed as the sum shared(T, T') + nonshared(T, T') + Q. Similarly, the sum shared (T', T') = Q' + shared (T, T') + nonshared (T, T'). It is now straightforward to verify that the righthand side of (1) adds up Q + Q' + nonshared(T, T') which is the number of quartets where the quartet topologies differ in T and T', i.e. qdist(T,T').

            In the section below, we describe how to use (1) to compute the quartet distance in time O(n + |V||V'| min{id, id'}), more precisely O(n + |V||V'|) for a preprocessing step, after which we can use O(|V||V'|) for calculating shared(T, T'), O(|V||V'|{id, id'}) for calculating nonshared(T,T'), O(|V|) for calculating shared(T,T) and O(|V'|) for calculating shared(T',T').

            Terminology

            Let T and T' be two unrooted trees. In this paper we will explicitly refer to the leaves of a tree as leaves and the non-leaf nodes as internal node. We will assume that T and T' each has n labelled leaves numbered 1,..., n such that the leaf numbered x in T has the same label as the leaf numbered x in T'. The leaf sets are denoted L and L' for T and T' respectively, note that L = L'. We will use V and V' to denote the internal nodes in T and T' respectively. The degree of an internal node v is the number of subtrees connected to it, and is denoted d v . The internal degree of an internal node v, id v , is the number of non-leaf subtrees connected to it. We will assume that no internal node in T and T' has degree two, and we will denote the maximal internal degree of all internal nodes in T and T' by id and id' respectively. Let v be an internal node in T, and let F1, ..., http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq9_HTML.gif be the subtrees connected to it, as shown in Fig. 6 We call these the subtrees of v. We say that v claims all butterfly quartets ab|cd where a,bF i , cF k and dF m for ikm (see Fig. 7). With this definition, each butterfly quartet is claimed by exactly two internal nodes.
            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_Fig6_HTML.jpg
            Figure 6

            An internal node vT with subtrees F1,...,F d , here d v = 6.

            Adding the subscript yz|w|x to an internal node claiming the butterfly quartet wx|yz, indicates that the leaves y and z are found in a single subtree of the internal node, while the leaves w and x are found in different subtrees. For example, considering the quartet ab|cd, v and v' in Fig. 7 are written as vab|c|dand http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq10_HTML.gif .
            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_Fig7_HTML.jpg
            Figure 7

            Internal nodes vT and v'T', each claiming the quartet ab|cd.

            Given a subtree F of T, and a subtree G of T', we call the intersection FG a shared leaf set, i.e. the set of leaves present in both F and G. The size of the shared leaf set, |FG|, then denotes the number of leaves present in both F and G. The size of a single subtree F is similarly denoted |F|. We will use http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq11_HTML.gif to represent the subtree of T containing all leaves not in F and similarly for http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq12_HTML.gif and G in T', see Fig. 8 for an example. Note that http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq11_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq12_HTML.gif are also subtrees of T and T' respectively, and thus |FG|, | http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq11_HTML.gif G|, |F http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq12_HTML.gif | and | http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq11_HTML.gif http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq12_HTML.gif | are all sizes of shared leaf sets between a single subtree from T and a single subtree from T'. In the presentation of the algorithms below, we will assume that we have access to |F|, |G| and |FG| for all subtrees F of T and G of T' in time O(1). At the end of the section we will describe how this can be achieved by an O(n) time preprocessing step, which does not affect the asymptotic worst case running time of the presented algorithms.
            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_Fig8_HTML.jpg
            Figure 8

            A rooted subtree F, and its complement rooted subtree http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq11_HTML.gif .

            Counting shared butterfly quartets

            For each pair of internal nodes v, v' from T, T' we want to count the number of shared butterfly quartets claimed by both internal nodes, shared(v, v'). Assume that F1, ..., http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq9_HTML.gif are subtrees of v and G1, ..., http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq13_HTML.gif are subtrees of v'. We wish to count all quartets on the form ab|cd where a, bF i G j , cF k G l and dF m G n , ikm, jln (see Fig. 7). Counting the possible combinations of a and b is expressed by the following double sum, which sums over all pairs of subtrees of v and v':

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq14_HTML.gif

            Given that a and b are in F i G j , we need to find c and d in http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq11_HTML.gif i http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq12_HTML.gif j . The number of possible choices of c and d is expressed by:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq15_HTML.gif

            However when finding c and d in http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq11_HTML.gif i http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq12_HTML.gif j , the condition that c and d must be in different subtrees is not satisfied. Therefore we subtract the number of times c and d are in the same subtree of v and v':

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq16_HTML.gif

            Any pair in F k G l is counted twice, once in |F k http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq12_HTML.gif j | and once in | http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq11_HTML.gif i G l |, therefore these pairs are subtracted once using the double sum above. (2) expresses the number of ways c and d can be found in different subtrees, given that a and b are found in F i G j :

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq17_HTML.gif

            We can now compute the number of shared butterfly quartets between two internal nodes, ie. the number of butterfly quartets claimed by both internal nodes with the same topology:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq18_HTML.gif

            If the trees, T and T', have a shared quartet ab|cd, then there are two internal nodes in each tree that claims this quartet: vab|c|dand vcd|a|bin T and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq10_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq19_HTML.gif in T'. Since both shared(vab|c|d, http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq10_HTML.gif ) and shared(vcd|a|b, http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq19_HTML.gif ) will count the quartet, the total number of shared quartets between the two trees is:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq20_HTML.gif

            It is straightforward to observe that calculating shared(v, v') using a direct computation of (3) takes time O( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq21_HTML.gif ). It is however not necessary for shared(v, v') to sum over all subtrees of v and v'. Since each term in the sums involves taking a 2-subset from a shared leaf set, we need only to consider subtrees that are not leaves. This reduces the running time to O( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq22_HTML.gif ). This running time can be improved even more, we start by expressing (2) in a different way:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq23_HTML.gif

            Let

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq24_HTML.gif

            We can ignore leaf subtrees, so we need to compute id v' different http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq25_HTML.gif j 's and S j 's which can each be computed in O(id v ) time. Symmetrically each of the id v http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq26_HTML.gif 's and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq27_HTML.gif 's takes time O( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq28_HTML.gif ) to compute, and the total time of computing S is O(id v id v' ). The total time of computing all sums mentioned is thus O(id v id v' ) and this is the key to reducing the time usage of shared(v,v'). Using the sums we can express (4) as:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq29_HTML.gif

            Provided that the sums http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq25_HTML.gif j , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq26_HTML.gif , S j , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq27_HTML.gif and S have been calculated, (4) can be calculated in time O(l). Since calculation of the sums is independent on the calculation of shared(v, v'), these calculations can be done serially as shown in the algorithm below, thereby reducing the time usage of shared(T, T') to:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq30_HTML.gif

            ALGORITHM – CALCULATING THE NUMBER OF SHARED BUTTERFLY QUARTETS BETWEEN T AND T'

            Requires: T, T' two input trees with the same leaf set.

            Ensures: Res = shared(T, T')

            Res ← 0

            for v internal node in T do

               for v' internal node in T' do

                  Calculate sums http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq25_HTML.gif j , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq26_HTML.gif , S j , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq27_HTML.gif and S

                  ResRes + shared(v, v')

               end for

            end for

            Res http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq31_HTML.gif

            Counting nonshared butterfly quartets

            For each pair of internal nodes v, v' we want to count the number of nonshared butterfly quartets claimed by both internal nodes, nonshared(v, v'). Such quartets have the property that a pair of leaves found in the same subtree of v will be found in different subtrees of v' and vice versa, i.e. a nonshared quartet with leaves a, b, c and d, has a ∈ F i G j , bF i G l , cF k G n and dF m G j (see Fig. 9). The following expression counts all nonshared quartets related to a pair of nodes v and v', obeying that if two leaves of the quartet are in one subtree of v they are in different subtrees of v' and vice versa:
            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_Fig9_HTML.jpg
            Figure 9

            Internal nodes vT claiming the quartet ab|cd and v'T' claiming the quartet ad|bc.

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq32_HTML.gif

            Even though (6) satisfies the property of nonshared quartets, it possibly counts more than the number of nonshared quartets claimed by an internal node in each tree. The problem is that given two internal nodes, they do not nescessarily claim the quartets counted by (6). If we denote the leaves of an nonshared quartet a, b, c and d, the first, second, third and fourth factors in (6) counts the number of choices of a, b, c and d respectively. The first and second factor choose a and b from F i , while the third and fourth choose c and d from http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq11_HTML.gif i . In the cases where c and d are chosen from the same subtree F k , ki of v, v does not claim the quartet. We must subtract these quartets, which can be counted as:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq33_HTML.gif

            Similarly there are cases where b and c are chosen from the same subtree G l , lj of v', which we must also subtract. These can be counted as:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq34_HTML.gif

            The cases where both c and d are chosen from the same subtree F k , ki of v and b and c are chosen from the same subtree G l , lj of v' are included in both the expressions above and therefore they must be added again. The following expression counts the number of these cases:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq35_HTML.gif

            Combining equations (6), (7), (8) and (9), gives a way of calculating the number of nonshared quartets between two internal nodes v and v':

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq36_HTML.gif

            Assuming that the trees have a nonshared quartet with topology form ab|cd in T, and ad|bc in T', there are two internal nodes in each tree claiming the quartet: vab|c|dand vcd|a|bin T and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq37_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq38_HTML.gif in T'. All of the four combinations of these will identify the quartet as nonshared. Therefore the total number of nonshared quartets between the two trees is:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq39_HTML.gif

            Direct computation of nonshared(v, v') using (10) takes time O( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq21_HTML.gif ). Each of the subtrees has to be intersected with two disjoint sets in each of the sums. This means that if a subtree is only a leaf, at least one of these intersections will be zero, and the term will be zero. Therefore we can ignore subtrees that consist of a single leaf, just like when computing shared(T, T'), and reduce the time usage to O( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq22_HTML.gif ). The time usage of the calculation of nonshared(v, v') can be further improved by rewriting (9):

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq40_HTML.gif

            Inspired by the precomputing of sums used in shared(v, v'), (5), we calculate for each i, k, ki the sum:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq41_HTML.gif

            There are O( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq42_HTML.gif ) of these sums and each takes time O(id v' ) to calculate, so the time complexity for calculating all sums is O( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq43_HTML.gif ). In the case that id v' id v , we can switch v and v' and thus get time usage O(id v id v' min{id v , id v' }). Assuming that the sums have been calculated, (9) can now be calculated in time O(id v id v' min{id v , id v' }) by the expression:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq44_HTML.gif

            By substituting (9) with (12) in (10), we can calculate nonshared(v, v', in time O(id v id v' min{id v , id v' }). Since calculation of the sums is independent of the calculation of nonshared(v, v'), these calculations can be done serially as shown in the algorithm below.

            ALGORITHM – CALCULATING THE NUMBER OF NONSHARED BUTTERFLY QUARTETS BETWEEN T AND T'

            Requires: T, T' two input trees with the same leaf set.

            Ensures: Res = nonshared(T, T')

            Res ← 0

            for v internal node in T do

               for v' internal node in T' do

                  Calculate sums Si,k

                  ResRes + nonshared(v, v')

               end for

            end for

            Res http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq45_HTML.gif

            The time complexity of the algorithm is:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq46_HTML.gif

            Counting butterfly quartets in a single tree

            Reusing the idea of precomputing certain sums enables us to calculate the number of butterfly quartets in a single tree T in time O(|V|). Since the number of butterfly quartets in a single tree is the number of butterfly quartets shared between the tree and itself, we will use shared(T, T) to denote the number of butterfly quartet in T. This notation also emphasizes that computing the number is essentially a comparison of the tree against itself. Given a node v in T we can express the number of quartets it claims in the following way:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq47_HTML.gif

            where the F i 's are the subtrees of v. Now let

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq48_HTML.gif

            we can now express (13) as

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq49_HTML.gif

            S can be calculated in time O(id v )and using the precomputed S, (14) can be also calculated in time O(id v ). Summing the results of (14) for all nodes in T gives the number of quartets in the tree, shared(T, T). The total time usage is

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq50_HTML.gif

            Calculating the shared leaf set sizes

            The algorithms presented above all rely on O(1) time access to the size of the shared leaf set |FG| for any pair of subtrees F of T and G of T', where F and G each has size at least two, i.e. contains more than a single leaf. We will refer to |FG| as the intersection size of subtree F and G. In [9] and O(n2) time and space algorithm is presented for computing the intersection sizes of all pairs of subtrees F and G of two binary input trees. A straightforward generalization of this algorithm to two input trees T and T' of arbitrary degrees results in an O(n2dd') time and O(n2) space algorithm, which gives a worst case running time of O(n4).

            In this section we will present an improved algorithm for computing the intersection sizes of all pairs of subtrees of T and T' which runs in time O(n + |V||V'|) and space O(|V||V'|). We will assume that the size of each subtree F of T and G of T', i.e. |F| and |G|, is available in time O(1). This can be achieved as presented in the next section. Our algorithm for computing all intersection sizes is as follows. Choose an arbitrary node r in T and an arbitrary node r' in T'. Rooting the trees in r and r' respectively gives rise to two rooted trees T r and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq51_HTML.gif , Fig. 10 shows an example of rooting a tree. Calculating the shared leaf set sizes of T r and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq51_HTML.gif and all subtrees in both trees can be done using:
            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_Fig10_HTML.jpg
            Figure 10

            An arbitrary node, r, is chosen as the root in the tree, leading to a rooted tree.

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq52_HTML.gif

            where F i are all subtrees of T r and G j are all subtrees of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq51_HTML.gif . This can be calculated using dynamic programming in time O(n2):

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq53_HTML.gif

            Except |T r http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq51_HTML.gif |, the shared leafset sizes calculated by (15) are the shared leaf set sizes of all rooted subtrees of T and T' that do not contain the nodes r and r'. Assuming that the subtree F of T does not contain r, then the subtree http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq11_HTML.gif does contain r and similarly for r' and subtrees G and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq12_HTML.gif of T'. The shared leaf set sizes of these trees that do contain r and r' can be calculated from the intersection sizes that we have available using (16):

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq54_HTML.gif

            In other words all shared leaf set sizes can be calculated in time O(n2). First the shared leaf set sizes for subtrees that to not contain an arbitrary node r and r' from each tree are calculated in time O(n2). Then the shared leaf set sizes of subtrees that do contain r or r' (or both) are calculated constant time for each shared leaf set. Since there are O(n2) shared leaf sets the total time usage is O(n2).

            The reduction to time O(n + |V||V'|) and space O(|V||V'|) is done by handling the cases where F, F i , G or G j is a leaf in a special way. For each pair of nodes v, v' we let Leaf [v, v'] be the number of leaves directly connected to v that have the same label as a leaf directly connected to v'. Leaf [v, v'] is constructable in time O(n + |V|||V'|) in the following way: First, set Leaf [v, v'] = 0 for all pairs of nodes v, v'. Given a leaf number, x, there is a unique node, node(x), in T and a unique node, node'(x), in T'. For each leaf number, x, we increment Leaf [node(x), node'(x)]. There are n such numbers, and by assumption, the leaves can be found in constant time given a number. Thus Leaf [v, v'] is constructable in time O(n + |V||V'|). We choose r and r' in T and T' and create two rooted trees T r and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq51_HTML.gif . The non-leaf children of r in T r are F1,..., F x and the non-leaf children of r' in http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq51_HTML.gif are G1,..., G y . The intersection size of the two trees can be defined recursively as:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq55_HTML.gif

            The first term counts all leaves directly connected to both r and r'. The second term counts all leaves connected directly to r', that are also in T r , but not directly connected to r. The third term counts all leaves connected directly to r, that are also in http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq51_HTML.gif , but not directly connected to r'. Summing these three terms counts all leaves present in both subtrees, but leaves not connected directly to the roots are counted twice, and are subtracted by the last term.

            Since (17) is only summing over the non-leaf children of a given internal node, calculating the shared leaf set sizes of all these pairs of subtrees can be done using dynamic programming in time:

            http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq56_HTML.gif

            By the same arguments as above, the rest of the shared leaf set sizes can be computed in time O(|V||V'|) and space O(|V||V'|). Therefore the total running time of the algorithm is O(n + |V||V'|) and space usage is O(|V||V'|).

            Calculating the sizes of all subtree leaf sets

            All of the above algorithms make use of the sizes of the leaf sets of the rooted subtrees of the input trees, either directly or indirectly. Rooting T in an arbitrary node r gives rise to the rooted tree T r . Every subtree F x of T r is a rooted subtree of T, and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq11_HTML.gif x is also a rooted subtree of T. Note that the set of subtrees F x http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq11_HTML.gif x , since one tree is the complement of the other, contains all subtrees of T and that | http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq11_HTML.gif x | = n - |F x |. By using dynamic programming the sizes of all subtrees, F x , can be computed by a single traversal of T r . For each F x the size of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-1-16/MediaObjects/13015_2006_Article_16_IEq11_HTML.gif x can be computed in constant time, since n is known. This means that all leaf set sizes of a tree of arbitrary degree can be calculated in time O(n).

            Declarations

            Authors’ Affiliations

            (1)
            Department of Computer Science, University of Aarhus
            (2)
            Department of Statistics, University of Oxford
            (3)
            Bioinformatics Research Center, University of Aarhus

            References

            1. Felsenstein J: Inferring Phylogenies. 2004, Sinauer Associates Inc
            2. Robinson DP, Foulds LR: Comparison of weighted labelled trees. Combinatorial mathematics, VI (Proc 6th Austral Conf). 1979, Lecture Notes in Mathematics, Springer, 119-126.View Article
            3. Waterman MS, Smith TF: On the similarity of dendrograms. Journal of Theoretical Biology. 1978, 73: 789-800.PubMedView Article
            4. Allen BL, Steel M: Subtree transfer operations and their induced metrics on evolutionary trees. Annals of Combinatorics. 2001, 5: 1-13.View Article
            5. Robinson DP, Foulds LR: Comparison of phylogenetic trees. Mathematical Biosciences. 1981, 53: 131-147.View Article
            6. Estabrook G, McMorris F, Meacham C: Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Syst Zool. 1985, 34: 193-200.View Article
            7. Steel M, Penny D: Distribution of tree comparison metrics–some new results. Syst Biol. 1993, 42 (2): 126-141.
            8. Doucette CR: An Efficient Algorithm to Compute Quartet Dissimilarity Measures. 1985, [Unpublished, Bachelor of Science (Honours) Dissertation. Memorial University of Newfoundland]
            9. Bryant D, Tsang J, Kearney PE, Li M: Computing the quartet distance between evolutionary trees. Proceedings of the 11th Annual Symposium on Discrete Algorithms (SODA). 2000, 285-286.
            10. Brodal GS, Fagerberg R, Pedersen CNS: Computing the Quartet Distance Between Evolutionary Trees in Time O(n log n). Algorithmica. 2003, 38: 377-395.View Article
            11. Christiansen C, Mailund T, Pedersen CNS, Randers M: Computing the Quartet Distance Between Trees of Arbitrary Degree. Proceedings of Workshop on Algorithms in Bioinformatics (WABI). 2005, LNBI, Springer-Verlag, 3692: 77-88.View Article
            12. r8s. [http://​ginger.​ucdavis.​edu/​r8s/​]
            13. Pfam. [http://​www.​sanger.​ac.​uk/​Software/​Pfam/​]
            14. Besenbacher S, Mailund T, Westh-Nielsen L, Pedersen CNS: RBT – A tool for building refined Buneman trees. Bioinformatics. 2005, 21: 1711-1712.PubMedView Article
            15. QuartetDist. [http://​www.​daimi.​au.​dk/​~chrisc/​qdist/​]

            Copyright

            © Christiansen et al; licensee BioMed Central Ltd. 2006

            This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://​creativecommons.​org/​licenses/​by/​2.​0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

            Advertisement