Computing the skewness of the phylogenetic mean pairwise distance in linear time
© Tsirogiannis and Sandel; licensee BioMed Central Ltd. 2014
Received: 9 December 2013
Accepted: 27 May 2014
Published: 14 June 2014
Skip to main content
© Tsirogiannis and Sandel; licensee BioMed Central Ltd. 2014
Received: 9 December 2013
Accepted: 27 May 2014
Published: 14 June 2014
The phylogenetic Mean Pairwise Distance (MPD) is one of the most popular measures for computing the phylogenetic distance between a given group of species. More specifically, for a phylogenetic tree and for a set of species R represented by a subset of the leaf nodes of , the MPD of R is equal to the average cost of all possible simple paths in that connect pairs of nodes in R.
Among other phylogenetic measures, the MPD is used as a tool for deciding if the species of a given group R are closely related. To do this, it is important to compute not only the value of the MPD for this group but also the expectation, the variance, and the skewness of this metric. Although efficient algorithms have been developed for computing the expectation and the variance the MPD, there has been no approach so far for computing the skewness of this measure.
In the present work we describe how to compute the skewness of the MPD on a tree optimally, in Θ(n) time; here n is the size of the tree . So far this is the first result that leads to an exact, let alone efficient, computation of the skewness for any popular phylogenetic distance measure. Moreover, we show how we can compute in Θ(n) time several interesting quantities in , that can be possibly used as building blocks for computing efficiently the skewness of other phylogenetic measures.
The optimal computation of the skewness of the MPD that is outlined in this work provides one more tool for studying the phylogenetic relatedness of species in large phylogenetic trees. Until now this has been infeasible, given that traditional techniques for computing the skewness are inefficient and based on inexact resampling.
Communities of co-occuring species may be described as “clustered” if species in the community tend to be close phylogenetic relatives of one another, or “overdispersed” if they are distant relatives . To define these terms we need a function that measures the phylogenetic relatedness of a set of species, and also a point of reference for how this function should behave in the absence of ecological and evolutionary processes. One such function is the mean pairwise distance (MPD); given a phylogenetic tree and a subset of species R that are represented by leaf nodes of , the MPD of the species in R is equal to average cost of all possible simple paths that connect pairs of nodes in R.
where is the value of the MPD for R in , and and are the expected value and the standard deviation respectively of the MPD calculated over all subsets of r species in .
In a previous paper we presented optimal algorithms for computing the expectation and the standard deviation of the MPD of a phylogenetic tree in O(n) time, where n is the number of the edges of . This enabled exact computations of these statistical moments of the MPD on large trees, which were previously infeasible using traditional slow and inexact resampling techniques. However, an important problem remained unsolved; quantifying our degree of confidence that the NRI value observed in a community reflects non-random ecological and evolutionary processes.
This degree of confidence can be expressed as a statistical P value, that is the probability that we would observe an NRI value as extreme or more so if the community was randomly assembled. Traditionally, estimating P is accomplished by ranking the observed MPD against the distribution of randomized MPD values . If the MPD falls far enough into one of the tails of the distribution (generally below the 2.5 percentile or above the 97.5 percentile, yielding P<0.05), the community is said to be significantly overdispersed or significantly clustered. However, this approach relies on sampling a large number of random subsets of species in , and recomputing the MPD for each random subset. Therefore, this method is slow and imprecise. This problem is exacerbated when it is necessary to consider multiple trees at once, arising for example from a Bayesian posterior sample of trees [4, 5]. In such cases, sufficient resampling from all trees in the sample can be computationally limiting.
We can approximate the P value of an observed NRI by assuming a particular distribution of the possible MPD values and evaluating its cumulative distribution function at the observed MPD. Because the NRI measures the difference between the observed values and expectation in units of standard deviations, this yields a very simple rule if we assume that possible MPD values are normally distributed: any NRI value larger than 1.96 or smaller than -1.96 is significant. Unfortunately, the distribution of MPD values is often skewed, such that this simple rule will lead to incorrect P value estimates [6, 7]. Of particular concern, this skewness introduces a bias towards detecting either significant clustering or significant overdispersion . If the distribution of MPD values for a particular tree can be reasonably approximated using a skew-normal distribution, calculating the skewness analytically would enable us to remove this bias and improve the accuracy of P value estimates. In the last part of the paper, we describe experiments on large randomly generated trees, supporting this argument. Further, when a large sample of trees should be considered, the full distribution of MPD values can be considered as a mixture of skew-normal distributions [9, 10], greatly simplifying and speeding up the process of calculating P values across the entire set of trees.
However, so far there has been no result in the related literature that shows how to compute the needed skewness measure efficiently. Hence, given a phylogenetic tree and an integer r there is the need to design an efficient and exact algorithm that can compute the skewness of the MPD for r species in . This would provide the last critical piece required for the adoption of a fully analytical and efficient approach for analysing ecological communities using the MPD and the NRI.
In the present work we show how we can compute efficiently the skewness of the MPD. More specifically, given a tree that consists of n edges and a positive integer r, we prove that we can compute the skewness of of the MPD over all subsets of r leaf nodes in optimally, in Θ(n) time. For the calculation of this skewness value we consider that every subset of exactly r species in is picked uniform with probability out of all possible subsets that have r species. The main contribution of this paper is a constructive proof that leads straightforwardly to an algorithm that computes the skewness of the MPD in Θ(n) time. This is clearly optimal, and it outperforms even the best algorithms that are known so far for computing lower-order statistics for other phylogenetic measures; for example the best known algorithm for computing the variance of the popular Phylogenetic Diversity (PD) runs in O(n2) time .
More than that, we prove how we can compute in Θ(n) time several quantities that are related with groups of paths in the given tree; these quantities can be possibly used as building blocks for computing efficiently the skewness (and other statistical moments) of phylogenetic measures that are similar to the MPD. Such an example is the measure which is the equivalent of the MPD for computing the distance between two subsets of species in .
The rest of this paper is, almost in its entirety, an elaborate proof for computing the skewness of the MPD on a tree in Θ(n) time. In the next section we define the problem that we want to tackle, and we present a group of quantities that we use as building blocks for computing the skewness of the MPD. We prove that all of these quantities can be computed in linear time with respect to the size of the input tree. Then, we provide the main proof of this paper; there we show how we can express the value of the skewness of the MPD in terms of the quantities that we introduced earlier. The proof implies a straightforward linear time algorithm for the computation of the skewness as well. In the last section we provide experimental results that indicate that computing the skewness of the MPD can be a useful tool for improving the estimation of P values when a skew-normal distibution is assumed. There we describe experiments that we conducted on large randomly generated trees to compare two different methods for estimating P values; one method is based on random sampling of a large number of tip sets, and the other method relies in calculating the mean, variance, and skewness of the MPD for the given tree.
Let be a phylogenetic tree, and let E be the set of its edges. We denote the number of the edges in by n, that is n=|E|. For an edge e∈E, we use w e to indicate the weight of this edge. We use S to denote the set of the leaf nodes of . We call these nodes the tips of the tree, and we use s to denote the number of these nodes.
Since a phylogenetic tree is a rooted tree, for any edge e∈E we distinguish the two nodes adjacent to e into a parent node and a child node; among these two, the parent node of e is the one for which the simple path from this node to the root does not contain e. We use Ch(e) to indicate the set of edges whose parent node is the child node of e, which of course implies that e∉Ch(e). We indicate the edge whose child node is the parent node of e by parent(e). For any edge e∈E, tree is the subtree of whose root is the child node of edge e. We denote the set of tips that appear in as S(e), and we denote the number of these tips by s(e).
Given any edge e∈E, we partition the edges of into three subsets. The first subset consists of all the edges that appear in the subtree of e. We denote this set by Off(e). The second subset consists of all edges e′∈E for which e appears in the subtree of e′. We use Anc(e) to indicate this subset. For the rest of this paper, we define that e∈Anc(e), and that e∉Off(e). The third subset contains all the tree edges that do not appear neither in Off(e), nor in Anc(e); we indicate this subset by Ind(e).
where and are the expectation and the variance of the MPD for subsets of exactly r tips in , and ER∈Sub(S,r)[·] denotes the function of the expectation over all subsets of exactly r tips in S. In a previous paper, we showed how we can compute the expectation and the variance of the MPD on in O(n) time . Therefore, in the rest of this work we focus on analysing the value and expressing this quantity in a way that can be computed efficiently, in linear time with respect to the size of .
The quantities that we use for expressing the skewness of the MPD
We provide now the following lemma.
Given a phylogenetic tree that consists of n edges, we can compute all the quantities that are presented in Table 1 in O(n) time in total.
Each of the quantities (I)-(X) in Table 1 can be computed by scanning a constant number of times the input tree , either bottom-up or top-to-bottom. For computing quantity (XI) we follow a more involved divide-and-conquer approach.
We showed in a previous paper how we can compute quantity (I) and the quantities in (III) for all e∈E in O(n) time in total .
We explain now how we can compute the six quantities in (2) in O(n) time, assuming that we have already computed TC s u b (e) and s(e) for every e∈E. To make the description simpler, we show in detail how we can compute the second and fourth quantities that appear in the last expression; it is easy to show that the rest of the quantities in (2) can be calculated in a similar manner.
The remaining quantities in (2) can be computed in a quite similar manner as the two quantities that we already described.
Since we have already computed TC(v) for every tip v∈S, we can trivially evaluate in O(n) time. Hence, to compute quantity (V) it remains now to calculate the values for every edge e∈E. We can do this in O(n) time as follows: at each tip u∈S we store the value TC(u) that we have already computed. Then we scan bottom-up and we calculate SUM4(e) by summing up the values SUM4(l) for all edges l∈Ch(e).
From the two last expressions, and given the description that we provided for other similar quantities in Table 1, it easy to conclude that PC(e) can be evaluated for every edge e in O(n) time by scanning a constant number of times. Having computed PC(e) for all edges e∈E, the quantity PSQ(e) can be computed in a similar manner.
Therefore, according to (3) we can compute the sum for all edges e∈E in linear time in total, given that we have already computed TC(e) for every e∈E, and TC(u) for every u∈S.
and can be computed in Θ(|Ch(e)|) time, given that we have already computed S Q s u b (l),∀l∈Ch(e).
The last expression can be computed in Θ(|Ch(e)|) time in a similar way as the previous terms of the sum in (5).
which takes Θ(|Ch(e)|) time to be computed for each edge e.
which can also be computed in Θ(|Ch(e)|) time.
All the sums that we analysed from (4) up to (12) can be computed in Θ(|Ch(e)|) time for every edge e in the tree. From this we conclude that for every edge e∈E we can evaluate QD(e) in (4) in Θ(|Ch(e)|) time from the respective values of the edges in Ch(e). Since , we prove that we can compute QD(e) for all the edges in in Θ(n).
In the previous section we defined the problem of computing the skewness of the MPD for a given phylogenetic tree . Given a positive integer r≤s, we showed that to solve this problem efficiently it remains to find an efficient algorithm for computing ; this is the mean value of the cube of the MPD among all possible subsets of tips in that consist of exactly r elements. To compute this efficiently, we introduced in Table 1 eleven different quantities which we want to use in order to express this mean value. In Lemma 1 we proved that these quantities can be computed in O(n) time, where n is the size of .
Next we prove how we can calculate the value for the mean of the cube of the MPD based on the quantities in Table 1. In particular, in the proof of the following lemma we show how the value can be written analytically as an expression that contains the quantities in Table 1. This expression can then be straightforwardly evaluated in O(n) time, given that we have already computed the aforementioned quantities. Because the full form of this expression is very long (it consists of a large number of terms), we have chosen not to include it in the definition of the following lemma. We chose to do so because we considered that including the entire expression would not make this work more readable. In any case, the full expression can be easily infered from the proof of the lemma.
For any given natural r ≤ s, we can compute in Θ (n) time.
Hence, our goal now is to compute a sum whose elements are the product of costs of triples of paths. Recall that for each of these paths, the end-nodes of the path are a pair of distinct tips in the tree. Although the end-nodes of each path are distinct, in a given triple the paths may share one or more end-nodes with each other. Therefore, the distinct tips in any triple of paths may vary from two up to six tips. Indeed, in (15) we get a sum where the triples of paths in the sum are partitioned in five groups; a triple of paths is assigned to a group depending on the number of distinct tips in this triple. In (15) the sum for each group of triples is multiplied by the same factor (r)θ(u,v,x,y,c,d)/(s)θ(u,v,x,y,c,d), hence we have to calculate the sum for each group of triples separately.
Notice that some of the terms in (16) are multiplied with an extra constant factor. This happens for the following reason; the sum in TRS(X) counts each triple once for every different combination of three pairs of tips. However, in the triple sum in (15) some triples appear more than once. For example, every triple that belongs in class B appears three times in (15), hence there is an extra factor three in front of TRS(B) in (16).
To compute efficiently , it remains to compute efficiently each value TRS(X) for every isomorphism class X that is presented in Figure 2. Next we show in detail how we can do that by expressing each quantity TRS(X) as a function of the quantities that appear in Table 1.
We get the value of by plugging into (16) the values that we got for all eight isomorphism classes of triples. For any isomorphism class X we showed that the value TRS(X) can be computed by using the quantities in Table 1. The lemma follows from the fact that each quantity that appears in this table is used a constant number of times for computing value TRS(X) for any class X, and since we showed that we can precompute all these quantities in Θ(n) time in total.
Let be a phylogenetic tree that contains s tips, and let r be a natural number with r ≤ s . The skewness of the mean pairwise distance on among all subsets of exactly r tips of can be computed in Θ (n) time.
According to the definition of skewness, as it is also presented in (1), we need to prove that we can compute in Θ(n) time the expectation and the variance of the MPD, and the value of the expression . In a previous paper we showed that the expectation and the variance of the MPD can be computed in Θ(n) time. By combining this with Lemma 2 we get the proof of the theorem.
Earlier in this paper, we mentioned that distributions of MPD values are often found to be skewed, suggesting that it is necessary to incorporate this skewness into analytical P value estimation. However, it is unclear whether good P value estimates are possible with only the first three moments of the distribution, or if more detailed distributional information is required.
The sizes of tip samples that we considered for our experiments, together with the number of sets that we sampled for each tip size in order to derive the “true” values
Size of each tip sample
Number of sampled sets
Thus, we conclude that the errors introduced by assuming a skew-normal distribution of MPD values appear to be comparable to or smaller than those introduced by standard resampling procedures, while also showing better scaling with increased tip sample size. Finally, the computation of P values using skew-normal approximation is typically faster than with resampling, particularly in cases involving large samples of trees.
Given a rooted tree and a non-negative integer r, we proved that we can compute the skewness of the MPD among all subsets of r leaves in in O(n) time. An interesting problem for future research would be to implement the algorithm that is outlined by our proof, and show its efficiency in practice. Also, it would be interesting to derive a similar result for the so-called Community Distance measure; this is the equivalent of the MPD when distances between two sets of species are considered .
We want to thank Dr. Erick Matsen for his insightful comments on the paper and the productive discussion that we had regarding the motivation for deriving an algorithm for the MPD skewness.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.