Skip to main content

Recursive algorithms for phylogenetic tree counting

Abstract

Background

In Bayesian phylogenetic inference we are interested in distributions over a space of trees. The number of trees in a tree space is an important characteristic of the space and is useful for specifying prior distributions. When all samples come from the same time point and no prior information available on divergence times, the tree counting problem is easy. However, when fossil evidence is used in the inference to constrain the tree or data are sampled serially, new tree spaces arise and counting the number of trees is more difficult.

Results

We describe an algorithm that is polynomial in the number of sampled individuals for counting of resolutions of a constraint tree assuming that the number of constraints is fixed. We generalise this algorithm to counting resolutions of a fully ranked constraint tree. We describe a quadratic algorithm for counting the number of possible fully ranked trees on n sampled individuals. We introduce a new type of tree, called a fully ranked tree with sampled ancestors, and describe a cubic time algorithm for counting the number of such trees on n sampled individuals.

Conclusions

These algorithms should be employed for Bayesian Markov chain Monte Carlo inference when fossil data are included or data are serially sampled.

Background

A phylogenetic tree is the common object of interest in many areas of biological science. The tree represents the ancestral relationships between a group of individuals. Given molecular sequence data sampled from a group of organisms it is possible to infer the historical relationships between these organisms using a statistical model of molecular evolution. At present, Bayesian Markov chain Monte Carlo (MCMC) methods are the dominant inferential tool for inferring molecular phylogenies [1].

It is a recent trend to include fossil evidence into the inference to obtain absolute estimates of divergence times [2, 3]. Fossils may restrict the age of the most resent common ancestor of a subgroup of individuals. This imposes a constraint on the tree topology (the discrete component of a genealogy) and therefore reduces the space of allowable genealogies.

Another trend in phylogenetic analyses is serial (or heterochronous) sampling in which molecular data is obtained from significantly different time points and analysed together. This type of data arises most frequently with ancient DNA and rapidly evolving pathogens [4]‐[6]. In this case tip dates become a part of the genealogy.

Including serially sampled or fossil data modifies or restricts the shape of a phylogenetic tree. Little has been done to describe and classify these modified trees. In this paper, we aim to explore the new spaces formed by these trees.

A genealogy consists of discrete and continuous components — the tree topology and the divergence times. The tree topologies form a finite tree space when the number of tips is bounded. An important characteristic of this space is the number of trees in it and we aim to find an efficient way to calculate this number.

In the case that fossil data restricts the tree topology, counting the number of trees that satisfy the imposed constraints reveals how much the constraints reduce the tree space.

The number of trees arises as a constant in tree prior distributions. Typically we model the distribution of tree topologies as independent of the distribution of divergence times. The density function of the distribution of genealogies is then a product of the density function for the divergence times and the distribution function for tree topologies. A common prior on tree topologies is uniform over all allowable topologies so the distribution function is a constant that is equal to one over the number of tree topologies. When inferring tree topologies using Bayesian MCMC methods, we do not usually need to know this constant but in some cases, as described below, the absolute value of the prior distribution is of interest and the constant has to be calculated.

When fossils are used to restrict the age of internal nodes, the tree prior should accurately account for this fact. Heled and Drummond [3] introduced a natural approach for tree prior specification when fossil evidence is employed in the inference. Their method requires counting of ranked phylogenetic trees that obey a number of constraints that arise from including the fossil evidence. The construction requires calculation of the marginal density for the time of the calibration node, the node representing the most recent common ancestor of a clade which may or may not be monophyletic. For a particular location of the calibration node, or particular constraints on the tree topology, the marginal density function is the marginal density function for the divergence times weighted by the number of trees satisfying the constraints. In this case, the weight constants do not cancel in the MCMC scheme and therefore have to be calculated.

Tree counting has a long history. For phylogenetic trees, the counting problem is to find the number of all possible trees on n leaves. For some types of phylogenetic tree, there are known closed form solutions to this problem. For other types, only recursive equations have been derived. In this paper, we consider only rooted trees.

A survey of results on counting different types of rooted trees is presented in [7] where trees with different combinations of the following properties are considered: trees are either labeled (only leaves are labeled) or unlabeled, ranked or non‐ranked, and bifurcating or multifurcating. The results presented in the survey can also be found in [6, 8, 9].

In [10], Griffiths considered unlabeled, non‐ranked rooted trees such that interior nodes can have one child or more and the root has at least two children. Using generating functions, he derived recursive equations for counting the number of all possible such trees on n leaves with s interior nodes. In [11], Felsenstein considered partially labeled trees, i.e., a tree in which all the leaves are labeled and some interior nodes also may be labeled. He derived the recursive equations for counting the number of rooted, non‐ranked, partially labeled trees with n labeled nodes.

In this paper, we consider a number of counting problems for different classes of phylogenetic trees. First, we describe an effective way of counting the number of all possible fully ranked trees on n leaves, that is, trees on n leaves in which all internal and leaf nodes are ranked.

Second, we find the number of bifurcating trees that resolve a given multifurcating tree with n leaves. We give a solution to this problem for rooted, ranked, labeled trees and generalise the algorithm to count resolutions to fully ranked trees.

Finally, we introduce and formally describe a new type of phylogenetic tree and describe an algorithm for counting the number of all such trees on n leaves. This type of tree is important when we have a serial sample and sampled individuals can be direct ancestors of later sampled individuals. When the population size is small or the fraction of individuals sampled from the population is large, this type of tree should be included in the inference [12, 13].

Serial sampling

We mainly follow the terminology from [9] for the definitions of phylogenetic trees. A tree is a finite connected undirected graph with no cycles. A rooted tree is a tree with a single node ρ designated as a root. Every rooted tree T=(V,E,ρ) imposes a partial order on V that is defined as follows: v1 T v2 if a unique simple path from the root to v2 passes through v1. So the root is the smallest element. If v1 T v2 then we say that v1 is an ancestor of v2 and v2 is a descendant of v1. A node in a rooted tree is called interior if it has descendants and a leaf if it has no descendants. The root is considered interior. Denote V the set of interior nodes of T. A node u is a parent of a node v and v is a child of u if v< T u and there is no wV such that v< T w< T u. A rooted tree is called binary if every interior node has exactly two children. It is called weakly binary if every interior node has at most two children. We have chosen this terminology to fit with the usage of “binary” in the phylogenetics literature which may not agree with that in other literatures.

Let X be a finite non‐empty set of labels. A phylogenetic X‐tree is a pair T=(T,ϕ), where T is a tree and ϕ is a bijection from X onto the set of leaves of T (we may omit X and say “tree” instead of “X‐tree” if the set of labels is not specified). The tree T is called an underlying tree or a shape of the phylogenetic tree and ϕ is a labeling function. If the underlying tree of is rooted then is called a rooted phylogenetic tree. In what follows, we consider only rooted trees unless explicitly stated otherwise. A phylogenetic tree is binary (weak binary) if its underlying tree is binary (weak binary). A ranked phylogenetic tree is a pair (T,h), where is a rooted phylogenetic tree and h is an injective function (ranking function) from the set V into the set {1,,| V |} such that v1 T v2 implies h(v1)≤h(v2) for every v 1 , v 2 V . In other words, there is a linear order on the interior nodes of T that is consistent with the partial order of T.

Definition 1

A ranked X‐tree is a binary ranked phylogenetic X‐tree.

An example of a ranked tree is given in Figure 1.

Figure 1
figure 1

Ranked tree. Ranked X‐tree, X={A,B,C,D,E}. The numbers on the right are values of the ranking function.

In biology, a phylogenetic tree represents the evolutionary history of a collection of sampled individuals. The collection of individuals is represented by the set X. The root of the tree is the most recent common ancestor of X and interior nodes are bifurcation events. The ranking function represents the time order of the bifurcation events. A general problem in evolutionary biology is how to reconstruct the phylogenetic tree from sequence data obtained from sampled individuals. Tackling this problem in a Bayesian framework may require counting the number of all possible histories on a sample of individuals.

When all individuals are sampled at the same time (as in Figure 1) counting tree problem has a simple solution.

Let X be a fixed label set such that |X|=n. The number of all ranked X‐trees up to isomorphism is

R ( n ) = n ! ( n 1 ) ! 2 n 1

This formula has been derived by many authors. Proofs can be found in [6, 7], or [9]. The letter R in the equation comes from the word “ranked”.

The situation is different when individuals are sampled at different times (serially sampled). In this case, we need to define another kind of phylogenetic tree in which leaves are also ranked.

Definition 2

A fully ranked (FR) X‐tree is a pair(T,h), whereis a binary rooted phylogenetic X‐tree and h:V→{1,…,l} with| V |<l|V|is a surjective function such that

  • v1 T v2 implies h(v1)≤h(v2) and

  • h(v1)=h(v2) implies v1=v2 or v 1 , v 2 V V .

An example of a fully ranked X‐tree is given in Figure 2.

Figure 2
figure 2

Fully ranked tree. Fully ranked X‐tree. X={A,B,C,D,E}. The numbers on the right are values of the ranking function.

Before the tree is reconstructed we observe only leaves (sampled individuals) of the tree that are grouped (pre‐ranked) according to the times they were sampled. For the tree shown in Figure 2, we have two sampling times and hence two groups: A, B, and C form the first group, D and E form the second group.

Let T=(T,h) be a fully ranked X‐tree with h:V→{1,…,l}. Let m=|h(ϕ(X))|, that is, the number of sampling times. Define a pre‐ranking functionĥ from X onto {1,…,m} for tree T such that for all x1,x2X

  • h(ϕ(x1))≤h(ϕ(x2)) implies ĥ( x 1 )ĥ( x 2 ) and

  • h(ϕ(x1))=h(ϕ(x2)) iff ĥ( x 1 )=ĥ( x 2 ).

For the tree given in Figure 1, ĥ(A)=ĥ(B)=ĥ(C)=1 and ĥ(D)=ĥ(E)=2.

Let X and ĥ:X{1,,m} be fixed. We are interested in the number of all fully ranked X‐trees that haveĥ as a pre‐ranking function. Note that this number depends only on the numbers n i =|{x|ĥ(x)=i}|, the number of individuals sampled at the ith time point, not on X and ĥ directly. We denote this quantity by F(n1,…,n m ), where F stands for “fully ranked”. Then

F( n 1 ,, n m )= i = 1 n m R ( n m ) R ( i ) F( n 1 ,, n m 1 +i)
(1)

and F(n)=R(n).

Proof

Consider a continuous process of bifurcation in which lineages may bifurcate in time or be cut and labeled (sampled). The process finishes when all lineages are cut producing a tree. The discrete structure of the tree produced by this process is a fully ranked X‐tree. It is easy to see that every fully ranked X‐tree can be obtained as a result of this process. To count the required number we can count the number of different trees which can be produced by the process if we know that after it finishes there are n i sampled individuals (i.e., cut and labeled lineages) at the ith time point, i.e., we have the sequence (n1,…,n m ).

Suppose that at the (m−1)th time point there are i lineages that are ancestral to n m individuals sampled at time m. When we look at this process backwards in time the bifurcation events become coalescence events. The number of different ways these n m lineages coalesce to i lineages is R ( n m ) R ( i ) . This is the number of all possible ranked X‐trees on n m individuals but since we are not interested in the structure of the coalescent after we reach i lineages, it is divided by the number of ways in which the remaining i lineages can coalesce. Note that if coalescence patterns are different between the (m−1)th and mth time points then the trees are also different.

Further, for each of these coalescence patterns, we need to count the number of different ways these i lineages and other n1,…,nm−1 lineages can coalesce. This is where we can apply the recursion. We can consider that we also cut these i lineages at time m−1 and label them with the ranked subtrees descendant from these lineages. Then, at time m−1, we have nm−1 sampled individuals and another i sampled individuals and it remains to count the number of trees on the sequence (n1,…,nm−1+i).

Note that two trees are different if they have different numbers of lineages at time m−1. The number of additional i lineages can be between 1 and n m and we need to sum over all possible i to complete the recursion. □

We introduce a third type of tree in which sampled individuals may be direct ancestors of later sampled individuals. We call it a tree with sampled ancestors. This type of tree is not usually considered in phylogenetics since the probability of sampling a direct ancestor is often negligible. In small populations or when a large portion of the population is sampled, however, this can not be ignored.

Let T=(V,E,ρ) be a weak binary tree. Define a set V ̂ as follows:

V ̂ = { v V | deg ( v ) = 1 or [ deg ( v ) = 2 and v is not the root ] }

A rooted S‐phylogenetic X‐tree is a pair T=(T,ϕ), where T is a weak binary tree and ϕ:X V ̂ is a bijection.

Definition 3

A fully ranked X‐tree with sampled ancestors (FRS X‐tree) is a pair (T,h), where is a rooted S‐phylogenetic X‐tree and h:V→{1,…,l} is a surjective function such that

  • v1< T v2 implies h(v1)<h(v2) and

  • h(v1)=h(v2) implies v1=v2 or v 1 , v 2 V ̂ ,

(see Figure 3).

Figure 3
figure 3

FRS tree. FRS X‐tree with the labeled 1‐degree root. X={A,B,C,D,E}. The numbers on the right are values of the ranking function.

The definition of a pre‐ranking function remains the same for FRS trees. Let S(n1,…,n m ) (with S standing for ‘sampled ancestors’) denote the number of all FRS X‐trees that have the same pre‐ranking function ĥ, where n i =|{x|ĥ(x)=i}|. Then

S ( n 1 , , n m ) = i = 1 n m j = 0 min { i , n m 1 } i j n m 1 j × R ( n m ) R ( i ) S ( n 1 , , n m 1 + i j )
(2)

and S(n)=R(n).

Proof

Consider the same process as before with only change of sampling events. Now, at some points of time, some lineages are cut and labeled and others are only labeled but not cut.

Then this equation can be obtained as follows. We have n m individuals that are sampled at the mth time point. At time m−1, there are between 1 and n m ancestral lineages of those n m individuals, depending on the number of coalescence between times m and m−1. Let i denote the number of these lineages. Then there are R ( n m ) R ( i ) different possible coalescent patterns that can lead to this situation. Some of these i ancestral lineages may be among the individuals sampled at time m−1, i.e. lineages that are labeled but not cut at time m−1. Let j be the number of those ancestral lineages that are sampled at time m−1. There are i j ways to chose these j lineages out of i and there are n m 1 j possible ways to chose j sampled at time m−1 individuals that are not cut at time m−1.

Further, at time m−1, there are nm−1 sampled lineages and ij ancestral lineages that are not sampled and it remains to count the number of FRS trees on the sequence (n1,…,nm−1+ij).

Finally, we sum over all possible i and j to complete the recursion. □

Dynamic counting

Calculating the recursions (1) and (2) directly is inefficient and impractical. Here we describe a more efficient algorithm for counting fully ranked trees using these recursions. Rewrite equation (1) as

F ( n 1 , , n m ) = R ( n m ) i = 1 n m F ( n 1 , , n m 1 + i ) R ( i )

Then instead of calculating F(n1,…,n j +α) for j{1,…,m−1} and α{0,…,nj+1+…+n m } we can calculate

A j (α)= F ( n 1 , , n j + α ) R ( α )
(3)

using recurrence equations:

A j ( 0 ) = R ( n j ) i = 1 n j A j 1 ( i ) and
(4)
A j ( α + 1 ) = ( n j + α ) ( n j + α + 1 ) α ( α + 1 ) A j ( α ) + R ( n j + α + 1 ) R ( α + 1 ) A j 1 ( n j + α + 1 ) .
(5)

This leads to Algorithm 1 to calculate F(n1,…,n m ). Let n be the number of sampled individuals, i.e., n= i = 1 m n i . Calculation of all the R(i) takes O(n) steps and calculation of all the Aj takes at most O(n) steps. In total, the algorithm takes O(m n) steps.

Algorithm 1 Calculating the number of fully ranked trees

A similar approach leads to an O(m n2)‐time algorithm for counting FRS trees. The description of this algorithm is in Appendix Appendix 1: Algorithm for counting FRS trees.

Constraints

In phylogenetic analysis, it is common to have some limited information about the ancestors of sampled individuals. We consider two types of such information. First, we may know that a subgroup of sampled individuals is monophyletic. That means that the most recent common ancestor of the subgroup is not an ancestor of any other individual that does not belong to the subgroup. Second, we may know the relative ages of the most recent common ancestors of monophyletic subgroups. This known information imposes constraints on the space of possible phylogenetic trees representing the evolutionary history of sampled individuals. The question is how many phylogenetic trees satisfy the constraints on a group of sampled individuals?

The number of resolutions of a constraint tree

We first describe a problem for contemporaneous sampling in terms of constraint trees. We call a rooted tree multifurcated if each interior node has at least 2 children. Note that in contrast to the common terminology we assume that a binary tree is also multifurcated. If we replace the word “binar” with “multifurcated” in the definition of a ranked tree we obtain a more general class of trees.

Definition 4

A constraint X‐tree is a multifurcated ranked phylogenetic X‐tree.

An example of a constraint tree is given in Figure 4. A constraint tree represents prior information about clades and ranking. Each interior node constrains a subgroup of individuals, leaves that are descendant from this node, to by monophyletic. The most recent common ancestor of the whole group of individuals, the root node, is also regarded as a constraint. The ranking function constrains the ages of the most recent common ancestors of the monophyletic subgroups to have a specified order.

Figure 4
figure 4

Constraint tree. Constraint tree, labels are omitted. Subtree 2 is coloured green. It has two child nodes that are leaves, therefore, n 2 =2. The ancestor function for this tree is defined as f(2)=f(3)=1 and f(4)=2. A compact notation for this constraint tree is ( n 1 ,, n k ,f)=(0,2,3,2,{(2,1),(3,1),(4,2)}).

We say that a ranked X‐tree T 1 =( T 1 , h 1 )resolves a constraint X‐tree T 2 =( T 2 , h 2 ) if there is an isomorphic embedding of T 2 into T 1 , i.e., there is an injective mapping f:V 2 →V 1 such that

  • ϕ 1 (x)=f(ϕ 2 (x)) for each xX;

  • v T 2 u

    iff f(v) T 1 f(u) for each u,vV 2 , and

  • h 2 (v)≤h 2 (u) implies h 1 (f(v))≤h 1 (f(u)) for each u,v V 2 .

We wish to calculate the number of ranked trees that resolve a given constraint tree T=(T,ϕ,h). It is easy to see that this number depends only on the underlying tree T and ranking function h, but does not depend on the labeling function ϕ or the label set X.

We now introduce some notation in order to define recursive equations. We label interior nodes according to their ranks such that node i is a node v such that h(v)=i. A subtree induced by node i and its children is called subtree i. Child nodes in a subtree may be leaves in the initial tree. Let n i ≥0 denote the number of such child nodes in subtree i. Let f:{2,…,k}→{1,…k−1} be the parent function on interior nodes, i.e., f(i)=j whenever j is a parent to i. When k=1, i.e., there is only one interior node, f=. See Figure 4 for an example of introduced notation. Note that a tuple (n 1 ,…,n k ,f) completely defines a pair (T,h).

Let Rr(n 1 ,…,n k ,f) be the number of ranked trees resolving a constraint tree defined by the tuple (n 1 ,…,n k ,f). The superscript r stands for “resolution”. Then the following equations hold.

R r (2,)=1,
(6)
R r ( n 1 , , n k 1 , 2 , f ) = i C n i 2 R r ( n 1 , , n i 1 , , n k 1 , 2 , f ) + R r ( n 1 , , n f ( k ) + 1 , , n k 1 , f | { 2 , , k 1 } ) and
(7)
R r ( n 1 , , n k , f ) = i C n i 2 R r ( n 2 , , n i 1 , , n k , f ) , if n k > 2 ,
(8)

where C is the collection of nodes that have more than 2 children and at least 2 of them are leaves, i.e., C={i<k | n i ≥2 and n i i >2} and α i =|{a| f(a)=i}|. Note that n i i is the number of children of node i.

Proof

When a constraint tree has 2 leaves, it is unique and is resolution of itself. So Equation (6) is trivial. To explain the main sum in Equation (7) and (8) we consider the constraint tree which is defined by (3,3,{(2,1)}) and shown in Figure 5, left. The last interior node of a resolving tree (that is, the interior node with the highest rank or the furthest node from the root) is either a parent to leaves in subtree 1 or leaves in subtree 2. Suppose it is the first case (see Figure 5, centre). Since leaves have distinct labels from X, there are 3 2 ways to chose two leaves that are children of that last node. We can partition all the resolving trees for which the last node is in subtree 1 in 3 2 groups. The number of trees in each group is the number of trees that resolve a constraint tree defined by (2,3,{(2,1)}) and shown on the right of Figure 5. A similar argument holds if the last node in a resolving tree is a parent to nodes from subtree 2.

Figure 5
figure 5

Recursive approach. The last interior node in a resolving tree locates in subtree 2.

So in the general case, there are k subtrees and if the last node is in subtree i we have n i 2 ways to choose two lineages that coalesce and then we should count the number of trees resolving the tree defined by (n 1 ,…,n i −1,…,n k ,f). Note that in the example above, we consider the tree with more than 2 leaves in each subtree. However, the last interior node of a resolving tree can not be in subtree i for i<k if there is not enough leaves in this subtree. This can happen either if there are less than 2 leaves in subtree i or if there are 2 leaves in subtree i and node i has only these two leaves as its children. Both cases imply that any parent to leaves of subtree i in a resolving tree has a lesser rank than the rank of node k. This explains why we sum only over the elements of the set C.

Finally, we should consider one more case which explains why there is one more summand in equation (7). If the last node in a constraint tree has only two children, i.e., there are 2 leaves in subtree k, then there is one more group of resolving trees, the group that consists of resolving trees that have this node as the last node. □

Dynamic counting

We will calculate Rr(n 1 ,…,n k ,f) for the corresponding constraint tree. In order to find Rr(n 1 ,…,n k ,f), at each step s, we will calculate numbers R r ( x 1 ,, x t ,f | t 1 ) with i t x i =s, t −1 ={2,…,t}, and t≤k. Note that we do not have to calculate all such numbers. To determine which numbers are required we define two upper triangular matrices m and M of size k×k.

Suppose we draw a horizontal line which is strictly below the line that passes through node j and strictly above the line that passes through node j+1 (or all the leaves if j=k). Then m i,j is the minimal possible number of intersections of this line with branches of subtree i in a resolving tree and M i,j is the maximal possible number. An example is given in Figure 6.

Figure 6
figure 6

Defining matrices m and M . Two trees that resolve a constraint tree from Figure 4 (only resolutions of subtree 2 are shown) and three ways to draw a horizontal line. The yellow lines correspond to the minimal number of intersections and the red line, to maximal. Thus, m 2,2 =2 and M 2,2 =3. We can also note that m 1,2 =M 1,2 =1.

Let a i,j =|{x≤j f(x)=i}| for i≤j. So a i,j is the number of children of node i with ranks at most j. Then

M i , j = n i + α i a i , j and m i , j = 2 if a i , j = 0 , 1 if a i , j > 0 and M i , j > 0 , 0 otherwise.

Let t≤k and x 1 ,, x t N. We call a tuple ( x 1 ,, x t ,f | t 1 ) eligible if m i,t ≤x i M i,t for 1≤it.

We now turn to Algorithm 2 to count resolutions. At each step s≤n, we construct a set S s . A unique element of S n is Rr(n 1 ,…,n k ,f) and calculating elements of S s only requires elements of S s−1 .

Algorithm 2 Calculating the number of resolutions of a constraint tree
Proposition 1

When k is fixed Algorithm 2 does at most O(nk) steps.

Proof

The algorithm does O(k) steps for each eligible tuple and, since we assume k is a constant, it is O(1). For given j there are

i = 1 j ( M i , j m i , j + 1 )

eligible tuples of size j. Since M i,j m i,j +1≤n i i −1, for the total number of eligible tuples we have

j = 1 k i = 1 j ( M i , j m i , j + 1 ) i = 1 k ( n 1 + α 1 1 ) · · ( n i + α i 1 ) < k ( n 1 + k 1 ) · · ( n k + k 1 ) k ( n k + k 1 ) k = O ( n k )

The number of resolutions of a fully ranked constraint tree

We can generalise the results of the previous section to fully ranked trees. Now we replace the word “binary” with “multifurcated” in the definition of a fully ranked tree to get

Definition 5

A fully ranked constraint X‐tree is a pair (T,h) , where is a multifurcated rooted phylogenetic X‐tree and h is a function such that h:V→{1,…,l}, where | V |<l|V| , and

  • v 1 T v 2 implies h(v 1 )≤h(v 2 ),

  • h(v 1 )=h(v 2 ) implies v 1 =v 2 or v 1 , v 2 V V .

We say that a fully ranked X‐tree T 1 =( T 1 , h 1 )resolves a fully ranked constraint X‐tree T 2 =( T 2 , h 2 ) if there is an isomorphic embedding of T 2 into T 1 , i.e., there is an injective mapping f:V 2 →V 1 such that

  • ϕ 1 (x)=f(ϕ 2 (x)) for each xX;

  • v T 2 u

    iff f(v) T 1 f(u) for each u,vV 2 ,

  • h 2 (v)≤h 2 (u) implies h 1 (f(v))≤h 1 (f(u)) for each u,vV 2 , and

  • h 2 (v)=h 2 (u) iff h 1 (f(v))=h 1 (f(u)) for each u,vV 2 .

The problem is to count the number of fully ranked resolutions of a fully ranked constraint tree.

Let T=(T,h) be a fully ranked constraint tree with h:V→{1,…,l} and| V |=k. Again we label interior nodes with numbers from {1,…,k}. However, labels and ranks of interior nodes do not necessary coincide now. Let r:{1,,k}h( V ) be an injective increasing function which maps the kth interior node to its rank. Note that r(1) is always equal to 1 because the root node always has rank 1 and that r(k)<l because only leaves can have rank l. Now, node i is the node that has rank r(i) and subtree i is the subtree which is induced by node i and its children. Since leaves in subtrees are now ranked, we need more parameters to encode the tree. Let n= [ ni,j] be a matrix of sizek×l, whereni,j is the number of leaves of rankj in subtreei or the number of children of nodei with rankj. Note that if nodei is a parent of nodej thenni,r(j)=1 andni,x=0 forx r(j) and this means that the parent function is uniquely defined by the matrix n and functionr and the number of resolutions ofT depends only on (n,r).

If a constraint tree has only 2 leaves then there is only 1 tree resolving it:

F r ( ( 2 ) , r 0 ) = F r ( ( 1 , 1 ) , r 0 ) = 1

withr 0 mapping 1 to 1.

For the main recursion we need to determine the location of the last interior node in a resolving tree. As before, this node can be a parent to leaves in different subtrees. Also, since leaves now may have different ranks, the last interior node in a resolving tree may be ranked in different ways with respect to ranks of the leaves. LetNc,x denote the number of children of nodec with ranks at leastx, that is, N c , x = j = x l n c , j . For eachp{r(k)+1,…,l}, we define a setCp of candidate subtrees in which the last node can be placed at levelp, i.e. between the (p−1)th andpth time points.

C p = { c N c , p 2 and N c , r ( c ) + 1 > 2 }

See Figure 7 for an example of subtree where the last node can be placed between the (p−1)th andpth time points.

Figure 7
figure 7

Location of a new node between the (p−1)‐th andp‐th time points. A new node can be placed in subtreec between the (p−1)‐th andp‐th time points if the number of green branches is greater or equal to 2 and the number of all branches in subtreec is greater than 2.

For eachp andc such thatcCp, there is a distinct group of resolving trees. The quantity of trees in this group is equal to the number of resolutions of a constraint tree which is defined by matrix n c , p =[ n i , j c , p ] of sizek×p such that

  • n i , j c , p = n i , j

    for 1≤ik andj<p,

  • n c , p c , p = N c , p 1

    ,

  • n i , p c , p = N i , p

    forik andic.

If nodek has only 2 children then, as before, there is one more group of resolving trees. This group consists of the resolving trees in which the last interior node coincides with nodek. Let n =[ n i , j ] be a matrix of size (k−1)×r(k) such that

  • n i , j = n i , j

    for 1≤ik andj<r(k),

  • n i , r ( k ) = N i , r ( k )

    for 1≤ik−1.

This matrix defines a constraint tree for which the number of resolutions is equal to the number of trees in the last group.

Then the main recursion is as follows:

F r ( n , r ) = p = r ( k ) + 1 l c C p N c , p 2 F r ( n c , p , r ) + F r ( n , r | { 1 , , k 1 } ) if N k , r ( k ) + 1 = 2 ,
(9)
F r ( n , r ) = p = r ( k ) + 1 l c C p N c , p 2 F r ( n c , p , r ) otherwise ,
(10)

Using these equations, we can calculateFr(n,r) forO(m2nk) steps, wherem is the number of sampling points, i.e.,m=lk. The calculation is described in Appendix Appendix 2: Algorithm for counting fully ranked resolutions of a fully ranked constraint tree.

Conclusions

In Table 1, we summarised the complexity (upper bounds) of the counting algorithms. We can see that counting fully ranked and FRS trees is reasonably fast, particularly when the number of sampling points is small. Counting of resolutions of a constraint tree is an expensive procedure but for smallk it remains possible. In practice,k is typically small so this algorithm will be of practical use. Counting resolutions of an FRS constraint tree, using the same methods, has a greater cost and is less elegant, but its practicality and bounds on its complexity remain to be assessed.

Table 1 The complexity of algorithms

These algorithms can be implemented in software for phylogenetic analysis that involves serial sampling scheme or limited prior knowledge about ancestors of particular clades for calculating tree prior distributions.

Appendix 1: Algorithm for counting FRS trees

Rewrite equation (2) as follows:

S ( n 1 , , n m ) = i = 0 n m A ( i , n m 1 , n m ) S ( n 1 , , n m 1 + i )

where A(i,a,b)= x = 0 min { a , b i } R ( b ) R ( i + x ) i + x x a x . Then we can calculateS(n 1 ,…, n m ) recursively using Algorithm 3. We use the notation N j = i = j m n i .

Algorithm 3 Calculating the number of FRS trees

At each stepj>1 of the algorithm, we calculateS(n 1 ,…, n j +α) for 0≤αNj+1. We skip stepj=1 and do not calculateS ( n 1 + α) for 0≤α N 2 becauseS ( x )= R ( x) and we have already calculated all the necessaryR(x). Further, for calculating eachS(n 1 ,…, n j + α), we need to calculate the coefficientsA and this is the most expensive part of the algorithm. So, at each stepj>1, we need to calculateA ( i , nj−1,n j + α) for 0≤i n j + α and 0≤α Nj+1.

First, we calculate these values forα=0. Denote

B ( i , a , b , x ) = R ( b ) R ( i + x ) i + x x a x , for x a and i + x b

then

A ( i , a , b ) = x = 0 min { a , b i } B ( i , a , b , x )

where 0≤i b,a=nj−1, andb = n j . Wheni is fixed, calculation ofA ( i,nj−1,n j ) requires values ofB(i,nj−1,n j ,x) for 0≤x m i n { nj−1,n j i}. Rewrite this asB ( n j β,nj−1,n j ,x) for 0≤x m i n{nj−1,β} withβ=n j i. Now we can calculate these values for 0≤βn j and, hence, for 0≤i n j using recursive equations:

B( n j , n j 1 , n j ,0)=1
((11))
B ( n j ( β + 1 ) , n j 1 , n j , x ) = ( n j ( β + 1 ) + x ) ( n j β ) B ( n j β , n j 1 , n j , x ) 2
(11)
B ( n j ( β + 1 ) , n j 1 , n j , ( β + 1 ) ) = ( n j β ) ( n j 1 β ) B ( n j β , n j 1 , n j , β ) ( β + 1 ) 2
(12)

We use equation (11) for 0≤xm i n { nj−1,β}. Ifm i n{nj−1,β+1}≠m i n { nj−1,β} and thereforem i n { nj−1,β+1}=m i n { nj−1,β}+1=β+1 then we also use equation (12). The cost of this calculation is dominated by the number of the summandsB ( n j β,nj−1,n j ,x) for 0≤x m i n { nj−1,β} and 0≤β n j . This number is O( n j 2 ).

Now we need to calculateA(0,nj−1,n j +α),…,A(n j + α,nj−1,n j + α) for 1≤α Nj+1. HavingA(0,nj−1,n j +α),…,A(n j +α,nj−1,n j +α) calculated (note that we have already calculated these values forα=0), we can calculateA(0,nj−1,n j +(α+1)),…,A(n j +(α+1),nj−1,n j +(α+1)) using equations

A ( i , a , b + 1 ) = ( b + 1 ) b 2 A ( i , a , b ) + b + 1 i a b + 1 i if b i < a , ( b + 1 ) b 2 A ( i , a , b ) if a b i and A ( b + 1 , a , b + 1 ) = 1
(13)

where 0≤i b,a = nj−1, andb=n j +α.

Moreover, for eachα, we can optimize calculation of the second summands in the first case of equation (13) using recursion

n j + α i + 1 n j 1 n j + α ( i + 1 ) = ( n j + α i ) 2 ( i + 1 ) ( n j 1 n j α + i + 1 ) n j + α i n j 1 n j + α i
(14)

To apply this recursion we need an initial value. This value depends onnj−1,n j , andα and, since only the first case of equation (13) contains the second summand, it is not necessary the value of this summand fori=0. There are 3 cases for calculation of all the initial values that are necessary at stepj.

Case 1: N j < nj−1. That means that we use the first case of equation (13) for allα. In this case, for allα, the initial value for recursion (14) is the value of the second summand in (13) wheni=0. So we need to calculate n j + α 0 n j 1 n j + α + 0 for 1≤α Nj+1 and those are n j 1 n j + 1 ,, n j 1 n j + N j + 1 . This takesO(N j ) steps.

Case 2: n j +α 0 = nj−1 for someα 0 {0,…, Nj+1}. Ifα 0 >0 then, for allα α 0 , we use the first case of (13) and therefore we need to calculate n j 1 n j + 1 ,, n j 1 n j + α 0 , this takesO ( α 0 ) steps. For allα > α 0 , we use the second case first and, starting fromi = α α 0 , we use only the first case. So we need to calculate n j + α α α 0 n j 1 n j + α ( α α 0 ) forα{α 0 +1,…,Nj+1}, which are n j 1 + 1 1 ,, n j 1 + ( N j + 1 α 0 ) ( N j + 1 α 0 ) . This takesO(Nj+1α 0 ) steps. In total, we haveO ( Nj+1) steps.

Case 3: nj−1<n j . That means that, for allα, we use the second case first and, starting fromi = n j + α nj−1, we use only the first case. Calculate n j + 1 n j 1 ,, n j + N j + 1 n j 1 forO ( nj−1+Nj+1) steps.

Each case costs at mostO ( N j ). Provided that the initial values are calculated, the cost of the rest calculation at stepj is dominated by the number of the coefficientsA ( i,nj−1,n j + α) for 0≤i n j + α and 1≤α Nj+1 and it isO( N j 2 ). Summing upO( N j 2 )X,O ( Nj−1), and O( N j 2 ) gives us the cost of each stepj, which is O( N j 2 ). Since 1<j m and calculation ofR ( i) for 0≤i N 1 = n takesO ( n), the algorithm doesO ( m n2) steps in total.

Appendix 2: Algorithm for counting fully ranked resolutions of a fully ranked constraint tree

The algorithm for counting resolutions of a constraint tree requires a few changes to count resolutions of a fully ranked constraint tree. Recall that, at each steps, we calculated the setS s which consists of the numbers of resolutions of intermediate trees withs leaves. To constructS s , for each element ofSs−1, we proposed a collection of tuples that may define intermediate trees withs leaves. To accept eligible tuples we defined two matrices m and M. The general scheme for counting resolutions of a fully ranked constraint tree is the same. However, we need to define matrices m and M for a fully ranked constraint tree and describe the procedure of proposing new tuples. The procedure becomes more technical because of additional ranking.

We will calculateSr(n,r) for the corresponding fully ranked constraint tree T. Let

a i , j = | { x node i is a parent of node x and r ( x ) j } |

for 1≤i kandr ( i )≤ j l−1, i.e.,ai,j is the number of interior nodes that are children of nodei and have rank at mostj. Define two matrices m and M of sizek ×( l−1) such that

M i , j = N i , j + 1 and m i , j = 2 if a i , j = 0 , 1 if a i , j > 0 and M i , j > 0 , 0 otherwise.

As before, if we consider a horizontal line that is strictly between the horizontal line that passes through the nodes of rankj and the horizontal line that passes through the nodes of rank (j+1) then these matrices determine the minimal and maximal possible numbers of intersections of this line with branches of subtreei.

Let x= [xi,j] be a matrix of sizet × q, wheret k andq l, and t={1,…,t}. A tuple (x,r |t) is eligible if

  • x i , j x=ni,j for 1≤i t and 1≤j < q, and

  • m i , q 1 x q i M i , q 1

for 1≤i t.

Having an intermediate tree withs−1 leaves, we need to consider all the possible ways to transform this tree to a tree withs leaves. First we need to add a new leaf to some subtree. We give the highest rank to this leaf. After that we may add new time points below the last time point. If we add new time points, we should rerank the leaves with the highest rank such that the new tree has enough leaves at each time point. Let x= [xi,j] be a matrix of sizet × q fort k andq l. This matrix represents an intermediate tree withs−1 leaves. Suppose we add a new leaf to subtreei 0 , where 1≤i 0 t. Letq 0 be the number of time points in a new tree and, therefore,q q 0 r(t+1) (orq q 0 l ift=k). Define a functiona d d o n e((x,r| t ),i 0 , q 0 ) which returns a tuple ( x ,r | t ), where x =[ x i , j ] is a matrix of sizet×q 0 such that

  • x i 0 , q 0 = x i 0 , q j = q q 0 1 n i 0 , j +1

    ,

  • x i , q 0 = x i , q j = q q 0 1 n i , j

    forii 0 ,

  • x i , j = x i , j

    for 1≤it andj<q, and

  • x i , j = n i , j

    for 1≤it andqj<q 0 .

If this tuple is eligible then it represents a new intermediate tree withs leaves.

At each steps of the algorithm, for each tuple (x,r | t ) such thatFr(x,r | t ) Ss−1, we need to propose a collection of new tuples. The first part of this procedure is Algorithm 4. To stop the procedure, when we recognise that a new tree can not havex time points, we use a predicatee x t(x) which is true if we can extend the number of time points in a new tree tox. The number of time points in a new tree can not always be extended to any arbitrary number because there may not be enough leaves of the highest rank to rerank them in such a way that there will beni,j leaves of each new rankj. Soe x t(q) is always true because the tree to which we add a new leaf already hasq time points.

Algorithm 4 Working with a tuple (x,r| t ) such thatFr(x,r | t )Ss−1(part 1)

Ift < k then the proposing procedure contains the second part. In this case, we also can increase the number of leaves in an intermediate tree by adding a new subtree. That means that one of the leaves of the tree becomes a parent to two new leaves. Again we may add new time points. Letq 0 be the number of time points in a new tree and, therefore,r(t+1)<q 0 r(t+2) (orr(t+1)<qx0 l whent+1=k). We define another functiona d d c o n s t r a i n t((x,r| t ),q 0 ) which returns a tuple ( x ,r | t + 1 ), where x =[ x i , j ] is a matrix of size (t+1)×q 0 such that

  • x t + 1 , q 0 =2 j = r ( t + 1 ) + 1 q 0 1 n t + 1 , j

    ,

  • x t + 1 , j = n t + 1 , j

    for 1≤j<q 0 ,

  • x i , q 0 = x i , q j = q q 0 1 n i , j

    for 1≤i t,

  • x i , j = x i , j

    for 1≤i t andj < q, and

  • x i , j = n i , j

    for 1≤i t andq j < q 0 .

This tuple defines a new tree if it is eligible. The second part of the procedure is Algorithm 5. The values ofq 0 ande x t(q 0 ) are as after running Algorithm 4.

Algorithm 5 Working with a tuple (x,r | t ) such thatFr(x,r | t )S s −1 (part 2)

Finally, as before, the main algorithm calculates setsS s recursively for 0≤s n, wheren is the number of leaves in a constraint tree T, and a unique element ofS n isSr(n,r).

Letm be a number of sampling points, i.e.,m=lk.

Proposition 2

Ifk is fixed then the algorithm does at mostO(m) steps for each eligible tuple.

Proof

Let (x r | t ) be an eligible tuple. First, having all the necessary summands for equations (9) and (10), we need to calculateFr(x,r | t ). This takes at mostO ( t ×[ r(t+1)−r(t)]) steps. Since we assumek is a constant and sincetk andr ( t+1)−r(t)≤m, it isO(m). Second, we need to perform the procedure of proposing new tuples for this tuple. The most expensive part here is calculation of the last columns of matrices x because it involves calculation of the sums j = q q 0 1 n i , j . Note that, storing the values of x i , q j = q q 0 1 n i , j for 1≤i t at each stepq 0 , we can optimise these calculations. Then it takesO(t×(q r ( t+2)) steps and it isO ( m) again. □

Proposition 3

If k is fixed then the algorithm has O(mnk) time complexity.

Proof

Leti x be such a number thatr(i x )≤x < r(i x +1). Then the number of eligible tuples is

j = 1 l 1 i = 1 i j ( M i , j m i , j + 1 )

Note that

M i , j M i , j + 1 N i , j + 1 N i , r ( i ) + 1 + 1 = b i + 1

2whereb 1 +…+ b k = n + k−1. Then

j = 1 l 1 i = 1 i j ( M i , j M i , j + 1 ) < ( l 1 ) ( b 1 + 1 ) ( b k + 1 ) < ( l 1 ) ( n k ) k = O ( m n k )

From this and Proposition 2, it follows that the algorithm does at mostO(m2nk) steps. □

References

  1. Yang Z, Rannala B: Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method. Mol Biol Evol. 1997, 14 (7): 717-724. 10.1093/oxfordjournals.molbev.a025811.

    Article  CAS  PubMed  Google Scholar 

  2. Yang Z, Rannala B: Bayesian estimation of species divergence times under a molecular clock using multiple fossil calibrations with soft bounds. Mol Biol Evol. 2005, 23: 212-226. 10.1093/molbev/msj024.

    Article  PubMed  Google Scholar 

  3. Heled J, Drummond A: Tree priors for relaxed phylogenetics and divergence time estimation. Syst Biol. 2012, 61: 138-149. 10.1093/sysbio/syr087.

    Article  PubMed Central  PubMed  Google Scholar 

  4. Rodrigo AG, Felsenstein J: The Evolution of HIV. 1999, Baltimore: Johns Hopkins Univ Press

    Google Scholar 

  5. Drummond A, Pybus O, Rambaut A, Forsberg R, Rodrigo A: Measurably evolving populations. Trends Ecol Evol. 2003, 18: 481-488. 10.1016/S0169-5347(03)00216-7.

    Article  Google Scholar 

  6. Felsenstein J: Inferring Phylogenies. 2004, Sunderland: Sinauer Associates

    Google Scholar 

  7. Murtagh F: Counting dendograms: a survey. Discrete Appl Math. 1984, 7: 191-199. 10.1016/0166-218X(84)90066-0.

    Article  Google Scholar 

  8. Gordon AD: A review of hierarchical classification. J R Stat Soc A. 1987, 150 (2): 119-137. 10.2307/2981629.

    Article  Google Scholar 

  9. Semple C, Steel M: Phylogenetics. 2003, New York: Oxford University Press

    Google Scholar 

  10. Griffiths RC: Counting genealogical trees. J Math Biol. 1987, 25: 423-431. 10.1007/BF00277166.

    Article  CAS  PubMed  Google Scholar 

  11. Felsenstein J: The number of evolutionary trees. Syst Zool. 1978, 23: 27-33.

    Article  Google Scholar 

  12. Stadler T: Sampling‐through‐time in birth‐death trees. J Theor Biol. 2010, 267 (3): 396-404. 10.1016/j.jtbi.2010.09.010.

    Article  PubMed  Google Scholar 

  13. Stadler T, Kouyos RD, von Wyl, Yerly S, Böni J, Bürgisser P, Klimkait T, Joos B, Rieder P, Xie D, Günthard HF, Drummond A, Bonhoeffer S, the Swiss HIV Cohort Study: Estimating the basic reproductive number from viral sequence data. Mol Biol Evol. 2012, 29: 347-357. 10.1093/molbev/msr217.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

AJD was partially funded by a Rutherford Discovery Fellowship from the Royal Society of New Zealand.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexandra Gavryushkina.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

The authors equally contributed to conceive the work. AG drafted the manuscript and DW and AJD revise it. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Gavryushkina, A., Welch, D. & Drummond, A.J. Recursive algorithms for phylogenetic tree counting. Algorithms Mol Biol 8, 26 (2013). https://doi.org/10.1186/1748-7188-8-26

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1748-7188-8-26

Keywords