
Research

Open

 Published:
Polynomial algorithms for the Maximal Pairing Problem: efficient phylogenetic targeting on arbitrary trees
Algorithms for Molecular Biologyvolume 5, Article number: 25 (2010)
Abstract
Background
The Maximal Pairing Problem (MPP) is the prototype of a class of combinatorial optimization problems that are of considerable interest in bioinformatics: Given an arbitrary phylogenetic tree T and weights ω_{ xy }for the paths between any two pairs of leaves (x, y), what is the collection of edgedisjoint paths between pairs of leaves that maximizes the total weight? Special cases of the MPP for binary trees and equal weights have been described previously; algorithms to solve the general MPP are still missing, however.
Results
We describe a relatively simple dynamic programming algorithm for the special case of binary trees. We then show that the general case of multifurcating trees can be treated by interleaving solutions to certain auxiliary Maximum Weighted Matching problems with an extension of this dynamic programming approach, resulting in an overall polynomialtime solution of complexity (n^{4} log n) w.r.t. the number n of leaves. The source code of a C implementation can be obtained under the GNU Public License from http://www.bioinf.unileipzig.de/Software/Targeting. For binary trees, we furthermore discuss several constrained variants of the MPP as well as a partition function approach to the probabilistic version of the MPP.
Conclusions
The algorithms introduced here make it possible to solve the MPP also for large trees with highdegree vertices. This has practical relevance in the field of comparative phylogenetics and, for example, in the context of phylogenetic targeting, i.e., data collection with resource limitations.
Background
Comparisons among species are fundamental to elucidate evolutionary history. In evolutionary biology, for example, they can be used to detect character associations [1–3]. In this context, it is important to use statistically independent comparisons, i.e., any two comparisons must have disjoint evolutionary histories (phylogenetic independence). The Maximal Pairing Problem (MPP) is the prototype of a class of combinatorial optimization problems that models this situation: Given an arbitrary phylogenetic tree T and weights ω_{ xy }for the paths between any two pairs of leaves (x, y) (representing a particular comparison), what is the collection of pairs of leaves with maximum total weight so that the connecting paths do not intersect in edges?
Algorithms for special cases of the MPP that are restricted to binary trees and equal weights (which thus simply maximizes the number of pairs) have been described, but not implemented [2]. Since different pairs of taxa may contribute different amounts of information depending on various factors (e.g., their phylogenetic distance or the difference of particular character states), the weighted version is of considerable practical interest. A particular question of this type is addressed by phylogenetic targeting, where one seeks to optimize the choice of species for which (usually expensive and timeconsuming) data should be collected [4]. Phylogenetic targeting boils down to two separate tasks: (1) estimation of the weight ω_{ xy }that measures the benefit or our amount of information contributed by including the comparison of species x with species y and (2) the identification of an optimal collection of pairs of species such that they represent independent measurements, i.e., the solution of the corresponding MPP. To date, the only publicly available software package for phylogenetic targeting [5] can handle multifurcating trees; however, the implementation uses a brute force enumeration of subsets of children and hence scales exponentially in the maximal degree.
As a consequence of the everincreasing amount of available sequence data, phylogenetic trees of interest continue to increase in size, and large trees with hundreds or even thousands of vertices are not an exception any more [6–9]. Most large phylogenies contain a substantial number of multifurcations that represent uncertainties in the actual phylogenetic relationships. It appears worthwhile, therefore, to extend previous approaches to efficiently solve the MPP for multifurcating trees and arbitrary weights.
Algorithms
Definitions and Preliminaries
Let T(V, E) be a rooted (unordered) tree with a vertex set V = L ∪ J (where L are the leaves of T, J its interior vertices, L the number of leaves, and J the number of interior vertices) and an edge set E = V × V.
Every vertex x, with the exception of the root r, has a unique father, fa(x), which is the neighbor of x closest to the root. We set fa(r) = ∅. Note that, given an unrooted tree without vertices with no father, we can obtain a rooted tree by subdividing an arbitrary edge with r. Furthermore, for each u ∈ J, let chd(u) be the set of children of v (i.e., its descendants). Obviously, y ∈chd(u) if and only if fa(y) = u and chd(u) = ∅ if and only if v ∈ L. We write T[v] for the subtree rooted at v. Furthermore, we assume that chd(u) ≠ 1 throughout this contribution. A tree is binary if chd(u) = 2 for all v ∈ J, and multifurcating if chd(u) > 2 holds for some interior vertices. Finally, let T[v, C] be the subtree of T rooted at an interior vertex v ∈ J, but with only a subset C of its children. All subtrees T[v] with v ∈ chd( v)\C are thus excluded from T[v, C].
For the purpose of this contribution, we interpret a path π in T as a sequence {e_{1},...,e_{ l }} of edges e_{ i }∈ E such that e_{ i }= e_{ j }implies i = j and e_{ i }∩ e_{i+1}= {x_{ i }} are single vertices for all 1 ≤ i < l. The vertices x_{ 0 }∈e_{ 1 }and x_{ l }∈e_{ l }are the endpoints of π. For two vertices x, y ∈ V, we denote the unique path with endpoints x and y by π_{ xy }. In the following, we will frequently be concerned with paths connecting an interior vertex u ∈ J with a leaf x ∈ L. This path contains exactly one child of u, which we denote by u_{ x }(u, x). In the following, the array n(u, x) will be used to allow efficient navigation in T.
A pathsystem ϒ on T is a set of paths π such that

1.
If π = π_{ xy }∈ ϒ, then x, y ∈ L and x ≠ y, i.e., every path connects two distinct leaves.

2.
If π' ≠ π'', then π' ∩ π'' = ∅, i.e., any two paths in ϒ are edgedisjoint.
Note that two paths in ϒ have at most one vertex in common (otherwise they would also share the subpath, and therefore edges, between two common vertices). In binary trees, two edgedisjoint paths are also vertexdisjoint, since two edgedisjoint paths can only run through an interior vertex u with chd(u) ≥ 3 (see Fig. 1). Two edgedisjoint paths can share a vertex u in two distinct situations: (1) if both paths have u as the last common ancestor of their respective leaves, u must have at least four children, (2) if u is the last common ancestor for one path, while the other path also includes an ancestor of u, three children of u are sufficient. These two situations will also lead to distinct cases in the algorithms that are presented next.
Furthermore, let ω_{ xy }: L × L → ℝ be an arbitrary weight function on pairs of leaves of T. We define the weight of a pathsystem ϒ as
A pathsystem ϒ that maximizes ω(ϒ), i.e., a solution of the MPP, will in the following be called optimal pathsystem. It conceptually corresponds to Maddison's "maximal pairing" [2], although we describe here a more general problem (see Background and Variants). In the following sections, our main objective is to compute optimal pathsystems.
The Maximal Pairing Problem for binary trees
Forward recursion
In this section we reconsider the approach of [4] for the special case of binary trees. This subsumes also Maddison's [2] discussion of the special unweighted case (see section Variants). We develop the dynamic programming solution for this class of MPP using a presentation that readily leads itself to the desired generalization to multifurcating trees.
For a given interior vertex u ∈ J we use the abbreviation C_{ x }= C_{ x }(u) = chd(u)\u_{ x }for the set of children of u that are not contained in the path that connects u with the leaf x. Since T is binary by assumption in this subsection, C_{ x }contains a unique vertex .
We will need two arrays (S, R) to store optimal solutions of partial problems. For each u ∈ V, let S_{ u }be the score of an optimal pathsystem on the subtree T[u]. For each u ∈ V and leaf x ∈ T[u], we furthermore define R_{ ux }as the score of an optimal pathsystem on T [u] that is edgedisjoint with the path π_{ ux }. R_{ ux }can be decomposed as follows:
For completeness, we set S_{ x }= R_{ xx }= 0 for all leaves x ∈ L.
An optimal pathsystem on T [u] either consists of optimal pathsystems on each of the two trees T [v] and T[w] rooted at the two children v, w ∈ chd(u), or it contains a path π_{ xy }with endpoints x ∈ T[v] and y ∈T[w]. Thus, S_{ u }can be calculated as follows:
Recursion (3) can then be evaluated from the leaves towards the root.
In order to facilitate the backtracing part of the algorithm, it is convenient to introduce an auxiliary variable F_{ u }. If an optimal score in eq.(3) is obtained by the second alternative, the pair (x, y) that led to the highest score is recorded in F_{ u }; otherwise, we set F_{ u }= ∅.
Backtracing
A computed optimal pathsystem ϒ_{max} on T = T [r] from the forward recursions can be reconstructed by backtracing. For binary trees, this is straightforward. We start at the root r. In the general set, at an interior vertex u with v, w ∈ chd(u), we first check whether F_{ u }= ∅. If this is the case, all paths π_{ xy }∈ ϒ_{max} are contained within the subtrees T[v] and T[w], and we continue to backtrace in both T[v] and T[w]. If F_{ u }= (x, y), then π_{ xy }is added to ϒ_{max}, and we need to backtrace an optimal pathsystem for each of the subtrees "hanging off" π_{ xy }. In other words, we need optimal pathsystems for the subtrees rooted at the vertices and for u ∈ π_{ xy }. These can be obtained recursively by following the decompositions of R_{ vx }and R_{ wy }, respectively, given in eq.(2).
Time and Space complexity
All entries S_{ u }for interior vertices u can be computed in (n^{3}) time, because a total of n(n  1) ∈ (n^{2}) pairs of leaves have to be considered in eq.(3) and computation of each S_{ u }entry takes at most (n) time. Since we need to store the quadratic arrays R_{ ux }and n(u, x) as well as the linear arrays S_{ u }and F_{u}, we need (n^{2}) memory.
The Maximal Pairing Problem for multifurcating trees
Forward recursion
In trees with multifurcations, for a pathsystem ϒ, more than one path can run through each vertex m ∈ J with chd(m) > 2 without violating phylogenetic independence. In addition to an optimal score S_{ u }, we also define an optimal score Q_{ ux }of all pathsystems on T[u]\T[u_{ x }], i.e., of all pathsystems that avoid not only the path π_{ ux }but the entire subtree T[u_{ x }], where u_{ x }is as usual the child of u along π_{ ux }. We therefore have
The computation of S_{ u }and Q_{ ux }are analogous problems. In general, consider an (interior) vertex u ∈ J and a subset C ⊆ chd(u) of children of u. Our task is to compute an optimal pathsystem on the subtree T[u, C] of T. We first observe that any pathsystem on T[u, C] contains 0 ≤ k ≤ ⌊C/2⌋ paths π_{ k }through u. Each of these paths runs through exactly two distinct children and of u. For fixed and , the path ends in leaves and (Fig. 1). The best possible score contribution for the path π_{ x'x'' }is
and the best possible score for a particular pair of children v', v'' ∈ C is therefore
For the purpose of backtracing, it will be convenient to record the path π_{ xy }, or rather its pair of end points (x, y), that maximized in eq.(6) in an auxiliary variable F_{ v',v'' }.
Since there are k paths through u covering 2k of the C subtrees, there are C  2k children v_{ l }of u, with 1 ≤ l ≤ C  2k, each of which contributes to an optimal pathsystem with a subpathsystem that is contained entirely within the subtree T[v_{ l }]. Since these contributions are independent of each other, they are obtained by solving the MPP on T[v_{ l }], i.e., their contribution to the total score of an optimum pathsystem is S_{ vl }.
For each subtree T[u, C] we therefore face the problem of determining the optimal combination of pairs and isolated children. This task can be reformulated as a weighted matching problem on an auxiliary graph Γ(C) whose vertex set consists of two copies of the elements of C, denoted v and v*. Within one copy of C, there is an edge between any two elements. The remaining C edges of Γ(C) connect each v with its copy v*. The associated edge weights are ω_{ v',v'' }= and ω_{v,v*}= S_{ v }, respectively. An example is shown in Fig. 2.
Clearly, an optimal path of the form x',...,v', u, v'',...,x'' is represented by the edge (v', v'') of Γ(C), while a selfcontained subtree T[v] is represented by an edge of the form (v, v*). It remains to show that every maximum matching of the auxiliary graph Γ(C) corresponds to a legal conformation of paths, i.e., we have to demonstrate that in a maximum matching ℳ, each vertex v ∈ C is contained in an edge. First, note that v* covered by an edge of ℳ if and only if (v, v*) ∈ ℳ. Suppose v is not covered in ℳ. Since ω_{ v,v* }is nonnegative, we can exclude matchings that do not cover all edges of C from the solution set. We can thus compute the entries of S_{ u }and Q_{ ux }, respectively, in polynomial time by solving maximum weighted matching problems with nonnegative weights. Introducing the symbol MWM(Γ) for the maximum weight of a matching on the auxiliary graph Γ, we can write this as
Here we make use of the fact that the weight of a matching equals the sum of the weights of the pathsystems that correspond to the edges of the auxiliary graphs. In order to facilitate backtracing, we keep tabulated not only the weights but also the corresponding maximum matchings for each Γ(chd(u)) and Γ(chd(u)\{u_{ x }})).
Backtracing
Backtracing for multifurcating trees proceeds in analogy to the binary case. Again we start from the root towards the leaves, treating each interior vertex u. If chd(u) = 2, see the backtracing for the binary case. If chd(u) > 2, we first need the solution ℳ of the MWM for chd(u). For each edge (v, v*) ∈ ℳ, v is called recursively to determine its optimal pathsystem. Each edge (v', v'') ∈ ℳ, however, represents a path π_{ xy }that belongs to an optimal pathsystem. Each of these paths π_{ xy }maximizes for a particular pair of children v', v'' ∈ chd(u) and therefore has been stored in F_{ v'v'' }during the forward recursion. Thus, each of these paths π_{ xy }can be added to the optimal pathsystem.
As in the binary case, it remains to add the solutions from an optimal pathsystems from the subtrees that are not on the path from x to v' and y to v″, respectively, for each particular edge (v', v'') ∈ ℳ. This can be done as follows. According to eqns.(2) and (4), R_{ v'x }can be decomposed into and either Q_{ v'x }or . If chd(v') = 2, the child node that is not on the path from v' to x is called recursively to obtain an optimal pathsystem in T[k]. If chd(v') > 2, however, the solution of the MWM for Q_{ v'x }is needed to determine an optimal pathsystem on the subtree , because multiple paths may go through V'. can then be further decomposed until R_{ xx }is reached. The same procedure is employed for R_{ v''y }.
Time and Space complexity
A maximum weighted matching on arbitrary graphs with V vertices and E edges can be computed in (VE log E) time and (E) space by Gabow's classical algorithm [10] or one of several more recent alternatives [11, 12]. In our setting, E ∈ (chd(u)^{2}), hence the total memory complexity of our dynamic programming algorithm is (n^{2}).
All entries for (the edge weights for the matching problems) can be computed in (n^{3}) time, because a total of (n  1) ∈ (n^{2}) pairs of leaves have to be considered in eq.(6) and computation of each entry takes at most (n) time. The effort for one of the (chd(u)) maximum weighted matching problems for a given interior vertex u with more than two children is bounded by
(chd(u)^{3}log(chd(u))^{2}). The total effort for all MWMs is therefore bounded by
which dominates the overall time complexity of the algorithm (see Appendix for a derivation).
As in the binary case, (n^{2}) space is necessary and sufficient to store the arrays R and S. Furthermore, (n^{2}) space is needed to save the array Q and the endpoints (x, y) of the path π_{ xy }that maximized each Q entry. The latter is needed for the backtracing. In addition, we keep the quadratic array n(u, x) to allow efficient navigation in T. For each interior vertex u with chd(u) > 2, chd(u) + 1 different maximal matchings have to be stored: one that corresponds to S_{ u }and chd(u) that correspond to Q_{ ux }. Each of these solutions requires (chd(u)) space. The total space complexity of all MWM solutions is therefore ∑_{ u }chd(u)^{2} ∈ (n^{2}) (see Appendix).
Algorithmic variants
Several variants and special cases of the general MPP algorithm are readily derived for related problems. In the following, we briefly touch upon some of them.
Special weight functions
It is worth noting that finding a pathsystem that simply maximizes the number of pairs, as presented in [2] and applied in [13], for example, constitutes a special case of the MPP with unit weights. (Of course the same result is obtained by setting ω_{ xy }to any fixed positive weight.) This case may be of practical use under certain circumstances, as it maximizes the number of independent measurements, thus improving power of subsequent statistical tests. Specifically, this weight function selects a pathsystem with pairs. In order to maximize the number of edges that are covered by an optimal pathsystem, we simply set ω_{ xy }= d(x, y), where d(x, y) is the graphtheoretic distance, i.e., we interpret the edge lengths in the tree as unity. Alternatively, instead of assigning weights for pairs of leaves directly, edges e ∈ E can be weighted, and the weight for a particular pair of leaves (x, y) can then be simply defined as .
Fixed number of paths
A variant of practical interest is to limit an optimal pathsystem to κ leafpairs. This may be relevant in a phylogenetic targeting setting, for example, in cases where resources are limiting data acquisition efforts to a small number taxa so that it pays to make every effort to choose them optimally (see also [4]). Typically, κ will be small in this setting.
For binary trees, this variant can be implemented by conditioning the matrices R and S to a given number of paths. Eq.(2) thus becomes
for a given number k ≤ k in the partial solutions. If an optimal pathsystem on T[u] is composed of optimal pathsystems on the two trees rooted at its children v and w, respectively, then the k paths are arbitrarily contained within T [v] and T [w]. Thus, k + 1 different cases have to be considered, and the case with the highest score has to be identified. This yields to the following extension of eq.(3) for S_{ u,k }:
We set S_{ x }= R_{ xx,l }= R_{ux,0}= 0 for all x ∈ L, u ∈ J, and l ∈ {0, k}. The latter condition ensures that if no path can be selected anymore in a particular subtree, its score must be 0.
As mentioned above, however, eq.(9) only holds for binary trees. For multifurcating trees, the auxiliary maximum weighted matching problems are replaced by the task of finding matchings that maximize the weight for a fixed number k of edges. We are, however, not aware that this variant of matching problems has been studied in detail so far. For small κ, it could of course be solved by brute force enumeration.
Selecting paths or taxa in addition to already selected paths or taxa
In some applications it may be the case that a subset of taxa or paths is already given, e.g. because the corresponding data have already been acquired in the past. The question then becomes how additional resources should be allocated.
In the simpler case, we are given a partial pathsystem ∏. It then suffices to remove or mark the corresponding leaves from T (to ensure that they are not selected again) and to set the weight of all paths that have edges in common with ∏ to  ∞ to enforce independence from the prescribed pairs.
The situation is less simple if only the taxa are given and the pairs are not prescribed. Here, the goal is to find an optimal pathsystem that includes all z ∈ Z, where Z ⊂ L denotes the taxa that are required to appear in the output. First, we note that such a solution not necessarily exists, e.g. if Z = L and L is odd. As a simple example, consider a binary tree with three leaves. In that case, only one path and thus two leaves can be selected. This constraint also holds for the subtree rooted at any interior vertex u and the z ∈ Z in T[u], i.e., partial solutions of the MPP (see below).
For binary trees, this variant can be implemented by conditioning the matrices R and S to a subset of all possible paths and leaves. This is achieved by setting the score to ∞ for a particular interior vertex if one of the preconditions cannot be met in eqns.(2) and (3). For example, if two leaves x, y ∈ Z have the same father u, an optimal pathsystem of both T[u] and T must contain the path π_{ xy }, because otherwise, either x or y would not belong to the optimal pathsystem due to the requirement of independence. Similarly, if a particular path π_{ xy }in the second alternative achieves the highest score in eq.(3), π_{ xy }must not be selected if this conflicts with the possibility to select other prescribed leaves z ∈ Z (Fig. 3).
To derive the recursions for this variant, let Z_{ u }denote the leaves z ∈ Z with z ∈ T[u] and let L be the leaves of T[u]. It is convenient to first check whether a solution exists for T[u]. If L = Z_{ u }and L is odd, S_{ u }= ∞ (i.e., no pathsystems exists that selects all z ∈ Z_{ u }in T[u]). Otherwise, an optimal pathsystem for T[u] with v, w ∈ chd(u) can be calculated as follows:
Furthermore,
and
for any x ∈ L. In analogy to the algorithm for the unconstrained MPP, we initialize the recursions by R_{ xx }= 0 for x ∈ L. This variant does not change the overall time and space complexity, and backtracing is also identical to the unconstrained version of the MPP.
For multifurcating trees, the maximum weighted matching problems are replaced by finding matchings that maximize the weight with the constraint that particular vertices must be included in the matching. Similarly to the variant introduced above, however, we are not aware that this particular problem has been studied in detail.
Probabilistic version
Sometimes, not only an optimal solution is of interest. As in the case of sequence alignments [14] or biopolymer structures [15], one may analyze the entire ensemble of solutions. Both for physical systems such as RNA, and for alignments with a logodds based scoring system, one can show that individual configurations ϒ with score S(ϒ), in our case pathsystems, contribute to the ensemble proportional its Boltzmann weight exp(βS(ϒ)), where the "inverse temperature" β defines a natural scale that is implicitly given by the scoring or energy model. In the case of physical systems β = 1/kT is linked to the ambient temperature T; for logodds scores, β = 1; if the scoring scheme is rescaled, as e.g. in the case of the Dayhoff matrix in protein alignments, then β is the inverse of this scaling factor. In cases where schemes without a probabilistic interpretation are used, suitable values of β have to be determined empirically. The larger β, the more an optimal pathsystem is emphasized in the ensemble. The partition function of the system is
The probability p_{ϒ} to pick ϒ from the ensemble is p_{ϒ} = exp(βS(ϒ))/Z.
The recursion in eq.(3) can be converted into a corresponding recursion for the partition functions Z_{ u }of pathsystems on subtrees T = T[u], because the decomposition of the scoremaximization is unambiguous in the sense that every conformation falls into exactly of the case of recursion. This is a generic feature of dynamic programming algorithms that is explored in some depth in the theory of Algebraic Dynamic Programming[16]. We find
with Z_{ u }= 1 if u ∈ L and
for k ∈ J. Note that these recursions are completely analogous to the score optimization in eqns.(2) and (3): the max operator is replaced by a sum, and addition of scores is replaced by multiplication of partition functions and Boltzmann factors.
In order to compute the probability P_{ xy }of a particular path π_{ xy }in the ensemble we have to add up the contributions p ϒ of all pathsystems that contain π_{ xy }
and compute the ratio P_{ xy }= Z(π_{ xy })/Z. The recursions for the restricted partition function Z(π_{ xy }) can be computed in analogy to eq.(14), but with two additional constraints. First, since π_{ xy }∈ ϒ by definition, the leaves i ∈ T[v] and j ∈ T[w] are constrained in eq.(14), because only paths π_{ ij }that are edgedisjoint with π xy can be considered. The recursion for the partition function of the last common ancestor node of x and y, denoted k, is also constrained, because π_{ xy }must go through k. Calculation of the partition functions for the children of k is therefore not needed to compute Z_{ k }. Thus,
In resource requirements, this backward recursion is comparable to the forward recursion in eq.(3): Z(π_{ xy }) and thus also P_{ xy }can be calculated in (n^{3}) time, because the number of leafpairs that have to be considered is still in (n^{2}). There is an additional factor (n) arising from the need to determine if the path π_{ xy }is edgedisjoint with another path, which however does not increase overall time complexity. Furthermore, (n^{2}) space is needed.
The computation of partition functions is a much more complex problem for trees with multifurcations since it would require us in particular to compute partition functions for the interleaved matching problems. These are not solved by means of dynamic programming; instead, they use a greedy algorithm acting on augmenting paths in the auxiliary graphs. These algorithms therefore do not appear to give rise to efficient partition function versions.
The TARGETING software
We implemented the polynomial algorithms for the MPP in the program TARGETING. The TARGETING program is written in C and uses Ed Rothberg's implementation [17] of the Gabow algorithm [10] to solve the Maximum Weight Matching Problem on general graphs. The software also provides an userfriendly interface and can solve the special weight variants as well. The source code can be obtained under the GNU Public License at http://www.bioinf.unileipzig.de/Software/Targeting/.
Concluding Remarks
In this contribution, we introduced a polynomial algorithm for the Maximal Pairing Problem (MPP) as well as some variants. The efficient generalization of the dynamic programming approach to trees with multifurcations is nontrivial, since a straightforward approach yields runtimes that are exponential in the maximal degree of the input tree. A polynomialtime algorithm can be constructed by interleaving the dynamic programming steps with the solution of auxiliary maximum weighted matching problems. This generalized algorithm for the MPP is implemented in the software package TARGETING, providing a userfriendly and efficient way to solve the MPP as well as some of its variants.
Future work in this area is likely to focus on developing algorithms for the variants of the MPP on multifurcating trees. In particular, the interleaving of dynamic programming for the MPP and the greedy approach for the auxiliary matching problems does not readily generalize to a partition function algorithm for multifurcating trees. The concept of unique matchings as discussed in [18] may be of relevance in this context.
The MPP solver presented here has applications in a broad variety of research areas. The method of phylogenetically independent comparisons relies on relatively few assumptions [1–3] and is frequently used in evolutionary biology, in particular in anthropology, comparative phylogenetics and, more generally, in studies that test evolutionary hypotheses [19–22]. As highlighted earlier, another application area lies in the design of studies in which tedious and expensive data collection is the limiting factor, so that a careful selection (phylogenetic targeting) becomes an economic necessity [5]. As noted in [13], alternative applications can be found in molecular phylogenetics, for example in the context of estimating relative frequencies of different nucleotide substitutions or the determination of the fraction of invariant sites in a particular gene.
Appendix
Pseudocode
Below, we include some pseudocode for the computation of an optimal pathsystem for an arbitrary tree T.
Require: ω_{ xy }≥ 0 ∀ pairs x, y ∈ L and precomputed array n(u, x) n(u, x) ∀ u ∈ J and x ∈ L
1: S_{ x }= R_{ xx }= Q_{x,x}= 0 ∀_{ x }∈ L
if chd(u) = 2 and if chd(u) > 2 ∀_{ u }∈ J and x ∈ L
2: for all u ∈ J in postorder tree traversal do
3: if chd(u) = 2 then
4: {v, ω} ← chd(u)
5: S_{u 1}= S_{ v }+ S_{ w }
6: for all paths π_{ xy }with x ∈ T[v] and y ∈ T[w] do
7: determine the path π_{ xy }that maximizes
8: S_{u 2}= ω_{ xy }+ R_{ v,x }+ R_{ w,y }
9: end for
10: if S_{u 2}>S_{u 1}then
11: F_{ u }= (x, y)
12: else
13: F_{ u }= ∅
14: end if
15: S_{ u }= max(S_{u 1}, S_{u 2})
16: else
17: for all pairs v', v'' ∈ chd(u) do
18: determine the path π_{ xy }that maximizes and set F_{ v'v'' }= (x, y) and ω_{ v',v'' }=
19: end for
20: for all pairs v, v* ∈ chd(u) do
21: ω_{ v,v* }= S_{ v }
22: end for
23: use computed edge weights for the following MWM problems
24: S_{ u }= MWM(Γ(chd(u)))
25: for i = 1 to chd(u) do
26: k ← ith child from u
27: compute δ = MWM(Γ(chd(u)\k))
28: for all leaves x ∈ T[k] do
29: Q_{ ux }= δ
30: end for
31: end for
32: tabulate solution of all MWM problems
33: end if
34: end for
The following algorithm summarizes backtracing. It starts at the root of the tree, but consider any vertex u:
1: if chd(u) = 0 then
2: return
3: end if
4: if chd(u) = 2 then
5: {v, w} ← chd(u)
6: if F_{ u }= ∅ then
7: call backtracing for T[v] (using the solution of the MWM for S_{ v }if chd(v) > 2)
8: repeat for T[w]
9: else
10: add F_{ u }= (x, y) = π_{ xy }to solution set
11: k = v {path from v to x}
12: while k ≠ x do
13: *
14: if chd(k) = 2 then
15: call backtracing for
16: else
17: call backtracing for T[k]\T[k_{ x }] (using the solution of the MWM for Q_{ kx })
18: end if
19: *
20: k = k_{ x }
21: end while
22: repeat for k = w {path from w to y}
23: end if
24: else
25: {v_{1}, v_{2},...,v_{ n }} ← chd(u)
26: take the appropriate tabulated MWM M
27: for all edges (v_{ i }, v_{ j })of M do
28: add = (x, y) = π_{ xy }to solution set
29: k = v_{ i }{path from v_{ i }to x}
30: while k ≠ x do
31: see case differentiation for the binary case (lines between *)
32: k = k_{ x }
33: end while
34: repeat for k = v_{ j }{path from v_{ j }to y
35: end for
36: for all edges (v_{ i }, v_{ l }*)of M do
37: call backtracing for T[v_{ i }] (using the solution of the MWM for if chd(v_{ i }) > 2)
38: end for
39: end if
A useful inequality
Consider an algorithm that operates on a rooted tree with n leaves requiring ((d_{ u })^{α}) time for each interior vertex with d_{ u }children. A naive estimate immediately yields the upper bound (n^{α+1}). Using the following lemma, however, we can obtain a better upper bound. Although Lemma 0.1 is probably known, we could not find a reference and hence include a proof for completeness.
Lemma 0.1 Let T be a phylogenetic tree with n leaves, u an interior vertex, d_{ u }= chd(u) the outdegree of u, and α > 1. Then
Proof Let h denote the total number of interior vertices. Each leaf or interior vertex except the root is a child of exactly one interior vertex. Thus ∑_{ u }d_{ u }= n + (h  1). For fixed h, we can employ the method of Lagrange multipliers to maximize the objective function subject to the constraint ∑_{ u }d_{ u }= n + (h  1) = c ≤ 2n  1. The Lagrange function is then
Setting the partial derivatives of Λ = 0 yields the following system of equations:
This system of equations is solved by for all i ∈ {1, h}. The above sum is maximal when T is a full dary tree for some d. The constraint can thus be expressed as h · d = n + h  1 and F = hd^{α} which is maximized by making d as large as possible (i.e., n) and hence minimizing the number h of interior vertices (i.e., 1). Hence, F(n)=n^{α}.
References
 1.
Felsenstein J: Phylogenies and the comparative method. Amer Nat. 1985, 125: 115. 10.1086/284325
 2.
Maddison WP: Testing Character Correlation using Pairwise Comparisons on a Phylogeny. J Theor Biol. 2000, 202: 195204. 10.1006/jtbi.1999.1050
 3.
Ackerly DD: Taxon sampling, correlated evolution, and independent contrasts. Evolution. 2000, 54: 14801492.
 4.
Arnold C, Nunn CL: Phylogenetic Targeting of Research Effort in Evolutionary Biology. American Naturalist. 2010, In review
 5.
Arnold C, Nunn CL: Phylogenetic Targeting Website. 2010,http://phylotargeting.fas.harvard.edu
 6.
BinindaEmonds OR, Cardillo M, Jones KE, MacPhee RD, Beck RM, Grenyer R, Price SA, Vos RA, Gittleman JL, Purvis A: The delayed rise of presentday mammals. Nature. 2007, 446: 507512. 10.1038/nature05634
 7.
Burleigh JG, Hilu KW, Soltis DE: Inferring phylogenies with incomplete data sets: a 5gene, 567taxon analysis of angiosperms. BMC Evol Biol. 2009, 9: 61 10.1186/14712148961
 8.
Arnold C, Matthews LJ, Nunn CL: The 10kTrees Website: A New Online Resource for Primate Phylogeny. Evol Anthropology. 2010,
 9.
Sanderson MJ, Driskell AC: The challenge of constructing large phylogenetic trees. Trends Plant Sci. 2003, 8: 374379. 10.1016/S13601385(03)001651
 10.
Gabow H: Implementation of Algorithms for Maximum Matching on Nonbipartite Graphs. PhD thesis. 1973, Stanford University
 11.
Galil Z, Micali S, Harold G: An O(EV log V) algorithm for finding a maximal weighted matching in general graphs. SIAM J Computing. 1986, 15: 120130. 10.1137/0215009
 12.
Gabow HN, Tarjan RE: Faster scaling algorithms for general graph matching problems. J ACM. 1991, 38: 815853. 10.1145/115234.115366
 13.
Purvis A, Bromham L: Estimating the transition/transversion ratio from independent pairwise comparisons with an assumed phylogeny. J Mol Evol. 1997, 44: 112119. 10.1007/PL00006117
 14.
Mückstein U, Hofacker IL, Stadler PF: Stochastic Pairwise Alignments. Bioinformatics. 2002, S153S160: 18
 15.
McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structures. Biopolymers. 1990, 29: 11051119. 10.1002/bip.360290621
 16.
Steffen P, Giegerich R: Versatile and declarative dynamic programming using pair algebras. BMC Bioinformatics. 2005, 6: 224 10.1186/147121056224
 17.
Rothenberg E: Solver for the Maximum Weight Matching Problem. 1999, http://elib.zib.de/pub/Packages/mathprog/matching/weighted/
 18.
Gabow HN, Kaplan H, Tarjan RE: Unique Maximum Matching Algorithms. J Algorithms. 2001, 40: 159183. 10.1006/jagm.2001.1167
 19.
Nunn CL, Baton RA: Comparative Methods for Studying Primate Adaptation and Allometry. Evol Anthropology. 2001, 10: 8198. 10.1002/evan.1019
 20.
Goodwin NB, Dulvy NK, Reynolds JD: Lifehistory correlates of the evolution of live bearing in fishes. Phil Trans R Soc B: Biol Sci. 2002, 357: 259267. 10.1098/rstb.2001.0958
 21.
Vinyard CJ, Wall CE, Williams SH, Hylander WL: Comparative functional analysis of skull morphology of treegouging primates. Am J Phys Anthropology. 2003, 120: 153170. 10.1002/ajpa.10129
 22.
Poff NLR, Olden JD, Vieira NKM, Finn DS, Simmons MP, Kondratieff BC: Functional trait niches of North American lotic insects: traitsbased ecological applications in light of phylogenetic relationships. J North Am Benthological Soc. 2006, 25: 730755. 10.1899/08873593(2006)025[0730:FTNONA]2.0.CO;2
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
Both authors designed the study and developed the algorithms. CA implemented the TARGETING software. Both authors collaborated in writing the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Partition Function
 Binary Tree
 Dynamic Programming Algorithm
 Interior Vertex
 Binary Case