Polynomial algorithms for the Maximal Pairing Problem: efficient phylogenetic targeting on arbitrary trees

Arnold, Christian; Stadler, Peter F

doi:10.1186/1748-7188-5-25

Research
Open access
Published: 02 June 2010

Polynomial algorithms for the Maximal Pairing Problem: efficient phylogenetic targeting on arbitrary trees

Christian Arnold^1,2 &
Peter F Stadler^1,3,4,5,6

Algorithms for Molecular Biology volume 5, Article number: 25 (2010) Cite this article

6249 Accesses
4 Citations
Metrics details

Abstract

Background

The Maximal Pairing Problem (MPP) is the prototype of a class of combinatorial optimization problems that are of considerable interest in bioinformatics: Given an arbitrary phylogenetic tree T and weights ω_xyfor the paths between any two pairs of leaves (x, y), what is the collection of edge-disjoint paths between pairs of leaves that maximizes the total weight? Special cases of the MPP for binary trees and equal weights have been described previously; algorithms to solve the general MPP are still missing, however.

Results

We describe a relatively simple dynamic programming algorithm for the special case of binary trees. We then show that the general case of multifurcating trees can be treated by interleaving solutions to certain auxiliary Maximum Weighted Matching problems with an extension of this dynamic programming approach, resulting in an overall polynomial-time solution of complexity (n⁴ log n) w.r.t. the number n of leaves. The source code of a C implementation can be obtained under the GNU Public License from http://www.bioinf.uni-leipzig.de/Software/Targeting. For binary trees, we furthermore discuss several constrained variants of the MPP as well as a partition function approach to the probabilistic version of the MPP.

Conclusions

The algorithms introduced here make it possible to solve the MPP also for large trees with high-degree vertices. This has practical relevance in the field of comparative phylogenetics and, for example, in the context of phylogenetic targeting, i.e., data collection with resource limitations.

Background

Comparisons among species are fundamental to elucidate evolutionary history. In evolutionary biology, for example, they can be used to detect character associations [1–3]. In this context, it is important to use statistically independent comparisons, i.e., any two comparisons must have disjoint evolutionary histories (phylogenetic independence). The Maximal Pairing Problem (MPP) is the prototype of a class of combinatorial optimization problems that models this situation: Given an arbitrary phylogenetic tree T and weights ω_xyfor the paths between any two pairs of leaves (x, y) (representing a particular comparison), what is the collection of pairs of leaves with maximum total weight so that the connecting paths do not intersect in edges?

Algorithms for special cases of the MPP that are restricted to binary trees and equal weights (which thus simply maximizes the number of pairs) have been described, but not implemented [2]. Since different pairs of taxa may contribute different amounts of information depending on various factors (e.g., their phylogenetic distance or the difference of particular character states), the weighted version is of considerable practical interest. A particular question of this type is addressed by phylogenetic targeting, where one seeks to optimize the choice of species for which (usually expensive and time-consuming) data should be collected [4]. Phylogenetic targeting boils down to two separate tasks: (1) estimation of the weight ω_xythat measures the benefit or our amount of information contributed by including the comparison of species x with species y and (2) the identification of an optimal collection of pairs of species such that they represent independent measurements, i.e., the solution of the corresponding MPP. To date, the only publicly available software package for phylogenetic targeting [5] can handle multifurcating trees; however, the implementation uses a brute force enumeration of subsets of children and hence scales exponentially in the maximal degree.

As a consequence of the ever-increasing amount of available sequence data, phylogenetic trees of interest continue to increase in size, and large trees with hundreds or even thousands of vertices are not an exception any more [6–9]. Most large phylogenies contain a substantial number of multifurcations that represent uncertainties in the actual phylogenetic relationships. It appears worthwhile, therefore, to extend previous approaches to efficiently solve the MPP for multifurcating trees and arbitrary weights.

Algorithms

Definitions and Preliminaries

Let T(V, E) be a rooted (unordered) tree with a vertex set V = L ∪ J (where L are the leaves of T, J its interior vertices, |L| the number of leaves, and |J| the number of interior vertices) and an edge set E = V × V.

Every vertex x, with the exception of the root r, has a unique father, fa(x), which is the neighbor of x closest to the root. We set fa(r) = ∅. Note that, given an unrooted tree without vertices with no father, we can obtain a rooted tree by subdividing an arbitrary edge with r. Furthermore, for each u ∈ J, let chd(u) be the set of children of v (i.e., its descendants). Obviously, y ∈chd(u) if and only if fa(y) = u and chd(u) = ∅ if and only if v ∈ L. We write T[v] for the subtree rooted at v. Furthermore, we assume that |chd(u)| ≠ 1 throughout this contribution. A tree is binary if |chd(u)| = 2 for all v ∈ J, and multifurcating if |chd(u)| > 2 holds for some interior vertices. Finally, let T[v, C] be the subtree of T rooted at an interior vertex v ∈ J, but with only a subset C of its children. All subtrees T[v] with v ∈ chd( v)\C are thus excluded from T[v, C].

For the purpose of this contribution, we interpret a path π in T as a sequence {e₁,...,e_l} of edges e_i∈ E such that e_i= e_jimplies i = j and e_i∩ e_i+1= {x_i} are single vertices for all 1 ≤ i < l. The vertices x₀∈e₁and x_l∈e_lare the endpoints of π. For two vertices x, y ∈ V, we denote the unique path with endpoints x and y by π_xy. In the following, we will frequently be concerned with paths connecting an interior vertex u ∈ J with a leaf x ∈ L. This path contains exactly one child of u, which we denote by u_x(u, x). In the following, the array n(u, x) will be used to allow efficient navigation in T.

A path-system ϒ on T is a set of paths π such that

1.
If π = π_xy∈ ϒ, then x, y ∈ L and x ≠ y, i.e., every path connects two distinct leaves.
2.
If π' ≠ π'', then π' ∩ π'' = ∅, i.e., any two paths in ϒ are edge-disjoint.

Note that two paths in ϒ have at most one vertex in common (otherwise they would also share the sub-path, and therefore edges, between two common vertices). In binary trees, two edge-disjoint paths are also vertex-disjoint, since two edge-disjoint paths can only run through an interior vertex u with |chd(u)| ≥ 3 (see Fig. 1). Two edge-disjoint paths can share a vertex u in two distinct situations: (1) if both paths have u as the last common ancestor of their respective leaves, u must have at least four children, (2) if u is the last common ancestor for one path, while the other path also includes an ancestor of u, three children of u are sufficient. These two situations will also lead to distinct cases in the algorithms that are presented next.

Furthermore, let ω_xy: L × L → ℝ be an arbitrary weight function on pairs of leaves of T. We define the weight of a path-system ϒ as

(1)

A path-system ϒ that maximizes ω(ϒ), i.e., a solution of the MPP, will in the following be called optimal path-system. It conceptually corresponds to Maddison's "maximal pairing" [2], although we describe here a more general problem (see Background and Variants). In the following sections, our main objective is to compute optimal path-systems.

The Maximal Pairing Problem for binary trees

Forward recursion

In this section we reconsider the approach of [4] for the special case of binary trees. This subsumes also Maddison's [2] discussion of the special unweighted case (see section Variants). We develop the dynamic programming solution for this class of MPP using a presentation that readily leads itself to the desired generalization to multifurcating trees.

For a given interior vertex u ∈ J we use the abbreviation C_x= C_x(u) = chd(u)\u_xfor the set of children of u that are not contained in the path that connects u with the leaf x. Since T is binary by assumption in this subsection, C_xcontains a unique vertex .

We will need two arrays (S, R) to store optimal solutions of partial problems. For each u ∈ V, let S_ube the score of an optimal path-system on the subtree T[u]. For each u ∈ V and leaf x ∈ T[u], we furthermore define R_uxas the score of an optimal path-system on T [u] that is edge-disjoint with the path π_ux. R_uxcan be decomposed as follows:

(2)

For completeness, we set S_x= R_xx= 0 for all leaves x ∈ L.

An optimal path-system on T [u] either consists of optimal path-systems on each of the two trees T [v] and T[w] rooted at the two children v, w ∈ chd(u), or it contains a path π_xywith endpoints x ∈ T[v] and y ∈T[w]. Thus, S_ucan be calculated as follows:

(3)

Recursion (3) can then be evaluated from the leaves towards the root.

In order to facilitate the backtracing part of the algorithm, it is convenient to introduce an auxiliary variable F_u. If an optimal score in eq.(3) is obtained by the second alternative, the pair (x, y) that led to the highest score is recorded in F_u; otherwise, we set F_u= ∅.

Backtracing

A computed optimal path-system ϒ_max on T = T [r] from the forward recursions can be reconstructed by backtracing. For binary trees, this is straightforward. We start at the root r. In the general set, at an interior vertex u with v, w ∈ chd(u), we first check whether F_u= ∅. If this is the case, all paths π_xy∈ ϒ_max are contained within the subtrees T[v] and T[w], and we continue to backtrace in both T[v] and T[w]. If F_u= (x, y), then π_xyis added to ϒ_max, and we need to backtrace an optimal path-system for each of the subtrees "hanging off" π_xy. In other words, we need optimal path-systems for the subtrees rooted at the vertices and for u ∈ π_xy. These can be obtained recursively by following the decompositions of R_vxand R_wy, respectively, given in eq.(2).

Time and Space complexity

All entries S_ufor interior vertices u can be computed in (n³) time, because a total of n(n - 1) ∈ (n²) pairs of leaves have to be considered in eq.(3) and computation of each S_uentry takes at most (n) time. Since we need to store the quadratic arrays R_uxand n(u, x) as well as the linear arrays S_uand F_u, we need (n²) memory.