Terminology and basics
A phylogenetic tree can be represented as a tree T with leaves labeled by some set of organisms S. If each leaf label is unique, then the phylogenetic tree is singly-labeled. Unless noted otherwise, the phylogenetic trees we describe throughout this paper are singly-labeled and unrooted.
Each edge e in an unrooted, singly-labeled phylogenetic tree defines a bipartition \(\pi _e\) (also sometimes referred to as a split) on the set of leaf labels induced by the deletion of e from the tree, but not its endpoints. Each bipartition splits the leaf set into two non-empty disjoint parts, A and B, and is denoted by A|B. The set of bipartitions of a tree T is given by C(T) = {\(\pi _e\) : \(e \in E(T)\)}, where E(T) is the edge set for T. Tree \(T'\) is a refinement of T if T can be obtained from \(T'\) by contracting a set of edges in \(E(T')\). A tree T is fully resolved (i.e., binary) if there is no tree that refines T other than itself.
A set Y of bipartitions on some leaf set S is compatible if there exists an unrooted tree T leaf-labeled by S such that Y \(\subseteq\) C(T). A bipartition \(\pi\) of a set S is said to be compatible with a tree T with leaf set S if and only if there is a tree \(T'\) such that \(C(T') = C(T) \cup \{\pi \}\) (i.e., \(T'\) is a refinement of T that includes the bipartition \(\pi\)). Similarly, two trees on the same leaf set are said to be compatible if they share a common refinement. An important result on compatibility is that pairwise compatibility of a set of bipartitions over a leaf set ensures setwise compatibility [19, 20]; it then follows that two trees are compatible if and only if the union of their sets of bipartitions is compatible. Furthermore, by [21] (and see discussion in [22, 23]), a set \(\mathcal {C}\) of bipartitions is compatible if and only if there is a tree T such that \(C(T)=\mathcal {C}.\)
The Robinson−Foulds (RF) distance [17] between two trees T and \(T'\) on the same set of leaves is defined as the minimum number of edge-contractions and refinements required to transform T into \(T'\) (where each such operation changes the number of edges in the tree by exactly one, so contracting a single edge or refining a polytomy to add a single edge). For singly-labeled trees, the RF distance equals the number of bipartitions present in only one tree (i.e., the symmetric difference). The normalized RF distance is the RF distance divided by \(2n-6\), where n is the number of leaves in each tree; this produces a value between 0 and 1 since the two trees can only disagree with respect to internal edges, and \(n-3\) is the maximum number of internal edges in an unrooted tree with n leaves.
Given a phylogenetic tree T on taxon set S, T restricted to \(R \subseteq S\) is the minimal subgraph of T connecting elements of R and suppressing nodes of degree two. We denote this as \(T|_R\). If T and \(T'\) are two trees with R as the intersection of their leaf sets, their shared edges are edges whose bipartitions restricted to R are in the set \(C(T|_R)\cap C(T'|_R)\). Correspondingly, their unique edges are edges whose bipartitions restricted to R are not in the set \(C(T|_R)\cap C(T'|_R)\). See Fig. 1 for a pictorial depiction of unique and shared edges.
RF-optimal tree refinement and completion (RF-OTRC) problem
We now turn our attention to the optimization problem of interest to this paper. This section is limited to the context of singly-labeled trees; we postpone the extension to cases where the gene tree can have multiple copies of a species at the leaves, which are referred to as multi-labeled trees (i.e., MUL-trees [24]), until a later section.
If the trees t and T have the same set of taxa, then the RF-OTRC problem becomes the RF-optimal tree refinement (RF-OTR) problem, while if t is already binary but can be missing taxa, then the RF-OTRC problem becomes the RF-optimal tree completion (RF-OTC) problem. OCTAL, presented in [25], solves the RF-OTC problem in \(O(n^2)\) time, and an improved approach presented by Bansal [26] solves the RF-OTC problem in linear time. We refer to this faster approach as Bansal’s algorithm. In this paper we present an algorithm that solves the RF-OTR problem exactly in polynomial time and show that the combination of this algorithm with Bansal’s algorithm solves the RF-OTRC problem exactly in \(O(n^{1.5} \log n)\) time, where T has n leaves. We refer to the two steps together as Tree Refinement And CompleTION (TRACTION).
TRACTION algorithm
The input to TRACTION is a pair of unrooted, singly-labeled trees (t, T), where t is the estimated gene tree on set R of species and T is the binary reference tree on S, with \(R \subseteq S\). Note that we allow t to not be binary (e.g., if low support edges have already been collapsed) and to be missing species (i.e., \(R \subset S\) is possible).
-
Step 1: Refine t so as to produce a binary tree \(t^*\) that maximizes shared bipartitions with T.
-
Step 2: Add the missing species from T into \(t^*\), minimizing the RF distance.
Step 1: Greedy refinement of t
To compute \(t^*\), we first refine t by adding all bipartitions from \(T|_{R}\) that are compatible with t; this produces a unique tree \(t'\). If \(t'\) is not fully resolved, then there are multiple optimal solutions to the RF-OTR problem, as we will later prove. The algorithm selects one of these optimal solution as follows. First, we add edges from t that were previously collapsed (if such edges are available). Next, we randomly refine the tree until we obtain a fully resolved refinement, \(t^*\). Note that if \(t'\) is not binary, then \(t^*\) is not unique. We now show that the first step of TRACTION solves the RF-OTR problem.
Theorem 1
Let T be an unrooted, singly-labeled tree on leaf set S, and let t be an unrooted, singly-labeled tree on leaf set \(R \subseteq S\). A fully resolved (i.e. binary) refinement of t minimizes the RF distance to \(T|_{R}\) if and only if it includes all compatible bipartitions from \(T|_{R}\).
Proof
Let \(C_0\) denote the set of bipartitions in \(T|_R\) that are compatible with t. By the theoretical properties of compatible bipartitions (see “Terminology and basics” section), this means the set \(C_0 \cup C(t)\) is a compatible set of bipartitions that define a unique tree \(t'\) where \(C(t')=C_0 \cup C(t)\) (since the trees are singly-labeled).
We now prove that for any binary tree B refining t, B minimizes the RF distance to \(T|_R\) if and only if B refines \(t'\).
Consider a sequence of trees \(t=t_0, t_1, t_2, \ldots , t_k\), each on leaf set R, where \(t_i\) is obtained from \(t_{i-1}\) by adding one edge to \(t_{i-1}\), and thus adds one bipartition to \(C(t_{i-1})\). Let \(\delta _i=RF(t_{i},T|_R) - RF(t_{i-1},T|_R)\), so that \(\delta _i\) indicates the change in RF distance produced by adding a specific edge to \(t_{i-1}\) to get \(t_i\). Hence,
$$\begin{aligned} RF(t_i,T|_R) = RF(t_0,T|_R) + \sum _{j \le i} \delta _j. \end{aligned}$$
A new bipartition \(\pi _i\) added to \(C(t_{i-1})\) is in \(C(T|_R)\) if and only if \(\pi _i \in C_0\). If this is the case, then the RF distance will decrease by one (i.e., \(\delta _i =-1\)). Otherwise, \(\pi _i \not \in C_0\), and the RF distance to \(T|_R\) will increase by one (i.e., \(\delta _i =1\)).
Now suppose B is a binary refinement of t. We can write the bipartitions in \(C(B){\backslash}C(t)\) into two sets, X and Y, where X are bipartitions in \(C_0\) and Y are bipartitions not in \(C_0\). By the argument just provided, it follows that \(RF(B,T|_R) = RF(t,T|_R) - |X| + |Y|\). Note that \(|X \cup Y|\) must be the same for all binary refinements of t, because all binary refinements of t have the same number of edges. Thus, \(RF(B,T|_R)\) is minimized when |X| is maximized, so B minimizes the RF distance to \(T|_R\) if and only if C(B) contains all the bipartitions in \(C_0\). In other words, \(RF(B,T|_R)\) is minimized if and only if B refines \(t'\). \(\square\)
Corollary 1
TRACTION finds an optimal solution to the RF-OTR problem.
Proof
Given input gene tree t and reference tree T on the same leaf set, TRACTION produces a tree \(t''\) that refines t and contains every bipartition in T compatible with t; hence by Theorem 1, TRACTION solves the RF-OTR problem. \(\square\)
Step 2: Adding in missing species
The second step of TRACTION can be performed using OCTAL or Bansal’s algorithm, each of which finds an optimal solution to the RF-OTC problem in polynomial time. Indeed, we show that any method that optimally solves the RF-OTC problem can be used as an intermediate step to solve the RF-OTRC problem.
To prove this, we first restate several prior theoretical results. In [25] we showed the minimum achievable RF distance between T and \(T'\) is given by:
$$\begin{aligned} RF(T,T')&= RF(T|_R,t) +2m \end{aligned}$$
(1)
where m is the number of Type II superleaves in T relative to t, which we define:
Definition 1
Let T be a binary tree on leaf set S and t be a tree on leaf set \(R \subseteq S\). The superleaves of T with respect to t are defined as follows (see Fig. 1). The set of edges in T that are on a path between two leaves in R define the backbone; when this backbone is removed, the remainder of T breaks into pieces. The components of this graph that contain vertices from \(S \setminus R\) are the superleaves. Each superleaf is rooted at the node that was incident to one of the edges in the backbone, and is one of two types:
-
Type I superleaves: the edge e in the backbone to which the superleaf was attached is a shared edge in \(T|_R\) and t
-
Type II superleaves: the edge e in the backbone to which the superleaf was attached is a unique edge in \(T|_R\) and t
Theorem 2
(Restatement of Theorem 9 in [25]) Given unrooted, singly-labeled binary trees t and 7 with the leaf set of t a subset of the leaf set S of T, OCTAL(T, t) solves the RF-OTC problem and runs in \(O(n^2)\) time, where T has n leaves.
Proof of correctness for TRACTION
Lemma 1
Let T be an unrooted, singly-labeled, binary tree on leaf set S with \(|S|=n\), and let t be an unrooted, singly-labeled tree on leaf set \(R \subseteq S\). TRACTION returns a binary unrooted tree \(T'\) on leaf set S such that \(RF(T',T)\) is minimized subject to \(T'|_{R}\) refining t.
Proof
By construction TRACTION outputs a tree \(T'\) that, when restricted to the leaf set of t, is a refinement of t. Hence, it is clear that \(T'|_{R}\) refines t. Now, it is only necessary to prove that RF(\(T'\), T) is minimized by TRACTION. Since the intermediate tree \(t^*\) produced in the first step of TRACTION is binary, Theorem 2 gives that TRACTION using OCTAL (or any method exactly solving the RF-OTC problem) will add leaves to \(t^*\) in such a way as to minimize the RF distance to T; hence it suffices to show that \(t^*\) computed by TRACTION has the smallest RF distance to T among all binary refinements of t.
As given in Eq. 1, the optimal RF distance between \(T'\) and T is the sum of two terms: (1) RF(\(t^*\), \(T|_R\)) and (2) the number of Type II superleaves in T relative to \(t^*\). Theorem 1 shows that TRACTION produces a refinement \(t^*\) that minimizes the first term. All that remains to be shown is that \(t^*\) is a binary refinement of t minimizing the number of Type II superleaves in T relative to \(t^*\).
Consider a superleaf X in T with respect to t. If t were already binary, then every superleaf X is either a Type I or a Type II superleaf. Also, note that every Type I superleaf in T with respect to t will be a Type I superleaf for any refinement of t. However, when t is not binary, it is possible for a superleaf X in T to be a Type II superleaf with respect to t but a Type I superleaf with respect to a refinement of t. This happens when the refinement of t introduces a new shared edge with T to which the superleaf X is attached in T. Notice that since the set of all possible shared edges that could be created by refining t is compatible, any refinement that maximizes the number of shared edges with T also minimizes the number of Type II superleaves. Theorem 1 shows that TRACTION produces such a refinement \(t^*\) of t. Thus, TRACTION finds a binary unrooted tree \(T'\) on leaf set S such that RF(\(T'\), T) is minimized subject to the requirement that \(T'|_{R}\) refine t. \(\square\)
Theorem 3
TRACTION solves the RF-OTRC problem and runs in
\(O(n^{1.5}\log n)\)
time if used with Bansal’s algorithm and
\(O(n^2)\)
time if used with OCTAL, where n is the number of leaves in the species tree.
Proof
The above lemma shows that TRACTION solves the RF-OTRC problem. Let t, T, S, and R be as defined in the RF-OTRC problem statement. What remains to be shown is a running time analysis for the first stage of TRACTION (refining t). We claim this step takes \(O(|S|+ |R|^{1.5} \log (|R|))\) time.
Constructing \(T|_R\) takes O(|S|) time. Checking compatibility of a single bipartition with a tree on K leaves, and then adding the bipartition to the tree if compatible, can be performed in only \(O(|K|^{0.5} \log (|K|))\) after a fast preprocessing step (see Lemmas 3 and 4 from [27]). Hence, determining the set of edges of \(T|_R\) that are compatible with t takes only \(O(|S| + |R|^{1.5} \log (|R|))\) time. Therefore, the first stage of TRACTION takes \(O(|S|+ |R|^{1.5} \log (|R|))\) time. Hence, if used with OCTAL, TRACTION takes \(O(|S|^{2})\) time and if used with Bansal’s algorithm TRACTION takes \(O(|S|^{1.5} \log |S|)\) time. \(\square\)
Extending TRACTION to MUL-trees
Up to this point, we have formulated gene tree correction problems only in the context where the input trees are each singly-labeled (i.e., have at most one leaf for each species). However, in the context of GDL, a gene tree may have multiple copies of a species at its leaves (i.e., it can be a “MUL-tree”). We now generalize the RF-OTR problem to allow the input unresolved tree t to be a MUL-tree, although we still require the species tree T to be singly-labeled.
Recall that the RF distance between two trees is the minimum number of contractions and refinements that suffice to transform one tree into the other, and that this is equal to the bipartition distance for singly-labeled trees. This definition requires that the two trees have the same number of copies of each species (also referred to as “label-multiplicity”), since otherwise there is no such edit transformation. However, even when the two MUL-trees have the same number of copies of each species, we cannot rely on the use of the bipartition distance, as two MUL-trees can have identical sets of bipartitions but not be isomorphic [28].
In the context we will address, we are given a MUL-tree \(\mathcal {R}\) (i.e., the gene family tree) and a singly-labeled tree T (i.e., the species tree). To extend the RF-OTR problem so that we can use it for such an input pair, we will draw on some definitions and results from [11, 28].
Definition 2
Let r and t be given with r a MUL-tree and t a singly-labeled tree, and both with the same set of species labeling the leaves. We construct the MUL-tree Ext(t, r) from t as follows: for each species s and the unique leaf x in t labeled by s, we replace x by a node \(v_s\) that is attached to k leaves, each labeled by s, where k is the number of leaves in r that are labeled by s. We refer to Ext(t, r) as the extension of t relative to r. Note that Ext(t, r) and r have the same number of copies of each species.
Before we present TRACTION-MT (i.e., TRACTION for MUL-trees), we need one more definition.
Definition 3
Let \(r_1\) and \(r_2\) be MUL-trees, both leaf-labeled by the same set of species, with the same number of copies of each species labeling the leaves. We construct \(r_1'\) from \(r_1\) (and similarly \(r_2'\) from \(r_2\)) by relabeling the leaves of \(r_1\) so that it is singly-labeled by replacing the k leaves labeled by s with \(s_1, s_2, \ldots , s_k\). Note that \(r_1'\) and \(r_2'\) are now singly-labeled trees and that \(L(r_1')=L(r_2')\). We say the pair \((r_1',r_2')\) is a consistent full differentiation of \((r_1,r_2)\).
We now present TRACTION-MT. The input to TRACTION-MT is a pair \((\mathcal {R},T)\) where \(\mathcal {R}\) is a MUL-tree and T is a singly-labeled tree, and they are both leaf-labeled by a set S of species.
-
Step 1: Compute \(Ext(T,\mathcal {R})\) (i.e., the extended version of T with respect to \(\mathcal {R}\), see Definition 2).
-
Step 2: Relabel the leaves in T and \(Ext(T,\mathcal {R})\) in a mutually consistent fashion (see Definition 3), thus producing trees \(T'\) and \(\mathcal {R}'\).
-
Step 3: Apply TRACTION to the pair \(\mathcal {R}'\) and \(T'\), producing tree \(\mathcal {R}^*\) on leafset \(S'\). For every species \(s \in S\) and leaf in \(\mathcal {R}^*\) labeled \(s_i\), replace the label \(s_i\) by s, thus producing a tree \(\mathcal {R}^{**}\) on leaf-set S that is isomorphic to \(\mathcal {R}^*\).
-
Step 4: Return \(\mathcal {R}^{**}\).
Theorem 4
TRACTION-MT solves the RF-OTR-MT problem exactly and has running time \(O(|\mathcal {R}|^{1.5} \log |\mathcal {R}|)\).
Proof
Let MUL-tree \(\mathcal {R}\) and singly-labeled tree T be given, and let \(\mathcal {R}^{**}\) be the tree returned by TRACTION-MT for this pair. We will show that \(\mathcal {R}^{**}\) is a refinement of \(\mathcal {R}\) that has minimum RF distance to \(Ext(T,\mathcal {R})\) among all binary refinements, thus establishing that TRACTION-MT solves the RF-OTR-MT problem optimally [28].
Steps 1 and 2 together take the input pair \(\mathcal {R}\) and T and creates two new trees \(\mathcal {R}'\) and \(T'\) that form a pair of consistent full differentiations of \(\mathcal {R}\) and \(Ext(T,\mathcal {R})\). By Theorem 3 in [11], \(RF(\mathcal {R},Ext(T,\mathcal {R})) = RF(\mathcal {R}',T')\). Since \(\mathcal {R}'\) and \(T'\) are singly-labeled, Step 2 produces a tree \(\mathcal {R}^*\) that is a refinement of \(\mathcal {R}'\) and minimizes the RF distance to \(T'\). Therefore the tree \(\mathcal {R}^{**}\) is a refinement of \(\mathcal {R}\) that minimizes the RF distance to \(Ext(T,\mathcal {R})\). Hence, TRACTION-MT finds an optimal solution to the RF-OTR-MT problem on this input pair.
Finally, for the running time analysis, the creation of the two trees \(\mathcal {R}'\) and \(\mathcal {T}'\) takes \(O(|\mathcal {R}|)\). Then running TRACTION on this pair takes an additional \(O(|\mathcal {R}|^{1.5} \log |\mathcal {R}|)\) time, as noted in Theorem 3. \(\square\)
Figure 2 provides example of a MUL-tree, an extended species tree, and TRACTION’s solution to the RF-OTR problem for MUL-trees.