The algorithm begins with input tree t and adds leaves one at a time from the set \(S \setminus R\) until a tree on the full set of taxa S is obtained. To add the first leaf, we choose an arbitrary taxon x to add from the set \(S \setminus R\). We root the tree \(T|_{R \cup \{x\}}\) (i.e., T restricted to the leaf set of t plus the new leaf being added) at x, and then remove x and the incident edge; this produces a rooted binary tree we will refer to as \(T^{(x)}\) that has leaf set R.
We perform a depth-first traversal down \(T^{(x)}\) until a shared edge e (i.e., an edge where the clade below it appears in tree t) is found. Since every edge incident with a leaf in \(T^{(x)}\) is a shared edge, every path from the root of \(T^{(x)}\) to a leaf has a distinct first edge e that is a shared edge. Hence, the other edges on the path from the root to e are unique edges.
After we identify the shared edge e in \(T^{(x)}\), we identify the edge \(e'\) in t defining the same bipartition, and we add a new node \(v(e')\) into t so that we subdivide \(e'\). We then make x adjacent to \(v(e')\). Note that since t is binary, the modification \(t'\) of t that is produced by adding x is also binary and that \(t'|_R = t\). These steps are then repeated until all leaves from \(S \setminus R\) are added to t. This process is shown in Fig. 1 and given in pseudocode below.
Proof of correctness
In what follows, let T be an arbitrary binary tree on taxon set S and t be an arbitrary binary tree on taxon set R \(\subseteq\) S. Let \(T'\) denote the tree returned by OCTAL given T and t. We set \(r=RF(T|_R,t)\). As we have noted, OCTAL returns a binary tree \(T'\) that is an S-completion of t. Hence, to prove that OCTAL solves the RF Optimal Tree Completion problem exactly, we only need to establish that \(RF(T,T')\) is the smallest possible of all binary trees on leaf set S that are S-completions of t. While the algorithm works by adding a single leaf at a time, we use two types of subtrees, denoted as superleaves (see Fig. 2), to aid in the proof of correctness.
Definition 1
The backbone of T with respect to t is the set of edges in T that are on a path between two leaves in R.
Definition 2
A superleaf of T with respect t is a rooted group of leaves from \(S \setminus R\) that is attached to an edge in the backbone of T. In particular, each superleaf is rooted at the node that is incident to one of the edges in the backbone
Definition 3
There are exactly two types of superleaves, Type I and Type II:
-
1
A superleaf is a Type I superleaf if the edge e in the backbone to which the superleaf is attached is a shared edge in \(T|_R\) and t. It follows then that a superleaf X is a Type I superleaf if and only if there exists a bipartition A|B in \(C(t) \cap C(T|_R)\) where \(A|(B \cup X)\) and \((A \cup X)|B\) are both in \(C(T|_{R \cup X})\).
-
2
A superleaf is a Type II superleaf if the edge e in the backbone to which the superleaf is attached is a unique edge in \(T|_R\) and t. It follows that a superleaf X is a Type II superleaf if and only if for any bipartition A|B such that \(A|(B \cup X)\) and \((A \cup X)|B\) are both in \(C(T|_{R \cup X})\), \(A|B \not \in C(t)\).
Now we begin our proof by establishing a lower bound on the RF distance to T for all binary S-completions of t.
Lemma 4
Let Y be a Type II superleaf for the pair (T, t), and let \(x \in S \setminus R\). Let \(t^*\) be the result of adding x into t arbitrarily (i.e., we do not attempt to minimize the resulting RF distance). If \(x \not \in Y\), then Y is a Type II superleaf for the pair \((T,t^*)\). Furthermore, if \(x \in Y\), then \(RF(T|_{R \cup \{x\}}, t^*) \ge RF(T|_R,t) +2\).
Proof
It is easy to see that if \(x \not \in Y\), then Y remains a Type II superleaf after x is added to t. Now suppose \(x \in Y\). We will show that we cannot add x into t without increasing the RF distance by at least 2. Since Y is a Type II superleaf, it is attached to a unique edge in \(T|_{R \cup Y}\), and this is the same edge that x is attached to in \(T|_{R \cup \{x\}}\). So suppose that x is added to t by subdividing an arbitrary edge \(e'\) in t with bipartition C|D; note that we do not require that x is added to a shared edge in t. After adding x to t we obtain tree \(t^*\) whose bipartition set includes \(C|(D \cup \{x\})\) and \((C \cup \{x\})|D\). If C|D corresponds to a unique edge relative to t and \(T|_R\), then both of these bipartitions correspond to unique edges relative to \(t^*\) and \(T|_{R \cup \{x\}}\). If C|D corresponds to a shared edge, then at most one of the two new bipartitions can correspond to a shared edge, as otherwise we can derive that Y is a Type I superleaf. Hence, the number of unique edges in t must increase by at least one no matter how we add x to t, where x belongs to a Type II superleaf. Since t is binary, the tree that is created by adding x is binary, so that \(RF(T|_{R \cup \{x\}},t^*) \ge RF(T|_R,t) +2\). \(\square\)
Lemma 5
Let \(T^*\) be an unrooted binary tree that is a S-completion of t. Then \(RF(T^*,T) \ge r+2m\), where \(r=RF(T|_R,t)\) and m is the number of Type II superleaves for the pair (T, t).
Proof
We note that adding a leaf can never reduce the total RF distance. The proof follows from Lemma 4 by induction. \(\square\)
Now that we have established a lower bound on the best achievable RF distance (i.e., the optimality criterion for the RF Optimal Tree Completion problem), we show OCTAL outputs a tree \(T'\) that is guaranteed to achieve this lower bound. We begin by noting that when we add x to t by subdividing some edge \(e'\), creating a new tree \(t'\), all the edges other than \(e'\) in t continue to “exist” in \(t'\) although they define new bipartitions. In addition, \(e'\) is split into two edges, which can be considered new. Thus, we can consider whether edges that are shared between t and T remain shared after x is added to t.
Lemma 6
Let \(t'\) be the tree created by AddLeaf given input tree t on leaf set R and tree T on leaf set \(R \cup \{x\}\). If x is added to tree t by subdividing edge \(e'\) (thus creating tree \(t'\)), then all edges in t other than \(e'\) that are shared between t and T remain shared between \(t'\) and T.
Proof
Let \(T^{(x)}\) be the rooted tree obtained by rooting T at x and then deleting x. Let e be the edge in \(T^{(x)}\) corresponding to \(e'\), and let \(\pi _e=A|B\); without loss of generality assume A is a clade in \(T^{(x)}\). Note that C(T) contains bipartition \(A|(B \cup \{x\})\) (however, C(T) may not contain \((A \cup \{x\})|B\), unless e is incident with the root of \(T^{(x)}\)). Furthermore, for subclade \(A' \subseteq A\), \(A'|(R \setminus A') \in\) \(C(T|_R)\) and \(A'|(R \setminus A' \cup \{x\}) \in\) C(T). Now suppose \(e^*\) in t is a shared edge between t and \(T|_R\) that defines bipartition \(C|D \ne A|B\). Since A|B and C|D are both bipartitions of t, without loss of generality either \(C \subset A\) or \(A \subset C\). If \(C \subset A\), then C is a clade in \(T^{(x)}\), and so \(e^*\) defines bipartition \(C|(D \cup \{x\})\) within \(t'\). But since \(C \subset A\), the previous analysis shows that \(C|(D \cup \{x\})\) is also a bipartition of T, and so \(e^*\) is shared between T and \(t'\). Alternatively, suppose \(A \subset C\). Then within \(t'\), \(e^*\) defines bipartition \((C \cup \{x\})|D\), which also appears as a bipartition in T. Hence, \(e^*\) is also shared between T and \(t'\). Therefore, any edge \(e^*\) other than \(e'\) that is shared between t and T remains shared between \(t'\) and T, for all leaves x added by AddLeaf. \(\square\)
Lemma 7
OCTAL(T, t) preserves the topology of superleaves in T (i.e. for any superleaf with some subset of leaves \(Q \subseteq S\), OCTAL(T, t)\(|_Q\) equals \(T|_Q\)).
Proof
We will show this by induction on the number of leaves added. The lemma is trivially true for the base case when just one leaf is added to t. Let the inductive hypothesis be that the lemma holds for adding up to n leaves to t for some arbitrary \(n \in \mathbb {N}^+\). Now consider adding \(n+1\) leaves, and choose an arbitrary subset of n leaves to add to t, creating an intermediate tree \(t'\) on leaf set K using the algorithm OCTAL. Let x be the next additional leaf to be added by OCTAL.
If x is the first element of a new superleaf to be added, it is trivially true that the topology of its superleaf is preserved, but we need to show that x will not break the monophyly of an existing superleaf in \(t'\). By the inductive hypothesis, the topology of each superleaf already placed in \(t'\) has been preserved. Thus, each superleaf placed in \(t'\) has some shared edge in \(t'\) and \(T|_{K}\) incident to that superleaf. If x were placed onto an edge contained in some existing superleaf, that edge would change its status from being shared to being unique, which contradicts Lemma 6.
The last case is where x is part of a superleaf for the pair (T, t) that already has been added in part to t. AddLeaf roots \(T|_{K \cup \{x\}}\) at x and removes the edge incident to x, creating rooted tree \(T^{(x)}\). The edge incident to the root in \(T^{(x)}\) must be a shared edge by the inductive hypothesis. Thus, OCTAL will add x to this shared edge and preserve the topology of the superleaf. \(\square\)
Lemma 8
OCTAL(T, t) returns binary tree \(T'\) such that \(RF(T,T')=r+2m\), where m is the number of Type II superleaves for the pair (T, t) and \(r=RF(T|_R,t)\).
Proof
We will show this by induction on the number of leaves added.
Base Case Assume \(|S\setminus R|\) = 1. Let x be the leaf in S\(\setminus R\). AddLeaf adds x to a shared edge of t corresponding to some bipartition A|B, which also exists in \(T^{(x)}\).
-
1.
First we consider what happens to the RF distance on the edge x is attached to.
-
If x is a Type I superleaf, the edge incident to the root in \(T^{(x)}\) will be a shared edge by the definition of Type I superleaf, so AddLeaf adds x to the corresponding edge \(e'\) in t. The two new bipartitions that are created when subdividing \(e'\) will both exist in T by the definition of Type I superleaf so the RF distance does not change.
-
If x is a Type II superleaf, either \((A \cup \{x\})|\)B or \(A|(B\cup \{x\})\) must not exist in C(T). Since AddLeaf adds x to a shared edge, exactly one of those new bipartitions must exist in C(T).
-
2.
Now we consider what happens to the RF distance on the edges x is not attached to. Lemma 6 shows that AddLeaf (and therefore OCTAL) preserves existing shared edges between t and \(T|_R\), possibly excluding the edge where x is added.
Thus, the RF distance will only increase by 2 if x is a Type II superleaf, as claimed.
Inductive step Let the inductive hypothesis be that the lemma holds for up to n leaves for some arbitrary \(n \in \mathbb {N}^+\). Assume \(|S\setminus R|\) = \(n+1\). Now choose an arbitrary subset of leaves \(Q\subseteq S \setminus R\), where \(|Q|=n\), to add to t, creating an intermediate tree \(t'\) using the algorithm OCTAL. By the inductive hypothesis, assume \(t'\) is a binary tree with the RF distance between \(T|_{Q\cup R}\) and \(t'\) equal to \(r+2m\), where m is the number of Type II superleaves in Q. AddLeaf adds the remaining leaf x \(\in S\setminus R\) to a shared edge of \(t'\) and \(T|_{Q\cup R}\).
-
1.
Lemma 6 shows that AddLeaf (and therefore OCTAL) preserves existing shared edges between \(t'\) and \(T|_{Q\cup R}\), possibly excluding the edge where x is added.
-
2.
Now we consider what happens to the RF distance on the edge x is attached to. There are three cases: (i) x is not the first element of a superleaf (ii) x is the first element of a Type I superleaf or (iii) x is the first element of a Type II superleaf.
-
Case (i): If x is not the first element of a superleaf to be added to t, it directly follows from Lemma 7 that OCTAL will not change the RF distance when adding x.
-
Case (ii): If x is the first element of a Type I superleaf to be added, then x is attached to a shared edge in the backbone corresponding to some bipartition A|B existing in both C(t) and \(C(T|_R)\). Let \(e'\) be the edge in t s.t. \(\pi _{e'}=A|B\). Note there must exist an edge e in \(T|_{Q\cup R}\) producing A|B when restricted to just R. Hence, the bipartition \(\pi _e\) has the form M|N where \((M \cap R) = A\) and \((N \cap R) = B\). We need to show that \(M|N \in C(t')\).
-
By Lemma 6, any leaves from Q not attached to \(e'\) by OCTAL will preserve this shared edge in \(t'\).
-
Now consider when leaves from Q are added to \(e'\) by OCTAL. We decompose M and N into the subsets of leaves existing in either R or Q: let \(M = A \cup W\) and \(N = B \cup Z\). OCTAL will not cross a leaf from W with a leaf from Z along \(e'\) because this would require crossing the shared edge dividing these two groups: any leaf \(w \in W\) has the property that \((A\cup \{w\}) | B\) is a shared edge and any leaf \(z \in Z\) has the property that \(A | (B \cup \{z\})\) is a shared edge. Hence, any leaves added from Q that subdivide \(e'\) will always preserve an edge between leaves contained in W and Z on \(e'\).
Thus, \(M|N \in C(t')\). Moreover, \((M\cup \{x\})| N\) and \(M | (N \cup \{x\})\) are bipartitions in C(T). AddLeaf roots T at x and removes the edge incident to x, creating rooted tree \(T^{(x)}\). We have shown that the edge incident to the root in \(T^{(x)}\) must be a shared edge, so adding x does not change the RF distance.
-
Case (iii): If x is the first element of a Type II superleaf to be added, we have shown in Lemma 4 that the RF distance must increase by at least two. Since AddLeaf always attaches x to some shared edge \(e'\), the RF distance increases by exactly 2 when subdividing \(e'\).
Thus, OCTAL will only increase the RF distance by 2 if x is a new Type II superleaf.
\(\square\)
Combining the above results, we establish our main theorem:
Theorem 9
Given unrooted binary trees t and T with the leaf set of t a subset of the leaf set of T, OCTAL(T, t) returns an unrooted binary tree
\(T'\)
that is a completion of t and that has the smallest possible RF distance to T. Hence, OCTAL finds an optimal solution to the RF Optimal Tree Completion problem. Furthermore, OCTAL runs in
\(O(n^2)\)
time, where T has n leaves.
Proof
To prove that OCTAL solves the RF Optimal Tree Completion problem optimally, we need to establish that OCTAL returns an S-completion of the tree t, and that the RF distance between the output tree \(T'\) and the reference tree T is the minimum among all S-completions. Since OCTAL always returns a binary tree and only adds leaves into t, by design it produces a completion of t and so satisfies the first property. By Lemma 8, the tree \(T'\) output by OCTAL has an RF score that matches the lower bound established in Lemma 5. Hence, OCTAL returns a tree with the best possible score among all S-completions.
We now show that OCTAL can be implemented to run in \(O(n^2)\) time, as follows. The algorithm has two stages: a preprocessing stage that can be completed in \(O(n^2)\) time and a second stage that adds all the leaves from \(S \setminus R\) into t that also takes \(O(n^2)\) time.
In the preprocessing stage, we annotate the edges of T and t as either shared or unique, and we compute a set A of pairs of shared edges (one edge from each tree that define the same bipartition on R). We pick \(r \in R\), and we root both t and T at r. We begin by computing, for each of these rooted trees, the LCA (least common ancestor) matrix for all pairs of nodes (leaves and internal vertices) and the number \(n_u\) of leaves below each node u; both can be computed easily in \(O(n^2)\) time using dynamic programming. (For example, to calculate the LCA matrix, first calculate the set of leaves below each node using dynamic programing, and then calculate the LCA matrix in the second step using the set of leaves below each node.) The annotation of edges in t and T as shared or unique, and the calculation of the set A, can then be computed in \(O(n^2)\) time as follows. Given an edge \(e \in E(T)\), we note the bipartition defined by e as X|Y, where X is the set of leaves below e in the rooted version of T. We then let u denote the LCA of X in t, which we compute in O(n) time (using O(n) LCA queries of pairs of vertices, including internal nodes, each of which uses O(1) time since we already have the LCA matrix). Once we identify u, we note the edge \(e'\) above u in t. It is easy to see that e is a shared edge if and only if e and \(e'\) induce the same bipartition on R, and furthermore this holds if and only if \(n_u = |X|\). Hence, we can determine if e is a shared edge, and also its paired edge \(e'\) in t, in O(n) time. Each edge in T is processed in O(n) time, and hence the preprocessing stage can be completed in \(O(n^2)\) time.
After the preprocessing, the second stage inserts the leaves from \(S \setminus R\) into t using AddLeaf, and each time we add a leaf into t we have to update the set of edges of t (since it grows through the addition of the new leaf) and the set A. Recall that when we add \(s \in S \setminus R\) into t, we begin by rooting T at s, and then follow a path towards the leaves until we find a first shared edge; this first shared edge may be the edge incident with s in T or may be some other edge, and we let e denote the first shared edge we find. We then use the set A to identify the edge \(e' \in E(t)\) that is paired with e. We subdivide \(e'\) and make s adjacent to the newly created node. We then update A, the set of bipartitions for each tree, and the annotations of the edges of t and T as shared or unique. By Lemma 6, AddLeaf preserves all existing shared edges other than the edge the new leaf x is placed on, and these specific edges in E can each be updated in O(1) time. Furthermore, OCTAL places x on a shared edge, bifurcating it to create two new edges. Thus, just two edges need to be checked for being shared, which again can be done in O(n) as claimed. Thus, adding s to t and updating all the data structures can be completed in O(n) time. Since there are at most n leaves to add, the second stage can be completed in \(O(n^2)\) time. Hence, OCTAL runs in \(O(n^2)\) time, since both stages take \(O(n^2)\) time. \(\square\)