Linear-time algorithms for phylogenetic tree completion under Robinson–Foulds distance

Background We consider two fundamental computational problems that arise when comparing phylogenetic trees, rooted or unrooted, with non-identical leaf sets. The first problem arises when comparing two trees where the leaf set of one tree is a proper subset of the other. The second problem arises when the two trees to be compared have only partially overlapping leaf sets. The traditional approach to handling these problems is to first restrict the two trees to their common leaf set. An alternative approach that has shown promise is to first complete the trees by adding missing leaves, so that the resulting trees have identical leaf sets. This requires the computation of an optimal completion that minimizes the distance between the two resulting trees over all possible completions. Results We provide optimal linear-time algorithms for both completion problems under the widely-used Robinson–Foulds (RF) distance measure. Our algorithm for the first problem improves the time complexity of the current fastest algorithm from quadratic (in the size of the two trees) to linear. No algorithms have yet been proposed for the more general second problem where both trees have missing leaves. We advance the study of this general problem by proposing a useful restricted version of the general problem and providing optimal linear-time algorithms for the restricted version. Our experimental results on biological data sets suggest that completion-based RF distances can be very different compared to traditional RF distances.


Background
A phylogenetic tree, or phylogeny, is a uniquely leaflabeled tree that shows the evolutionary relationships between different biological entities, generally either species or genes. Phylogenies may be either rooted or unrooted. The leaf nodes of a phylogeny represent the extant set of entities on which the phylogeny is built, while internal nodes represent hypothetical ancestors. The comparison of different phylogenetic trees is one of the most fundamental tasks in evolutionary biology and computational phylogenetics. Many biologically relevant distance or similarity measures have been defined in the literature for the case when the two phylogenies to be compared have the same leaf set. These include the widely used Robinson-Foulds distance [1], triplet and quartet distances [2,3], nearest neighbor interchange (NNI) and subtree prune and regraft (SPR) distances [4][5][6], maximum agreement subtrees [7][8][9], nodal distance [10], geodesic distance [11] and several others. Often, however, this comparison involves two trees that have non-identical leaf sets. The need to compare trees that do not have identical leaf sets arises naturally in several situations: For instance, algorithms for computing phylogenetic supertrees are typically based on comparing input trees on partial leaf sets with candidate supertrees on the complete leaf set [12][13][14][15][16]. Likewise, searching for phylogenies similar to a query tree in a phylogenetic database
The traditional approach to comparing two phylogenies on non-identical leaf sets is to first restrict the two phylogenies to their common leaf set and then apply one of the distance or similarity measures that compare two trees on the same leaf set. However, an alternative, and perhaps more useful, approach to comparing trees with non-identical taxa is to fill-in or complete the two trees to be compared with the leaves missing from each, resulting in two trees on the same leaf set, and then apply the distance or similarity measure. This completion based approach is especially desirable when used with the Robinson-Foulds (RF) distance measure [1], the most commonly used distance measure in evolutionary biology. Indeed, several important biological applications would directly benefit from the use of this completion-based RF distance, such as the construction of majority-rule(+) supertrees [22][23][24][25], construction of Robinson-Foulds supertrees [13,14,26], phylogenetic database search [17][18][19][20], and clustering of phylogenetic trees [21]. To distinguish between the two methods for computing RF distance between two trees with non-identical leaf sets, we refer to the completion-based RF distance as RF(+) distance and to the traditional pruning-based RF distance as RF(−). Figure 1 shows an example of two trees with partially overlapping leaf sets and these two ways of computing the RF distance between them.

Previous work
The idea of a completion-based RF(+) distance was proposed at least a decade ago. Cotton and Wilkinson were among the first to propose such a distance measure in their seminal paper describing majority-rule supertrees [22]. Specifically, they defined two types of majority-rule supertrees: majority-rule(−) and majority-rule(+) supertrees. The majority-rule(−) supertrees were based on traditional RF(−) distances between trees, while majority-rule(+) supertrees were based on completion-based RF(+) distances. Majority-rule(+) supertrees and its variants have been shown to have many desirable properties [27] and there have been efforts to develop exact (ILPbased) and heuristic methods for computing majorityrule(+) supertrees [23,25]. Though these methods only work for small datasets, they have been shown to result in biologically meaningful supertrees [23]. The paper by Kupczok [25] characterizes the RF(+) distance in the and RF(+) distance measures when applied to trees with partially overlapping leaf sets. In this example, the leaf sets of T 1 and T 2 are a subset of the leaf set of S. To compute the RF(−) distance between T 1 and S, we must first restrict S to the leaf set of T 1 , resulting in tree S 1 . The RF(−) distance between S and T 1 is thus RF(S 1 , T 1 ) , which is 2. Likewise, to compute the RF(−) distance between T 2 and S, we must first restrict S to the leaf set of T 2 , resulting in tree S 2 . The RF(−) distance between S and T 2 is thus RF(S 2 , T 2 ) , which is also 2. In contrast, to compute the RF(+) distance between T 1 and S, we must first compute an optimal completion of T 1 on the leaf set of S (denoted by the dashed red lines), resulting in tree T ′ 1 . The RF(+) distance between S and T 1 is thus RF(S, T ′ 1 ) , which is 2. Likewise, to compute the RF(+) distance between T 2 and S, we must first compute an optimal completion of T 2 on the leaf set of S, resulting in tree T ′ 2 . The RF(+) distance between S and T 2 is thus RF(S, T ′ 2 ) , which is 4. Observe that while both T 1 and T 2 are equidistant from S under Bansal Algorithms Mol Biol (2020) 15:6 case when the leaf set of one tree is a subset of the leaf set of the other in terms of incompatible splits between the two trees, but does not provide an efficient algorithm for computing this distance or for computing an actual completion. More recently, Christensen et al. [28] provided an O(n 2 ) time algorithm for the case when the leaf set of one tree is a subset of the leaf set of the other and applied the algorithm to compute optimal completions for gene trees with respect to a species tree. To the best of our knowledge, no algorithms (polynomial time or otherwise) currently exist for the general problem where the two trees have only partially overlapping leaf sets, or for any of its variants.

Our contribution
In this work, we address an important gap in the algorithmics of phylogenetic tree comparison. Specifically, we provide the first optimal, linear-time algorithms for two fundamental computational problems that arise when comparing phylogenetic trees with non-identical leaf sets. For the first problem, which arises when computing the RF(+) distance between two binary trees where the leaf set of one tree is a proper subset of the other, we improve upon the time complexity of the previous fastest algorithm for this problem by a factor of n, where n is the number of leaves in the larger of the two trees. For the second problem, which is a generalization of the first and arises when computing the RF(+) distance between two binary trees that have only partially overlapping leaf sets, we show that the default problem formulation can result in tree completions that are unsupported by the original input trees, propose a modification of the problem formulation that corrects this deficiency, and provide optimal linear-time algorithms for the modified problem. Crucially, no polynomial time algorithms currently exist for the default formulation of the second problem, and our modified problem formulation can be viewed as a useful restricted version of the general problem. Our algorithms are easy to understand and implement, work for both rooted and unrooted trees, and are scalable to the entire tree of life. These algorithms can be applied wherever phylogenetic distances must be computed between trees with non-identical leaf sets and enable new kinds of phylogenetic and comparative analyses that have been computationally infeasible.
We implemented our algorithm for the first problem and applied it to three published biological supertree data sets to study how RF(+) distances differ from RF(−) distances in practice. For each data set, we ordered the input trees according to their RF(+) and RF(−) distances to a precomputed supertree and measured how often the relative pairwise ranking between any pair of input trees differs between the two rankings. We found a large number of such pairs for each data set, demonstrating, for the first time, that using the RF(+) distance can result in very different relative estimates of phylogenetic distances compared to using the RF(−) distance. RF(+) distances have several desirable properties compared to RF(−) distances. For instance, the set of possible values RF(+) distance can take ranges from 0 to about twice the size of the union of the leaf sets of the two trees, while for RF(−) distance this range is only from 0 to about twice the size of the intersection of the two leaf sets. Thus, RF(+) distances have significantly more discriminatory power than RF(−) distances. In applications such as median supertree construction, RF(+) distance has the distinct advantage that each input tree gets an equal "vote" in the supertree construction since all input trees contribute an RF distance within the same range. With RF(−) distances, larger trees can contribute much more to the total distance than smaller trees. Finally, in computing RF(−) distances we ignore the additional topological information provided by leaves that are present in only one tree, while RF(+) distance makes complete use of the information in the topologies of the two trees. RF(+) distances thus make more efficient use of the available information. Despite these advantages, RF(+) distances have not been applied in practice due to unavailability of efficient algorithms. In contrast, RF(−) distances can be computed in time linear in the sizes (number of leaves) of the input trees. Our new algorithms address this discrepancy by making it equally computationally efficient to compute RF(+) distances.

Preliminaries and problem definitions
Given a tree T, we denote its node set, edge set, and leaf set by V(T), E(T), and Le (T ) , respectively. The set of all non-leaf (i.e., internal) nodes of T is denoted by I(T).
If T is rooted, the root node of T is denoted by rt (T ) , the parent of a node v ∈ V (T ) by pa T (v) , its set of children by Ch T (v) , and the (maximal) subtree of T rooted at v by T(v). If two nodes in T have the same parent, they are called siblings of each other. The least common ancestor, denoted lca T (L) , of a set L ⊆ Le (T ) in T is defined to be the node v ∈ V (T ) such that L ⊆ Le (T (v)) and L ⊆ Le (T (u)) for any child u of v. A rooted tree is binary if all of its internal nodes have exactly two children, while an unrooted tree is binary if all its nodes have degree either 1 or 3. Throughout this work, the term tree refers to binary trees with uniquely labeled leaves.
Let T be a rooted or unrooted tree. Given a set L ⊆ Le (T ) , let T be the subtree of T with leaf set L. We define the leaf induced subtree T[L] of T on leaf set L to be the tree obtained from T by successively removing each non-root node of degree two and adjoining its two neighbors.

Definition 1 (Completion of a tree) Given a tree T and a set
If T is a rooted tree, for each node v ∈ V (T ) , the clade C T (v) is defined to be the set of all leaf nodes in T(v); i.e. C T (v) = Le (T (v)) . We denote the set of all clades of a rooted tree T by Clade (T ) . This concept can be extended to unrooted trees as follows. If T is an unrooted tree, each edge (u, v) ∈ E(T ) defines a partition of the leaf set of T into two disjoint subsets Le (T u ) and Le (T v ) , where T u is the subtree containing node u and T v is the subtree containing node v, obtained when edge (u, v) is removed from T. The partition induced by any edge (u, v) ∈ E(T ) is called a split and is represented by the set { Le (T u ), Le (T v )} . The set of all splits in an unrooted tree T is denoted by Split (T ).
The symmetric difference of two sets A and B, denoted by A B , is the set Let S and T be two trees. Without loss of generality, we will assume that | Le (T )| ≤ | Le (S)| . When Le (S) = Le (T ) , there are two possible scenarios: (1) Le (T ) Le (S) , i.e., the leaf set of T is a proper subset of the leaf set of S, and (2) Le (S) ∩ Le (T ) Le (T ) , i.e., each of S and T contains leaves not found in the other. Based on these two scenarios, and depending on whether the two trees are rooted or unrooted, we define the following four problems.
Problem 1 (Rooted One-Tree RF(+) (ROT-RF(+))) Given two rooted trees S and T such that Problem 2 (Unrooted One-Tree RF(+) (UOT-RF(+))) Given two unrooted trees S and T such that Problem 3 (Rooted RF(+) (R-RF(+))) Given two rooted trees S and T, compute a completion S ′ of S on Le (S) ∪ Le (T ) and a completion T ′ of T on Le (S) ∪ Le (T ) such that RF (S ′ , T ′ ) is minimized.

Problem 4 (Unrooted RF(+) (U-RF(+))) Given two unrooted trees S and T, compute a completion S ′ of S on Le (S) ∪ Le (T ) and a completion T ′ of T on
We show how to solve Problems 1 and 2 in O(|V(S)|) time. As we will see later, Problems 3 and 4 can actually lead to unsupported completions. We will therefore define meaningful variants of Problems 3 and 4 (requiring only a slight variation on the original problems) and show how to solve them in O(|V (S)| + |V (T )|) time. For the purposes of complexity analysis, we will assume that the leaves of S and T are labeled by integers from the set {1, . . . , | Le (S) ∪ Le (T )|} . However, our algorithms work even if the leaf labels are arbitrary, and universal hashing [29] or perfect hashing [30] can be used to guarantee expected O(|V (S)| + |V (T )|) time complexity.

A linear-time algorithm for ROT-RF(+)
To solve the ROT-RF(+) problem, our algorithm starts with the trees S and T and modifies T by adding to it, according to a particular scheme, the leaves from Le (S) \ Le (T ) . The completed tree thus produced, denoted by T ′ , will be such that RF (S, T ′ ) is minimized.
We define Tree-Add(T , v, X) to be the tree obtained from T by attaching to it a tree X, where Le (X) ∩ Le (T ) = ∅ , as follows: If v is not the root of T, then attach X onto the edge ( pa (v), v) (by subdividing ( pa (v), v) into two edges) such that rt (X) becomes the sibling of the node v ∈ V (T ) . If v is the root of T, then Tree-Add(T , v, X) is the tree obtained by creating a new root node and setting v and rt (X) as its two children.
The main idea behind our algorithm can be illustrated by the following simple example. Suppose the given trees S and T are such that Le (S) = Le (T ) ∪ {l} . The goal is to add this leaf l to T so as to minimize the RF distance. Let v denote the sibling of l in S. Let u denote the node lca T ( Le (S(v))) . As we will prove later, T ′ = Tree-Add(T , u, l) must be an optimal completion for T. Our algorithm extends this idea to the case when T has multiple missing leaves. A description of the algorithm follows: Proof It suffices to show that T ′ maximizes the number of matched clades C S (v) , for v ∈ V (S).
Observe that Algorithm OneTreeCompletion partitions V(S) into three sets according to the color assigned to each node: red, green, or blue. We will consider these three sets of nodes separately.
Case 1: Red nodes. All maximal subtrees in S that contain only red nodes are included as is in the completed tree T ′ . Thus, if v is a red node then C S (v) has a match in T ′ . Thus, T ′ maximizes the number of matched clades C S (v) over all red v.
Case 2: Green nodes. We claim that if v is green and C S (v) does not have a match in T ′ then it must be unmatchable. Suppose C S (v) has a match in T, and let u ∈ V (T ) be such that C S (v) = C T (u) . Observe that the clade C T (u) Figure 2 illustrates the algorithm through an example. Next, we prove the correctness and analyze the time complexity of this algorithm. We need the following additional definitions: Definition 3 (Matched clade) Given any two rooted trees A and B on the same leaf set, and v ∈ V (A) , we say

Definition 4 (Matchable clade of S) Given any
The correctness of Algorithm OneTreeCompletion follows from the following lemma.

Lemma 1 Let T ′ denote the completion of T returned by Algorithm OneTreeCompletion on trees S and T.
Let T * denote an optimal completion of T on Le (S) that must also appear in T ′ since no blue node x ∈ V (S) will be such that M S (x) ∈ V (T (u)) . This implies that if C S (v) has a match in T then C S (v) must also have a match in T ′ . In other words, if C S (v) does not have a match in T ′ then C S (v) cannot have a match in T. Now, since C S (v) only contains leaves that are already present in T, no comple- is not already present in Clade (T ) . Thus, if C S (v) has no match in T then C S (v) must be unmatchable. This proves our claim, and so T ′ must maximize the number of matched clades C S (v) for green v.
Case 3: Blue nodes. We claim that if v is blue and C S (v) does not have a match in T ′ then it must be unmatchable. Let C ′ S (v) denote the set containing only the green nodes from C S (v) . We will say that clade has a partial-match in T, and let u be the node from T for which C T (u) = C ′ S (v) (note that, in fact, u = M S (v) ). Observe that any marked node x ∈ V (S(v)) must be such that M S (x) ∈ V (T (u)) . This implies that Algorithm OneTreeCompletion adds all the maximal red subtrees within S(v) (i.e., subtrees rooted at a red child of a marked node in S(v)) to one or more of the edges in the set {( pa (t), t)|t ∈ T (u)} . Moreover, since C T (u) = C ′ S (v) , none of the other marked nodes Thus, there must be a node u ′ ∈ T ′ for which only contains leaves that are already present in T, and there exists no node u ∈ V (T ) has no partial-match in T then C S (v) must be unmatchable. This proves our claim, and so T ′ must maximize the number of matched clades C S (v) for blue v.
In summary, the tree T ′ maximizes the number of matched clades for each of the three sets into which V(S) is partitioned, thereby maximizing the number of matched clades over all of V(S). Hence, T ′ must be a solution for the ROT-RF(+) problem. A node is colored green if all leaves in the subtree rooted at that node are present in both S and T, red if all leaves in that subtree are present only in S, and blue if that subtree has both green and red descendants. If a blue node v has exactly one red child, then it is "marked". In this example, s 1 and s 4 are marked nodes, highlighted in the figure by the double perimeter around the blue (square) nodes. The algorithm then computes the LCA mapping, defined to be lca T ( Le (S(v)) ∩ Le (T )) , for each green or blue node v of S. These LCA mappings appear in the square boxes on S in the middle column. The algorithm then performs a pre-order traversal of S, grafting copies of the red subtrees at each marked node onto the appropriate edges of T. The grafted subtrees are shown using dashed red lines on T ′ in the right column. Tree T ′ is an optimal completion of T on Le (S) Bansal Algorithms Mol Biol (2020) 15:6 Proof Lemma 1 establishes that Algorithm One-TreeCompletion solves the ROT-RF(+) problem. It therefore suffices to show that this algorithm can be implemented in O(|V(S)|) time. We consider the complexity of each of the three 'for' loops separately.

Theorem 1 Algorithm OneTreeCompletion solves the ROT-RF(+) problem in O(|V(S)|) time.
The  Note that Algorithm OneTreeCompletion computes a single optimal completion, and that optimal completions need not be unique.

Solving UOT-RF(+) in linear time
An unrooted tree can be converted into a rooted tree by adding a root node on a chosen edge (thereby splitting the chosen edge into two edges, with the two end points of the chosen edge becoming the two children of the root node). Thus, if the unrooted tree has e edges then there are e ways to root that tree, with each of the e ways resulting in a different rooted tree.
If S and T are unrooted trees then we will show how to compute an optimal completion of T on Le (S) by using Algorithm OneTreeCompletion on appropriately rooted versions of S and T. The following observation establishes a direct relationship between the RF distance between two unrooted trees on the same leaf set and the RF distance between appropriately rooted versions of the two unrooted trees. This observation is also proved in [14].
Observation 1 Let P and Q be unrooted trees on the same leaf set, and l be any leaf node (common to P and Q). Let P be obtained by rooting P on the edge connecting l to the rest of P, and Q be obtained by rooting Q on the edge connecting l to the rest of Q. Then, RF (P, Q) = RF ( P, Q).
Proof Consider any edge (u, v) ∈ E(P) . We will use P u to denote the subtree containing node u and P v to denote the subtree containing node v, obtained when edge (u, v) is removed from P. Edge (u, v) defines the split { Le (P u ), Le (P v )} in P. We define a bijection f : Split (P) → Clade ( P) \ {l, rt (P)} from splits in P to clades in P as follows. Given any split { Le (P u ), Le (P v )} , without loss of generality, we assume that the leaf l occurs in the P u side of this split, i.e., l ∈ Le (P u ) , and Conversely, suppose {X, Y } � ∈ Split (Q) . Again, without loss of generality, we may assume that l ∈ X and so f ({X, Y }) = Y . There cannot be any edge (u, v) ∈ E(Q) for which either Q u or Q v is equal to Y. Thus, there cannot be any node q in V ( Q) for which Proof Observe that S and T ′ are on the same leaf set. Let T ′′ be obtained by rooting T ′ on the edge connecting l to the rest of T ′ . The tree T ′′ must be a valid (not necessarily optimal) completion of the tree T on Le ( S) . Thus, by Observation 1, RF (S, T ′ ) = RF ( S, T ′′ ).

Lemma 2 Let S and T be unrooted trees such that
Likewise, observe that S and T ′ are on the same leaf set. Let T ′′ be the unrooted tree obtained by suppressing the root node of T ′ . The tree T ′′ must be a valid (not necessarily optimal) completion of the tree T on Le (S) . Thus, by Observation 1, RF ( S, T ′ ) = RF (S, T ′′ ).
We claim that T ′′ must be an optimal completion of T on Le ( S) . If not, then RF ( S, T ′ ) < RF ( S, T ′′ ) , implying that RF (S, T ′′ ) < RF (S, T ′ ) , which is a contradiction since T ′ is an optimal completion of T on Le (S) . Thus, we must have RF ( S, T ′ ) = RF ( S, T ′′ ) , implying that RF (S, T ′ ) = RF ( S, T ′ ) .
Based on the observation above, we solve the UOT-RF(+) problem as follows: Algorithm for UOT-RF(+) on input trees S and T: 1. Let l be any leaf from Le (T ) . Construct S by rooting S on the edge connecting l to the rest of S, and T by rooting T on the edge connecting l to the rest of T.

Call Algorithm OneTreeCompletion with trees S and
T as input. Let T ′ be the tree returned. 3. Convert T ′ into an unrooted tree by suppressing the root node and output the resulting tree. Proof Let T * denote the output of the algorithm described above, and let T ′ denote an optimal completion of T on Le (S) . Since S and T are rooted at a common leaf-edge, l, of S and T, and since the tree T ′ minimizes RF ( S, T ′ ) , Lemma 2 implies that RF (S, T ′ ) = RF ( S, T ′ ).

Theorem 2 The UOT-RF(+) problem can be solved in O(|V(S)|) time.
Now, observe that S and T * have the same leaf set, and that l is a leaf node common to S and T * . Furthermore, S is obtained by rooting S on the edge connecting l to the rest of S, and T ′ is obtained by rooting T * on the edge connecting l to the rest of T * . Thus, by Observation 1, we must have RF (S, T * ) = RF ( S, T ′ ) . Thus, RF (S, T * ) must be equal to RF (S, T ′ ) , implying that T * is an optimal completion of T on Le (S) .
The previous fastest algorithm for solving the UOT-RF(+) problem [28] has quadratic time complexity. Our algorithm is able to find edges on which to graft the missing subtrees more efficiently than the algorithm from [28] because we use appropriately rooted versions of the unrooted input trees and then use simple post-order and pre-order tree traversals of the trees coupled with efficient least common ancestor computations.

The R-RF(+) problem
Observe how an optimal completion of T in the ROT-RF(+) problem maximizes the number of clades that have a match in S. This ensures a meaningful completion of T. However, in the R-RF(+) problem, where both trees may have missing leaves, it is possible that optimal completions of the two trees contain "extraneous" clades that contain leaves from both S and T but do not contain any leaves common to S and T. Extraneous clades are created by pairing a subtree containing only missing leaves from one tree with a subtree containing only missing leaves from the other tree. Such clades can help to lower the RF distance between the two completed trees, but are completely unsupported by the topologies of S and T. This phenomenon is illustrated through an example in Fig. 3. We therefore define a variant of the R-RF(+) problem that only allows completions that do not result in extraneous clades. Crucially, this restriction to only non-extraneous clades also makes the underlying completion problem easier to solve. Note that extraneous clades could indeed be "correct" in some cases, so restricting to non-extraneous clades could sometimes prevent us from considering certain correct clades when computing completions.
Definition 5 (Extraneous clade) Suppose S and T are rooted trees. Given completions S ′ and T ′ of S and T, respectively, on Le (S) ∪ Le (T ) , we define a clade of S ′ or T ′ to be an extraneous clade if it contains leaves from both S and T but no leaves that are common to S and T.

Problem 5 (Extraneous-Clade-Free R-RF(+) (EF-R-RF(+))) Given two rooted trees S and T, compute a completion S ′ of S on Le (S) ∪ Le (T ) and a completion T ′ of T on Le (S) ∪ Le (T ) such that S ′ and T ′ do not contain any extraneous clades and RF
An example of an optimal EF-R-RF(+) completion appears in Fig. 3. Next, we show how to solve the EF-R-RF(+) problem in linear time. In the following, we will show that when Algorithm TwoTreeCompletion terminates, the trees S ′ and T ′ returned by the algorithm must be such that they do not contain any extraneous clades, and that RF (S ′ , T ′ ) is the smallest possible for any completion of S and T that does not have extraneous clades. We will assume, without any loss of generality, that S and T have at least one leaf in common; if there are no leaves in common between S and T then the EF-R-RF(+) problem has no solution since any completion of S and T would necessarily contain extraneous clades.
For brevity, in the remainder of this section, we will implicitly assume that all completions of S and T are on the leaf set Le (S) ∪ Le (T ) . Next, we define the notions of original nodes, grafted nodes, and grafted subtrees in tree completions. Definition 7 (Grafted nodes) Let S ′ and T ′ denote any completions of S and T. Observe that any node u ∈ I(S ′ ) \ O(S ′ ) is either a node that was already present in a subtree from T (consisting of leaves missing from S) as that subtree was grafted into S, or a new node that was created as a subtree from T (consisting of leaves missing from S) was grafted into S. We refer to the new nodes created by the grafting of a subtree from T into S ′ as the grafted nodes of S ′ , denoted G(S ′ ) . Analogously, the set of nodes in I(T ′ ) \ O(T ′ ) that were newly created through the process of grafting a subtree from S into T are called the grafted nodes of T ′ , denoted G(T ′ ).
Definition 8 (Grafted subtrees) If S ′ denotes any completion of S and u ∈ G(S ′ ) , then u is created by the grafting of a subtree of T (consisting of leaves missing from S) at that node u in S ′ . We denote the grafted subtree of T at u by graft(u) . Similarly, if T ′ denotes any completion of T and v ∈ G(T ′ ) , then v is created by the grafting of a subtree of S at that node v in T ′ . We denote the grafted subtree of S at v by graft(v).

Node colorings
For convenience, we will color the nodes of S and T according to the coloring scheme used in Algorithm One-TreeCompletion. Thus, each node of S and T is colored either red, or green, or blue. We will assume that these colored nodes maintain their original colors in the completed trees S ′ and T ′ , and thus both S ′ and T ′ contain nodes that are red, green, and blue, as well as nodes that are uncolored.
We now show that the completed trees S ′ and T ′ returned by Algorithm TwoTreeCompletion must be free of extraneous clades.

Lemma 3 The trees S ′ and T ′ returned by Algorithm TwoTreeCompletion do not have any extraneous clades.
Proof Let us first consider the tree T ′ . Any non-original node in T ′ is either a node from a maximal red subtree of S or is a grafted node created by grafting a maximal red subtree of S into T ′ using the Tree-Add operation. Based on Algorithm OneTreeCompletion, each grafted node created through the Tree-Add operation has at least one green descendant, and so it cannot be extraneous. Moreover, any node inside a maximal red subtree of S only has descendants from S, not from T. Thus, since T did not contain any extraneous clades to begin with, neither can T ′ . An analogous argument applies to S ′ . The next lemma identifies an important property of optimal completions. Lemma 4 Let S * and T * be any optimal completions of S and T, respectively, under the EF-R-RF(+) problem. Then, for any u ∈ G(S * ) , graft(u) must be a maximal red subtree of T and, for any v ∈ G(T * ) , graft(v) must be a maximal red subtree of S.
Proof Observe that any maximal red subtree of T must appear as is in the tree T * , since grafting a red leaf or subtree from S into any of the red subtrees of T would result in an extraneous clade. We will show that if there exists a node u ∈ G(S * ) for which graft(u) is not a maximal red subtree of T, it is possible to modify the tree S * so that the modified tree has more matched clades than S * , a contradiction. An analogous argument applies to T * . Suppose there exists such a node u. Then, there must exist a red internal node r of T such that the two subtrees, denoted R ′ and R ′′ , rooted at the two children of r appear as is in the tree S * but not as siblings of each other (i.e., their roots do not have the same parent in S * ). Let r ′ and r ′′ denote the root nodes of R ′ and R ′′ , respectively, and s ′ and s ′′ denote the parents of r ′ and r ′′ in S * . Thus, R ′ = graft(s ′ ) and R ′′ = graft(s ′′ ) . Now, observe that all clades of S * rooted either at a node on the path from lca S * (s ′ , s ′′ ) to s ′ or on the path from lca S * (s ′ , s ′′ ) to s ′′ , except for the node lca S * (s ′ , s ′′ ) itself, must be mismatched clades (since all maximal red subtrees of T appear as is in the tree T * ). Also, note that if S * is modified by pruning out the subtree R ′ and regrafting it on the edge (s ′′ , r ′′ ) then the only matched clades that can become mismatched are the ones whose roots lie on the path from lca S * (s ′ , s ′′ ) to s ′ or from lca S * (s ′ , s ′′ ) to s ′′ , except for node lca S * (s ′ , s ′′ ) . Thus, modifying the tree S * in this fashion does not result in any additional mismatched clades, but results in a new matched clade rooted at the node where R ′ is regrafted. Thus, the modified tree has a larger number of matched clades than S * , which is a contradiction.
We also have the following simple observation about optimal completions. Observation 2 Let S * and T * be optimal completions of S and T, respectively, that satisfy the property described in Lemma 4. Then any u ∈ G(S * ) and any v ∈ G(T * ) must have at least one green leaf as a descendant.
Proof This follows immediately from the fact that, under EF-R-RF(+), each clade must contain at least one green leaf (otherwise it would be an extraneous clade). Finally, the following lemma proves the correctness of Algorithm TwoTreeCompletion.
Lemma 5 Let S ′ and T ′ denote the completions of S and T, respectively, returned by Algorithm TwoTreeCompletion. Let S * and T * denote optimal completions of S and T, respectively, under the EF-R-RF(+) problem. Then, RF (S ′ , T ′ ) = RF (S * , T * ).
Proof Based on Lemma 4, we know that S * and T * are such that, for any u ∈ G(S * ) , graft(u) is a maximal red subtree of T, and for any v ∈ G(T * ) , graft(v) is a maximal red subtree of S. Furthermore, observe that, given the tree T * , we can compute an optimal completion for S with respect to T * by using Algorithm OneTreeCompletion. Thus, without any loss of generality, we will assume that S * has the topology that would be computed by Algorithm OneTreeCompletion on input (T * , S).
To prove this lemma, it suffices to show that the number of matched clades in T ′ (with respect to S ′ ) is no less than the number of matched clades in T * (with respect to S * ). We first define a one-to-one correspondence between the internal nodes of T ′ and the internal nodes of T * . Consider any node t ∈ I(T ′ ) . There are three possibilities: (i) t ∈ O(T ′ ) , (ii) t ∈ G(T ′ ) , and (iii) t is a node from a maximal red subtree of S. For case (i), observe that O(T ′ ) = O(T * ) , and so if t ∈ O(T ′ ) then a counterpart of t also exists in T * . For case (ii) observe that each t ∈ G(T ′ ) is created by grafting graft(t) into T ′ . We will associate t with that unique node of T * that is created by grafting the same maximal red subtree of S, graft(t) , into T * . For case (iii), since the same maximal red subtree of S also appears in T * , the node associated with t is the same node from the same maximal red subtree of S in T * . We denote the node of I(T * ) corresponding to node t ∈ I(T ′ ) by f(t). It is not difficult to see that f : I(T ′ ) → I(T * ) is one-to-one and onto.
We now traverse the nodes of T ′ in post order and identify the first node t ∈ I(T ′ ) for which C T ′ (t) is a mismatch in S ′ but C T * (f (t)) is a match in S * . If no such node exists then the number of matched clades in T * could not be more than the number of matched clades in T ′ , completing our proof. Thus, suppose such a node t exists. We again have three possible cases depending on whether (i) t ∈ O(T ′ ) , (ii) t ∈ G(T ′ ) , or (iii) t is a node from a maximal red subtree of S. We consider each of these cases separately.
Case (i): t ∈ O(T ′ ) . In this case, C T ′ (t) must be a proper subset of C T * (f (t)) . This is because if C T ′ (t) = C T * (f (t)) then both clades would either be matches or both would be mismatches, while if T ′ (t) contains a grafted subtree not present in T * (f (t)) then C T * (f (t)) could not possibly be a matched clade. Thus, there must be at least one maximal red subtree of S that is grafted on an edge in T * (f (t)) but not on an edge in T ′ (t) . We let X denote the set of such maximal red subtrees, and let G * denote the set of grafted nodes corresponding to these maximal red subtrees from X in the tree T * .
Let a be any node on the path from f(t) to any g ∈ G * in T * , except for the node f(t) itself. Since T ′ is computed by executing Algorithm OneTreeCompletion on input (S, T), and no subtree from X is grafted inside the subtree T ′ (t) , T * (a) cannot be a matched clade. We can therefore modify T * by cutting out all subtrees of X from T * (f (t)) and grafting them onto the parent edge of f(t) (in any arbitrary order if |X| > 1 ). Let T * M denote this modified version of T * , and let g * denote the newly created grafted node that is closest to rt (T * M ) along the path from rt (T , and so C T * M (g * ) must be a matched clade in T * M , while C T * M (f (t)) is no longer a matched clade. Thus, overall, the number of matched clades in T * M is the same as the number of matched clades in T * . If we now assign T * to be T * M then node t is no longer such that C T ′ (t) is a mismatch in S ′ but C T * (f (t)) is a match in S * . Moreover, observe that any grafted node corresponding to a maximal red subtree from X in the tree T ′ must lie along the path from rt (T ′ ) to t (otherwise C T * (f (t)) could not be a matched clade). Thus, the nodes of I(T ′ ) that have already been considered so far in the post-order traversal remain unaffected by the change in the topology of T * .
Case (ii): t ∈ G(T ′ ) . The argument in this case is similar to that from case (i). As before, C T ′ (t) must be a proper subset of C T * (f (t)) . This is because if C T ′ (t) = C T * (f (t)) then both clades would either be matches or both would be mismatches, while if T ′ (t) contains a grafted subtree not present in T * (f (t)) then C T * (f (t)) could not possibly be a matched clade. There are therefore two possibilities: T * (f (t)) either includes an original node r ∈ O(T * ) for which the corresponding original node in T ′ is an ancestor of t, or T * (f (t)) does not include such an original node.
For the first possibility, T * (f (t)) includes an original node r ∈ O(T * ) for which the corresponding original node in T ′ is an ancestor of t. Without loss of generality, let r denote that original node from T * (f (t)) that is closest to f(t). Let a be any node along the path from f(t) to r, except for f(t) itself. Observe that a must be a grafted node and that C T * (a) cannot be a match since it does not include the subtree graft(t) . We can therefore modify T * by cutting out all grafted subtrees along the path from We will show that the completed unrooted trees, S ′ and T ′ , returned by the above algorithm must be extraneoussplit-free and minimize RF (S ′ , T ′ ).

Lemma 6
The trees S ′ and T ′ returned by the above algorithm do not have any extraneous splits.
Proof Since the trees S ′ and T ′ computed in the above algorithm do not have any extraneous clades, each clade in Clade ( S ′ ) and Clade ( T ′ ) must have at least one leaf from Le (S) ∩ Le (T ) . Now, consider any leaf l ′ ∈ Le (S) ∩ Le (T ) , and let S ′′ be obtained by rooting S ′ on the edge connecting l ′ to the rest of S ′ , and T ′′ be obtained by rooting T ′ on the edge connecting l ′ to the rest of T ′ . Observe that any clade in Clade ( S ′′ ) must either be a clade from Clade ( S ′ ) or must contain the leaf l. Likewise, any clade in Clade ( T ′′ ) must either be a clade from Clade ( T ′ ) or must contain the leaf l. Thus, neither S ′′ nor T ′′ contain any extraneous clades, and so, by the definition of an extraneous split, the trees S ′ and T ′ must be free of any extraneous splits.

Lemma 7 Let S and T be unrooted trees with partially
overlapping leaf sets and | Le (S) ∩ Le (T )| ≥ 2 . Let S ′ be an optimal completion of S and T ′ be an optimal completion of T, on Le (T ) ∪ Le (S) , such that S ′ and T ′ do not contain any extraneous splits and minimize RF (S ′ , T ′ ) . Let l be any leaf node common to S and T. Let S be obtained by rooting S on the edge connecting l to the rest of S, and T be obtained by rooting T on the edge connecting l to the rest of T. If S ′ and T ′ are optimal completions of S and T , respectively, under the EF-R-RF(+) problem, then RF (S ′ , T ′ ) = RF ( S ′ , T ′ ).
Proof Observe that S ′ and T ′ are on the same leaf set. Let T ′′ be obtained by rooting T ′ on the edge connecting l to the rest of T ′ , and S ′′ be obtained by rooting S ′ on the edge connecting l to the rest of S ′ . The trees T ′′ and S ′′ must be valid (but not necessarily optimal) completions of the trees T and S under the EF-R-RF(+) problem. Thus, by Observation 1, RF (S ′ , T ′ ) = RF (S ′′ , T ′′ ).
Likewise, observe that S ′ and T ′ are on the same leaf set. Let S ′′ and T ′′ be the unrooted trees obtained by suppressing the root nodes of S ′ and T ′ , respectively. As shown in Lemma 6, the trees S ′′ and T ′′ must be valid (not necessarily optimal) completions of S and T under the EF-U-RF(+) problem. Thus, by Observation 1, RF ( S ′ , T ′ ) = RF ( S ′′ , T ′′ ).
We claim that S ′′ and T ′′ must be optimal completions of S and T , respectively, on Le (T ) ∪ Le (S) . If not, then RF ( S ′ , T ′ ) < RF (S ′′ , T ′′ ) , implying that RF ( S ′′ , T ′′ ) < RF (S ′ , T ′ ) , which is a contradiction since S ′ and T ′ are optimal completions of S and T under the EF-U-RF(+) problem. Thus, we must have Lemma 7 proves that the algorithm described above correctly solves the EF-U-RF(+) problem. Furthermore, note that the time complexity of the algorithm above is dominated by the time complexity of Algorithm TwoTreeCompletion, which is O(|V (S)| + |V (T )|) . Thus, we immediately have the following theorem.

Experimental evaluation
We implemented our algorithm for the ROT-RF(+) problem and applied it to three large biological supertree data sets with the goal of assessing the impact of using RF(+) distance instead of the traditional RF(−) distance in practice. Specifically, we computed a supertree (using a standard supertree method; RFS [13] in this case) for each of the supertree data sets, and computed the RF(+) and RF(−) distances between the supertree and the input trees for each data set. Let the RF(+) distance between a supertree S and an input tree I be denoted by RF + (S, I) , and the RF(−) distance those two trees by RF − (S, I) . For each data set, we ordered the input trees according to their RF(+) and RF(−) distances to the supertree and measured how often the relative ranking between any pair of input trees differs between the two rankings. More precisely, given a supertree S and its set of input trees I , we computed RF − (S, I) and RF + (S, I) for each I ∈ I , and counted the number of Type-1, Type-2, and Type-3 pairs The three data sets, marsupials [32], placental mammals [33], and legumes [34], contain 272, 116, and 571 species, and 158, 726, and 22 input trees, respectively. We observed that for the 158 input trees of the marsupial data set, there were 521 Type-1 pairs, 619 Type-2 pairs, and 376 Type-3 pairs. For the 726 input trees of the placental mammals data set, there were 5816 Type-1 pairs, 14, 344 Type-2 pairs, and 6, 238 Type-3 pairs. Likewise, for the 22 input trees in the legumes data set, we observed 8 Type-1 pairs, 3 Type-2 pairs, and no Type-3 pairs. These results, summarized in Table 1, show that there can be substantial difference between RF(−) and RF(+) distances.

Conclusion
In this work, we provide the first optimal, linear-time algorithms for two fundamental computational problems that arise when comparing phylogenetic trees with non-identical leaf sets. For the first problem, which arises when computing the RF(+) distance between two trees where the leaf set of one tree is a proper subset of the other, we improved upon the time complexity of the previous fastest algorithm by a factor of n, where n is the number of leaves in the larger of the two trees. For the second problem, which arises when computing the RF(+) distance between two trees that have only partially overlapping leaf sets, and for which there are no existing algorithms, we defined a useful restriction of the problem and provided an optimal linear-time algorithm for it. These algorithms make it as computationally efficient to compute RF(+) distances as RF(−) distances. The algorithms work for both rooted and unrooted trees, and can be directly applied wherever phylogenetic distances must be computed between trees with non-identical leaf sets. Furthermore, our experiments with three large biological supertree data sets suggest that using the RF(+) distance can result in very different relative estimates of phylogenetic distances compared to using the RF(−) distance.
The algorithms presented here have several important, well-established applications, including construction of majority-rule(+) supertrees and supertree construction in general, phylogenetic database search, and clustering of phylogenetic trees, and these applications should be studied and developed further. A more detailed experimental study is needed to properly assess the impact of using RF(+) distances and to systematically study the effect of factors such as fraction of leaf set overlap and degree of discordance between trees. This work also motivates several theoretical questions for future investigation. For instance, our algorithms for the EF-R-RF(+) and EF-U-RF(+) problems cannot be easily extended to solve the R-RF(+) and U-RF(+) problems. In particular, if optimal completions are allowed to contain extraneous clades then inferring the number and composition of these extraneous clades (to attain overall optimality) appears to be computationally challenging. It would be interesting to determine if linear or near-linear time algorithms exist for R-RF(+) and U-RF(+).