Terminology and basics
A phylogenetic tree can be represented as a tree T with leaves labeled by some set of organisms S. If each leaf label is unique, then the phylogenetic tree is singlylabeled. Unless noted otherwise, the phylogenetic trees we describe throughout this paper are singlylabeled and unrooted.
Each edge e in an unrooted, singlylabeled phylogenetic tree defines a bipartition \(\pi _e\) (also sometimes referred to as a split) on the set of leaf labels induced by the deletion of e from the tree, but not its endpoints. Each bipartition splits the leaf set into two nonempty disjoint parts, A and B, and is denoted by AB. The set of bipartitions of a tree T is given by C(T) = {\(\pi _e\) : \(e \in E(T)\)}, where E(T) is the edge set for T. Tree \(T'\) is a refinement of T if T can be obtained from \(T'\) by contracting a set of edges in \(E(T')\). A tree T is fully resolved (i.e., binary) if there is no tree that refines T other than itself.
A set Y of bipartitions on some leaf set S is compatible if there exists an unrooted tree T leaflabeled by S such that Y \(\subseteq\) C(T). A bipartition \(\pi\) of a set S is said to be compatible with a tree T with leaf set S if and only if there is a tree \(T'\) such that \(C(T') = C(T) \cup \{\pi \}\) (i.e., \(T'\) is a refinement of T that includes the bipartition \(\pi\)). Similarly, two trees on the same leaf set are said to be compatible if they share a common refinement. An important result on compatibility is that pairwise compatibility of a set of bipartitions over a leaf set ensures setwise compatibility [19, 20]; it then follows that two trees are compatible if and only if the union of their sets of bipartitions is compatible. Furthermore, by [21] (and see discussion in [22, 23]), a set \(\mathcal {C}\) of bipartitions is compatible if and only if there is a tree T such that \(C(T)=\mathcal {C}.\)
The Robinson−Foulds (RF) distance [17] between two trees T and \(T'\) on the same set of leaves is defined as the minimum number of edgecontractions and refinements required to transform T into \(T'\) (where each such operation changes the number of edges in the tree by exactly one, so contracting a single edge or refining a polytomy to add a single edge). For singlylabeled trees, the RF distance equals the number of bipartitions present in only one tree (i.e., the symmetric difference). The normalized RF distance is the RF distance divided by \(2n6\), where n is the number of leaves in each tree; this produces a value between 0 and 1 since the two trees can only disagree with respect to internal edges, and \(n3\) is the maximum number of internal edges in an unrooted tree with n leaves.
Given a phylogenetic tree T on taxon set S, T restricted to \(R \subseteq S\) is the minimal subgraph of T connecting elements of R and suppressing nodes of degree two. We denote this as \(T_R\). If T and \(T'\) are two trees with R as the intersection of their leaf sets, their shared edges are edges whose bipartitions restricted to R are in the set \(C(T_R)\cap C(T'_R)\). Correspondingly, their unique edges are edges whose bipartitions restricted to R are not in the set \(C(T_R)\cap C(T'_R)\). See Fig. 1 for a pictorial depiction of unique and shared edges.
RFoptimal tree refinement and completion (RFOTRC) problem
We now turn our attention to the optimization problem of interest to this paper. This section is limited to the context of singlylabeled trees; we postpone the extension to cases where the gene tree can have multiple copies of a species at the leaves, which are referred to as multilabeled trees (i.e., MULtrees [24]), until a later section.
If the trees t and T have the same set of taxa, then the RFOTRC problem becomes the RFoptimal tree refinement (RFOTR) problem, while if t is already binary but can be missing taxa, then the RFOTRC problem becomes the RFoptimal tree completion (RFOTC) problem. OCTAL, presented in [25], solves the RFOTC problem in \(O(n^2)\) time, and an improved approach presented by Bansal [26] solves the RFOTC problem in linear time. We refer to this faster approach as Bansal’s algorithm. In this paper we present an algorithm that solves the RFOTR problem exactly in polynomial time and show that the combination of this algorithm with Bansal’s algorithm solves the RFOTRC problem exactly in \(O(n^{1.5} \log n)\) time, where T has n leaves. We refer to the two steps together as Tree Refinement And CompleTION (TRACTION).
TRACTION algorithm
The input to TRACTION is a pair of unrooted, singlylabeled trees (t, T), where t is the estimated gene tree on set R of species and T is the binary reference tree on S, with \(R \subseteq S\). Note that we allow t to not be binary (e.g., if low support edges have already been collapsed) and to be missing species (i.e., \(R \subset S\) is possible).

Step 1: Refine t so as to produce a binary tree \(t^*\) that maximizes shared bipartitions with T.

Step 2: Add the missing species from T into \(t^*\), minimizing the RF distance.
Step 1: Greedy refinement of t
To compute \(t^*\), we first refine t by adding all bipartitions from \(T_{R}\) that are compatible with t; this produces a unique tree \(t'\). If \(t'\) is not fully resolved, then there are multiple optimal solutions to the RFOTR problem, as we will later prove. The algorithm selects one of these optimal solution as follows. First, we add edges from t that were previously collapsed (if such edges are available). Next, we randomly refine the tree until we obtain a fully resolved refinement, \(t^*\). Note that if \(t'\) is not binary, then \(t^*\) is not unique. We now show that the first step of TRACTION solves the RFOTR problem.
Theorem 1
Let T be an unrooted, singlylabeled tree on leaf set S, and let t be an unrooted, singlylabeled tree on leaf set \(R \subseteq S\). A fully resolved (i.e. binary) refinement of t minimizes the RF distance to \(T_{R}\) if and only if it includes all compatible bipartitions from \(T_{R}\).
Proof
Let \(C_0\) denote the set of bipartitions in \(T_R\) that are compatible with t. By the theoretical properties of compatible bipartitions (see “Terminology and basics” section), this means the set \(C_0 \cup C(t)\) is a compatible set of bipartitions that define a unique tree \(t'\) where \(C(t')=C_0 \cup C(t)\) (since the trees are singlylabeled).
We now prove that for any binary tree B refining t, B minimizes the RF distance to \(T_R\) if and only if B refines \(t'\).
Consider a sequence of trees \(t=t_0, t_1, t_2, \ldots , t_k\), each on leaf set R, where \(t_i\) is obtained from \(t_{i1}\) by adding one edge to \(t_{i1}\), and thus adds one bipartition to \(C(t_{i1})\). Let \(\delta _i=RF(t_{i},T_R)  RF(t_{i1},T_R)\), so that \(\delta _i\) indicates the change in RF distance produced by adding a specific edge to \(t_{i1}\) to get \(t_i\). Hence,
$$\begin{aligned} RF(t_i,T_R) = RF(t_0,T_R) + \sum _{j \le i} \delta _j. \end{aligned}$$
A new bipartition \(\pi _i\) added to \(C(t_{i1})\) is in \(C(T_R)\) if and only if \(\pi _i \in C_0\). If this is the case, then the RF distance will decrease by one (i.e., \(\delta _i =1\)). Otherwise, \(\pi _i \not \in C_0\), and the RF distance to \(T_R\) will increase by one (i.e., \(\delta _i =1\)).
Now suppose B is a binary refinement of t. We can write the bipartitions in \(C(B){\backslash}C(t)\) into two sets, X and Y, where X are bipartitions in \(C_0\) and Y are bipartitions not in \(C_0\). By the argument just provided, it follows that \(RF(B,T_R) = RF(t,T_R)  X + Y\). Note that \(X \cup Y\) must be the same for all binary refinements of t, because all binary refinements of t have the same number of edges. Thus, \(RF(B,T_R)\) is minimized when X is maximized, so B minimizes the RF distance to \(T_R\) if and only if C(B) contains all the bipartitions in \(C_0\). In other words, \(RF(B,T_R)\) is minimized if and only if B refines \(t'\). \(\square\)
Corollary 1
TRACTION finds an optimal solution to the RFOTR problem.
Proof
Given input gene tree t and reference tree T on the same leaf set, TRACTION produces a tree \(t''\) that refines t and contains every bipartition in T compatible with t; hence by Theorem 1, TRACTION solves the RFOTR problem. \(\square\)
Step 2: Adding in missing species
The second step of TRACTION can be performed using OCTAL or Bansal’s algorithm, each of which finds an optimal solution to the RFOTC problem in polynomial time. Indeed, we show that any method that optimally solves the RFOTC problem can be used as an intermediate step to solve the RFOTRC problem.
To prove this, we first restate several prior theoretical results. In [25] we showed the minimum achievable RF distance between T and \(T'\) is given by:
$$\begin{aligned} RF(T,T')&= RF(T_R,t) +2m \end{aligned}$$
(1)
where m is the number of Type II superleaves in T relative to t, which we define:
Definition 1
Let T be a binary tree on leaf set S and t be a tree on leaf set \(R \subseteq S\). The superleaves of T with respect to t are defined as follows (see Fig. 1). The set of edges in T that are on a path between two leaves in R define the backbone; when this backbone is removed, the remainder of T breaks into pieces. The components of this graph that contain vertices from \(S \setminus R\) are the superleaves. Each superleaf is rooted at the node that was incident to one of the edges in the backbone, and is one of two types:

Type I superleaves: the edge e in the backbone to which the superleaf was attached is a shared edge in \(T_R\) and t

Type II superleaves: the edge e in the backbone to which the superleaf was attached is a unique edge in \(T_R\) and t
Theorem 2
(Restatement of Theorem 9 in [25]) Given unrooted, singlylabeled binary trees t and 7 with the leaf set of t a subset of the leaf set S of T, OCTAL(T, t) solves the RFOTC problem and runs in \(O(n^2)\) time, where T has n leaves.
Proof of correctness for TRACTION
Lemma 1
Let T be an unrooted, singlylabeled, binary tree on leaf set S with \(S=n\), and let t be an unrooted, singlylabeled tree on leaf set \(R \subseteq S\). TRACTION returns a binary unrooted tree \(T'\) on leaf set S such that \(RF(T',T)\) is minimized subject to \(T'_{R}\) refining t.
Proof
By construction TRACTION outputs a tree \(T'\) that, when restricted to the leaf set of t, is a refinement of t. Hence, it is clear that \(T'_{R}\) refines t. Now, it is only necessary to prove that RF(\(T'\), T) is minimized by TRACTION. Since the intermediate tree \(t^*\) produced in the first step of TRACTION is binary, Theorem 2 gives that TRACTION using OCTAL (or any method exactly solving the RFOTC problem) will add leaves to \(t^*\) in such a way as to minimize the RF distance to T; hence it suffices to show that \(t^*\) computed by TRACTION has the smallest RF distance to T among all binary refinements of t.
As given in Eq. 1, the optimal RF distance between \(T'\) and T is the sum of two terms: (1) RF(\(t^*\), \(T_R\)) and (2) the number of Type II superleaves in T relative to \(t^*\). Theorem 1 shows that TRACTION produces a refinement \(t^*\) that minimizes the first term. All that remains to be shown is that \(t^*\) is a binary refinement of t minimizing the number of Type II superleaves in T relative to \(t^*\).
Consider a superleaf X in T with respect to t. If t were already binary, then every superleaf X is either a Type I or a Type II superleaf. Also, note that every Type I superleaf in T with respect to t will be a Type I superleaf for any refinement of t. However, when t is not binary, it is possible for a superleaf X in T to be a Type II superleaf with respect to t but a Type I superleaf with respect to a refinement of t. This happens when the refinement of t introduces a new shared edge with T to which the superleaf X is attached in T. Notice that since the set of all possible shared edges that could be created by refining t is compatible, any refinement that maximizes the number of shared edges with T also minimizes the number of Type II superleaves. Theorem 1 shows that TRACTION produces such a refinement \(t^*\) of t. Thus, TRACTION finds a binary unrooted tree \(T'\) on leaf set S such that RF(\(T'\), T) is minimized subject to the requirement that \(T'_{R}\) refine t. \(\square\)
Theorem 3
TRACTION solves the RFOTRC problem and runs in
\(O(n^{1.5}\log n)\)
time if used with Bansal’s algorithm and
\(O(n^2)\)
time if used with OCTAL, where n is the number of leaves in the species tree.
Proof
The above lemma shows that TRACTION solves the RFOTRC problem. Let t, T, S, and R be as defined in the RFOTRC problem statement. What remains to be shown is a running time analysis for the first stage of TRACTION (refining t). We claim this step takes \(O(S+ R^{1.5} \log (R))\) time.
Constructing \(T_R\) takes O(S) time. Checking compatibility of a single bipartition with a tree on K leaves, and then adding the bipartition to the tree if compatible, can be performed in only \(O(K^{0.5} \log (K))\) after a fast preprocessing step (see Lemmas 3 and 4 from [27]). Hence, determining the set of edges of \(T_R\) that are compatible with t takes only \(O(S + R^{1.5} \log (R))\) time. Therefore, the first stage of TRACTION takes \(O(S+ R^{1.5} \log (R))\) time. Hence, if used with OCTAL, TRACTION takes \(O(S^{2})\) time and if used with Bansal’s algorithm TRACTION takes \(O(S^{1.5} \log S)\) time. \(\square\)
Extending TRACTION to MULtrees
Up to this point, we have formulated gene tree correction problems only in the context where the input trees are each singlylabeled (i.e., have at most one leaf for each species). However, in the context of GDL, a gene tree may have multiple copies of a species at its leaves (i.e., it can be a “MULtree”). We now generalize the RFOTR problem to allow the input unresolved tree t to be a MULtree, although we still require the species tree T to be singlylabeled.
Recall that the RF distance between two trees is the minimum number of contractions and refinements that suffice to transform one tree into the other, and that this is equal to the bipartition distance for singlylabeled trees. This definition requires that the two trees have the same number of copies of each species (also referred to as “labelmultiplicity”), since otherwise there is no such edit transformation. However, even when the two MULtrees have the same number of copies of each species, we cannot rely on the use of the bipartition distance, as two MULtrees can have identical sets of bipartitions but not be isomorphic [28].
In the context we will address, we are given a MULtree \(\mathcal {R}\) (i.e., the gene family tree) and a singlylabeled tree T (i.e., the species tree). To extend the RFOTR problem so that we can use it for such an input pair, we will draw on some definitions and results from [11, 28].
Definition 2
Let r and t be given with r a MULtree and t a singlylabeled tree, and both with the same set of species labeling the leaves. We construct the MULtree Ext(t, r) from t as follows: for each species s and the unique leaf x in t labeled by s, we replace x by a node \(v_s\) that is attached to k leaves, each labeled by s, where k is the number of leaves in r that are labeled by s. We refer to Ext(t, r) as the extension of t relative to r. Note that Ext(t, r) and r have the same number of copies of each species.
Before we present TRACTIONMT (i.e., TRACTION for MULtrees), we need one more definition.
Definition 3
Let \(r_1\) and \(r_2\) be MULtrees, both leaflabeled by the same set of species, with the same number of copies of each species labeling the leaves. We construct \(r_1'\) from \(r_1\) (and similarly \(r_2'\) from \(r_2\)) by relabeling the leaves of \(r_1\) so that it is singlylabeled by replacing the k leaves labeled by s with \(s_1, s_2, \ldots , s_k\). Note that \(r_1'\) and \(r_2'\) are now singlylabeled trees and that \(L(r_1')=L(r_2')\). We say the pair \((r_1',r_2')\) is a consistent full differentiation of \((r_1,r_2)\).
We now present TRACTIONMT. The input to TRACTIONMT is a pair \((\mathcal {R},T)\) where \(\mathcal {R}\) is a MULtree and T is a singlylabeled tree, and they are both leaflabeled by a set S of species.

Step 1: Compute \(Ext(T,\mathcal {R})\) (i.e., the extended version of T with respect to \(\mathcal {R}\), see Definition 2).

Step 2: Relabel the leaves in T and \(Ext(T,\mathcal {R})\) in a mutually consistent fashion (see Definition 3), thus producing trees \(T'\) and \(\mathcal {R}'\).

Step 3: Apply TRACTION to the pair \(\mathcal {R}'\) and \(T'\), producing tree \(\mathcal {R}^*\) on leafset \(S'\). For every species \(s \in S\) and leaf in \(\mathcal {R}^*\) labeled \(s_i\), replace the label \(s_i\) by s, thus producing a tree \(\mathcal {R}^{**}\) on leafset S that is isomorphic to \(\mathcal {R}^*\).

Step 4: Return \(\mathcal {R}^{**}\).
Theorem 4
TRACTIONMT solves the RFOTRMT problem exactly and has running time \(O(\mathcal {R}^{1.5} \log \mathcal {R})\).
Proof
Let MULtree \(\mathcal {R}\) and singlylabeled tree T be given, and let \(\mathcal {R}^{**}\) be the tree returned by TRACTIONMT for this pair. We will show that \(\mathcal {R}^{**}\) is a refinement of \(\mathcal {R}\) that has minimum RF distance to \(Ext(T,\mathcal {R})\) among all binary refinements, thus establishing that TRACTIONMT solves the RFOTRMT problem optimally [28].
Steps 1 and 2 together take the input pair \(\mathcal {R}\) and T and creates two new trees \(\mathcal {R}'\) and \(T'\) that form a pair of consistent full differentiations of \(\mathcal {R}\) and \(Ext(T,\mathcal {R})\). By Theorem 3 in [11], \(RF(\mathcal {R},Ext(T,\mathcal {R})) = RF(\mathcal {R}',T')\). Since \(\mathcal {R}'\) and \(T'\) are singlylabeled, Step 2 produces a tree \(\mathcal {R}^*\) that is a refinement of \(\mathcal {R}'\) and minimizes the RF distance to \(T'\). Therefore the tree \(\mathcal {R}^{**}\) is a refinement of \(\mathcal {R}\) that minimizes the RF distance to \(Ext(T,\mathcal {R})\). Hence, TRACTIONMT finds an optimal solution to the RFOTRMT problem on this input pair.
Finally, for the running time analysis, the creation of the two trees \(\mathcal {R}'\) and \(\mathcal {T}'\) takes \(O(\mathcal {R})\). Then running TRACTION on this pair takes an additional \(O(\mathcal {R}^{1.5} \log \mathcal {R})\) time, as noted in Theorem 3. \(\square\)
Figure 2 provides example of a MULtree, an extended species tree, and TRACTION’s solution to the RFOTR problem for MULtrees.