Overview of the algorithm
\(\mathrm{BuildNT}\) (Algorithm 1) is our algorithm for testing compatibility of semi-labeled trees. Its argument, U, is a valid position in \(\mathcal {P}\) such that \(H_{\mathcal {P}}(U)\) is connected. \(\mathrm{BuildNT}\) relies on the fact—proved later, in Theorem 1—that if \(\mathcal {P}|\mathrm{Desc}_\mathcal {P}(U)\) is compatible, then U must contain a nonempty set S of semi-universal labels. If such a set S exists, the algorithm replaces U by its successor \(U'\) with respect to S. It then processes each connected component of \(H_{\mathcal {P}}(U')\) recursively, to determine if the associated sub-profile is compatible. If all the recursive calls are successful, then their results are combined into a supertree for \(\mathcal {P}|\mathrm{Desc}_\mathcal {P}(U)\).
In detail, \(\mathrm{BuildNT}\) proceeds as follows. Line 1 computes the set S of semi-universal labels in U. If S is empty, then, \(\mathcal {P}|\mathrm{Desc}_\mathcal {P}(U)\) is incompatible, and, thus, so is \(\mathcal {P}\). This fact is reported in Line 3. Line 4 creates a tentative root \(r_U\), labeled by S, for the tree \(\mathcal {T}_U\) for L(U). Line 5 checks if S contains exactly one label \(\ell\), with no proper descendants. If so, by the connectivity assumption, \(\ell\) must be the sole member of \(\mathrm{Desc}_\mathcal {P}(U)\); that is, \(L(U) = \ell\). Therefore, Line 6 simply returns the tree with a single node, labeled by \(S = \{\ell \}\). Line 7 updates U, replacing it by its successor with respect to S. Let \(W_1, W_2, \dots , W_p\) be the connected components of \(H_{\mathcal {P}}(U)\) after updating U. By Lemma 6, \(U | W_j\) is a valid position, for each \(j \in [p]\). Lines 8–12 recursively invoke \(\mathrm{BuildNT}\) on \(U | W_j\) for each \(j \in [p]\), to determine if there is a tree \(t_j\) that ancestrally displays \(\mathcal {P}| \mathrm{Desc}_\mathcal {P}(U \cap W_j)\). If any subproblem is incompatible, Line 12 reports that \(\mathcal {P}\) is incompatible. Otherwise, Line 13 returns the tree obtained by making the \(t_j\)s the subtrees of root \(r_U\).
Next, we argue the correctness of \(\mathrm{BuildNT}\).
Correctness
Lemma 7
Let
U
be a valid position in
\(\mathcal {P}\). If
\(\mathrm{BuildNT}(U)\)
returns a tree
\(\mathcal {T}_U\), then
\(\mathcal {T}_U\)
is a phylogenetic tree such that
\(L(\mathcal {T}_U) = L(U)\).
Proof
We use induction on |L(U)|. The base case, where \(|L(U)| = 1\), is handled by Lines 5–6. In this case, \(S = L(U) = \{\ell \}\) and \(\mathrm{BuildNT}(U)\) correctly returns the tree consisting of a single node, labeled by \(\{\ell \}\). Otherwise, let \(W_1, \ldots , W_p\) be the connected components of \(H_{\mathcal {P}}(U)\) in step 8. Since \(\mathrm{BuildNT}(U)\) returns tree \(\mathcal {T}_U\), it must be the case that, for each \(j \in [p]\), the result \(t_j\) returned by the recursive call to \(\mathrm{BuildNT}(U|W_j)\) in Line 10 is a tree. Since \(|S| \ge 1\), we have \(|L(W_j)| < |L(U)|\), for each \(j \in [p]\). Thus, we can assume inductively that \(t_j\) is a phylogenetic tree for \(L(W_j)\). Since \(S \cup \bigcup _{j \in [p]} L(W_j) = L(U)\), the tree returned in Line 13 is a phylogeny with species set L(U). \(\square\)
Theorem 1
Let
\(\mathcal {P}= \{\mathcal {T}_1, \mathcal {T}_2, \ldots , \mathcal {T}_k\}\)
be a profile and let
\(U_\mathrm{root}\)
be the root position, as defined in Eq. (1). Then,
\(\mathrm{BuildNT}(U_\mathrm{root})\)
returns either (i) a semi-labeled tree
\(\mathcal {T}\)
that ancestrally displays
\(\mathcal {P}\), if
\(\mathcal {P}\)
is ancestrally compatible, or (ii)
incompatible
otherwise.
Proof
\(\mathrm{BuildNT}(U_\mathrm{root})\) either returns a tree or incompatible. We consider each case separately.
-
(i)
Suppose that \(\mathrm {BuildNT}(U_\mathrm{root})\) returns a semi-labeled tree \(\mathcal {T}\). By Lemma 7, \(L(\mathcal {T}) = L(\mathcal {P})\). We prove that \(\mathcal {T}\) ancestrally displays \(\mathcal {P}\). By Lemma 1, it suffices to show that \(D(\mathcal {T}_i) \subseteq D(\mathcal {T})\) and \(N(\mathcal {T}_i) \subseteq N(\mathcal {T})\), for each \(i \in [k]\). Consider any \((\ell ,\ell ') \in D(\mathcal {T}_i)\). Then, \(\ell\) has a child \(\ell ''\) in \(\mathcal {T}_i\) such that \(\ell '' \le _{\mathcal {T}_i} \ell '\) —note that we may have \(\ell '' = \ell\). There must be a recursive call to \(\mathrm {BuildNT}(U)\), for some valid position U, where \(\ell\) is the set S of semi-universal labels obtained in Line 1. By Observation 2, label \(\ell ''\), and thus \(\ell '\), both lie in one of the connected components of the graph obtained by deleting all labels in S, including \(\ell\), and their incident edges from \(H_{\mathcal {P}}(U)\). It now follows from the construction of \(\mathcal {T}\) that \((\ell , \ell ') \in D(\mathcal {T})\). Thus, \(D(\mathcal {T}_i) \subseteq D(\mathcal {T})\). Now, consider any \(\{\ell ,\ell ' \} \in N(\mathcal {T}_i)\). Let v be the lowest common ancestor of \(\phi _i(\ell )\) and \(\phi _i(\ell ')\) in \(\mathcal {T}_i\) and let \(\ell _v\) be the label of v. Then, \(\ell _v\) has a pair of children, \(\ell _1\) and \(\ell _2\) say, in \(\mathcal {T}_i\) such that \(\ell _1 \le _{\mathcal {T}_i} \ell\), and \(\ell _2 \le _{\mathcal {T}_i} \ell '\). Because \(\mathrm {BuildNT}(U_\mathrm{root})\) returns a tree, there are recursive calls \(\mathrm {BuildNT}(U_1)\) and \(\mathrm {BuildNT}(U_2)\) for valid positions \(U_1\) and \(U_2\) such that \(\ell _1\) is semi-universal for \(U_1\) and \(\ell _2\) is semi-universal for \(U_2\). We must have \(U_1 \ne U_2\); otherwise, \(|U_1(i)| = |U_2(i)| \ge 2\), and, thus, neither \(\ell _1\) nor \(\ell _2\) is semi-universal, a contradiction. Further, it follows from the construction of \(\mathcal {T}\) that we must have \(\mathrm{Desc}_\mathcal {P}(U_1) \cap \mathrm{Desc}_\mathcal {P}(U_2) = \emptyset\). Hence, \(\ell \parallel _\mathcal {T}\ell '\), and, therefore, \(\{\ell ,\ell '\} \in N(\mathcal {T})\).
-
(ii)
Asssume, by way of contradiction, that \(\mathrm{BuildNT}(U_\mathrm{root})\) returns incompatible, but that \(\mathcal {P}\) is ancestrally compatible. By assumption, there exists a semi-labeled tree \(\mathcal {T}\) that ancestrally displays \(\mathcal {P}\). Since \(\mathrm {BuildNT}(U_\mathrm{root})\) returns incompatible, there is a recursive call to \(\mathrm{BuildNT}(U)\) for some valid position U such that U has no semi-universal label, and the set S of Line 1 is empty. By Lemma 2, \(\mathcal {T}| \mathrm{Desc}_\mathcal {P}(U)\) ancestrally displays \(\mathcal {P}|\mathrm{Desc}_\mathcal {P}(U)\). Thus, by Lemma 4, \(\mathcal {T}| \mathrm{Desc}_\mathcal {P}(U)\) ancestrally displays \(\mathcal {T}_i | \mathrm{Desc}_i(U)\), for every \(i \in [k]\). Let \(\ell\) be any label in the label set of the root of \(\mathcal {T}| \mathrm{Desc}_\mathcal {P}(U)\). Then, for each \(i \in [k]\) such that \(\ell \in L(\mathcal {T}_{i})\), \(\ell\) must be the label of the root of \(\mathcal {T}_i|\mathrm{Desc}_i(U)\). Thus, for each such i, \(U(i) = \{\ell \}\). Hence, \(\ell\) is semi-universal in U, a contradiction.
\(\square\)
An iterative version
We now present \(\mathrm {BuildNT}_\mathrm {N}\) (Algorithm 2), an iterative version of \(\mathrm {BuildNT}\), which lends itself naturally to an efficient implementation. \(\mathrm{BuildNT_N}\) performs a breadth-first traversal of \(\mathrm{BuildNT}\)’s recursion tree, using a first-in first-out queue Q that stores pairs of the form \((U, \mathrm{pred})\), where U is a valid position in \(\mathcal {P}\) and \(\mathrm{pred}\) is a reference to the parent of the node corresponding to U in the supertree built so far. \(\mathrm{BuildNT_N}\) simulates recursive calls in \(\mathrm{BuildNT}\) by enqueuing pairs corresponding to subproblems. We explain this in more detail next.
\(\mathrm {BuildNT_N}\) initializes its queue to contain the starting position, \(U_\mathrm{root}\), with a null parent. It then proceeds to the while loop of Lines 3–14. Each iteration of the loop starts by dequeuing a valid position U, along with a reference \(\mathrm{pred}\) to the potential parent for the subtree for L(U) in the supertree. The body of the loop closely follows the steps performed by a call to \(\mathrm{BuildNT}(U)\). Line 5 computes the set S of semi-universal labels in U. If S is empty, the algorithm reports that \(\mathcal {P}\) is incompatible and terminates (Lines 6–7). The algorithm then creates a tentative root \(r_U\) labeled by S for the tree \(\mathcal {T}_U\) for L(U), and links \(r_U\) to its parent (Line 8). If S consists of exactly one element that has no proper descendants, we skip the rest of the current iteration of the while loop, and continue to the next iteration (Lines 9–10). Line 11 replaces U by its successor with respect to S. Lines 13–14 enqueue each of \(U|W_1, U|W_2, \ldots , U|W_p\), along with \(r_U\), for processing in a subsequent iteration. If the while loop terminates without any incompatibility being detected, the algorithm returns the tree with root \(r_{U_\mathrm{root}}\).
Although the order in which \(\mathrm{BuildNT_N}\) processes connected components differs from that of \(\mathrm {BuildNT}\) —breadth-first instead of depth-first—, it is straightforward to see that the effect is equivalent, and the proof of correctness of \(\mathrm{BuildNT}\) (Theorem 1) applies to \(\mathrm{BuildNT_N}\) as well. We thus state the following without proof.
Theorem 2
Let
\(\mathcal {P}= \{\mathcal {T}_1, \mathcal {T}_2, \ldots , \mathcal {T}_k\}\)
be a profile. Then,
\(\mathrm{BuildNT_N}(\mathcal {P})\)
returns either (i) a semi-labeled tree
\(\mathcal {T}\)
that ancestrally displays
\(\mathcal {P}\), if
\(\mathcal {P}\) is ancestrally compatible, or (ii)
incompatible
otherwise.
Let Q be \(\mathrm{BuildNT_N}\)’s first-in first-out queue. In the rest of the paper, we will say that a valid position U
is in
Q if \((U,\mathrm{pred}) \in Q, \text { for some } \mathrm{pred}.\) Let \(H_Q\) be the subgraph of \(H_{\mathcal {P}}\) induced by \(\bigcup \{\mathrm{Desc}(U): U \text { is in } Q\}.\) By Observation 1, \(H_Q\) is obtained from \(H_{\mathcal {P}}\) through edge and node deletions.
Lemma 8
At the start of any iteration of
\(\mathrm{BuildNT_N}\)’s
while loop, the set of connected components of
\(H_Q\)
is
\(\{V(H_{\mathcal {P}}(U)) : U \text { is in } Q\}\).
Proof
The property holds at the outset, since, by Assumption 2, \(H_{\mathcal {P}}= H_{\mathcal {P}}(U_\mathrm{root})\) is a connected graph, and the only element of Q is \((U_\mathrm{root},\mathtt {null})\). Assume that the property holds at the beginning of iteration l. Let \((U, \mathrm{pred})\) be the element dequeued from Q in Line 4. Then, \(H_{\mathcal {P}}(U)\) is connected. In place of \((U, \mathrm{pred})\), Lines 13–14 enqueue \((U|W_j, r_U)\), for each \(j \in [p]\), where, by construction, \(H_{\mathcal {P}}(U|W_j)\) is a connected component of \(H_{\mathcal {P}}(U)\). Thus, the property holds at the beginning of iteration \(l+1\). \(\square\)
In other words, Lemma 8 states that each iteration of \(\mathrm{BuildNT_N}(\mathcal {P})\) deals with a subgraph of \(H_{\mathcal {P}}\), whose connected components are in one-to-one correspondence with the valid positions stored in Q. This is illustrated by the next example.
An example
Figures 4, 5, 6, 7 and 8 illustrate the execution of \(\mathrm{BuildNT_N}\) on the profile \(\mathcal {P}= (\mathcal {T}_1, \mathcal {T}_2, \mathcal {T}_3)\) of Fig. 1. The figures show how the graph \(H_Q\) —initially equal to \(H_{\mathcal {P}}= H_{\mathcal {P}}(U_\mathrm{root})\) (Fig. 3)—evolves as its edges and nodes are deleted.
In each figure, \(H_Q\) is shown on the left and the current supertree is shown on the right. For brevity, the figures only exhibit the state of \(H_Q\) and the supertree after all the nodes at each level are generated. The various valid positions processed by \(\mathrm{BuildNT_N}(\mathcal {P})\) are denoted by \(U_\alpha\), for different subscripts \(\alpha\); \(S_\alpha\) denotes the semi-universal labels in \(U_\alpha\), and \(U_\alpha '\) denotes the successor of \(U_\alpha\) with respect to \(S_\alpha\). We write \(L_\alpha\) as an abbreviation for \(L(U_\alpha )\) The root of the tree for \(L_\alpha\) is \(r_{U_\alpha }\) and is labeled by \(S_\alpha\).
Initially, \(Q = ((U_\mathrm{root},\mathtt {null}))\). In what follows, the elements of Q are listed from front to rear.
Level 0. Refer to Fig. 4. As seen earlier, the set of semi-universal labels of \(U_\mathrm{root}\) is \(S_\mathrm{root}= \{1, 2\}\). Thus, \(H_{\mathcal {P}}(U_\mathrm{root}')\) has two components \(W_1\) and \(W_2\). Let \(U_1 = U_\mathrm{root}'|W_1\) and \(U_2 = U_\mathrm{root}'|W_2\). Then,
$$\begin{aligned} U_1 = (\{3\}, \{e,g\},\{g\}) \quad \text {and} \quad U_2 = (\{f\}, \{f\},\emptyset ). \end{aligned}$$
After level 0 is processed, \(Q = ((U_1,r_{U_\mathrm{root}}), (U_2,r_{U_\mathrm{root}}))\). Thus, the roots of the subtrees for \(L_1\) and \(L_2\) will be children of \(r_{U_\mathrm{root}}\).
Level 1. Refer to Fig. 5. We have \(S_1 = \{3\}\), so \(H_{\mathcal {P}}(U_1')\) has two components \(W_{11}\) and \(W_{12}\). Let \(U_{11} = U_1'|W_{11}\) and \(U_{12} = U_1'|W_{12}\). Then,
$$\begin{aligned} U_{11} = (\{a,d\}, \{g\},\{g\}) \quad \text {and} \quad U_{12} = (\{e\}, \{e\},\emptyset ). \end{aligned}$$
We have \(S_2 = \{f\}\), so \(H_{\mathcal {P}}(U_2')\) has two components \(W_{21}\) and \(W_{22}\). Let \(U_{21} = U_2'|W_{21}\) and \(U_{22} = U_2'|W_{22}\). Then,
$$\begin{aligned} U_{21} = (\emptyset , \{h\},\emptyset ) \quad \text {and} \quad U_{21} = (\emptyset , \{i\},\emptyset ). \end{aligned}$$
After level 1 is processed, \(Q = ((U_{11},r_{1}), (U_{12},r_{1}), (U_{21},r_{2}), (U_{22},r_{2}))\).
Level 2. Refer to Fig. 6. We have \(S_{11} = \{g\}\), so \(H_{\mathcal {P}}(U_{11}')\) has two components \(W_{111}\) and \(W_{112}\). Let \(U_{111} = U_{11}'|W_{111}\) and \(U_{112} = U_{11}'|W_{112}\). Then,
$$\begin{aligned} U_{111} = (\{a\}, \{a\},\{4\}) \quad \text {and} \quad U_{112} = (\emptyset , \{d\},\{d\}). \end{aligned}$$
The only semi-universal labels in \(U_{12}\), \(U_{21}\), and \(U_{22}\) are, respectively, e, h, and i. Since none of these labels have proper descendants, each of them is a leaf in the supertree.
After level 2 is processed, \(Q = ((U_{111},r_{11}), (U_{112},r_{11}))\).
Level 3. Refer to Fig. 7. We have \(S_{111} = \{4, a\}\), so \(H_{\mathcal {P}}(U_{111}')\) has two components \(W_{1111}\) and \(W_{1112}\). Let \(U_{1111} = U_{111}'|W_{1111}\) and \(U_{1112} = U_{111}'|W_{1112}\). Then,
$$\begin{aligned} U_{1111} = (\{b\}, \emptyset ,\{b\}) \quad \text {and} \quad U_{1112} = (\{c\}, \emptyset ,\{c\}). \end{aligned}$$
The only semi-universal label in \(U_{112}\) is d. Since d has no proper descendants, it becomes a leaf in the supertree.
After level 3 is processed, \(Q = ((U_{1111},r_{111}), (U_{1112},r_{111}))\).
Level 4. Refer to Fig. 8. The only semi-universal labels in \(U_{1111}\) and \(U_{1112}\) are, respectively, b and c. Since neither of these labels have proper descendants, each of them is a leaf in the supertree.
After level 4 is processed, Q is empty, and \(\mathrm{BuildNT_N}(\mathcal {P})\) terminates.