In this section, we present the main theoretical and algorithmic methods on the inference of an optimal tree displayed by a network problem (ODT). We mainly focus on the details related to the variant of the problem under the deep coalescence cost (ODT-DC). At the end of the section, we show how the theory can be adopted to solve the problem under the duplication cost (ODT-DUP).
Trees and networks
A network on a set of species X is a directed acyclic graph \(N=(V(N),E(N))\) with a single root such that: (1) its leaves, i.e., nodes of indegree 1 and outdegree 0, are labeled by the species from X, and (2) there is a directed path from the root to any other vertex. A network is binary if its leaves, root, and the remaining nodes have degrees 1, 2 and 3, respectively. A node is called a reticulation if it has indegree two and outdegree one, and a tree node if it has indegree at most one and outdegree two. A network is semi-binary, if additionally, it may contain semi-binary nodes of indegree at most one and outdegree one, which includes the root having exactly one child. We can contract a semi-binary node v of indegree one as follows: (1) remove v, (2) remove both edges incident with v, and (3) insert a new directed edge connecting the unique parent of v with the only child of v. Similarly, if v has indegree zero we remove v, and the child of v becomes a new root. If a directed graph \(G'\) is obtained from a graph G by a sequence of contract operations, then G is called a subdivision of \(G'\).
If \(\langle v,w \rangle \in E(N)\), then v is a parent of w and w is a child of v, denoted \(w.\mathsf {parent}=v\) if w is a non-root tree node or a leaf. We write \(v.{\mathsf {sibling}}=w\) if \(v \ne w\) have the same parent. We write \(v \succeq w\) if there is a directed path from v to w, and \(v \succ w\) if \(v \succeq w\) and \(v \ne w\). The set of all leaves in a network is denoted L(N), by \(R(N) \subset V(N)\) we denote the set of reticulation nodes in N, by \(T(N) \subset V(N)\) we denote the set of all tree nodes in N, and by \(E_R(N) \subset E(N)\) we denote the set of all reticulation edges in N, that is, edges \(\langle v,r \rangle \in E(N)\) with \(r \in R(N)\). We say that a reticulation edge e is a sibling of a reticulation edge \(e'\) if they share the same bottom reticulation node. By \(\deg _N(v)\) we denote the outdegree of v in N.
A phylogenetic network is a binary network on X in which the leaves are labeled one-to-one with the species from XFootnote 1. A species tree is a phylogenetic network without reticulation nodes. A gene tree, or in short a tree, is a binary network without reticulation nodes. Note that the leaf labeling in a gene tree does not have to be one-to-one. Such labelled trees are called multi-labelled trees or MUL-trees [31]. A phylogenetic network is tree-child network, if each non-leaf node has a child that is either a tree node or a leaf [32,33,34,35].
Deep coalescence cost: embedding a tree into a (displayed) tree
Given a gene tree G and a species tree S on X, the lca-mapping \(\mathsf {M}:V(G) \rightarrow V(S)\) is defined as follows: (1) if g is a leaf labeled \(x \in X\) then \(\mathsf {M}(g)\) is the unique leaf labeled x in S, and (2) if g has two children \(g'\) and \(g''\), then \(\mathsf {M}(g)\) is the lowest common ancestor of \(\mathsf {M}(g')\) and \(\mathsf {M}(g'')\) in S. Embedding G into S is performed by mapping each edge \(\langle v,w \rangle \in E(G)\) to a path connecting \(\mathsf {M}(v)\) and \(\mathsf {M}(w)\) in S. We say that the gene edge visits edges from that path. Let ||v, w|| denote the number of edges on the path connecting v and w. Then, the visited edges contribute to the deep coalescence cost, denoted \(\mathsf {DC}(G,S)\), as follows:
$$\begin{aligned} \mathsf {DC}(G,S) = \sum _{\langle v,w \rangle \in E(G)} (||\mathsf {M}(v),\mathsf {M}(w)||-1). \end{aligned}$$
(1)
Given a phylogenetic network N on X, we say that a species tree T on X is displayed by N, if N contains a subgraph \(T'\) that is a subdivision of T [36].
We now define the Optimal Displayed Tree under Deep Coalescence problem (ODT-DC) in the parsimony framework:
Problem 1
(ODT-DC) Given a tree G and a phylogenetic network N. Find an optimal tree \(S^*\) displayed by N that minimizes \(\mathsf {DC}(G,S)\) in the set of all trees S displayed by N.
The cost of an optimal displayed tree, we denote \(\mathsf {DC}(G,N)\). While the complexity of ODT-DC remains unknown for the class of tree-child networks, we claim that the problem is NP-hard in a general class of networks. The proof is similar to the NP-hardness proof of the best switching problem from [9]. See also [8] for the related problem of RF-embedding. Figure 1 depicts an example of DC costs.
Scenarios between gene trees and phylogenetic networks
In the previous Section, we showed how a gene tree is embedded into a species tree. Here, we propose to embed a gene tree into a phylogenetic network using a more general approach than embedding through a displayed tree. We start with the notion of unfolded network (see also [37, 38]), then we define scenarios between gene trees and unfolded networks.
For a phylogenetic network N on X with k reticulations, the unfolded network \(\hat{N}\) is the tree \(N_k\) obtained from N by a sequence of k unfolding operations defined on pairs \((N_i,\sigma _i)\), such that \(N_i\) is a semi-binary network on X and \(\sigma _i :V(N_i) \rightarrow V(N)\) defines the origin of a node from \(N_i\). Let \((N_0,\sigma _0)\) be a pair such that \(N_0=N\) and \(\sigma _0(v)=v\) for each \(v \in V(N)\). Then, for a sequence of all reticulation nodes \(r_1, r_2, \dots , r_k\) from N in a reversed topological order, \((N_i,\sigma _i)\) is obtained from \((N_{i-1},\sigma _{i-1})\) by unfolding the reticulation \(r_i\) as follows:
-
Let \(S_i\) be a copy of the subtree of \(N_{i-1}\) rooted at \(r_i\).
-
\(V(N_{i}):=V(N_{i-1}) \cup V(S_i)\) and \(E(N_{i}):=(E(N_{i-1}) \setminus \{ \langle p,r_i \rangle \}) \cup E(S_i) \cup \{ \langle p,r_i' \rangle \}\), where p is an arbitrary parent of \(r_i\) and \(r_i'\) is the root of \(S_i\).
-
\(\sigma _i(v)\) is \(\sigma _{i-1}(v)\) if \(v \in V(N_{i-1})\); otherwise, it is \(\sigma _{i-1}(t)\), if v is a copy of t from \(N_{i-1}\).
Informally, for each reticulation node, we copy its subtree, detach the original subtree from one parent, and attach the copy to the same parent, without changing the labels. To avoid using k directly, we set \(\sigma\) to be \(\sigma _k\).Footnote 2 Figure 2 depicts an unfolded network.
Lemma 2
(Correctness of unfolding) The unfolded network \(\hat{N}\) of N is a semi-binary tree.
Proof
The proof follows by induction on \(i=0,1,\dots ,k\), by showing that \(N_{i}\) is a semi-binary network on X with the reticulation nodes \(r_{i+1},\dots ,r_{k}\) such that there is no reticulation node below \(r_{i+1}\) in \(N_i\). For \(i=0\) the above statement holds trivially. For each \(i>0\), \(N_{i}\) is obtained from \(N_{i-1}\) by unfolding \(r_i\). Next, it follows from the topological order, and the inductive assumption, that there is no reticulation below \(r_{i}\), thus the set of all nodes below \(r_{i}\) induces a rooted subtree in \(N_{i-1}\). \(\square\)
Let a root-leaf path be a directed path connecting the root with a leaf in a network.
Theorem 3
(Unfolding Soundness) There is a one-to-one correspondence between root-leaf paths in N and root-leaf paths in \(\hat{N}\).
Proof
The bijection is established by \(\sigma\), i.e., if \(P=p_1,p_2,\dots ,p_m\) is a root-leaf path in \(\hat{N}\), then \(\sigma (P)=\sigma (p_1),\sigma (p_2),\dots ,\sigma (p_m)\) is the corresponding root-leaf path in the network N. \(\square\)
It follows from Theorem 3 that N and \(\hat{N}\) have the same structure of root-leaf paths. A scenario for G and N is a function \(\xi :L(G) \rightarrow L(\hat{N})\) that preserves the leaf labeling: for every \(g \in L(G)\), the labels of g and \(\xi (g)\) are equal. A scenario \(\xi\) can be extended to the lca-mapping \(\mathsf {M}_\xi :V(G) \rightarrow V(\hat{N})\) such that for \(g \in V(G)\), \(\mathsf {M}_\xi (g)\) is the lowest node v in \(\hat{N}\) such that \(\xi (g') \preceq v\), for each leaf \(g' \preceq g\). Note that \(M_\xi (g)\) is either a leaf or a tree node.
Deep coalescence score of scenarios
Having the lca-mapping determined by a scenario, we are ready to define the deep coalescence score, denoted \(\tilde{\mathsf {DC}}\), to approximate deep coalescence events induced by scenarios in phylogenetic networks. Our first goal is to deduce properties allowing us to approximate the DC cost to solve ODT-DC in the class of tree-child networks. In particular, our approach differs from the approaches from [7, 9, 10], e.g., in the way in which a cost of a path is defined, although the general concept of mapping a gene tree into a network is analogous.
For a scenario \(\xi\) for G and N, we say that \(\langle v,w \rangle \in E(G)\) visits \(\langle a,b \rangle \in E(\hat{N})\) if \(\mathsf {M}_\xi (v) \succeq a \succ b \succeq \mathsf {M}_\xi (w)\). Then, \(\langle a,b \rangle\) has exactly one of the following types.
-
Type I: \(\mathsf {M}_\xi (v)=a\), i.e., it is the first edge.
-
Type II: \(\mathsf {M}_\xi (v) \succ a\), \(\deg_{\hat{N}}(a)=2\) and \(\sigma (b.{\mathsf {sibling}}) \notin R(N)\).
-
Type III: \(\mathsf {M}_\xi (v) \succ a\), \(\deg _{\hat{N}}(a)=2\) and \(\sigma (b.{\mathsf {sibling}}) \in R(N)\); we say that \(\xi\) bypasses the reticulation edge \(\sigma (\langle a,b.{\mathsf {sibling}} \rangle\)).
-
Type IV: \(\deg _{\hat{N}}(a)=1\) (only if \(\sigma (a) \in R(N)\)).
In the above definition, type (I) is only for the first (i.e., the closest to the root) edge visited by a given edge from G, while for the remaining visited edges from \(\hat{N}\) an edge has: Type (II) if the sibling of its bottom node is a tree node, Type (III) if the sibling of its bottom node is a reticulation, and Type (IV) if the top node of the edge is a reticulation.
By \(\kappa _\xi (v,w)\) we denote the set of all edges of Type I or II visited by \(\langle v,w \rangle\). Then, the deep coalescence score for G, N and a scenario \(\xi\) is
$$\begin{aligned} \tilde{\mathsf {DC}}(G,N,\xi )=\sum _{\langle v,w \rangle \in E(G)} (|\kappa _\xi (v,w)|-1). \end{aligned}$$
(2)
Examples of scenarios and \(\tilde{\mathsf {DC}}\) scores are depicted in Figure 2. Finally, we can define Optimal Scenario Inference problem, DC-MinRec.
Problem 4
(DC-MinRec) Given a gene tree G and a phylogenetic network N. Find an optimal scenario \(\xi ^*\) that minimizes \(\tilde{\mathsf {DC}}(G,N,\xi )\) in the set of all scenarios \(\xi\) for G and N.
In the next sections, we propose a dynamic programming algorithm that solves DC-MinRec in O(|G||N|) time where N is a tree-child network. Note that the time complexity depends on the size of N (not on the potentially exponential size of \(\hat{N}\)).
In a trivial case, the solution to DC-MinRec is induced by the classical \(\mathsf {DC}\) cost.
Lemma 5
If N is a phylogenetic network with no reticulation node, there is only one scenario \(\xi\) for G and N. Moreover, \(\tilde{\mathsf {DC}}(G,N,\xi )=\mathsf {DC}(G,N)\).
Proof
N is a species tree and the scenario is determined by \(\xi :=\mathsf {M}|_{L(N)}\). In this case, \(\mathsf {M}_\xi =\mathsf {M}\), all visited edges in \(\hat{N}\) are of Type I or II. Thus, \(|\kappa _\xi (v,w)|=||\mathsf {M}(v),\mathsf {M}(w)||\) and the proof is straightforward from (1) and (2). \(\square\)
Displayed trees in tree-child networks
Here, we present several important properties of displayed trees in tree-child networks.
Given a tree-child network N on X, a set \(Y \subseteq E_R(N)\) is called perfect if, for each \(r \in R(N)\), Y contains exactly one edge whose bottom node is r. Given a perfect Y, the graph denoted \(N\setminus Y\), obtained from N by removing all edges from \(E_R(N) \setminus Y\) is a semi-binary tree on X, i.e., semi-binary network with no reticulations.
Lemma 6
Let N be a tree-child network on X and \(Y \subseteq E_R(N)\). Then, Y is perfect if and only if \(N \setminus Y\) is a semi-binary tree on X.
Proof
(\(\Rightarrow\)) In a tree-child network a node cannot have all descendands being reticulations. Therefore, \(N \setminus Y\) contains no unlabelled leaf. Next, every reticulation node r from N has exactly one parent in \(N \setminus Y\). Also, \(N \setminus Y\) is a connected graph on X, which follows by showing that each node is connected with the root. We omit easy inductive proof. (\(\Leftarrow\)) Let \(r \in R(N)\). Then, Y must contain exactly one reticulation edge whose bottom node is r. Otherwise, \(N \setminus Y\) is not a tree. \(\square\)
Since \(N \setminus Y\) is a semi-binary tree on X, contracting all semi-binary nodes from \(N \setminus Y\) yields a species tree \(N_Y\) on X. Next, the subgraph \(N \setminus Y\) of N is a subdivision of a tree \(N_Y\) on X. We conclude that \(N_Y\) is a displayed tree of N. We also have the following property.
Lemma 7
Let T be a displayed tree of a tree-child network N. Then, there is a perfect set Y such that \(N_Y=T\).
Proof
Let \(T'\) be a subgraph of N such that \(T'\) is a subdivision of T. Let \(Y=E_R(N) \setminus E(T')\). It remains to show that Y is perfect. Note that \(T'\) is a semi-binary tree on X and the rest follows similarly to the case \((\Leftarrow )\) from Lemma 6. \(\square\)
We say that the perfect set Y is induced by a tree T displayed by N if \(N_Y=T\). Note that different perfect sets may induce the same displayed tree. E.g., a tree child network with one reticulation and two leaves has two perfect sets each one inducing the same displayed tree.
Note that in more general cases of networks (see relaxed networks in Section Beyond tree-child networks) additional removal of non-labeled vertices with out-degree zero (i.e., unlabelled leaves) from \(N \setminus Y\) is required to obtain a semi-binary tree on X.
DC scores of scenarios vs. DC costs of displayed trees
This section presents several theoretical results connecting our scoring functions in the class of tree-child networks. Note that the notion of a cost will be used only with the \(\mathsf {DC}\) cost defined in (1) for trees and for phylogenetic networks in Problem 1, while for scenarios, we will use the notion of a \((\tilde{\mathsf {DC}})\) score.
To establish the correspondence between DC scores and DC costs, we first show that each perfect set Y determines a scenario. Recall that \(N_Y\) is obtained from N\Y by contracting semi-binary nodes. Let \(\hat{N}_Y\) be the graph obtained from \(\hat{N}\) by removing all edges e such that \(\sigma (e) \in E_R(N)\setminus Y\) and all subtrees whose root is the bottom node of e.
Lemma 8
Let N be a tree-child network and \(Y \subseteq E_R(N)\) be perfect. Then, \(\hat{N}_Y\) and N\Y are isomorphic, and the isomorphism is established by \(\sigma |_{V(\hat{N}_Y)}\).
Proof
The proof is by induction with unfolding steps. Using the same notation, we construct N/Y iteratively using the sequence of reticulation nodes from the construction of \(\hat{N}\). Let \(B_0=N\). For each \(i=0,1,\dots ,k\), \(B_i\) is inferred from \(B_{i-1}\) by removing the reticulation edge from \(E_R(N)\setminus Y\) adjacent to \(r_i\). It is not difficult to see that \(B_k=N \setminus Y\) (as we removed only edges from \(E_R(N)\setminus Y\)). \(\hat{N}_Y\) can be equivalently obtained by modification of the original unfolding step by removing the copy \(S_i\) or the original subtree rooted at \(r_i\) depending on whether the corresponding reticulation edge is in Y. \(\square\)
For a tree-child network N, a gene tree G and a perfect set \(Y \subseteq E_R(N)\), we define a scenario \(\xi _Y\) for G and N, such that for each gene leaf g labeled x, \(\xi _Y(g)\) is the only leaf in \(L(\hat{N}_Y) \subseteq L(\hat{N})\) labeled x. Correctness follows from Lemma 8. For example, in Fig. 2, if \(Y=\{ \langle u,p \rangle , \langle x,q \rangle \}\), then Y is perfect and \(N_Y=S_1\) from Figure 1. Moreover, for \(G=((a,(b,c)),d)\), \(\xi _Y\) maps a to \(a_1\), b to \(b_1\) c to \(c_1\) and d to \(d_1\) as depicted in \(E_1\).
We say that \(e \in E_R(N)\) is directly used by scenario \(\xi\) if there is a visited edge \(e'\) of Type I or II such that \(\sigma (e')=e\). Similarly, we say that reticulation edge e is potentially used by \(\xi\) if the sibling edge of e is bypassed by \(\xi\). By \(\Upsilon _\xi \subseteq E_R(N)\) we denote the set of reticulation edges used directly or potentially by \(\xi\) (see Fig. 2).
We say that \(Y \subseteq E_R(N)\) has a conflict if Y contains two sibling edges. We say that \(\xi\) is regular if \(\Upsilon _\xi\) has no conflict. For instance, \(\Upsilon _{E_4}\) for \(E_4\) from Fig. 2 has two possible conflicts in N. Observe that \(\Upsilon _\xi\) may not be perfect in general, even if \(\xi\) is regular. For instance, if \(G=(c,d)\) and \(\xi\) maps c to \(c_3\) in the network from Fig. 2, then \(\Upsilon _{\xi }=\{ \langle y,q \rangle \}\).
Now, we can state the crucial proposition that establishes a correspondence between regular scenarios and embedding to trees displayed by a tree-child network.
Proposition 9
(Scenario-Displayed Tree Correspondence) Let N be a tree-child network and let G be a gene tree. A scenario \(\xi\) for G and N is regular, if and only if for every perfect set Y such that \(\Upsilon _\xi \subseteq Y\), \(\tilde{\mathsf {DC}}(G,N,\xi )=\mathsf {DC}(G,N_Y)\).
Proof
(\(\Leftarrow\).) If \(\Upsilon _\xi\) is a subset of a perfect set Y, then \(\Upsilon _\xi\) has no conflict. Thus, \(\xi\) is regular. (\(\Rightarrow\)). If \(\xi\) is regular, then there is at least one perfect Y such that \(\Upsilon _\xi \subseteq Y\). Based on the definitions of \(\mathsf {DC}\) and \(\tilde{\mathsf {DC}}\), it is sufficient to prove
$$\begin{aligned} |\kappa _\xi (v,w)|=||\mathsf {M}(v),\mathsf {M}(w)||, \end{aligned}$$
(3)
for every edge \(\langle v,w \rangle \in E(G)\), where \(\mathsf {M}\) is the lca-mapping between G and the species tree \(N_Y\), for one fixed perfect set \(Y \supseteq \Upsilon _\xi\).
We have \(V(N/Y)=V(N) \supseteq V(N_Y)\), and \(\mathsf {M}(g) \in V(N_Y)\) is a leaf or a tree node. Let \(d=||\mathsf {M}(v),\mathsf {M}(w)||\) (in \(N_Y\)). Note that no removed edge from \(\hat{N}\) is visited by scenario \(\xi\), we conclude that \(\sigma (\mathsf {M}_\xi (g))=\mathsf {M}(g)\) for every g. If \(\mathsf {M}(w)=\mathsf {M}(v)\) then \(|\kappa _\xi (v,w)|=d=0\). Otherwise, assume \(\mathsf {M}(v) \succ \mathsf {M}(w)\). Let \(P=p_1,p_2,\dots ,p_m\) be the directed path from \(\mathsf {M}_\xi (v)\) to \(\mathsf {M}_\xi (w)\) in \(\hat{N}_Y\) (and in \(\hat{N}\)). Then, by Lemma 8, \(\sigma (P)\) is the unique directed path in N/Y from \(\mathsf {M}(v)\) to \(\mathsf {M}(w)\). Note that d equals one plus the number of nodes of outdegree 2 located strictly between \(\mathsf {M}(v)\) and \(\mathsf {M}(w)\) in \(N_Y\). The same statement holds in N/Y. We show that d equals the number of Type I and II edges in \(\hat{N}\). For an edge \(e_i=\langle p_i,p_{i+1} \rangle\), with \(0<i<m\) we have, the following types of edges:
-
Type I: the edge exists since \(m>1\).
-
Type II: \(\deg _{\hat{N}}(p_i)=\deg _{\hat{N}_Y}(p_i)=2\).
-
Type III: \(\deg _{\hat{N}}(p_i)=2\) and \(e_i\) bypasses the reticulation edge \(e'=\sigma (\langle p_i,p_{i+1}.{\mathsf {sibling}}) \rangle )\). The sibling of \(e'\) is in \(\Upsilon _\xi \subseteq Y\). Y is perfect, so \(e' \in E_R(N)\setminus Y\). Thus, \(\deg _{\hat{N}_Y}(p_i)=1\).
-
Type IV: \(\deg _{\hat{N}}(p_i)=\deg _{\hat{N}_Y}(p_i)=1\).
As \(\deg_{\hat{N}_Y}(p_i)=\deg_{N/Y}(\sigma (p_i))\), we see that \(\deg _{N/Y}(\sigma (p_i))=2\) if and only if \(i>1\) and \(e_i\) has Type II. Moreover, the directed path contains one edge of Type I. Thus, \(|\kappa _\xi (v,w)|=d\). This completes the proof of (3) and (\(\Rightarrow\)) implication. \(\square\)
In the following proposition, we show that the cost of a tree displayed by a network using a perfect set is bounded from below by the cost of its corresponding scenario.
Proposition 10
Let N be a tree-child network and let G be a gene tree. If \(Y \subseteq E_R(N)\) is perfect, then \(\mathsf {DC}(G,N_Y) \ge \tilde{\mathsf {DC}}(G,N,\xi _Y)\).
Proof
The proof is similar to the proof of Proposition 9, with the difference that we show \(||\mathsf {M}(v),\mathsf {M}(w)|| \ge |\kappa _{\xi _Y}(v,w)|\), for any gene tree edge \(\langle v,w \rangle\), where \(\mathsf {M}\) is the lca-mapping between G and \(N_Y\). The only difference is in the edges of Type III in the last part of the proof. Here, we have \(\deg _{\hat{N}}(p_i)=2\) and \(e_i\) bypasses the reticulation edge \(e'\). As we do not have the assumption that \(Y_\xi \subseteq Y\), \(e'\) may be present in Y. In such a case, \(\deg _{\hat{N}}(p_i)=2\). Thus, the node \(\sigma (p_i)\) has outdegree 2 in \(\hat{N}_Y\). We conclude that \(||\mathsf {M}(v),\mathsf {M}(w)||-|\kappa _{\xi _Y}(v,w)|\) is the number of edges of Type III on the directed path from \(\mathsf {M}_\xi (v)\) to \(\mathsf {M}_\xi (w)\) that bypass an edge from Y. This completes the first part of the proof. \(\square\)
Finally, we show that the equality between the score and the cost holds only if the induced scenario is regular.
Proposition 11
Let N be a tree-child network and let G be a gene tree. If \(Y \subseteq E_R(N)\) is perfect, then \(\mathsf {DC}(G,N_Y) = \tilde{\mathsf {DC}}(G,N,\xi _Y)\) if and only if \(\xi _Y\) is regular.
Proof
(\(\Leftarrow\)). It follows from Proposition 9. (\(\Rightarrow\)). From the proof of Proposition 10, we conclude that equality holds only if there is no edge in Y bypassed by \(\xi _Y\). Thus, each edge potentially used by \(\xi _Y\) must be in Y. As every directly used edge is also in Y, by the construction of \(\xi _Y\), we have \(\Upsilon _\xi \subseteq Y\). Thus, \(\Upsilon _\xi\) is regular. \(\square\)
The next theorem states that the cost of an optimal tree displayed by a network is bounded from below by the score of an optimal scenario.
Theorem 12
(Lower Bound Property) Let N be a tree-child network and let G be a gene tree. If \(S^*\) is an optimal tree displayed by N, and \(\xi ^*\) is an optimal scenario for G and N then \(\mathsf {DC}(G,S^*) \ge \tilde{\mathsf {DC}}(G,N,\xi ^*)\).
Proof
If \(S^*\) is a tree displayed by N then there is a perfect Y such that \(S^*=N_Y\). Thus, we have \(\mathsf {DC}(G,N_Y) \ge \tilde{\mathsf {DC}}(G,N,\xi _Y)\) from Proposition 10 and \(\tilde{\mathsf {DC}}(G,N,\xi _Y) \ge \tilde{\mathsf {DC}}(G,N,\xi ^*)\) from the definition of \(\xi ^*\). \(\square\)
In our example from Figs. 1 and 2, the cost of \(S_1\) and the score of \(E_1\) are equal. However, in general, a regular scenario may not exist. For instance, if \(G=(a,d)\), there is only one scenario \(\xi\) for N from Fig. 2, where a and d are mapped to \(a_1\) and \(d_1\), respectively. Then, \(\xi\) is not regular, and \(0=\tilde{\mathsf {DC}}(G,N,\xi )<\mathsf {DC}(G,N)=1\) (for \(S_1\) or \(S_2\)).
Finally, we present a crucial theoretical property used to solve ODT-DC in class of tree-child networks using solutions to instances of DC-MinRec.
Theorem 13
(Regularity) Let d be the score of an optimal scenario of a gene tree G and a tree-child network N. A tree S displayed by N with \(\mathsf {DC}(G,S)=d\) exists, if and only if there is an optimal regular scenario of G and N.
Proof
(\(\Leftarrow\)). Take any perfect set Y such that \(Y_{\xi ^*} \subseteq Y\) and \(S:=N_Y\). The equality follows from Proposition 11. (\(\Rightarrow\)). By Theorem 12, S is an optimal tree displayed by N, since d is a lower bound for the cost of a displayed tree. Now, we take the perfect set Y induced by S. The scenario \(\xi _Y\) has score d. Hence, it is optimal. By Proposition 11, \(\xi _Y\) is also regular. \(\square\)
Dynamic programming (DP) algorithms to solve DC-MinRec
Dynamic programming algorithms are commonly used in tree reconciliation, including models based on directed acyclic graphs (DAGs) [7, 9, 17, 23, 27], where a gene tree is mapped to a tree or a DAG through the lca-mapping or general mapping based on concepts close to our scenarios. Such approaches often lead to polynomial time solutions with square time complexity in the best case. Here, we present two dynamic programming solutions to Problem 4 by providing formulas to compute the score of an optimal scenario. We start with a simplified and computationally demanding DP formulation. Then, we show an efficient approach running in square time.
Additional notation: By \(v'\) and \(v''\), we denote the children of \(v \in T(N)\), and by \(r'\) the child of a reticulation node r. For simplicity, instead of \(\sigma (M_\xi (g))\) for a gene tree node g, we write \(\xi _g\) (i.e., \(\xi _{[.]}\) is a mapping from G to N).
Dynamic programming formulation in \(O(|G||N^3|)\) time: the first approach
We can express the formula for \(\delta\) in the following way:
$$\begin{aligned} \delta(g,s):={\left\{ \begin{array}{ll} \min _{s \succeq t,u} \delta (g',t)+\pi (s,t)+\delta (g'',u)+\pi (s,u) &{} g \in T(G)\text{ and }s \notin R(N), \\ 0 &{} g, s\ \text{are leaves with the same label,}\\ +\infty &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$
(4)
where, for \(s \succeq t\) in N,
$$\begin{aligned} \pi(s,t):={\left\{ \begin{array}{ll} ||s,t|| &{} ||s,t|| \le 1\ \ \text{(the empty path or Type I)}, \\ 1+\pi (s,t.\mathsf {parent}) &{} \{t, t.{\mathsf {sibling}}, t.\mathsf {parent}\} \cap R(N) = \emptyset \ \ \text{(Tp. II)}, \\ 1 + \min (\pi (s,\dot{t}),\pi (s,\ddot{t})) &{} t \in R(N)\ \ \text{(Type II)}, \\ \pi (s,t.\mathsf {parent}) &{} t.{\mathsf {sibling}} \text{ or } t.\mathsf {parent}\in R(N)\ \text{(Type III/IV)}, \\ +\infty &{}\text{otherwise.}\end{array}\right. } \end{aligned}$$
(5)
Here \(\dot{r}\) and \(\ddot{r}\) denote the parents of a reticulation node r.
The correctness of the above formulas follows from the following two Lemmas.
Lemma 14
(Correctness of \(\pi\)) For \(s \succeq t\), \(\pi (s,t)\) is the minimal number of Type I or II edges between two nodes \(a \succeq b\) in \(\hat{N}\) such that \(\sigma (a)=s\) and \(\sigma (b)=t\).
Proof
It follows from Theorem 3, that \(\pi\) can be computed directly from N. The proof is by induction on the length of the directed path. The cases in \(\pi\) formulas correspond directly to the types of edges (see comments in (5)), where we add/set 1 if the visited edge has Type I or II. Note that there is only one branching when \(t \in R(N)\). In such a case, the formula will choose the directed path to s with the lower cost. We omit technical details. \(\square\)
Lemma 15
(Correctness of \(\delta\)) For g and s, \(\delta (g,s)\) from Equation (4) is the minimal number of Type I/II edges visited by edges from E(G|g) in a scenario \(\xi\), in the set of all scenarios \(\xi\) between G and N such that \(\xi _g=s\).
Proof
The proof follows by the induction on the structure of G and N, where Lemma 14 is applied to prove the induction hypothesis in each step. We omit easy details. \(\square\)
Computing \(\delta\) using the above formulas requires \(O(|N|^2+|G||N|^3)\) time and \(O(|G||N|+|N|^2)\) space. Therefore, this approach is rather prohibitive for larger instances.
Efficient DP solution in O(|G||N|) time
By G|g, we denote the subtree of G rooted at g. The main component of dynamic programming is \(\delta\) such that for \(g \in V(G)\) and \(s \in V(N)\), \(\delta (g,s)\) is the minimum score for G|g in the set of all scenarios \(\xi\) between G|g and \(\hat{N}\) such that \(\xi _g=s\). For simplicity, we ignore \(-1\) from the \(\tilde{\mathsf {DC}}\) formula in the partial costs in \(\delta\) as this yields a constant term dependent on the size of G.
Let \(\tau (s)\) be 0 if s is a reticulation, and 1 otherwise. Then, we have the following dynamic programming formula that solves DC-MinRec:
In the next Lemma, we express properties satisfied by the above formulas.
Lemma 16
Let \(g \in V(G)\), \(s \in V(N)\) and all scenarios below are for G and N.
- D1:
-
\(\delta (g,s)\) is equal to minimum number of Type I/II edges visited by edges from E(G|g) among scenarios \(\xi\) satisfying \(\xi _g=s\).
- D2:
-
If c is a child of g and t is not a reticulation. Then, \(\delta ^\uparrow (c,t)\) is equal to minimum number of Type I/II edges visited by edges from E(G|c) plus the number of edges \(e'=\langle a,b \rangle\) of Type II visited by \(\{\langle c,g \rangle \}\) with \(t \succeq \sigma (a)\) among scenarios \(\xi\) such that \(\xi _{g} = s \succ t \succeq \xi _c\).
- D3:
-
If c is a child of g and s is a tree node. Then, \(\delta ^f(c,s)\) is equal to minimum number of Type I/II edges visited by edges from \(E(G|c) \cup \{\langle c,g \rangle \}\) among scenarios \(\xi\) satisfying \(\xi _{g}=s\).
Proof
We follow with D1, where the proofs for D2-D3 are included as internal statements. The proof is by induction on the structure of G and N. The base is when g and s are leaves, for which D1 is obvious. Inductive assumption: D1 holds, for every x, y such that \(g \succeq x\), \(s \succeq y\), and either \(x \ne g\) or \(y \ne s\). Inductive hypothesis: D1 holds for g and s, where at least one of g and s is not a leaf.
The proof for D2. Let \(\xi\) be the scenario having the minimal number of I/II-edges as defined in D2 and let k be the number of these edges. We follow by induction by assuming that \(\delta ^\uparrow (c,u)\) satisfies D2 for all u, such that \(t \succ u \notin R(N)\). We prove D2 for t. The base step is when \(\xi _c=t\). Then, \(k=\delta (c,t)\) and D2 follows from the induction assumption for D1 with cases (11) (the first expression when t is a tree node) and (13) (when t is a leaf). Assume that \(t \succ \xi _c\), then \(\langle g,c \rangle\) visits the edge \(e=\langle a,b \rangle\) such that \(\sigma (e)=\langle t,t' \rangle\). We have three cases. (Case D2.i) If both children of t are tree nodes, then e has Type II. Note that \(\xi\) also has the minimal number of edges satisfying the inductive assumption with nodes c and \(t'\). Otherwise, if there is a scenario \(\xi '\) with a score \(<k-1\) for c and \(t'\) then by visiting e, we will have a scenario with \(<k\) edges for D2. This contradicts the assumption that k is minimal. Thus, \(k=1+\delta ^\uparrow (c,t')\), by the inductive assumption. As \(\tau (t')\tau (t'')=1\), this matches the second expression in (11). (Case D2.ii) If \(t'\) is a tree node and \(t''\) is a reticulation then \(\xi\) bypasses the reticulation edge \(\langle t,t'' \rangle\). Similarly to the previous case, we show that \(\xi\) satisfies inductive assumption with c and \(t'\) (we omit details). Thus, \(k=\delta ^\uparrow (c,t')\) and \(\tau (t')\tau (t'')=0\), again this matches the second expression in (11). (Case D2.iii) If \(t'\) is a tree node and \(t'\) is a reticulation then \(\xi\) directly uses reticulation edge \(\langle t,t' \rangle\), i.e., e has Type II. Again, we show that \(\xi\) satisfies inductive assumption with c and a tree node v being the child of reticulation \(t'\) (we omit details). By the inductive assumption, we have \(k=1+\delta ^\uparrow (c,v)\), which equals \(\delta ^\uparrow (c,t')=1+\delta ^\uparrow (c,v)\), by (11) with \(\tau (t')\tau (t'')=0\), then by (12).
The proof for D3. Let \(\xi\) be the scenario having the minimal number of I/II-edges as defined in D3 and let k be the number of these edges. If \(\xi _c=s\), then there is no edge visited by \(\langle c,g \rangle\). Thus, \(k=\delta (c,s)\) by the induction assumption, which is the first expression in \(\min\) of (6). Otherwise, assume that for a child \(s'\) of s, \(s' \succeq \xi _c=t\). Then, there is one edge of Type I visited by \(\langle c,g \rangle\). We have two cases. (Case D3.i) If \(s'\) is a reticulation, then \(\tau (s')=0\) and \(k=\delta ^\uparrow (g,s')=1+\delta ^\uparrow (g,t)\) where t is the child of \(s'\). The latter follows from (12) and D2 (with \(s:=t\)). Note that \(\xi\) has the minimal number of edges \(k-1\) satisfying the corresponding assumptions of D2 (see a similar argument in the proof of case D2.i). (Case D3.ii) If \(s'\) is a tree node or a leaf then \(k=1+\delta ^\uparrow (g,s')\) by D2 (with \(s'\)). In both cases Type I edge is included. The rest is similar to case D3.i. This completes the proof of D3.
The proof of D1. It follows from D3, that \(\delta ^f(c,s)=\min _{s\succeq t} \delta (c,t) + \pi (s,t))\) (see def. of \(\pi\) in section Dynamic programming formulation in \(O(|G||N^3|)\) time: the first approach), for a child c of g. Thus, if g and s are tree nodes we show that \(\delta (g,s)=(\min _{s\succeq t} \delta (g',t) + \pi (s,t)) + (\min _{s\succeq u} \delta (g'',u) + \pi (s,u))\). The proof follows similarly to the previous cases by analysing \(\xi\) with the minimal number of edges satisfying constraints from D1 (see also the recursion from (4) and Lemma 15). The case relates to (6). We skip details. Finally, we have two remaining cases. If g is a leaf and s is a tree node, then there is no scenario \(\xi\) satisfying \(\xi _g=s\). Then, the number is \(+\infty\) (case (9)). If s is a leaf and g is a tree node, we have \(\delta (g,s)=0\) if all leaves below g are labeled by the label of s, and \(+\infty\) otherwise. This agrees with the number of visited Type I/II edges, where, in the second case, the set of scenarios satisfying the assumptions is empty. \(\square\)
The optimal score is given by the following theorem, whose proof follows immediately from the definitions of \(\delta\), \(\tilde{\mathsf {DC}}\) and Lemma 16.
Theorem 17
Given a gene tree G and a tree-child network N. The score of an optimal scenario \(\xi ^*\) is \(\tilde{\mathsf {DC}}(G,N,\xi ^*) = -|E(G)| + \min _{s \in V(N)} \delta (G.{{\,\mathrm{\mathsf {root}}\,}},s).\)
To infer an optimal scenario, we apply standard backtracking based on values of \(\delta\) array. Since there are three arrays, each of size |G||N| and every cell of an array can be computed in O(1) time, DP has O(|G||N|) time and space complexity. Note that in implementation \(\delta ^f\) can be embedded into \(\delta\) computation. Thus, the space may be reduced to two arrays.
Inferring used reticulations edges from DP
An optimal scenario can be inferred from DP formulas using standard backtracking. However, this scenario may not be perfect. To further utilize the results of DP, we infer the set of used reticulation edges. For two nodes v and w, let \(\rho (v,w)=\{ \langle v,w \rangle \}\) denote the one-element set with \(\langle v,w \rangle\) if this edge is a reticulation edge in N, and \(\rho (v,w)=\emptyset\) otherwise. Similarly, by \(\bar{\rho }(v,w)\) we denote the one-element set with the sibling edge of \(e=\langle v,w \rangle\) if e is a reticulation edge in N, and \(\bar{\rho }(v,w)=\emptyset\), otherwise. Then, DP components \(\delta\), \(\delta ^f\) and \(\delta ^\uparrow\) are associated with reticulation edge usage rules u, \(u^f\), and \(u^\uparrow\), resp., as follows:
$$\begin{aligned} u(g,s)&={\left\{ \begin{array}{ll} u^f(g',s) \cup u^f(g'',s) &{} \text{in (6),} \\ \emptyset &{} \text{in (7)-(9),} \end{array}\right. }\\ u^f(g,s)&={\left\{ \begin{array}{ll} u(g,s) &{} \text{if}\ \delta ^f(g,s)=\delta (g,s)\ \text{in}\ (10),\\ u^\uparrow (g,c) \cup \rho (s,c) & \text{if}\ \delta^f(g,s)=\tau (c)+\delta^\uparrow(g,c)\ \text{for some}\ c \in \{s',s''\} \ \text{in}\ (10), \end{array}\right.}\\ u^\uparrow(g,s)&={\left\{ \begin{array}{ll} u(g,s) &{} \text{if}\ \delta ^\uparrow (g,s)=\delta (g,s)~\text{in}~(11)~\text{or}~(13),\\ u^\uparrow (g,c) &{}\text{in}~(12), \\ u^\uparrow (g,c) \cup \rho (s,c) \cup \bar{\rho }(s,c.{\mathsf {sibling}}) &{} \delta ^\uparrow (g,s)=\tau (s')\tau (s'')+\delta ^\uparrow (g,c) \\ &{} \text {for some}~c \in \{s',s''\}~\text{in}~(11). \end{array}\right. } \end{aligned}$$
The correctness of above formulas follows from the next lemma.
Lemma 18
If the backtracking of DP results in a scenario \(\xi\), then \(\Upsilon _\xi =u(G.{{\,\mathrm{\mathsf {root}}\,}},\xi _{G.{{\,\mathrm{\mathsf {root}}\,}}})\).
Proof
The proof follows by analysis of cases when reticulation edges are directly or potentially used by \(\xi\) and it is based on the details from the proof of Lemma 16. There are three main cases when a reticulation edge is inserted using \(\rho\) or \(\bar{\rho }\).
Case I. When the first edge (Type I) on the visited, directed path is a reticulation edge, then, its corresponding reticulation edge from N is inserted in \(u^f(g,s)\) using \(\rho (s,c)\) in the second case. See also (D3.ii) in the proof of Lemma 16.
Case II. When the visited reticulation edge has Type II, then the corresponding reticulation edge is inserted in \(u^\uparrow (g,s)\) using \(\rho (s,c)\) in the last case. See also (D2.iii) in the proof of Lemma 16.
Case III. When the scenario bypasses a reticulation edge e, then e inserted in \(u^\uparrow (g,s)\) using \(\bar{\rho }(s,c.{\mathsf {sibling}})\) in the last case. See also (D2.ii) in the proof of Lemma 16. \(\square\)
Inferring optimal displayed trees under deep coalescence cost
In this Section, we propose algorithms to solve ODT-DC in the class of tree-child networks. We also show how to adopt the solution to use structural properties of tree-child networks (e.g., level-k tree-child networks). Also, we answer whether the problem can be analogously solved when the class of networks is broader than tree-child.
Solution to ODT-DC in the class of tree-child networks
Theorem 13 motivates the following general branching algorithm to solve ODT-DC. Suppose DP returns a solution with a conflict. Then, such a conflict can be resolved by branching and solving two sub-instances of the problem with phylogenetic networks induced from the input phylogenetic network by removing exactly one edge from the conflict. Let \(N_e\) be the tree-child network obtained from \(N/\{e\}\) by contracting all semi-binary nodesFootnote 3. Algorithm 1 details the procedure to infer an optimal tree displayed by a given network. Here, branching occurs when there is a conflict in the set of used reticulation edges. Thus, if the number of conflicts is low, e.g., when G and N are similar, we expect a small number of DP invocations.
Correctness of Algorithm 1 follows from Theorem 13 and the following theorem.
Theorem 19
If e, \(e' \in E_R(N)\) are sibling edges then \(\mathsf {DC}(G,N)=\min \{ \mathsf {DC}(G,N_{e}), \mathsf {DC}(G,N_{e'}) \}\). Moreover, T is an optimal tree displayed by N if and only if T is an optimal tree displayed by a network \(N_e\) or \(N_{e'}\) with minimum cost.
Proof
Let \(\Delta (N)\) be the set of all trees displayed by N. Then, the first statement follows from the fact that for tree-child networks, \(\Delta (N)=\Delta (N_e) \cup \Delta (N_{e'})\) and \(\Delta (N_e) \cap \Delta (N_{e'})=\emptyset\). The second statement follows easily from the above observation. \(\square\)
In the worst case, we need to branch for every reticulation twice, which gives \(2^{r+1}-1\) invocations of DP. Thus, Algorithm 1 has time complexity \(O(2^r |G||N|)\) in the worst case. However, as mentioned previously, we expect Algorithm 1 to behave better than worst complexity in practice. See also our experimental evaluation in Section Results.
Lower and upper bounds of the optimal cost of a displayed tree
[3]Recall that N/X is the network obtained from N by removing all edges from X.
In applications where only the optimal cost is needed, for instance, in problems of network inference, we can use the Lower Bound Theorem 12. As the cost of an optimal displayed tree is bounded below by the score from DP, we can also compute the upper bound using regular scenarios returned from multiple invocations of DP. See details in Algorithm 2.
Lemma 20
For a gene tree G and a tree-child network N, Algorithm 2 returns l and u such that \(l \le \mathsf {DC}(G,N) \le u\).
Proof
The proof follows by induction on the number of reticulation nodes in a network. If N is a tree, then the statement is obvious, as the scenario has no conflicts \(l=u=DC(G,N)\). Otherwise, we have several cases. If the scenario from DP has no conflict, then we have the exact solution (see Line 4). Otherwise, there is a conflict, and if the recursion depth is reached, then the computation is completed in Line 5 with proper bounds (see Theorem 12). In the final case, we have two pairs of bounds from two invocations. By the inductive assumption, the bounds are correct for \(N_e\) and \(N_{e'}\). For the lower bound of N, we have to take a minimum of l and \(l'\), as there may exist the optimal scenario for the network \(N_e\) or \(N_e'\) with the cost \(\min (l,l')\) in the “worst” case. Such a scenario is optimal for N. Similarly, we proceed with the upper bound. \(\square\)
Inferring optimal trees displayed by level-k tree-child networks
Our results can also be extended to level-k tree-child networks. The definition and properties are adopted from [9, 39, 40]. A level-k network is a phylogenetic network in which every biconnected component has at most k reticulation nodes [39]. If B is a biconnected component of N, then by \(B.{{\,\mathrm{\mathsf {root}}\,}}\) we denote the unique node in B with no ancestors in B. Using the notation from [9], by bc(N) we denote the tree obtained from N by contracting all its biconnected components. Let \(\mathsf {Lab}(N)\) denote the set of species present in N as leaf labels.
In Algorithm 3, edges visited by subtrees of G have to be connected in the embedding. Therefore, for each non-root component B in b(N), we minimize the score using the additional costs of a path to the root of B. Formally, \({\mathsf {DC}^{\uparrow }}(G,N)\) is the minimum value of \(\mathsf {DC}(G,S)+||M(G.{{\,\mathrm{\mathsf {root}}\,}}),S.{{\,\mathrm{\mathsf {root}}\,}}||\) in the set of all displayed trees S of N. Computing the value (almost) does not require modification of our algorithms. Here, instead of the formula from Theorem 17, we compute \({\mathsf {DC}^{\uparrow }}(G,N)\) using \(-|E(G)|+\delta ^\uparrow (G.{{\,\mathrm{\mathsf {root}}\,}},S.{{\,\mathrm{\mathsf {root}}\,}})\). The correctness follows from Lemma 16 case D2. The formula can be easily embedded into Algorithm 1. The time complexity of Algorithm 3 is \(O(2^k|G||N|)\).
Beyond tree-child networks
DP can be extended to analyse a broader class of networks, which is more beneficial from a practical point of view. Assume that instead of a tree-child network condition, our class of networks satisfies a relaxed condition: each node has at most one reticulation child. This assumption admits the child of reticulation to be a reticulation, which is not allowed in tree-child networks. Such networks, we call relaxed. We did not find an equivalent class in the literature. Note that the relaxed class is incomparable with a well-known class of Tree-Sibling networks (see networks \(N_1\) and \(N_2\) in Fig. 3), characterized by the condition: each reticulation has a tree-node sibling. Also, relaxed networks are not stable [41] in general, since the relaxed condition admits non-compressed networks (see Theorem 1 from [38]). For example, \(N_1\) from Fig. 3 is not stable.
For the relaxed class, we modify DP in (12): \(\tau (s')+\delta ^\uparrow (g,s')\), and in usage rules in the 2nd case of \(u^\uparrow\) referring to (12): \(u^\uparrow (g,s) \cup \rho (s,c)\), which is needed when the child is also a reticulation. Under this modification, Algorithm 1 returns a correct optimal displayed tree. We omit details for brevity.
We also analysed a general class of binary networks, i.e., in which a tree node may have two reticulation children. However, DP cannot correctly analyse such networks. When embedding a gene tree (a, b) into the network \(N_3\) from Fig. 3, we see that the optimal displayed tree is \(S=((a,b),c)\) with the cost 0. Here, S is constructed by removing a node x and all three incident edges, and a tree node \(x.\mathsf {parent}\) with two children is also contracted. In the current DP, when a gene edge \(\langle a.\mathsf {parent},a \rangle\) from G visits \(x.\mathsf {parent}\) DP will increase the cost. Therefore, the lower bound property is not satisfied in this case unless a solution in which such removed tree nodes are detected is implemented. It remains open whether it can be done in polynomial time without checking all variants of displayed trees.
Optimal displayed trees under gene duplication cost (ODT-DUP)
Algorithms presented in the previous Section can naturally by modified to operate on cost functions such as gene duplication or gene duplication and loss [16]. The main difference is the way an optimal scenario is computed. Here, we present a dynamic programming solution for optimal scenario problem under duplication cost. Since the results are analogous to the deep coalescence cost, we omit most of the theoretical details for brevity.
Given previously defined lca-mapping between a gene tree G and a species tree S, \(\mathsf {M}:V(G) \rightarrow V(S)\), we define duplication contribution of vertex \(g \in V(G)\), which has two children \(g', g''\) as
$$\begin{aligned} \mathsf {dup}(g) = {\left\{ \begin{array}{ll} 1 &{} \text {if } \mathsf {M}(g) = \mathsf {M}(g') \text { or } \mathsf {M}(g) = \mathsf {M}(g'') \text {,}\\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(14)
If g is a leaf, then \(\mathsf {dup}(g) = 0\). Then, the duplication cost between G and S (denoted by \(\mathsf {DUP}(G, S)\)) is defined as \(\mathsf {DUP}(G, S) = \sum _{g \in T(G)} \mathsf {dup}(g)\) [22].
In this section we solve the following problem.
Problem 21
(DUP-ODT) Given a rooted tree G and a phylogenetic network N. Find a tree S displayed by N with the minimum \(\mathsf {DUP}(G,S)\).
Similarly to deep coalescence score, given a scenario \(\xi :V(G) \rightarrow V(S)\), we define duplication score contribution of a vertex \(g \in V(G)\) as follows. If there is a child \(g'\) of g such that \(\mathsf {M}_\xi (g) = \mathsf {M}_\xi (g')\), then \(\tilde{\mathsf {dup}}(g)=1\). Otherwise, \(\tilde{\mathsf {dup}}(g)=0\). Then, the duplication score for G, N and a scenario \(\xi\) is defined as \(\tilde{\mathsf {DUP}}(G,N,\xi )=\sum _{g \in T(G)} \tilde{\mathsf {dup}}(g)\). Duplication score has analogous properties to the ones proved in Theorem 12 and Theorem 13, thus the score can be applied in our branch-and-bound framework. We omit details for brevity.
Similar to the DC case, we have two dynamic programming arrays, \(\delta\) and \(\delta ^\uparrow\). Recall, that \(\delta (g,s)\) is the minimum score for G|g in the set of all scenarios \(\xi\) between G|g and \(\hat{N}\) such that \(\xi _g=s\), and \(\delta ^\uparrow (g, s)\) is the minimum score for G|g in the set of all scenarios \(\xi\) between G|g and \(\hat{N}\) such that \(\xi _g=y\), where \(s \succeq y\). Dynamic programming formulation is as follows
DP components \(\delta\) and \(\delta ^\uparrow\) are associated with usage rules u, \(u^\uparrow\) respectively, as follows:
$$\begin{aligned} u(g,s)&={\left\{ \begin{array}{ll} u(g', s) \cup u^\uparrow (g'', s) &{} \text {for first case in (15),} \\ u^\uparrow (g', s) \cup u(g'', s) &{} \text {for second case in (15),} \\ u^\uparrow (g', s') \cup u^\uparrow (g'', s'') \cup \rho (s, s') \cup \rho (s, s'') &{} \text {for third case in (15),} \\ u^\uparrow (g'', s') \cup u^\uparrow (g', s'') \cup \rho (s, s') \cup \rho (s, s'') &{} \text {for fourth case in (15),} \\ \emptyset &{} \text {in (16)-(18),} \end{array}\right. }\\ u^\uparrow (g,s)&={\left\{ \begin{array}{ll} u(g,s) &{}\text{if}~\delta ^\uparrow (g,s)=\delta (g,s)~\text{in}~(19)~\text{or}~(21),\\ u^\uparrow (g,c) &{} \text {in (20),}\\ u^\uparrow (g,c) \cup \rho (s,c) &{} \text{if}~\delta ^\uparrow (g,s)=\delta ^\uparrow (g,c)~\text{for some c}~\in \{s',s''\}~\text{in}~(19). \end{array}\right. } \end{aligned}$$