In this section, we consider the MaxES and the MaxEC problems. Although sharing the same objectives, the minimization and maximization variants are not equivalent from an approximation point of view.
Given a relation graph R, the value of a solution \(R'\) for MaxES (MaxEC, respectively) over instance R [over instance (R, S), respectively] is called the agreement value of \(R'\) and it is denoted by \(A(R',R)\). Moreover, given a gene tree G, we denote by A(G, R) the agreement between the relation graph associated with G and R.
Next, we give a bound on the agreement value returned by an optimal solution of MaxES and MaxEC.
Lemma 6
Given a relation graph R (a relation graph R and a species tree S, respectively), an optimal solution of
MaxES
over instance R [an optimal solution of
MaxEC
over instance (R, S), respectively] has an agreement value of at least
\(\frac{n^2}{8}\).
Proof
Consider a relation graph R and a species tree S for the MaxEC problem. Let \(R'=(V(R), \emptyset )\) and \(R''=\left(V(R), {V(R) \atopwithdelims ()2}\right)\) be two solutions for MaxES over instance R [for MaxEC over instance (R, S), respectively]. It is easy to see that \(R'\) and \(R''\) are both feasible solutions of MaxES and of MaxEC. Since for each \(\{ u,v \}\), with \(u,v \in V\), \(u \ne v\), either one of \(R'\) or \(R''\) agrees with R, it holds
$$\begin{aligned} A(R,R')+ A(R,R'') = \left( {\begin{array}{c}n\\ 2\end{array}}\right) \end{aligned}$$
Then at least one of \(R'\), \(R''\) must have an agreement value of at least \(\frac{1}{2}\left( {\begin{array}{c}n\\ 2\end{array}}\right)\), hence an optimal solution of MaxES and MaxEC has an agreement value of at least \(\frac{1}{2}\left( {\begin{array}{c}n\\ 2\end{array}}\right) \ge \frac{n^2}{8}\). \(\square\)
Since it possible to compute an optimal solution of MaxES with additive cost \(\varepsilon n^2\), for each \(\varepsilon > 0\) [16], it follows that MaxES admits a PTAS.
Let OPT(R) be the value of an optimal solution on R, and let c be such that \(OPT(R) = cn^2\). The additive \(\varepsilon n^2\) approximation algorithm for cograph editing [16] yields a solution of value \((c  \varepsilon ) n^2\). As \(c \ge 1{/}8\) by Lemma 6, \(\varepsilon\) can be adjusted so that, for any \(0< \varepsilon ' < 1\), \((c  \varepsilon )n^2 \ge (1  \varepsilon ')cn^2\), hence yielding a PTAS. In the more general case, this algorithm does not ensure that genes from the same species remain paralogs. However, the authors of [16] claim that their approximation algorithm applies to any hereditary graph property (i.e. preserved after vertexdeletion), which holds for satisfiability.
A PTAS for MaxEC
The PTAS for MaxES does not guarantee that the returned relation graph \(R'\) (and its corresponding gene tree \(G'\)) is Sconsistent with the given species tree S. In this section, we present a PTAS for MaxEC based on smoothpolynomial integer programming [22], a technique that has been applied to design PTAS for problems like maximum quartet consistency [23] or maximum consensus clustering [24].
As for maximum quartet consistency, the MaxEC problem is reduced to the assignment of leaves in \({\Gamma }\) to a tree, and the resulting tree is then used to to reconstruct a gene tree \(G'\) that is consistent with S and whose relation graph requires at most \(\varepsilon n^2\) modifications with respect to the original graph. In order to guarantee the Sconsistency of the reconstructed gene tree, we need several technical arguments that are not used for maximum quartet consistency. Recall that we are considering binary trees.
Before giving the details, we present an overview of the PTAS. First, in “The compressed tree G
^{k}
” section, we show that starting from a gene tree \(G'\) we can compute a compressed tree
\(G^k\) that has at most k internal nodes and at most k leaves, where \(k > 0\) is a constant. In order to construct such a compressed tree, first in “The unlabeled compressed tree T
^{k}
” section we compute an unlabeled compressed tree
\(T^k\), and then in “A PTAS of MaxLA by smooth polynomial integer programming” section we compute a compressed tree \(G^k\) from \(T^k\) by using smoothpolynomial integer programming. Finally, we show in “Building a feasible solution” section how to reconstruct an Sconsistent gene tree from \(G^k\).
The compressed tree \(G^k\)
First, we will focus on the compressed tree, and we show that, given an optimal solution \(R'\) of MaxEC, there exists a compressed tree that respects a (large) subset of the speciation/duplication relations for \(R'\).
Consider an optimal solution \(R'\) of MaxEC, and let \((G',ev_{G'})\) be a DStree, where \(G'\) is the gene tree corresponding to \(R'\). Recall that each internal node of \(G'\) is associated by \(ev_{G'}\) either with a duplication (Dup) or with a speciation (Spec). We present the formal definition of compressed tree \(G^k\) associated with \((G',ev_{G'})\) (see Fig. 3).
Definition 3
Given a constant \(k > 0\) and a DStree \((G',ev_{G'})\), a compressed tree
\(G^k\) associated with \((G',ev_{G'})\) is a tree that has at most k internal nodes and at most k leaves, which are called leafsets. An internal node v can be a regular internal node or can belong to a twoset internal node
\(\langle u,v \rangle\) such that \(v \in ch(u)\), and both u and v have exactly one leafset as a child. The twoset internal nodes of \(G^k\) are disjoint, that is \(\langle u,v \rangle\) and \(\langle v,w \rangle\) cannot be twoset internal nodes of \(G^k\). Moreover, the following properties hold:

the leafsets of \(G^k\) induce a partition of \({\Gamma }\) and each leafset contains at most 8n / k elements of \({\Gamma }\)

each internal node of \(G^k\) is associated with two possible events, Dup or Spec, by the function \(ev_{G^k}\)

let \(I_{v_1}\), \(I_{v_2}\) be two leafsets connected to nodes \(u_1\) and \(u_2\), respectively, such that \(\langle u_1,u_2 \rangle\) is not a twoset internal node, let \(l_1 \in I_{v_1}\), \(l_2 \in I_{v_2}\), and \(x=lca_{G'}(l_1,l_2)\) and \(y=ev_{G^k}(lca_{G^k}(I_{v_1},I_{v_2}))\), then \(ev_{G'}(x) = ev_{G^k}(y)\).
Note that a leafset \(I_v\) of \(G^k\) is both a set of leaves of \(G'\), and a leaf of \(G^k\). It will sometimes be useful to clarify which one we wish to refer to, and so we denote by \(L(I_v)\) the set of leaves that belong to \(I_v\).
Now, we provide a constructive proof that shows that, starting from a solution \(R'\) (whose corresponding gene tree is \(G'\)) of MaxEC over instance (R, S), there exists such a compressed tree \(G^k\).
Consider the following algorithm. First, the algorithm initializes \(G^k\) to \(G'\) and all internal nodes are unmarked. Then, the algorithm traverses \(G'\) and construct the tree \(G^k\) as described in Algorithm Compressed Tree (\(G'\)).
When the algorithm stops it follows that each leafset has size at most 8n / k. Notice that, given a twoset internal node \(\langle v_1,v_2 \rangle\), the leaves assigned to the leafsets \(I_{z_1}\), \(I_{z_2}\) connected to \(v_1\) and \(v_2\) are considered as a single leafset with reference to the relation between elements in \(L(I_{z_1}) \cup L(I_{z_2})\).
Next, we show that the algorithm returns a compressed tree \(G^k\), with at most k internals node and k leafsets.
Lemma 7
Given a gene tree
\(G'\)
, Algorithm Algorithm Compressed Tree (\(G'\)
) returns a tree
\(G^k\)
, with at most k internal nodes and k leafsets.
Proof
First, consider the set of regular nodes of \(G^k\). Consider the set \(V^k_1\) of those nodes of \(G^k\) that the algorithm defines because the subtree rooted at one of such nodes contains at least 8n / k unassigned leaves. It follows that at most k / 8 such nodes are chosen.
Consider the set \(V^k_2\) of nodes of \(G^k\) defined as internal nodes because they are the least common ancestor of two internal nodes of \(V^k_1\). Now, if we restrict \(G^k\) to \(V^k_1 \cup V^k_2\), we obtain a tree having at most k / 8 leaves, as the leaves by construction are only nodes in \(V^k_1\), where each internal node, except for the root, has degree at least three. Hence \(V^k_2 \le V^k_1 \le k{/}8\).
Let v and z be two nodes in \(V^k_1 \cup V^k_2\), such that z is an ancestor of v in \(G^k\), and there are no other ancestor of v in \(G^k\) that belongs to \(V^k_1 \cup V^k_2\). It follows that, by construction, at most one twoset internal node on the path between v and z is defined in \(G^k\). Hence at most two internal nodes are defined on the path between v and z in \(G^k\), and since \(V^k_1 \cup V^k_2 \le k{/}4\), it follows that \(G^k\) contains at most k / 4 twoset internal nodes. Thus \(G^k\) consists of at most \(k/4+k/2=(3/4) k\) internal nodes.
Now, consider the defined leafsets. For each twoset internal node \(\langle v_1, v_2 \rangle\), there exists at most two leafsets connected with one of \(\langle v_1, v_2 \rangle\), hence at most k / 2 leafsets. For each of the k / 4 internal node \(v \in V^k_1 \cup V^k_2\), the leaves assigned to leafset connected to v are at most two, as \(G'\) is binary. Hence there exists hence at most k / 2 leafsets connected to internal nodes of \(v \in V^k_1 \cup V^k_2\). Hence, the number of leafset is bounded by \(k/2+k/2=k\). \(\square\)
In order to prove that \(G^k\) is a compressed tree, in addition to Lemma 7 we need the following result.
Lemma 8
Given a gene tree
\(G'\)
and a species tree S, let
\(G^k\)
be the tree computed by Algorithm Algorithm Compressed Tree (\(G'\)
). Given two distinct leafsets
\(I_u\)
and
\(I_w\)
of
\(G^k\)
connected to the internal nodes
z
and v, such that
\(\langle z,v \rangle\)
is not a twoset internal node, let
\(l_1 \in L(I_u)\)
and
\(l_2 \in L(I_w)\)
. Let
\(x^k =lca_{G^k}(I_u,I_w)\)
and
\(x = lca_{G'}(l_1,l_2)\)
. Then
\(ev_{G^k}(x^k) = ev_{G'}(x)\).
Proof
Let \(lca_{G^k}(I_u,I_w) = x^k\). Assume that \(I_u\) and \(I_w\) are connected to the same internal node of \(G^k\) (which must be \(x^k\)). Then when \(x^k\) is defined by Algorithm Algorithm Compressed Tree (\(G'\)), its event is the same as the corresponding node x of \(G'\). Assume that \(I_u\) and \(I_w\) are connected to different internal nodes of \(G^k\), \(u^k\) and \(w^k\), respectively, corresponding to node u and w of \(G'\). Consider \(x=lca_{G'}(u,w)\) then Algorithm Algorithm Compressed Tree (\(G'\)) defines a corresponding node \(x^k\) in \(G^k\) such that \(ev_{G^k}(x^k) = ev_{G'}(x)\).
Assume that \(x^k\) belongs to a twoset internal node \(\langle z,v \rangle\). Then, by construction, exactly one of \(I_u\), \(I_w\) (w.l.o.g. \(I_u\)) must be a leafset which is a child of \(x^k\), and exactly one of \(I_u\), \(I_w\) (w.l.o.g. \(I_w\)) is a leafset connected to a strict descendant c of \(x^k\), such that \(c\ne z,v\). Let \(y = lca_{G'}(l_1,l_2)\), for a leaf \(l_1\) in \(L(I_u)\) and a leaf \(l_2\) in \(L(I_w)\). By construction, \(l_1 \in I_u\) only if \(ev_{G^k}(x^k)=ev_{G'}(y)\), thus concluding the proof. \(\square\)
Lemmas 7 and 8 implies that Algorithm Algorithm Compressed Tree (\(G'\)) constructs a compressed gene tree \(G^k\), as by construction the leafsets induce a partition of \({\Gamma }\).
Next, we show a lower bound on the agreement value of an optimal assignment of leaves to the leafsets \(I_v\). We denote by \(A(G^k,R)\) (the agreement between R and \(G^k\)) as the agreement for each pair of leaves \(l_1, l_2 \in {\Gamma }\) that belong to two distinct leafsets \(I_u\) and \(I_w\) of \(G^k\) connected to the internal nodes u and v, such that \(\langle u,v \rangle\) is not a twoset internal node (notice that u and v may be the same node).
Lemma 9
Given an optimal solution
\(G^*\)
of
MaxEC
over instance (R, S) and a constant
\(k>0\)
, let
\(G^k\)
be the compressed tree computed starting from
\(G^*\)
. Then
\(A(G^k,R) \ge A(G^*,R) \frac{64n^2}{k}\).
Proof
Consider an optimal solution \(G^*\) of MaxEC over instance (R, S) and the compressed tree \(G^k\) constructed from \(G^*\). From Lemma 8, the pairs of leaves that belong to different leafsets (not connected to the same twoset internal node) have the same relations in \(G^*\) and in \(G^k\).
Consider the leaves of a same leafset \(I_v\) or of two leafsets \(I_w\) and \(I_u\) which are connected to the same twoset internal node. Since \(I_v \le \frac{8n}{k}\) and \(I_w \cup I_u \le \frac{8n}{k}\), the number of relations between two leaves belonging to a common leafset is at most \(\frac{64n^2}{k^2}\). Since there are at most k leafsets, the overall number of relations between pairs of leaves in \(G^k\) with respect to \(G^*\) are at most \(\frac{64n^2}{k}\), hence \(A(G^k,R) \ge A(G^*,R)  \frac{64n^2}{k}\). \(\square\)
The unlabeled compressed tree \(T^k\)
The tree \(G^k\) described above is of course not known, and it needs to be found. In this subsection we introduce the unlabeled compressed tree \(T^k\) that is used to construct the compressed tree \(G^k\). An unlabeled compressed tree
\(T^k\) is a compressed tree whose leafsets are empty. Here we introduce some properties of \(T^k\) and we reduce the MaxEC problem to a second problem, called MaxLA (to be defined later). The PTAS iterates through the possible unlabeled compressed trees \(T^k\). In particular, the PTAS iterates through (1) the structure of \(T^k\), (2) the events associated with internal nodes of \(T^k\), and (3) a set of labels that are allowed to be assigned to a leafset.
First, consider the structure of \(T^k\). Since by Lemma 7
\(G^k\) consists of at most k internal nodes and k leafsets, it follows that there are at most \(2^{4k^2}\) possible topologies for the unlabeled compressed tree \(T^k\). Indeed, the adjacency matrix of \(T^k\) has size \(4k^2\), and the possible adjacency matrices are at most \(2^{4k^2}\). Moreover, for each topology, we define in time \(O(2^k)\) the twoset internal nodes of \(T^k\).
Now, consider the events associated with the internal nodes of \(T^k\). For each unlabeled compressed tree \(T^k\), the events associated with the internal nodes of \(T^k\) are at most \(2^k\) (two possible cases, Dup or Spec, for each of the k internal nodes). Overall we iterate though \(O(2^{4k^2})\) possible unlabeled compressed tree \(T^k\).
Consider now an unlabeled compressed tree \(T^k\). In order to ensure that the gene tree \(G'\) constructed from \(T^k\) is Sconsistent with the given species tree S, we must ensure that the speciation nodes of \(G'\) are consistent with S. We define a mapping \(s_{T^k}\) of the nodes of \(T^k\), except the leafnodes connected to twoset internal nodes, to the nodes of S so that the mapping is feasible, that is the following conditions hold:

if v is an ancestor of u in \(T^k\), then \(s_{T^k}(v)\) is an ancestor (not necessarily proper) of \(s_{T^k}(u)\)

if v is an ancestor of u in \(T^k\) and \(ev_{T^k}(v)=Spec\), then \(s_{T^k}(v)\) is a proper ancestor of \(s_{T^k}(u)\)
Based on the mapping \(s_{T^k}\), define for each leafset \(I_v\), the allowed set \(A(I_v)\) of labels that can be assigned to a leafset \(I_v\). If \(I_v\) is a leafset not connected to a twoset internal node:
$$\begin{aligned} A(I_v)=\{ l{:}l \in {\mathcal{L}} (S_x) {\text { with }} x = s_{T^k}(I_v) \} \end{aligned}$$
If \(I_v\) is a leafset connected to an internal node u, with \(\langle u,w \rangle\) a twoset internal node (recall that \(ev_{T^k}(u)=Dup\)):
$$\begin{aligned} A(I_v)=\{ l: l \in {\mathcal{L}} (S_x) {\text { with }} x = s_{T^k}(u) \} \end{aligned}$$
If \(I_v\) is a leafset connected to a twoset internal node u, with \(\langle w,u \rangle\) a twoset internal node (recall that \(ev_{T^k}(u)=Spec\)), such that z is the only child of u in \(T^k\) which is an internal node:
$$\begin{aligned} A(I_v)=\{ l: l \in {\mathcal{L}} (S_x) \setminus L(S_y), {\text { with }} x = s_{T^k}(u) {\text { and }} y = s_{T^k}(z) \} \end{aligned}$$
Since \(T^k\) contains at most 2k nodes, the set of the feasible mappings \(s_{T^k}\) are at most \(O(n^{2k})\). Moreover, once the mapping \(s_{T^k}\) is computed, \(A(I_v)\) can be computed in O(nk) time.
Finally, for each set leafset \(I_v\), we assign one leaf (denoted by \(P(I_v)\)) of \(A(I_v)\) to \(I_v\), in time \(O(n^k)\). These leaves are called preassigned leaves and are assigned such that for each internal node x of \(T^k\), the lca mapping of the preassigned leaves maps x to a node y of S such that \(y=s_{T^k}(x)\). Notice that, given an optimal solution of MaxEC,there exists a feasible mapping with associated \(A(I_v)\) and \(P(I_v)\).
Now, we a able to define the MaxLA problem we will solve to compute the PTAS.
Maximum leaf assignment: (MaxLA)
 Input::

an unlabeled compressed tree \(T^k\) with a feasible mapping \(s_{T^k}\), a set of preassigned leaves \(P(I_v)\), and a set \(A(I_v)\), for each leafset \(I_v\), a set \({\Gamma }\), a relation graph R, a specie tree S;
 Output::

a compressed tree \(G^k\) obtained from \(T^k\) by assigning leaves of \({\Gamma }\) to the leafset of \(T^k\), where for each leafset \(I_v\) only leaves of \(A(I_v)\) are assigned to \(I_v\), such that, \(A(G^k,R)\) is maximized and each speciation node of \(G^k\) is Sconsistent.
By Lemma 9, it follows that an optimal solution of MaxLA has a an agreement value of at least \(A(G^*,R) \frac{64n^2}{k}\), where \(G^*\) is the optimal solution of MaxEC.
A PTAS of MaxLA by smooth polynomial integer programming
Now, we present a PTAS for MaxLA. Consider an unlabeled compressed tree \(T^k\), with the corresponding allowed sets \(A(I_v)\) and preassigned leaves \(P(I_v)\). We start by introducing the smooth polynomial integer programming technique [22].
A polynomial having degree c is called qsmooth, for a constant \(q>0\), if the coefficients of each degree\(\ell\) monomial belongs to the interval \([qn^{c \ell }, qn^{c \ell }]\), for each \(\ell\) with \(1 \le \ell \le c\).
First, we define some constants:

given a leafset \(I_v\) of \(T^k\) and \({\ell }\in {\Gamma }\), \(a(I_v,l)=1\) if \(l \in A(I_v)\) and 0 otherwise

given two leafsets \(I_v\), \(I_w\) of \(T^k\), \(r(I_v,I_w)\) is equal to 1 if \(lca_{T^k}(I_v,I_w)\) is a speciation, else (if \(lca_{T^k}(I_v,I_w)\) is a duplication) \(r(I_v,I_w)\) is equal to 0

given two leafsets \(I_v\), \(I_w\) of \(T^k\), \(t(I_v,I_w)\) is a constant equal to 0 if \(I_v\) and \(I_w\) are connected to the same twoset internal node, else it is equal to 1

given \(l_1, l_2 \in {\Gamma }\), \(e(l_1,l_2)=1\) if \(l_1 l_2 \in E(R)\) and \(e(l_1,l_2)=0\) otherwise
For each leafset \(I_v\) of \(T^k\) and each leaf \(l \in {\Gamma }\), define a variable \(x_{I_v,l}\) that has value 1 if l is assigned to \(I_v\), else is 0 (notice that \(x_{I_v,l}=1\) if l is a leaf preassigned to \(I_v\)). Given \(l_1, l_2 \in {\Gamma }\), define
$$\begin{aligned} p(l_1,l_2)&= \sum _{I_v \ne I_w} x_{I_v,l_1} a(I_v,l_1) x_{I_w,l_2} a(I_w,l_2) r(I_v,I_w) e(l_1,l_2) t(I_v,I_w)\\ &\quad+ x_{I_v,l_1} a(I_v,l)x_{I_w,l_2} a(I_w,l_2)(1r(I_v,I_w)) (1e(l_1,l_2))t(I_v,I_w)\end{aligned}$$
Now, assume that \(x_{I_v,l_1}=1\) and \(x_{I_w,l_2}=1\), where \(l_1 \in A(I_v)\), \(l_2 \in A(I_w)\), \(l_1\), \(l_2\) do not belong to the same twoset internal node and \(t(I_v,I_w)=1\); it holds that \(p(l_1,l_2)=1\) if and only if (1) the lca of \(I_v\) and \(I_w\) is a speciation (hence \(r(I_v,I_w)=1\)) and \(l_1\) and \(l_2\) are connected by an edge in R (hence \(e(l_1,l_2)=1\)) or (2) the lca of \(I_v\) and \(I_w\) is a duplication (hence \(r(I_v,I_w)=0\)) and there is no edge between \(l_1\) and \(l_2\) in R (hence \(e(l_1,l_2)=0\)).
Finally define p(x) as follows:
$$\begin{aligned} p(x)= \sum _{l_1,l_2 \in L} p(l_1,l_2) \end{aligned}$$
The polynomial integer programming is defined as follows
$$\begin{aligned} p(x) {\text { is maximixed}} \end{aligned}$$
$$\begin{aligned} \sum _{v} x_{I_v,l}=1\quad \forall l\in L \end{aligned}$$
$$\begin{aligned} \sum _{l} x_{I_v,l} \le 8n/k \end{aligned}$$
The polynomial p(x) is 1smooth.
Consider a solution for the smooth polynomial integer programming, given the correct unlabeled compressed tree \(T^k\), the correct allowed sets \(A(I_v)\) and the correct sets of preassigned leaves \(P(I_v)\). For each \(\varepsilon\), there is a polynomial time algorithm that produces a 0–1 assignment x to the leafset of \(T^k\) (hence a compressed tree \(G^k\)), such that \(p(x)\ge OPT  \varepsilon n^2\), where OPT is the maximum value of the smooth polynomial integer programming [22, 23].
Now, consider the labels assigned to different sets \(I_{v}\). By Lemma 9, we have that the agreement between \(G^k\) and R is at least \(\frac{n^2}{8}  \frac{64n^2}{k}\). By Lemma 6, \(A(G^*,R) \ge \frac{n^2}{8}\), where \(G^*\) is an optimal solution of MaxEC, hence it holds
$$\begin{aligned} A(G^k,R) \ge A(G^*,R)  \varepsilon n^2  \frac{64n^2}{k}= A(G^*,R) \left( 1 \frac{\varepsilon }{c}  \frac{1}{ck}\right) \end{aligned}$$
for a constant \(c \ge 0\). By choosing \(\varepsilon\) sufficiently small, and k sufficiently large, the PTAS for MaxLA follows.
Now, what we have to show is that, starting from a solution \(G^k\) of MaxLA, it is possible to construct in polynomial time a gene tree \(G'\) such that \(G'\) is Sconsistent and it has an agreement value not smaller than that of \(G^k\).
Building a feasible solution
Consider a compressed tree \(G^k\) returned by the smooth polynomial integer programming. Next we show how to reconstruct a gene tree \(G'\) which is consistent with S.
First, we consider only the set \({\Gamma }'\) of leaves \(l \in {\Gamma }\) that are assigned to a leafset \(I_v\), with \(l \in A(I_v)\). Notice indeed that if a leaf is assigned to a leafset \(I_v\) with \(l \notin A(I_v)\), then it will give a contribution 0 in the smooth polynomial integer program, as \(a(I_v,l_1)=0\), hence \(p(l_1,l_2)=0\), for each other leaf in \(l_2 \in {\Gamma }\). In this case, we construct a gene tree \(G'\) only for the set of leaves \({\Gamma }'\), then we construct a new gene tree \(G^*\) by joining \(G'\) and a subtree \(G''\) over leafset \({\Gamma }\setminus {\Gamma }'\) such that the internal nodes of \(G''\) and the root of \(G^*\) are all associated with a duplication.
We focus now on the set of labels \({\Gamma }'\) and assume that no leaf l is assigned to a leafset \(I_v\) such that \(l \notin A(I_v)\). Starting from \(G^k\) we construct in polynomial time the corresponding gene tree \(G'\). \(G'\) is computed by replacing each leafset \(I_v\) of \(G^k\) with a subtree labeled by the set \(L(I_v)\) of leaves that belong to \(I_v\) (see Fig. 4).
Consider the tree \(G^k\), a leaf set \(I_z\) of \(G^k\) connected to a node u of \(G^k\) and the set \(L(I_z)\) of leaves assigned to \(I_z\). We replace \(I_z\) by a subtree \(T'\) isomorphic to \(SL(I_z)\); each internal node of \(T'\) is labeled as Dup. Notice that the root of \(T'\) is connected to u.
As a last step, if \(d>1\) copies of a label l belongs to a leaf set \(I_v\), then we construct a subtree with d leaves all labeled by l, whose internal nodes are all associated with duplications.
We prove that the gene tree \(G'\) constructed is Sconsistent.
Lemma 10
The tree
\(G'\)
computed starting from
\(G^k\)
is Sconsistent.
Proof
In order to ensure the Sconsistency of \(G'\), we must prove that for each node \(v'\) of \(G'\) with \(ev_{G'}(v')=Spec\), each child of \(v'\) is mapped to a proper descendant of \(s_{G'}(v')\).
Consider a node \(v'\) of \(G'\) corresponding to an internal node v of \(G^k\) such that \(ev_{G'}(v')=Spec\) and \(ev_{G^k}(v)=Spec\) and v is not part of a twoset internal node. We claim that \(v'\) represents a speciation with respect to the species tree S. Let \(ch(v')\) be the set of children of \(v'\). Assume that \(s_{G'}(v') = x'\), and that \(s_{G'}(w') = x\), for some \(w' \in ch(v')\). We show that x is a proper descendant of \(x'\). Assume to the contrary that x and \(x'\) are the same node. We claim that there exists a leaf l that is assigned to \(I_z\), with \(l \notin A(I_z)\), for some leafset of \(G^k_{w}\), where w is the node of \(G^k\) corresponding to \(w'\). If the claim holds, then by construction \(a(I_z,l)=0\) and this would contradict our earlier remark on such nodes not belonging to \(\Gamma '\).
Hence, we must prove the claim: if x and \(x'\) are the same node of S, then there exists a leaf \(l \in {\Gamma }\) and a leafset \(I_z\) in \(G^k_w\), such that l is assigned to \(I_z\), with \(l \notin A(I_z)\). Assume that this is not the case. Since v is a speciation in \(G^k\), it follows that the preassigned leaves define a mapping \(s_{G^k}\) of v and w in two different nodes of S. Let \(s_{G_k}(w)=y\), where y is a proper descendant of \(x'\). Since \(s_{G'}(v') = s_{G'}(w')\), it follows that there exists a leaf l of \({\Gamma }\) not in \({\mathcal{L}} (S_{z})\) that is assigned to a leafset \(I_z\) in \(G^k_{w}\), otherwise \(w'\) would be mapped in y. Hence the claim holds.
Consider now the case that v belongs to a twoset internal node \(\langle u,v \rangle\). Since \(\langle u,v \rangle\) is a twoset internal node, \(ev_{G^k}(v)=Spec\) and \(ev_{G^k}(u)=Dup\). Moreover, let \(I_z\) be the leafset connected to v. Let \(z'\) be the root of the subtree of \(G'\) isomorphic to \(SL(I_z)\) that replaced the \(I_z\) leafset. Note that \(z'\) is a child of \(v'\). Let \(q'\) be the other child of \(v'\), and let q be the node of \(G^k\) corresponding to \(q'\).
Similarly to the previous case if \(s_{G'}(v') = s_{G'}(z')\) or \(s_{G'}(v') = s_{G'}(q')\), then we claim that there exists a leaf l that is assigned to \(I_w\) with \(l \notin A(I_w)\) for either \(I_w = I_z\) or for some leafset \(I_w\) in \(G^k_q\). In order to prove the claim, first notice that, by definition, the set \(A(I_z)\) contains only leaves of \(L(S_x) \setminus L(S_y)\), where x and y are the nodes of S where v and q are mapped by \(s_{G^k}\). Therefore, if \(s_{G'}(z')\) is not a proper descendant of x, there must be a leaf \(l \notin A(I_z)\) assigned to \(I_z\). Similarly, if \(s_{G'}(q')\) is not a proper descendant of x, because \(s_{G^k}(q) = y\) is a proper descendant of x, there must be a leaf l assigned to \(I_w\) in \(G^k_q\) such that \(l \notin A(I_w)\) (otherwise, \(q'\) would be mapped to y). We can conclude that the lemma holds. \(\square\)