When only the duplication cost is considered, the DLC optimization problem, DLCOP, can be approximated arbitrarily well using the polynomial-time approximation scheme (PTAS) for Multicut in binary trees [4] since duplications correspond exactly to removed edges in the Multicut problem. However, we now show that DLCOP has no PTAS in general, unless P = NP. Specifically, we show that DLCOP is APX-hard when duplications and losses are considered. We establish this result by a polynomial-time reduction from max3sat(b) which comprises a Boolean formula in 3-CNF form in which each variable appears at most B times in the clauses. Arora [9] showed that, for some \(\epsilon\), \(0< \epsilon < 1\), there exists a constant value of B (\(B = 13\)) and a polynomial-time reduction from any NP-complete problem \(\Pi\) to max3sat(b) that maps yes instances of \(\Pi\) to satisfiable instances of max3sat(b) and no instances of \(\Pi\) to instances of max3sat(b) in which less than \(1-\epsilon\) of the total number of clauses are satisfiable.
Our reduction maps an instance of max3sat(b) with n clauses (for sufficiently large values of n) to an instance of DLCOP and a parameter b such that the optimal solution to the DLCOP instance is less than b if the max3sat(b) instance is satisfiable and more than \((1+\alpha )b\) if at most \((1-\epsilon )n\) clauses can be satisfied, for some constant \(\alpha >0\). If a polynomial-time \((1+\alpha )\)-approximation algorithm exists for DLCOP, we can apply our gap-preserving reduction to generate a DLCOP instance from the max3sat(b) instance and then run the putative approximation algorithm to distinguish between satisfiable and \((1-\epsilon )\)-satisfiable instances of max3sat(b). Thus, the existence of a \((1+\alpha )\)-approximation algorithm for DLC implies that \(P=NP\), and the approximation-hardness of DLCOP follows.
Reduction
Given an instance of max3sat(b) comprising m variables and n clauses, we construct an instance of DLCOP comprising a gene tree, a species tree, a leaf map, and event costs. The reduction is based on the NP-hardness reduction in the previous section but introduces more complex gadgetry and uses nonzero cost for loss events.
Thorn gadget
An \(\ell\)
-thorn gadget, depicted in Fig. 6, is a binary tree with \(\ell\) leaves constructed as follows: let the root node be \(u_1\). Each node \(u_i\) has two children: internal node \(u_{i+1}\) and leaf \(t_i\), \(1 \le i \le \ell -2\). Node \(u_{\ell - 1}\) has two leaf children \(t_{\ell -1}\) and \(t_{\ell }\). Leaf \(t_{\ell }\) is denoted the end tip of the thorn gadget.
Variable gadgets
Let B(i) and \(\overline{B}(i)\) denote the number of occurrences of literals \(x_i\) and \(\overline{x}_i\), respectively. The variable gadget for variable \(x_i\), illustrated in Fig. 7, consists of a root node, \(\alpha _i\), and two subtrees, one for each of the two literals of this variable. The left subtree has root \(\beta _i\), with two children: Internal node \(\beta _i'\) and leaf \(y_i\). In turn, \(\beta _i'\) has two children: Internal node \(\beta _{i,1}\) and leaf \(y'_i\). Each node \(\beta _{i, q}\), \(1 \le q \le B(i)-2\), has a child \(\beta _{i, q+1}\) and a second child which is the root of a \((n^2-1)\)-thorn gadget with end tip \(y_{i, q}\). Node \(\beta _{i, B(i)-1}\) has two children, each of which is the root of a \((n^2-1)\)-thorn gadget. The end tips of these thorn gadgets are labeled \(y_{i, B(i)-1}\) and \(y_{i, B(i)}\). This construction introduces a distinct \((n^2-1)\)-thorn gadget for each occurrence of \(x_i\) in the 3SAT instance. We refer to the thorn gadget terminating at end tip \(y_{i, q}\) as the thorn gadget for the
qth occurrence of
\(x_i\). The right subtree of \(\alpha _i\), representing literal \(\overline{x}_i\), is symmetric to the left subtree, but with \(\beta _i\) and \(\beta '_i\) replaced with \(\overline{\beta }_i\) and \(\overline{\beta }'_i\), respectively, each \(\beta _{i, j}\) replaced by \(\overline{\beta }'_{i, j}\), and each \(y_{i, j}\) replaced by \(\overline{y}_{i, j}\). This construction introduces a distinct \((n^2-1)\)-thorn gadget for each clause containing \(\overline{x}_i\). We refer to the thorn gadget terminating at end tip \(\overline{y}_{i, q}\) as the thorn gadget for the
qth occurrence of
\(\overline{x}_i\).
Clause gadgets
A clause gadget corresponding to clause \(C_j\), shown in Fig. 8, consists of root node \(\delta _j\) with children \(\delta '_j\) and \(\lambda _{3, j}\). Node \(\delta '_j\) has two children \(\lambda _{1, j}\) and \(\lambda _{2, j}\). Each node \(\lambda _{h, j}\), \(1 \le h \le 3\), is the root of a tree with two children, a leaf \(k_{h, j}\) and a node \(\lambda '_{h, j}\), which in turn has two leaf children \(k'_{h, j}\) and \(k''_{h, j}\).
Gene tree
The gene tree G is constructed as follows: the root of the gene tree is a node \(g_0\) with children \(g_1\) and \(g_2\). Node \(g_1\) is the root of a \((3n-m+1)\)-thorn gadget. Node \(g_2\) is the root of an arbitrary binary subtree with \(n + m\) leaves. Each of the first n of those leaves becomes the root of a clause gadget for clauses \(C_1, \ldots , C_n\) and the remaining m leaves become the roots of m variable gadgets for variables \(x_1, \ldots , x_m\).
Species tree
The species tree, shown in Fig. 9, is rooted at node \(\rho _0\) and is constructed from a path \(\rho _0, \ldots , \rho _{2m}\) followed by \(\sigma _1, \sigma '_1, \ldots , \sigma _n, \sigma '_n\), and finally \(\tau _{1, 1}, \tau _{2, 1}, \tau _{3, 1}, \ldots , \tau _{1, n}, \tau _{2, n}, \tau _{3, n}\). This path is henceforth referred to as the trunk of the tree. Each node \(\rho _i\) has a leaf child \(r_i\), \(1 \le i \le 2m\), and each node \(\sigma _j\) and \(\sigma '_j\) has a leaf child \(s_j\) and \(s'_j\), respectively, \(1 \le j \le n\). Finally, each node \(\tau _{h, j}\), which corresponds the hth literal in clause \(C_j\), has a child that is the root of a \(n^2\)-thorn with end tip \(t_{h,j}\) (henceforth referred to as the
\(n^2\)
-thorn for
\(\tau _{h, j}\)), \(1 \le h \le 3\), \(1 \le j \le n\). Node \(\tau _{3, n}\) has an additional leaf child so that the tree is binary.
Leaf map and event costs
The leaf map Le is defined as follows:
-
1.
\(Le(y_i)=Le(\overline{y}_i)=r_{2i-1}\) and \(Le(y_i') = Le(\overline{y}_i') = r_{2i}\), \(1 \le i \le m\);
-
2.
\(Le(k_{1,j})=Le(k_{2,j})=Le(k_{3,j})=s_j\) and \(Le(k_{1,j}')=Le(k_{2,j}')=Le(k_{3,j}')=s'_{j}\), \(1 \le j \le n\);
-
3.
Each leaf in the \((3n-m+1)\)-thorn gadget rooted at node \(g_1\) is mapped to \(r_0\);
-
4.
If the hth literal of \(C_j\) is \(x_i\) and this is the qth occurrence of \(x_i\) in the 3SAT instance, then each leaf of the \((n^2-1)\)-thorn gadget for the qth occurrence of \(x_i\) is mapped to the leaf with the same index in the \(n^2\)-thorn gadget for \(\tau _{h, j}\) and \(k''_{h, j}\) is mapped to the end tip, \(t_{h, j}\), of that \(n^2\)-thorn gadget.
-
5.
If the hth literal of \(C_j\) is \(\overline{x}_i\) and this is the qth occurrence of \(\overline{x}_i\) in the 3SAT instance, then each leaf of the \((n^2-1)\)-thorn gadget for the qth occurrence of \(\overline{x}_i\) is mapped to the leaf with the same index in the \(n^2\)-thorn gadget for \(\tau _{h, j}\) and \(k''_{h, j}\) is mapped to the end tip, \(t_{h, j}\), of that \(n^2\)-thorn gadget.
Let the event costs be as follows: \(D=2Bn^2, L=1, C=0\). Finally, note that this reduction can be performed in polynomial time.
Proof of correctness
To prove the correctness of our reduction, we show that:
-
If the max3sat(b) instance is satisfiable, the optimal cost of the constructed DLC instance is less than
$$\begin{aligned} b=(10B + 2)n^3 + 121 n^2 \end{aligned}$$
-
For sufficiently large n, if at most \((1-\epsilon )n\) clauses of the max3sat(b) instance can be satisfied, the optimal cost is more than \((1+\alpha )b\), where
$$\begin{aligned} \alpha =\frac{\epsilon }{20B+4} \end{aligned}$$
Satisfiable MAX3SAT(B) instances
We first consider a satisfiable instance of max3sat(b). We show how a satisfying valuation can be used to construct a solution to the DLC instance whose cost is less than b.
The species map \(\mathcal {M}\) maps all internal nodes of G to \(\rho _0\) except for \(g_1\) and its descendant \((3n-m+1)\)-thorn gadget which are mapped to \(r_0\); each leaf \(g \in L(G)\) is mapped to \(Le(g)\).
For each variable \(x_i\), we place one duplication in the corresponding variable gadget, on the edge \(e(\overline{\beta }_i)\) if \(x_i\) is assigned true and on the edge \(e(\beta _i)\) if \(x_i\) is assigned false.Footnote 3 This ensures that \(y_i\) and \(\overline{y}_i\) are separated and that \(y'_i\) and \(\overline{y}'_i\) are separated, as required by part 1 of the leaf map. For each clause \(C_j\), identify any one literal that satisfies that clause. If the first literal in \(C_j\) satisfies the clause, place duplications on edges \(e(\lambda _{2, j})\) and \(e(\lambda _{3, j})\). Alternatively, if the second literal in \(C_j\) satisfies the clause, place duplications on edges \(e(\lambda _{1, j})\) and \(e(\lambda _{3, j})\); alternatively, if the third literal in \(C_j\) satisfies the clause, place duplications on edges \(e(\lambda _{1, j})\) and \(e(\lambda _{2, j})\). This placement of two duplications per clause gadget satisfies the constraints implied by part 2 of the leaf map, which requires that each pair of \(k_{1,j}, k_{2,j}, k_{3,j}\) be separated and that each pair of \(k'_{1,j}, k'_{2,j}, k'_{3,j}\) be separated. Thus far, \(m+2n\) duplications have been placed. Finally, we place \(3n-m\) duplications on the terminal edges of the \((3n-m+1)\)-thorn gadget, since all \(3n-m+1\) of its leaves are mapped to \(r_0\) by part 3 of the leaf map and thus each pair of leaves must be separated. Note that parts 4 and 5 of the leaf mapping do not map multiple species leaves to the same trees leaves and thus require no additional duplication placements. The total number of duplications is thus \(m+2n+(3n-m)=5n\).
Next, we count the number of losses. We do this by first counting losses on the \(n^2\)-thorns of the species tree and then on the trunk of the species tree.
Each clause \(C_j\) has three \(n^2\)-thorns in the species tree, one branching from each of \(\tau _{1, j}\), \(\tau _{2, j}\), and \(\tau _{3, j}\). Without loss of generality, assume that clause \(C_j\) is satisfied by its first literal and thus duplications were placed on \(e(\lambda _{2, j})\) and \(e(\lambda _{3, j})\). Also, without loss of generality, assume that the first literal in \(C_j\) is \(x_i\) (the case for \(\overline{x}_i\) is analogous) and that this is the \(q\)th occurrence of \(x_i\) in the 3SAT instance. The duplication on \(e(\lambda _{2, j})\) implies that leaf \(k''_{2, j}\) is mapped to a different locus than all of the leaves of the \((n^2-1)\)-thorn for the \(q\)th occurrence of \(x_i\) in the variable gadget for \(x_i\). Since \(Le(k''_{2, j}) = t_{2, j}\) by part 4 of the leaf map, there is a loss event on each of the \(n^2\) edges terminating at the leaves of the \(n^2\)-thorn gadget for \(\tau _{2, j}\). Similarly, the duplication on edge \(e(\lambda _{3, j})\) incurs \(n^2\) losses in the \(n^2\)-thorn gadget for \(\tau _{3, j}\) for a total of \(2n^2\) losses for clause \(C_j\). Since \(C_j\) is satisfied by \(x_i\), we know that \(x_i =\) true and thus a duplication was placed on edge \(e(\overline{\beta }_i)\) in the variable gadget for \(x_i\). Therefore, there is no duplication placed between \(k''_{1, j}\) and the leaves of the \((n^{2}-1)\)-thorn for the \(q\)th occurrence of \(x_i\) and thus there are no losses incurred on the \(n^2\)-thorn for \(\tau _{1, j}\). Since there are n clauses and each contributes \(2n^2\) losses in the corresponding \(n^2\)-thorns, \(2n^3\) losses are incurred thus far.
We next consider the number of losses incurred on the trunk of the species tree. Since \(\mathcal {M}(g_1) = r_0\), none of the loci created by the \(3n-m\) duplications in the \(3n-m+1\)-thorn required by part 3 of the leaf map induce loss events. There are \(1+2m+2n+3n\) nodes on the trunk and at most \(m+2n\) loci can be lost on each of the two edges emanating from each such node since there only \(m+2n\) other duplications.
Observing that \(m \le 3n\), the total number of losses can thus be bounded from above by
$$\begin{aligned} 2(m+2n)(1+2m+2n+3n)&\le 2\cdot 5n \cdot 12n <121n^2. \end{aligned}$$
Therefore, the total cost of this solution is bounded by
$$\begin{aligned} 5n\cdot 2Bn^2 + (2n^3+121n^2)\cdot 1 = (10B+2)n^3+121n^2 =b. \end{aligned}$$
At most (1-\(\epsilon\))-satisfiable MAX3SAT(B) instances
To complete the proof, we show that given an instance of max3sat(b) in which the fraction of satisfiable clauses is at most (1-\(\epsilon\)), the optimal cost of the corresponding DLC instance, for sufficiently large n, is greater than:
$$\begin{aligned} (1+\alpha )b&= \left( 1+\frac{\epsilon }{20B+4} \right) \left( (10B+2)n^3+121n^2 \right) \\&= (10B+2)n^3 + \frac{\epsilon }{20B+4}(10B+2)n^3 + \left( 1+\frac{\epsilon }{20B+4} \right) 121n^2 \\&= (10B+2)n^3 + \frac{\epsilon }{2} n^3 + \left( 1+\frac{\epsilon }{20B+4} \right) 121n^2 \\&=\left( 10B+2+\frac{\epsilon }{2} \right) n^3+\left( 1+\frac{\epsilon }{20B+4} \right) 121n^2. \end{aligned}$$
Part 1 of the leaf map requires at least one duplication placement per variable gadget, part 2 of the leaf map requires at least two duplications per clause gadget, and part 3 of the leaf map requires \(3n-m\) duplications to be placed in the \((3n-m+1)\)-thorn gadget. Therefore, all valid duplication placements for this instance use at least \(m + 2n + (3n-m) = 5n\) duplications. We call a solution that uses exactly 5n duplications well-behaved.
A well-behaved solution must use exactly one duplication in each variable gadget. For each variable gadget for variable \(x_i\), this duplication must be placed on either the edge \(e(\beta _i)\) or the edge \(e(\overline{\beta }_i)\) in order to separate both \(y_i\) and \(\overline{y}_i\) and \(y'_i\) and \(\overline{y'}_i\). We interpret a duplication on edge \(e(\beta _i)\) as setting variable \(x_i\) to false and a duplication on edge \(e(\overline{\beta }_i)\) as setting \(x_i\) to true. Thus, a well-behaved solution to the DLC Optimization Problem has a corresponding valuation of the variables in the 3SAT instance.
We now show that all optimal solutions to the DLC Optimization Problem are necessarily well-behaved. Consider a solution for our constructed DLC instance that is not well-behaved and thus comprises more than 5n duplications. A duplication placed outside of a variable, clause, or \((3n-m+1)\)-thorn gadget cannot satisfy any of the duplication requirements imposed by the leaf map and thus can be removed, reducing the number of duplications and not increasing the number of losses.
If a variable gadget for \(x_i\) contains more than one duplication, we may replace all duplications in that variable gadget with a single duplication on edge \(e(\beta _i) = (\alpha _i, \beta _i)\), which satisfies the duplication requirements of the leaf map and reduces the number of duplications by at least one. Introducing a new duplication may increase the number of losses. However, since each variable \(x_i\) appears in at most B clauses in the max3sat(b) instance, the number of new losses introduced can be at most \(Bn^2\) due to the B
\(n^2\)-thorn gadgets where losses are introduced and the O(n) internal vertices in the trunk of the species tree, which is dominated by \(Bn^2\) for sufficiently large n. Thus, the total number of new losses incurred is less than \(2Bn^2\) for sufficiently large n and thus less than the cost of the saved duplication.
Similarly, if a clause gadget for \(C_j\) contains more than two duplications, we can replace it with two duplications on the edges \(e(\lambda _{1,j})\) and \(e(\lambda _{2,j})\). The saving of one duplication is larger than the cost of the additional losses.
We have established that an optimal solution to the constructed DLC instance is necessarily well-behaved. Next, observe that any species map must map \(\lambda '_{h, j}\), \(1 \le h \le 3\), \(1 \le j \le n\), to a node v on the trunk of the species tree such that \(v \le _T \tau _{h, j}\) since \(\lambda '_{h, j}\) has children \(k'_{h, j}\) and \(k''_{h, j}\) and \(Le(k'_{h, j}) = s'_j\) while \(Le(k''_{h, j}) = t_{h, j}\).
Consider an optimal solution for the DLC instance. Since this solution is well-behaved, it induces a valuation of the Boolean variables as described above. As noted earlier, if clause \(C_j\) is satisfied by this valuation then a total of \(2n^2\) losses are incurred in two of the three \(n^2\)-thorns \(\tau _{1, j}\), \(\tau _{2, j}\), and \(\tau _{3, j}\). Conversely, if clause \(C_j\) is not satisfied by this valuation then a total of \(3n^2\) losses are incurred in all three of those \(n^2\)-thorns. To see this, let the \(h\)th literal, \(1 \le h \le 3\), of \(C_j\) be \(x_i\) (analogously, \(\overline{x}_i\)) and let this be the \(q\)th occurrence of this literal in the 3SAT instance. Since \(C_j\) is not satisfied \(x_i =\) false [analogously, \(\overline{x}_i =\) false and, therefore, there is a duplication placed on edge \(e(\beta _i)\) (analogously, \(e(\overline{\beta }_i)\)]. It follows that the loci of the leaves of the \((n^{2}-1)\)-thorn for the \(q\)th occurrence of \(x_i\) are different from the locus of \(k''_{h, j}\), causing \(n^2\) losses in the \(n^2\)-thorn for \(\tau _{h, j}\) since, as noted above, the path from \(\mathcal {M}(\lambda '_{h, j})\) to \(\mathcal {M}(k''_{h, j}) = t_{h, j}\) passes through every internal node of this thorn gadget. Thus, if \(C_j\) is unsatisfied, its three \(n^2\)-thorns in the species tree contribute \(3n^2\) losses.
We have shown that every satisfied clause contributes \(2n^2\) losses and every unsatisfied clause contributes \(3n^2\) losses. Therefore, if there are fewer than \(2n^3 + \epsilon n^3\) losses then there must be fewer than \(\epsilon n\) unsatisfied clauses. Since there are more than \(\epsilon n\) unsatisfied clauses by assumption, for sufficiently large n, the cost of a well-behaved solution, and thus of an optimal solution, is at least:
$$\begin{aligned} 5n(2Bn^2) + 2n^3 + \epsilon n^3&= (10B+2+\epsilon )n^3 \\&> \left( 10B+2+\frac{\epsilon }{2} \right) n^3+\left( 1+\frac{\epsilon }{20B+4} \right) 121n^2\\&= (1+\alpha )b \end{aligned}$$
\(\Box\)