The approximability of the String Barcoding problem

The String Barcoding (SBC) problem, introduced by Rash and Gusfield (RECOMB, 2002), consists in finding a minimum set of substrings that can be used to distinguish between all members of a set of given strings. In a computational biology context, the given strings represent a set of known viruses, while the substrings can be used as probes for an hybridization experiment via microarray. Eventually, one aims at the classification of new strings (unknown viruses) through the result of the hybridization experiment. In this paper we show that SBC is as hard to approximate as Set Cover. Furthermore, we show that the constrained version of SBC (with probes of bounded length) is also hard to approximate. These negative results are tight.


Background
The following setting was introduced by Rash and Gusfield in [1]: Given a set V of n strings v 1 ,...,v n (representing the genomes of n known viruses), and an extra string s (representing a virus in V, but not yet classified), we aim at recognizing s as one of the known viruses through an hybridization experiment. In the experiment, we utilize a set ∏ of k probes (DNA strings) and we will are able to determine which ones are contained in s (as substrings) and which are not. The result of the experiment is therefore a binary k-vector (called, in [1] a barcode) which can be seen as the signature of s with respect to the given probes. In order for the barcode to be able to discriminate between all the viruses, it must be true that, for each pair of viruses v i , v j , with 1 ≤ i <j ≤ n, there exists at least one π ∈ ∏ which is a substring of either v i or v j but not of both.
This amounts to saying that the barcodes of all v i 's must be distinct binary k-vectors. The cost of the hybridization experiment turns out to be proportional to k, and therefore the goal of the optimization problem, known as Minimum String Barcoding (SBC), is to find a feasible set ∏ of smallest possible cardinality. The problem has been pop-ularized by Rash and Gusfield [1], who proposed an Integer Programming approach for its solution. In [2,3], DasGupta et al. describe a greedy algorithm for robust barcoding (i.e., where each pair of viruses must be distinguished by at least a given number l of probes), which scales well to whole-genome sequences. For real-life instances, this algorithm is more effective than alternative approaches [1,4] whose time complexity grows very quickly with the length of the input sequences.
In [1], Rash and Gusfield stated that a variant of SBC, in which the maximum length of each probe is bounded by a constant, and the alphabet size is at least 3, is NP-hard. As for the unconstrained case, where no bound is given on the length of each probe, they left as an open problem to determine whether this version of SBC is NP-complete or not. In this paper we prove that both SBC and unconstrained SBC are in fact NP-complete already for binary alphabets. We do so by actually linking the approximability of SBC (both constrained and unconstrained) to the approximability of the classical Set Cover problem. This way, a sharp log n bound on the best achievable approxi-mation ratio is established for both versions of SBC. It must here be said that essentially the same result has independently been obtained, and already published, by Berman et al. [5]. The inapproximability result in [5] actually holds for a very general family of Minimum Test Collection problems which includes unconstrained SBC as a special case. However, our inapproximability result for constrained SBC is not covered by the general framework proposed in [5]. Note that the very nature of the hybridization experiment imposes that the used probes cannot be too long for technological and biological reasons (such as possible self-hybridization of the probes). Therefore, the bounded-length SBC problem is quite important in practice. In [5] the authors also obtain a (1 + log n)-approximation algorithm for the general Minimum Test Collection problem. Their result is the first improvement over the log n 2 = 2 log n approximation ratio that can essentially be achieved by a standard reduction of Minimum Test Collection to Set Cover followed by a run of the classical set covering greedy algorithm. Thanks to this positive result, all the bounds on the approximability ratios obtained either here or in [5] are tight also in terms of the multiplicative constant of the log n factor. This (1 + log n)approximation proposed in [5] is a greedy algorithm in which the choice of the test set to be added at each step is driven by a suitable entropy function. The analysis of the algorithm, also given in [5], is an elegant and non-trivial reinterpretation of the celebrated proof by Lovasz of the approximation ratio of the greedy algorithm for set cover.
The remainder of the paper is organized as follows. In next section, we introduce the Minimum Test Collection problem (MTC), a known NP-complete problem (see, e.g., Garey and Johnson [6]) for which set-cover-like inapproximability results are known [7]. We also introduce a restricted version of MTC and we show that the same inapproximability results hold for this restricted version as well. In the following section, we address the computational complexity of SBC and show that the approximation algorithm by Berman, DasGupta and Kao [5] delivers an essentially tight approximation ratio even for constrained SBC. More precisely, in the opening of the section we introduce formally the string barcoding problems studied and also point out that every SBC instance (either constrained or unconstrained) can be formulated as an MTC instance, which directly implies set-cover-like approximability results for SBC. We also observe here that the constrained SBC problem, when parameterized over the maximum probe length and the alphabet size, is in FPT and, in particular, it can be solved in linear time whenever these parameters are fixed (for a comprehensive treatment of FPT theory, see [8]). Next, we prove set-coverlike inapproximability results for SBC and for the maximum-length version of SBC via a common reduction from the restricted version of MTC introduced in the first section. (The NP-hardness of the maximum-length version of SBC had been already stated in [1], although without reporting the proof).

A starting problem: the Min Test Collection
In this section we introduce the Minimum Test Collection (MTC) problem, both in its general form and in a restricted version. We also report (and obtain) set-coverlike inapproximability results for MTC and its restricted version. Both the inapproximability of MTC and that of its restricted version will be used in later sections, when characterizing the approximability of the two variants of SBC.
The MTC problem, as defined in [6], is the following problem.

MTC PROBLEM
Find a minimum-size set ⊆ such that for any pair , the test fails on one element and succeeds on the other). A set that verifies this property is called a testing set of D; is a minimum testing set of D.
The MTC problem appears in many contexts. For example, the elements may represent a set of p diseases, and the T i are diagnostic tests, that can verify the presence/absence of q symptoms. The goal is to minimize the number of symptoms whose presence/absence should be verified in order to correctly diagnose the disease. In [6], Garey and Johnson proved that MTC is NP-complete by reducing 3dimensional Matching (3DM), which is NP-complete [9], to it. In [7] it was also proven by means of a reduction from Set Cover that no fully polynomial-time approximation scheme exists for MTC, unless P = NP. Later in this section we essentially employ this reduction. The same reduction had also been reconsidered in [10] where it was shown that MTC is not approximable within (1 -ε) log p for any ε > 0. We now introduce a special type of MTC instances, which we call standard. In this version of the problem, some particular tests must always be part of the problem instance.
In order to define these particular instances, assume the elements in D are ordered as d 1 ,...,d p and let D j = {d j ,...,d p } for each T ∈ and j = 1,...,p. A suffix-closed set of tests is called standard if D j ∈ and {d j } ∈ for each j = 1,...,p. An instance (D, ) of MTC is standard when is standard. In other words, a standard instance of MTC consists of a finite set D = {d 1 ,...,d p } and a set of tests which can be written as ..,S q' }: a generic set of subsets of D; Note that D , I , A and E may have non-empty intersection. In other words, where we assume = {T 1 ,...,T q } with q = p(q' + 2) and T i = S i for i = 1,2,...,q, then it might be the case that We now prove the following result.

Theorem 1 Minimum Test Collection (MTC) cannot be
approximated within (1 -ε) log p for any ε > 0 even when restricted to standard instances.
We prove the above theorem by a reduction from the Set Cover (SC) problem, which is defined ( [11]) as follows.

SC PROBLEM
Find a minimum-size collection ⊆ such that every element in S belongs to at least one subset in , i.e.
We say that any satisfying (1) covers S, and we call such a set a set cover for S.
It is well known that SC cannot be approximated within (1 -ε) log m for any ε > 0 (see [12]).
Let S = {s 1 ,...,s m } and = {C 1 ,...,C n } ⊆ 2 S be an arbitrary instance of SC. We show how to obtain a standard instance of MTC representing the given instance of SC.
First, let K := 2 k be the smallest power of 2 such that K ≥ m.
To each j ∈ {1, 2,..., K}, we associate a unique binary string b(j) of length k. Let R := {r 1 ,...,r K }, be a set of size K with R ∩ S = ∅. The set of elements D is defined as D = R ∪ S, with a particular order: The set of tests is constructed in the following way. First, for each i = 1,...,k, we call T i the test containing all the r j and s j such that the bit in position i of the binary string b(j) is set to 1. Then The following two lemmas investigate the properties of the proposed reduction.  obtain that X ⊆ , and we let = X.
We now argue simply that, since ∪ {T i | i = 1,...,k} is a testing set of D, then is a set cover of S. Indeed, no pair in P is distinguished by a set T i . Therefore, for each j = 1,...,m, the pair {r j , s j } is distinguished by some test ∈ . Moreover, since r j ∉ T for any T ∈ ⊆ , it must be that s j ∈ . Therefore, each s j is covered, and is a set cover of S. ᮀ With Lemmas 1 and 2, we are now ready to prove Theorem 1.

Proof of Theorem 1:
We first remark that SC is not approximable within (1 -ε) log m even when restricted to instances for which opt = ω(log m). Indeed, just consider duplicating a generic instance of SC into t := Llog 2 mO = ω(log m) identical and disjoint copies to obtain a new instance (S*, *) with |S*| = tm. Let opt denote the optimum value for the original instance (S, ) and opt* the optimum value for the instance (S*, *). Then opt* = t opt ≥ t = ω(log|S*|). Notice also that a solution to the instance (S*, *) of size at most opt*(1 -ε) log|S*| could be immediately translated into a solution to the instance (S, ) of size at most Here, ε > 0 and log t = o(log m), in contrast with the inapproximability results explicitly derived in [12]. In the analysis to follow we therefore assume that opt = ω(log m).
Denote now by opt and opt' the optimal solution values for the original problem (SC) and the transformed problem (MTC) respectively, and by apx and apx' the values of the respective approximated solutions that we can produce in polynomial time. By Lemma 1, Then, if we assume that we can obtain an approximate solution for the MTC problem, we can also guarantee that Since the proof of Lemma 2 is constructive, we obtain that Notice that p := |D| ≤ 2m. Consequently, since we know that SC is not approximable within (1 -ε) log m for any ε > 0, then we can conclude that MTC is not approximable within (1 -ε) log p for any ε > 0. ᮀ

The String Barcoding problems
The following is a formal definition of the String Barcoding problem (SBC):

SBC PROBLEM
Find a minimum-size set ∏ of strings such that for any pair of strings v, v' ∈ V there is at least one string π ∈ ∏ such that π is a substring of v or v', but not of both. A set that verifies this property is called a testing set of V; ∏ is a minimum testing set of V.
Rash and Gusfield state in [1] that it is unknown whether the basic String Barcoding problem is NP-hard or not and they also state that a variant of SBC called Max-length String Barcoding (MLSBC) is NP-hard when the underlying alphabet contains at least three elements. In this variant, a constraint on the maximum length of the substrings in ∏ is specified in input. More formally, MLSBC is the following problem:

MLSBC INSTANCE
An alphabet Σ, a set V = {v 1 ,...,v n } of strings over Σ and a constant L.

MLSBC PROBLEM
Find a testing set ∏ of V such that the length of each string π ∈ ∏ is less than or equal to L, and ∏ has smallest possible cardinality among such testing sets.
The main point of this paper is to link the approximability of SBC (both constrained and unconstrained) to the approximability of the classical Set Cover problem. Indeed, both SBC and MLSBC can be naturally regarded as instances of MTC, for which, in turn, a natural reduction to Set Cover is well known. In the next section we provide reductions for the reverse direction. These reductions will characterize the approximability of SBC and MLSBC from a computational complexity point of view. To better appreciate some aspects of these reductions, we make the following remark.

Inapproximability of SBC and MLSBC
In this subsection we prove the inapproximability of both SBC and MLSBC by means of a common reduction from the restricted form of MTC introduced in Section.

Theorem 2
The String Barcoding (SBC) problem cannot be approximated within (1 -ε) log n for any ε > 0. This negative result holds already for binary alphabets.
Where Ω is a set of strings, ‫ؠ‬σ∈Ω(σ) denotes the string obtained as the concatenation of all the strings in Ω lined up in lexicographic order (as a matter of fact, for the purpose of our reduction to work, the strings in Ω could be concatenated in any order, but we prefer to refer to a specific order so that the instance generated through the proposed reduction is uniquely defined). .,p} such that each string v j ∈ V contains all the strings in Σ 2k-1 plus the signatures f ∈ F of those tests T ∈ that succeed on d j (that is, such that d j ∈ T). More formally, the codification of an element d j ∈ D is the string seen as a binary string. Notice that the role of X is to separate the substrings, and that a different number of X characters is used in each string v in order to uniquely identify it when dealing with one of its substrings which includes a whole block of X's. The MLSBC instance is the same as the SBC instance plus the bound L = 10 k.
The number and size of the strings constructed above, and hence the above described transformation from an MTC instance to either an SBC instance or an MLSBC instance, is polynomial. With the next two lemmas we show that this is an objective-function preserving reduction from MTC to either SBC or MLSBC whence Theorems 2 and 3 follow. Proof: We want to show that, given a testing set ∏ for V, there exists a testing set ⊆ for D with | | ≤ |∏|. We actually commit ourselves to show that for every binary string π ∈ ∏ we can find a T π ∈ such that, for each j = 1,...,p, the string π occurs as a substring of v j if and only if d j ∈ T π . In following this plan of action, for each π ∈ ∏, we can clearly assume that π is a substring of some v j ∈ V but not all. Thus, if π contains a substring of the form 10 y 1 for some y > 1, then y is a multiple of 5, that is, y = 5t, and, actually, t = 2k + j with 1 ≤ j ≤ p, in which case we can take T π := {d j }. This works since v j is the only string in V of which π is a substring. Similarly, in case the string π contains no symbol 1 except in the first (or except in the last) x ≤ 2 positions, and where t = L(|π| -x)/5O (here we are assuming that the symbol in position x is forced to be a 1 if x > 0), then t = 2k + j with 1 <j ≤ p, in which case we can take T π := D j . This works since v i contains π as a substring if and only if i ≥ j. Furthermore, in case 00 is not a substring of π, and since π is a substring of some v j ∈ V but not all, then 10 k -8 ≤ |π| ≤ 10 k + 2.
Actually, where π' is the longest substring of π which both begins and ends with 1, then 10 k -8 ≤ |π'| ≤ 10 k, and π' is a substring of for precisely one ∈ -and in this case T π := works. We are left with the case π = 0 a 1α10 b with α containing no 00 substring and where one of a or b may possibly be 0 but M := max{a, b} ≥ 2. Assume w.l.o.g. that a = M. Again, let t = LM/5O. Clearly, we can assume t ≤ 2k + p. If t ≤ 2k + 1, then we can also assume that 1α10 b is a substring of 0 b for precisely one ∈ -in this case T π := works since the set of those strings in V having π as a substring is precisely {v j | d j ∈ }. We hence turn to consider t = 2k + j with 1 <j ≤ p. We can also assume that |α| ≤ 10 k -2. Let z be an indicator variable whose value is 1 if b ≠ 0 and 0 otherwise. If |1α1| + z < 5 k -3 then consider T π := D j , which works since the set of those strings in V having π as a substring is precisely {v i | i ≥ j} = {v i | d i ∈ D j }. (Actually, for the sake of precision, it can be observed that whenever |1α1| ≥ 10 k -5, the string 001α100 will be a substring of all v i , or none at all).
If |1α1| + z ≥ 5 k -3 then 1α10 is a substring of 0 for precisely one ∈ and T π := ∩ D j works since the set of those strings in V having π as a substring is precisely {v i