The approximability of the String Barcoding problem

Lancia, Giuseppe; Rizzi, Romeo

doi:10.1186/1748-7188-1-12

Research
Open access
Published: 08 August 2006

The approximability of the String Barcoding problem

Giuseppe Lancia¹ &
Romeo Rizzi¹

Algorithms for Molecular Biology volume 1, Article number: 12 (2006) Cite this article

5439 Accesses
2 Citations
Metrics details

Abstract

The String Barcoding (SBC) problem, introduced by Rash and Gusfield (RECOMB, 2002), consists in finding a minimum set of substrings that can be used to distinguish between all members of a set of given strings. In a computational biology context, the given strings represent a set of known viruses, while the substrings can be used as probes for an hybridization experiment via microarray. Eventually, one aims at the classification of new strings (unknown viruses) through the result of the hybridization experiment. In this paper we show that SBC is as hard to approximate as Set Cover. Furthermore, we show that the constrained version of SBC (with probes of bounded length) is also hard to approximate. These negative results are tight.

Background

The following setting was introduced by Rash and Gusfield in [1]: Given a set V of n strings v₁,...,v_n(representing the genomes of n known viruses), and an extra string s (representing a virus in V, but not yet classified), we aim at recognizing s as one of the known viruses through an hybridization experiment. In the experiment, we utilize a set ∏ of k probes (DNA strings) and we will are able to determine which ones are contained in s (as substrings) and which are not. The result of the experiment is therefore a binary k-vector (called, in [1] a barcode) which can be seen as the signature of s with respect to the given probes. In order for the barcode to be able to discriminate between all the viruses, it must be true that, for each pair of viruses v_i, v_j, with 1 ≤ i <j ≤ n, there exists at least one π ∈ ∏ which is a substring of either v_ior v_jbut not of both. This amounts to saying that the barcodes of all v_i's must be distinct binary k-vectors. The cost of the hybridization experiment turns out to be proportional to k, and therefore the goal of the optimization problem, known as Minimum String Barcoding (SBC), is to find a feasible set ∏ of smallest possible cardinality. The problem has been popularized by Rash and Gusfield [1], who proposed an Integer Programming approach for its solution. In [2, 3], DasGupta et al. describe a greedy algorithm for robust barcoding (i.e., where each pair of viruses must be distinguished by at least a given number l of probes), which scales well to whole-genome sequences. For real-life instances, this algorithm is more effective than alternative approaches [1, 4] whose time complexity grows very quickly with the length of the input sequences.

In [1], Rash and Gusfield stated that a variant of SBC, in which the maximum length of each probe is bounded by a constant, and the alphabet size is at least 3, is NP-hard. As for the unconstrained case, where no bound is given on the length of each probe, they left as an open problem to determine whether this version of SBC is NP-complete or not. In this paper we prove that both SBC and unconstrained SBC are in fact NP-complete already for binary alphabets. We do so by actually linking the approximability of SBC (both constrained and unconstrained) to the approximability of the classical Set Cover problem. This way, a sharp log n bound on the best achievable approximation ratio is established for both versions of SBC. It must here be said that essentially the same result has independently been obtained, and already published, by Berman et al. [5]. The inapproximability result in [5] actually holds for a very general family of Minimum Test Collection problems which includes unconstrained SBC as a special case. However, our inapproximability result for constrained SBC is not covered by the general framework proposed in [5]. Note that the very nature of the hybridization experiment imposes that the used probes cannot be too long for technological and biological reasons (such as possible self-hybridization of the probes). Therefore, the bounded-length SBC problem is quite important in practice. In [5] the authors also obtain a (1 + log n)-approximation algorithm for the general Minimum Test Collection problem. Their result is the first improvement over the log n² = 2 log n approximation ratio that can essentially be achieved by a standard reduction of Minimum Test Collection to Set Cover followed by a run of the classical set covering greedy algorithm. Thanks to this positive result, all the bounds on the approximability ratios obtained either here or in [5] are tight also in terms of the multiplicative constant of the log n factor. This (1 + log n)-approximation proposed in [5] is a greedy algorithm in which the choice of the test set to be added at each step is driven by a suitable entropy function. The analysis of the algorithm, also given in [5], is an elegant and non-trivial reinterpretation of the celebrated proof by Lovasz of the approximation ratio of the greedy algorithm for set cover.

The remainder of the paper is organized as follows. In next section, we introduce the Minimum Test Collection problem (MTC), a known NP-complete problem (see, e.g., Garey and Johnson [6]) for which set-cover-like inapproximability results are known [7]. We also introduce a restricted version of MTC and we show that the same inapproximability results hold for this restricted version as well. In the following section, we address the computational complexity of SBC and show that the approximation algorithm by Berman, DasGupta and Kao [5] delivers an essentially tight approximation ratio even for constrained SBC. More precisely, in the opening of the section we introduce formally the string barcoding problems studied and also point out that every SBC instance (either constrained or unconstrained) can be formulated as an MTC instance, which directly implies set-cover-like approximability results for SBC. We also observe here that the constrained SBC problem, when parameterized over the maximum probe length and the alphabet size, is in FPT and, in particular, it can be solved in linear time whenever these parameters are fixed (for a comprehensive treatment of FPT theory, see [8]). Next, we prove set-cover-like inapproximability results for SBC and for the maximum-length version of SBC via a common reduction from the restricted version of MTC introduced in the first section. (The NP-hardness of the maximum-length version of SBC had been already stated in [1], although without reporting the proof).

A starting problem: the Min Test Collection

In this section we introduce the Minimum Test Collection (MTC) problem, both in its general form and in a restricted version. We also report (and obtain) set-cover-like inapproximability results for MTC and its restricted version. Both the inapproximability of MTC and that of its restricted version will be used in later sections, when characterizing the approximability of the two variants of SBC.

The MTC problem, as defined in [6], is the following problem.

MTC INSTANCE

D = {d₁,...,d_p}: a set of (ground) elements.

T

(1)

= {T₁,...,T_q}: a set of subsets of D (representing tests that may succeed or fail on the elements. A test T succeeds on d if d ∈ T and fails on d otherwise).

MTC PROBLEM

Find a minimum-size set $T^{'}$ ⊆ $T$ such that for any pair of elements d, d' ∈ D there is at least one test T ∈ $T^{'}$ such that |{d, d'} ∩ T| = 1 (i.e., the test fails on one element and succeeds on the other). A set that verifies this property is called a testing set of D; $T^{'}$ is a minimum testing set of D.

The MTC problem appears in many contexts. For example, the elements may represent a set of p diseases, and the T_iare diagnostic tests, that can verify the presence/absence of q symptoms. The goal is to minimize the number of symptoms whose presence/absence should be verified in order to correctly diagnose the disease. In [6], Garey and Johnson proved that MTC is NP-complete by reducing 3-dimensional Matching (3DM), which is NP-complete [9], to it. In [7] it was also proven by means of a reduction from Set Cover that no fully polynomial-time approximation scheme exists for MTC, unless P = NP. Later in this section we essentially employ this reduction. The same reduction had also been reconsidered in [10] where it was shown that MTC is not approximable within (1 - ε) log p for any ε > 0. We now introduce a special type of MTC instances, which we call standard. In this version of the problem, some particular tests must always be part of the problem instance.

In order to define these particular instances, assume the elements in D are ordered as d₁,...,d_pand let D_j= {d_j,...,d_p} for j = 1,...,p. A set of tests $T$ is called suffix-closed if D_j∩ T ∈ $T$ for each T ∈ $T$ and j = 1,...,p. A suffix-closed set of tests $T$ is called standard if D_j∈ $T$ and {d_j} ∈ $T$ for each j = 1,...,p. An instance (D, $T$ ) of MTC is standard when $T$ is standard. In other words, a standard instance of MTC consists of a finite set D = {d₁,...,d_p} and a set of tests $T$ which can be written as $T$ = $T$ _D∪ $T$ _I∪ $T$ _A∪ $T$ _E, where

T

(2)

_D= {S₁,...,S_q'}: a generic set of subsets of D;

T

(3)

_I= {S_q'+1,...,S_q'+p} = {{d_i} | i = 1,...,p};

T

(4)

_A= {S_q'+p+1,...,S_q'+2p} = {D_j| 1 ≤ j ≤ p};

T

(5)

_E= {S_q'+2p+1,...,S_p(q'+2)} = {S ∩ D_j| S ∈ $T$ _D, 2 ≤ j ≤ p}.

Note that $T$ _D, $T$ _I, $T$ _Aand $T$ _Emay have non-empty intersection. In other words, where we assume $T$ = {T₁,...,T_q} with q = p(q' + 2) and T_i= S_ifor i = 1,2,...,q, then it might be the case that T_i= T_jwith i ≠ j.

We now prove the following result.

Theorem 1 Minimum Test Collection (MTC) cannot be approximated within (1 - ε) log p for any ε > 0 even when restricted to standard instances.

We prove the above theorem by a reduction from the Set Cover (SC) problem, which is defined ([11]) as follows.

SC INSTANCE

A finite set S = {s₁,...,s_m} and a collection $C$ = {C₁,...,C_n} ⊆ 2^Ssuch that S = $\cup_{i = 1}^{n} C_{i}$ .

SC PROBLEM

Find a minimum-size collection $C^{'}$ ⊆ $C$ such that every element in S belongs to at least one subset in $C^{'}$ , i.e.

$S = \underset{C \in C^{'}}{\cup} C . (1)$

We say that any $C^{'}$ satisfying (1) covers S, and we call such a set a set cover for S.

It is well known that SC cannot be approximated within (1 - ε) log m for any ε > 0 (see [12]).

Let S = {s₁,...,s_m} and $C$ = {C₁,...,C_n} ⊆ 2^Sbe an arbitrary instance of SC. We show how to obtain a standard instance of MTC representing the given instance of SC.

First, let K := 2^kbe the smallest power of 2 such that K ≥ m. To each j ∈ {1, 2,..., K}, we associate a unique binary string b(j) of length k. Let R := {r₁,...,r_K}, be a set of size K with R ∩ S = ∅. The set of elements D is defined as D = R ∪ S, with a particular order:

D = {r₁, s₁, r₂, s₂, ..., r_m, s_m, r_m+1, r_m+2..., r_K}

(i.e., D = {d₁, ..., d_p} with p = m + K). The set of tests $T$ is constructed in the following way. First, for each i = 1,...,k, we call T_ithe test containing all the r_jand s_jsuch that the bit in position i of the binary string b(j) is set to 1. Then let $T$ = $T$ _D∪ $T$ _I∪ $T$ _A∪ $T$ _Ewhere

T

(6)

_D= $C$ ∪ {T_i| i = 1,...,k},

T

(7)

_I= {{d_i} | i = l,...,p},

T

(8)

_A= {D_j| 1 ≤ j ≤ p},

T

(9)

_E= {T ∩ D_j| T ∈ $T$ _D, 2 ≤ j ≤ p}.

The following two lemmas investigate the properties of the proposed reduction.

Lemma 1 If S has a set cover $C^{'}$ ⊆ $C$ of size h, then D has a testing set $T^{'}$ ⊆ $T$ of size at most h + k.

Proof: Let $C^{'}$ ⊆ $C$ be a set cover for S of size h. We claim that $T^{'}$ := $C^{'}$ ∪ {T_i| i = 1,...,k} is a testing set for D, which proves the lemma. Indeed, consider two elements s_i(or r_i) and s_j(or r_j). If i ≠ j then the binary strings associated to i and j differ in some position x, and hence T_xdistinguishes between them. Otherwise, if i = j and the two elements still differ, then we are talking about s_iand r_i, for some i = 1,...,m. Notice that s_iis contained in at least one set C in $C^{'}$ since $C^{'}$ covers S. Moreover, r_i∉ C since C ⊆ S. It follows that there exists some set in $C^{'}$ , and hence in $T^{'}$ , which distinguishes between s_iand r_i. □

Lemma 2 If D has a testing set $T^{'}$ ⊆ $T$ of size h, then S has a set cover $C^{'}$ ⊆ $C$ of size at most h.

Proof: Let $T^{'}$ ⊆ $T$ be a testing set of D of size h. We propose a polynomial-time algorithm to produce a set $C^{'}$ ⊆ $C$ with | $C^{'}$ | ≤ | $T^{'}$ | such that $C^{'}$ ∪ {T_i| i = 1,...,k} is also a testing set of D. At the end, we argue that such a $C^{'}$ must be a set cover of S.

Let X = $T^{'}$ . Clearly, X ∪ {T_i| i = 1,...,k} distinguishes all the elements in D, and this invariant will be maintained throughout the algorithm. If X ⊆ $C$ , then we just let $C^{'}$ = X, and stop. Otherwise, let T ∈ X \ $C$ . Notice that all pairs of elements which are not distinguished by (X \ {T}) ∪ {T_i| i = 1,...,k} necessarily belong to the set P = {{s_i, r_i} | i = 1,...,m}. Our plan is hence to replace T by any set in $C$ which distinguishes all the pairs in P that are distinguished by T. It remains to show that such a set in $C$ always exists. Indeed, if T is a test D_jwith j = 2i and j ≤ 2m, then the ordering we have imposed among the elements of D implies that T distinguishes only the pair {s_i, r_i} of P, so it can be replaced by any C ∈ $C$ with s_i∈ $C$ ; if T is a test D_jwith j odd or j > 2m, then T distinguishes no pair in D, so that T can be dropped from X without the need for any replacement. If T is a test of the form T_i∩ D_j, then it again distinguishes at most one pair in P, and a similar reasoning holds. The same holds if T ∈ T_I, that is, T = {d} for some d ∈ D. Finally, if T is a test C ∩ D_jfor some C ∈ $C$ , then, clearly, it can be replaced with C. Hence, by substituting every test T ∈ X \ $C$ by tests in $C$ as shown, we obtain that X ⊆ $C$ , and we let $C^{'}$ = X.

We now argue simply that, since $C^{'}$ ∪ {T_i| i = 1,...,k} is a testing set of D, then $C^{'}$ is a set cover of S. Indeed, no pair in P is distinguished by a set T_i. Therefore, for each j = 1,...,m, the pair {r_j, s_j} is distinguished by some test $\bar{T}$ ∈ $C^{'}$ . Moreover, since r_j∉ T for any T ∈ $C^{'}$ ⊆ $C$ , it must be that s_j∈ $\bar{T}$ . Therefore, each s_jis covered, and $C^{'}$ is a set cover of S.□

With Lemmas 1 and 2, we are now ready to prove Theorem 1.

Proof of Theorem 1: We first remark that SC is not approximable within (1 - ε) log m even when restricted to instances for which opt = ω(log m). Indeed, just consider duplicating a generic instance of SC into t := ⌈log² m⌉ = ω(log m) identical and disjoint copies to obtain a new instance (S*, $C$ *) with |S*| = tm. Let opt denote the optimum value for the original instance (S, $C$ ) and opt* the optimum value for the instance (S*, $C$ *). Then opt* = t opt ≥ t = ω(log|S*|). Notice also that a solution to the instance (S*, $C$ *) of size at most opt*(1 - ε) log|S*| could be immediately translated into a solution to the instance (S, $C$ )of size at most

$\frac{1}{t} o p t * (1 - ε) \log | S * | = o p t (1 - ε) \log | S * | = (1 - ε) (\log m + \log t) o p t .$

Here, ε > 0 and log t = o(log m), in contrast with the inapproximability results explicitly derived in [12]. In the analysis to follow we therefore assume that opt = ω(log m).

Denote now by opt and opt' the optimal solution values for the original problem (SC) and the transformed problem (MTC) respectively, and by apx and apx' the values of the respective approximated solutions that we can produce in polynomial time. By Lemma 1,

opt' ≤ opt + k = opt + o(opt).

Then, if we assume that we can obtain an approximate solution

apx' ≤ f(|D|)opt'

for the MTC problem, we can also guarantee that

apx' ≤ f(|D|)(opt + o(opt)).

Since the proof of Lemma 2 is constructive, we obtain that

apx ≤ apx' ≤ f(|D|)(opt + o(opt)).

Notice that p := |D| ≤ 2m. Consequently, since we know that SC is not approximable within (1 - ε) log m for any ε > 0, then we can conclude that MTC is not approximable within (1 - ε) log p for any ε > 0. □

The String Barcoding problems

The following is a formal definition of the String Barcoding problem (SBC):

SBC INSTANCE

An alphabet Σ (e.g., Σ = {A, C, G, T}) and a set V = {v₁,...,v_n} of strings over Σ (representing virus genomes).

SBC PROBLEM

Find a minimum-size set ∏ of strings such that for any pair of strings v, v' ∈ V there is at least one string π ∈ ∏ such that π is a substring of v or v', but not of both. A set that verifies this property is called a testing set of V; ∏ is a minimum testing set of V.

Rash and Gusfield state in [1] that it is unknown whether the basic String Barcoding problem is NP-hard or not and they also state that a variant of SBC called Max-length String Barcoding (MLSBC) is NP-hard when the underlying alphabet contains at least three elements. In this variant, a constraint on the maximum length of the substrings in ∏ is specified in input. More formally, MLSBC is the following problem:

MLSBC INSTANCE

An alphabet Σ, a set V = {v₁,...,v_n} of strings over Σ and a constant L.

MLSBC PROBLEM

Find a testing set ∏ of V such that the length of each string π ∈ ∏ is less than or equal to L, and ∏ has smallest possible cardinality among such testing sets.

The main point of this paper is to link the approximability of SBC (both constrained and unconstrained) to the approximability of the classical Set Cover problem. Indeed, both SBC and MLSBC can be naturally regarded as instances of MTC, for which, in turn, a natural reduction to Set Cover is well known. In the next section we provide reductions for the reverse direction. These reductions will characterize the approximability of SBC and MLSBC from a computational complexity point of view. To better appreciate some aspects of these reductions, we make the following remark.

Fact 1 MLSBC can be solved in linear time whenever L and |Σ| are bounded by a constant.

Proof: Indeed, the number of strings π which may possibly end up in the testing set ∏ is bounded by

$f (| \sum |, L) : = \sum_{t = 1}^{L} | \sum |^{t} = \frac{| \sum |^{L + 1} - 1}{| \sum | - 1},$

whence the number of possible solutions is bounded by 2^f(|Σ|,L). Thus we have a constant number of possible solutions, and each can be checked in linear time. □.

Inapproximability of SBC and MLSBC

In this subsection we prove the inapproximability of both SBC and MLSBC by means of a common reduction from the restricted form of MTC introduced in Section.

Theorem 2 The String Barcoding (SBC) problem cannot be approximated within (1 - ε) log n for any ε > 0. This negative result holds already for binary alphabets.

Theorem 3 The Max-length String Barcoding problem cannot be approximated within (1 - ε) log n for any ε > 0. This negative result holds already for binary alphabets.

Let D = {d₁,...,d_p} and $T$ = {T₁,...,T_q} = $T$ _D∪ $T$ _I∪ $T$ _A∪ $T$ _Ebe a standard instance of MTC, with

T

(10)

_D= {T₁,...,T_q'},

T

(11)

_I= {T_q'+1,...,T_q'+p},

T

(12)

_A= {T_q'+p+1,...,T_q'+2p},

T

(13)

_E= {T_q'+2p+1,...,T_p(q'+2)}.

Where Ω is a set of strings, ◯σ∈Ω(σ) denotes the string obtained as the concatenation of all the strings in Ω lined up in lexicographic order (as a matter of fact, for the purpose of our reduction to work, the strings in Ω could be concatenated in any order, but we prefer to refer to a specific order so that the instance generated through the proposed reduction is uniquely defined).

An instance of SBC (or of MLSBC) is obtained in the following way. First, let k = ⌈log₂ q⌉. Then, let Σ = {A, B} and Σ₊ = {A, B, X} (the dummy symbol X will be used as a separator, to divide the really interesting substrings, made only of A s and B s). We will often treat Σ and Σ₊ as alphabets, even if the intermediate symbols A, B, and X actually stand for binary strings according to the rules: A ↦ 10101, B ↦ 11011, and X ↦ 00000. Thanks to these rules, any given string in Σ* or $\sum_{+}^{*}$ ultimately represents a unique binary string in Σ = {0,1}*. Let Σ^ldenote the set of all the strings of length l over the alphabet Σ. Finally, uniquely encode each different test T ∈ $T$ by a string f_T∈ Σ^k(called the signature of T) and let F = {f_T| T ∈ $T$ }; certainly this is possible since |Σ^k| = 2^k≥ q = | $T$ |. Now, the instance of SBC is completed by constructing the set of strings V = {v_j| j = 1,...,p} such that each string v_j∈ V contains all the strings in Σ^2k-1plus the signatures f ∈ F of those tests T ∈ $T$ that succeed on d_j(that is, such that d_j∈ T). More formally, the codification of an element d_j∈ D is the string

$v_{j} = X^{2 k + j} ○_{σ \in \sum^{2 k - 1}} (σ X^{2 k + j}) ○_{T ∍ d_{j}} (f_{T} f_{T} X^{2 k + j})$

seen as a binary string. Notice that the role of X is to separate the substrings, and that a different number of X characters is used in each string v in order to uniquely identify it when dealing with one of its substrings which includes a whole block of X' s. The MLSBC instance is the same as the SBC instance plus the bound L = 10 k.

The number and size of the strings constructed above, and hence the above described transformation from an MTC instance to either an SBC instance or an MLSBC instance, is polynomial. With the next two lemmas we show that this is an objective-function preserving reduction from MTC to either SBC or MLSBC whence Theorems 2 and 3 follow.

Lemma 3 If D has a testing set $T^{'}$ ⊆ $T$ of size h, then V has a testing set ∏ of size at most h. Furthermore, |π| ≤ L for every π ∈ ∏.

Proof: Consider the set of strings ∏ = {f_Tf_T| f_Tis the signature of T ∈ $T^{'}$ }. Clearly, |∏| ≤ | $T^{'}$ | and we aim at showing that ∏ is a testing set for V. More precisely, we claim that the binary string f_Tf_Tis a substring of the binary string v_jif and only if d_j∈ T. Indeed, when d_j∈ T, it follows immediately from the construction of v_jthat f_Tf_Tis a substring of v_j. As for the converse, when f_Tf_Tis a substring of v_j, then the shift of any of its occurrences within v_jis necessarily a multiple of 5, and hence f_Tf_Tis actually a substring of v_jalso when f_Tf_Tand v_jare regarded as strings over Σ₊. It follows that d_j∈ T. Notice moreover that |π| = 10 k ≤ L for every π ∈ ∏. □

Lemma 4 If V has a testing set of size h, then D has a testing set $T^{'}$ ⊆ $T$ of size at most h.

Proof: We want to show that, given a testing set ∏ for V, there exists a testing set $T^{'}$ ⊆ $T$ for D with | $T^{'}$ | ≤ |∏|. We actually commit ourselves to show that for every binary string π ∈ ∏ we can find a T_π∈ $T$ such that, for each j = 1,...,p, the string π occurs as a substring of v_jif and only if d_j∈ T_π. In following this plan of action, for each π ∈ ∏, we can clearly assume that π is a substring of some v_j∈ V but not all. Thus, if π contains a substring of the form 10^y1 for some y > 1, then y is a multiple of 5, that is, y = 5t, and, actually, t = 2k + j with 1 ≤ j ≤ p, in which case we can take T_π:= {d_j}. This works since v_jis the only string in V of which π is a substring. Similarly, in case the string π contains no symbol 1 except in the first (or except in the last) x ≤ 2 positions, and where t = ⌈(|π| - x)/5⌉ (here we are assuming that the symbol in position x is forced to be a 1 if x > 0), then t = 2k + j with 1 <j ≤ p, in which case we can take T_π:= D_j. This works since v_icontains π as a substring if and only if i ≥ j. Furthermore, in case 00 is not a substring of π, and since π is a substring of some v_j∈ V but not all, then 10 k - 8 ≤ |π| ≤ 10 k + 2.

Actually, where π' is the longest substring of π which both begins and ends with 1, then 10 k - 8 ≤ |π'| ≤ 10 k, and π' is a substring of $f_{\tilde{T}} f_{\tilde{T}}$ for precisely one $\tilde{T}$ ∈ $T$ – and in this case T_π:= $\tilde{T}$ works. We are left with the case π = 0^a1α 10^bwith α containing no 00 substring and where one of a or b may possibly be 0 but M := max{a, b} ≥ 2. Assume w.l.o.g. that a = M. Again, let t = ⌈M/5⌉. Clearly, we can assume t ≤ 2k + p. If t ≤ 2k + 1, then we can also assume that 1α 10^bis a substring of $f_{\tilde{T}} f_{\tilde{T}}$ 0^bfor precisely one $\tilde{T}$ ∈ $T$ – in this case T_π:= $\tilde{T}$ works since the set of those strings in V having π as a substring is precisely {v_j| d_j∈ $\tilde{T}$ }. We hence turn to consider t = 2k + j with 1 <j ≤ p. We can also assume that |α| ≤ 10 k - 2. Let z be an indicator variable whose value is 1 if b ≠ 0 and 0 otherwise. If |1α 1| + z < 5 k - 3 then consider T_π:= D_j, which works since the set of those strings in V having π as a substring is precisely {v_i| i ≥ j} = {v_i| d_i∈ D_j}. (Actually, for the sake of precision, it can be observed that whenever |1α 1| ≥ 10 k - 5, the string 001α 100 will be a substring of all v_i, or none at all). If |1α 1| + z ≥ 5 k - 3 then 1α 10 is a substring of $f_{\tilde{T}} f_{\tilde{T}}$ 0 for precisely one $\tilde{T}$ ∈ $T$ and T_π:= $\tilde{T}$ ∩ D_jworks since the set of those strings in V having π as a substring is precisely {v_i| i ≥ j, d_i∈ $\tilde{T}$ } = {v_i| d_i∈ D_j∩ $\tilde{T}$ }. □

References

Rash S, Gusfield D: String Barcoding: Uncovering Optimal Virus Signatures. Proceedings of the Annual International Conference on on Computational Molecular Biology (RECOMB). 2002, 254-261. ACM press
Google Scholar
DasGupta B, Konwar KM, Mandoiu II, Shvartsman A: Highly scalable algorithms for robust string barcoding. Int J of Bioinf Res and Appls. 2005, 1 (2): 145-161.
Article CAS Google Scholar
DasGupta B, Konwar KM, Mandoiu II, Shvartsman A: DNA-BAR: distinguisher selection for DNA barcoding. Bioinf. 2005, 21 (16): 3424-3426. 10.1093/bioinformatics/bti547.
Article CAS Google Scholar
Borneman J, Chrobak M, Della Vedova G, Figueroa A, Jiang T: Probe selection algorithms with applications in the analysis of microbial communities. Bioinf. 2001, 17 (Suppl 1): 39-48.
Article Google Scholar
Berman P, DasGupta B, Kao MY: Tight approximability results for test set problems in bioinformatics. J of Comp and Sys Sc. 2004, 71 (2): 145-162. 10.1016/j.jcss.2005.02.001. [Also in Proc. Workshop on Algorithm Theory, Lec Notes in Comp Sc, Springer, 3111:39–50, 2004],
Article Google Scholar
Garey MR, Johnson DS: Computers and Intractability: A Guide to the Theory of NP-Completeness. 1979, San Francisco: W. H. Freeman and Co
Google Scholar
Moret BME, Shapiro HD: On minimizing a set of tests. SIAM J on Sc and Stat Comp. 1985, 6: 983-1003. 10.1137/0906067.
Article Google Scholar
Downey RG, Fellows MR: Parametrized Complexity. 1998, Berlin: Springer-Verlag
Google Scholar
Karp RM: Reducibility among combinatorial problems. Compl and Comp Computations. 1972
Google Scholar
De Bontridder KMJ, Halldórsson BV, Halldórsson MM, Hurkens CAJ, Lenstra JK, Ravi R, Stougie L: Approximation algorithms for the test cover problem. Math Prog B. 2003, 1–3: 477-491. 10.1007/s10107-003-0414-6.
Article Google Scholar
Cormen TH, Leiserson CE, Rivest RL: Introduction to Algorithms. 2001, Boston: MIT press
Google Scholar
Feige U: A threshold of for approximating set cover. J ACM. 1998, 45: 634-652. 10.1145/285055.285059.
Article Google Scholar

Download references

Acknowledgements

We thank two anonymous referees for their careful reading of the paper. In particular, the first referee is acknowledged for pointing out to us the important reference [5], and the second referee for his detailed list of suggestions which greatly helped in improving the presentation. Part of this work was supported through MIUR grants P.R.I.N. and the F.I.R.B. project "Bioinformatica per la Genomica e la Proteomica".

Author information

Authors and Affiliations

Dipartimento di Matematica ed Informatica, Universitá di Udine, Via delle Scienze 206, Udine, Italy
Giuseppe Lancia & Romeo Rizzi

Authors

Giuseppe Lancia
View author publications
You can also search for this author in PubMed Google Scholar
Romeo Rizzi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giuseppe Lancia.

Additional information

Authors' contributions

All authors equally contributed to this paper. All authors read and approved the final manuscript.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Lancia, G., Rizzi, R. The approximability of the String Barcoding problem. Algorithms Mol Biol 1, 12 (2006). https://doi.org/10.1186/1748-7188-1-12

Download citation

Received: 16 May 2006
Accepted: 08 August 2006
Published: 08 August 2006
DOI: https://doi.org/10.1186/1748-7188-1-12

The approximability of the String Barcoding problem

Abstract

Background

A starting problem: the Min Test Collection

MTC INSTANCE

MTC PROBLEM

SC INSTANCE

SC PROBLEM

The String Barcoding problems

SBC INSTANCE

SBC PROBLEM

MLSBC INSTANCE

MLSBC PROBLEM

Inapproximability of SBC and MLSBC

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Rights and permissions

About this article

Cite this article

Keywords

Algorithms for Molecular Biology

Contact us

The approximability of the String Barcoding problem

Abstract

Background

A starting problem: the Min Test Collection

MTC INSTANCE

MTC PROBLEM

SC INSTANCE

SC PROBLEM

The String Barcoding problems

SBC INSTANCE

SBC PROBLEM

MLSBC INSTANCE

MLSBC PROBLEM

Inapproximability of SBC and MLSBC

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Algorithms for Molecular Biology

Contact us