- Research
- Open Access
- Published:

# The approximability of the String Barcoding problem

*Algorithms for Molecular Biology*
**volume 1**, Article number: 12 (2006)

## Abstract

The String Barcoding (SBC) problem, introduced by Rash and Gusfield (RECOMB, 2002), consists in finding a minimum set of substrings that can be used to distinguish between all members of a set of given strings. In a computational biology context, the given strings represent a set of known viruses, while the substrings can be used as probes for an hybridization experiment via microarray. Eventually, one aims at the classification of new strings (unknown viruses) through the result of the hybridization experiment. In this paper we show that SBC is as hard to approximate as Set Cover. Furthermore, we show that the constrained version of SBC (with probes of bounded length) is also hard to approximate. These negative results are tight.

## Background

The following setting was introduced by Rash and Gusfield in [1]: Given a set *V* of *n* strings *v*_{1},...,*v*_{
n
}(representing the genomes of *n* known viruses), and an extra string *s* (representing a virus in *V*, but not yet classified), we aim at recognizing *s* as one of the known viruses through an hybridization experiment. In the experiment, we utilize a set ∏ of *k* probes (DNA strings) and we will are able to determine which ones are contained in *s* (as substrings) and which are not. The result of the experiment is therefore a binary *k*-vector (called, in [1] a *barcode*) which can be seen as the signature of *s* with respect to the given probes. In order for the barcode to be able to discriminate between all the viruses, it must be true that, for each pair of viruses *v*_{
i
}, *v*_{
j
}, with 1 ≤ *i* <*j* ≤ *n*, there exists at least one *π* ∈ ∏ which is a substring of either *v*_{
i
}or *v*_{
j
}but not of both. This amounts to saying that the barcodes of all *v*_{
i
}'s must be distinct binary *k*-vectors. The cost of the hybridization experiment turns out to be proportional to *k*, and therefore the goal of the optimization problem, known as Minimum String Barcoding (SBC), is to find a feasible set ∏ of smallest possible cardinality. The problem has been popularized by Rash and Gusfield [1], who proposed an Integer Programming approach for its solution. In [2, 3], DasGupta et al. describe a greedy algorithm for *robust* barcoding (i.e., where each pair of viruses must be distinguished by at least a given number *l* of probes), which scales well to whole-genome sequences. For real-life instances, this algorithm is more effective than alternative approaches [1, 4] whose time complexity grows very quickly with the length of the input sequences.

In [1], Rash and Gusfield stated that a variant of SBC, in which the maximum length of each probe is bounded by a constant, and the alphabet size is at least 3, is NP-hard. As for the unconstrained case, where no bound is given on the length of each probe, they left as an open problem to determine whether this version of SBC is NP-complete or not. In this paper we prove that both SBC and unconstrained SBC are in fact NP-complete already for binary alphabets. We do so by actually linking the approximability of SBC (both constrained and unconstrained) to the approximability of the classical Set Cover problem. This way, a sharp log *n* bound on the best achievable approximation ratio is established for both versions of SBC. It must here be said that essentially the same result has independently been obtained, and already published, by Berman et al. [5]. The inapproximability result in [5] actually holds for a very general family of Minimum Test Collection problems which includes unconstrained SBC as a special case. However, our inapproximability result for constrained SBC is not covered by the general framework proposed in [5]. Note that the very nature of the hybridization experiment imposes that the used probes cannot be too long for technological and biological reasons (such as possible self-hybridization of the probes). Therefore, the bounded-length SBC problem is quite important in practice. In [5] the authors also obtain a (1 + log *n*)-approximation algorithm for the general Minimum Test Collection problem. Their result is the first improvement over the log *n*^{2} = 2 log *n* approximation ratio that can essentially be achieved by a standard reduction of Minimum Test Collection to Set Cover followed by a run of the classical set covering greedy algorithm. Thanks to this positive result, all the bounds on the approximability ratios obtained either here or in [5] are tight also in terms of the multiplicative constant of the log *n* factor. This (1 + log *n*)-approximation proposed in [5] is a greedy algorithm in which the choice of the test set to be added at each step is driven by a suitable entropy function. The analysis of the algorithm, also given in [5], is an elegant and non-trivial reinterpretation of the celebrated proof by Lovasz of the approximation ratio of the greedy algorithm for set cover.

The remainder of the paper is organized as follows. In next section, we introduce the Minimum Test Collection problem (MTC), a known NP-complete problem (see, e.g., Garey and Johnson [6]) for which set-cover-like inapproximability results are known [7]. We also introduce a restricted version of MTC and we show that the same inapproximability results hold for this restricted version as well. In the following section, we address the computational complexity of SBC and show that the approximation algorithm by Berman, DasGupta and Kao [5] delivers an essentially tight approximation ratio even for constrained SBC. More precisely, in the opening of the section we introduce formally the string barcoding problems studied and also point out that every SBC instance (either constrained or unconstrained) can be formulated as an MTC instance, which directly implies set-cover-like approximability results for SBC. We also observe here that the constrained SBC problem, when parameterized over the maximum probe length and the alphabet size, is in FPT and, in particular, it can be solved in linear time whenever these parameters are fixed (for a comprehensive treatment of FPT theory, see [8]). Next, we prove set-cover-like inapproximability results for SBC and for the maximum-length version of SBC via a common reduction from the restricted version of MTC introduced in the first section. (The NP-hardness of the maximum-length version of SBC had been already stated in [1], although without reporting the proof).

## A starting problem: the Min Test Collection

In this section we introduce the Minimum Test Collection (MTC) problem, both in its general form and in a restricted version. We also report (and obtain) set-cover-like inapproximability results for MTC and its restricted version. Both the inapproximability of MTC and that of its restricted version will be used in later sections, when characterizing the approximability of the two variants of SBC.

The MTC problem, as defined in [6], is the following problem.

### MTC INSTANCE

*D* = {*d*_{1},...,*d*_{
p
}}: a set of (ground) elements.

= {*T*_{1},...,*T*_{
q
}}: a set of subsets of *D* (representing tests that may *succeed* or *fail* on the elements. A test *T* succeeds on *d* if *d* ∈ *T* and fails on *d* otherwise).

### MTC PROBLEM

Find a minimum-size set ${\mathcal{T}}^{\prime}$ ⊆ $\mathcal{T}$ such that for any pair of elements *d*, *d'* ∈ *D* there is at least one test *T* ∈ ${\mathcal{T}}^{\prime}$ such that |{*d*, *d'*} ∩ *T*| = 1 (i.e., the test fails on one element and succeeds on the other). A set that verifies this property is called a *testing set* of *D*; ${\mathcal{T}}^{\prime}$ is a *minimum testing set* of *D*.

The MTC problem appears in many contexts. For example, the elements may represent a set of *p* diseases, and the *T*_{
i
}are diagnostic tests, that can verify the presence/absence of *q* symptoms. The goal is to minimize the number of symptoms whose presence/absence should be verified in order to correctly diagnose the disease. In [6], Garey and Johnson proved that MTC is NP-complete by reducing 3-dimensional Matching (3DM), which is NP-complete [9], to it. In [7] it was also proven by means of a reduction from Set Cover that no fully polynomial-time approximation scheme exists for MTC, unless P = NP. Later in this section we essentially employ this reduction. The same reduction had also been reconsidered in [10] where it was shown that MTC is not approximable within (1 - *ε*) log *p* for any *ε* > 0. We now introduce a special type of MTC instances, which we call *standard*. In this version of the problem, some particular tests must always be part of the problem instance.

In order to define these particular instances, assume the elements in *D* are ordered as *d*_{1},...,*d*_{
p
}and let *D*_{
j
}= {*d*_{
j
},...,*d*_{
p
}} for *j* = 1,...,*p*. A set of tests $\mathcal{T}$ is called *suffix-closed* if *D*_{
j
}∩ *T* ∈ $\mathcal{T}$ for each *T* ∈ $\mathcal{T}$ and *j* = 1,...,*p*. A suffix-closed set of tests $\mathcal{T}$ is called *standard* if *D*_{
j
}∈ $\mathcal{T}$ and {*d*_{
j
}} ∈ $\mathcal{T}$ for each *j* = 1,...,*p*. An instance (*D*, $\mathcal{T}$) of MTC is *standard* when $\mathcal{T}$ is standard. In other words, a standard instance of MTC consists of a finite set *D* = {*d*_{1},...,*d*_{
p
}} and a set of tests $\mathcal{T}$ which can be written as $\mathcal{T}$ = $\mathcal{T}$_{
D
}∪ $\mathcal{T}$_{
I
}∪ $\mathcal{T}$_{
A
}∪ $\mathcal{T}$_{
E
}, where

_{
D
}= {*S*_{1},...,*S*_{
q'
}}: a generic set of subsets of *D*;

_{
I
}= {*S*_{q'+1},...,*S*_{q'+p}} = {{*d*_{
i
}} | *i* = 1,...,*p*};

_{
A
}= {*S*_{q'+p+1},...,*S*_{q'+2p}} = {*D*_{
j
}| 1 ≤ *j* ≤ *p*};

_{
E
}= {*S*_{q'+2p+1},...,*S*_{p(q'+2)}} = {*S* ∩ *D*_{
j
}| *S* ∈ $\mathcal{T}$_{
D
}, 2 ≤ *j* ≤ *p*}.

Note that $\mathcal{T}$_{
D
}, $\mathcal{T}$_{
I
}, $\mathcal{T}$_{
A
}and $\mathcal{T}$_{
E
}may have non-empty intersection. In other words, where we assume $\mathcal{T}$ = {*T*_{1},...,*T*_{
q
}} with *q* = *p*(*q'* + 2) and *T*_{
i
}= *S*_{
i
}for *i* = 1,2,...,*q*, then it might be the case that *T*_{
i
}= *T*_{
j
}with *i* ≠ *j*.

We now prove the following result.

**Theorem 1** *Minimum Test Collection (MTC) cannot be approximated within* (1 - *ε*) log *p for any ε* > 0 *even when restricted to standard instances*.

We prove the above theorem by a reduction from the Set Cover (SC) problem, which is defined ([11]) as follows.

### SC INSTANCE

A finite set *S* = {*s*_{1},...,*s*_{
m
}} and a collection $\mathcal{C}$ = {*C*_{1},...,*C*_{
n
}} ⊆ 2^{S}such that *S* = $\cup}_{i=1}^{n}{C}_{i$.

### SC PROBLEM

Find a minimum-size collection ${\mathcal{C}}^{\prime}$ ⊆ $\mathcal{C}$ such that every element in *S* belongs to at least one subset in ${\mathcal{C}}^{\prime}$, i.e.

$S={\displaystyle \underset{C\in {\mathcal{C}}^{\prime}}{\cup}C}.\text{}\left(1\right)$

We say that any ${\mathcal{C}}^{\prime}$ satisfying (1) *covers S*, and we call such a set a *set cover* for *S*.

It is well known that SC cannot be approximated within (1 - *ε*) log *m* for any *ε* > 0 (see [12]).

Let *S* = {*s*_{1},...,*s*_{
m
}} and $\mathcal{C}$ = {*C*_{1},...,*C*_{
n
}} ⊆ 2^{S}be an arbitrary instance of SC. We show how to obtain a standard instance of MTC representing the given instance of SC.

First, let *K* := 2^{k}be the smallest power of 2 such that *K* ≥ *m*. To each *j* ∈ {1, 2,..., *K*}, we associate a unique binary string *b*(*j*) of length *k*. Let *R* := {*r*_{1},...,*r*_{
K
}}, be a set of size *K* with *R* ∩ *S* = ∅. The set of elements *D* is defined as *D* = *R* ∪ *S*, with a particular order:

*D* = {*r*_{1}, *s*_{1}, *r*_{2}, *s*_{2}, ..., *r*_{
m
}, *s*_{
m
}, *r*_{m+1}, *r*_{m+2}..., *r*_{
K
}}

(i.e., *D* = {*d*_{1}, ..., *d*_{
p
}} with *p* = *m* + *K*). The set of tests $\mathcal{T}$ is constructed in the following way. First, for each *i* = 1,...,*k*, we call *T*_{
i
}the test containing all the *r*_{
j
}and *s*_{
j
}such that the bit in position *i* of the binary string *b*(*j*) is set to 1. Then let $\mathcal{T}$ = $\mathcal{T}$_{
D
}∪ $\mathcal{T}$_{
I
}∪ $\mathcal{T}$_{
A
}∪ $\mathcal{T}$_{
E
}where

_{
D
}= $\mathcal{C}$ ∪ {*T*_{
i
}| *i* = 1,...,*k*},

_{
I
}= {{*d*_{
i
}} | *i* = l,...,*p*},

_{
A
}= {*D*_{
j
}| 1 ≤ *j* ≤ *p*},

_{
E
}= {*T* ∩ *D*_{
j
}| *T* ∈ $\mathcal{T}$_{
D
}, 2 ≤ *j* ≤ *p*}.

The following two lemmas investigate the properties of the proposed reduction.

**Lemma 1** *If S has a set cover* ${\mathcal{C}}^{\prime}$ ⊆ $\mathcal{C}$*of size h, then D has a testing set* ${\mathcal{T}}^{\prime}$ ⊆ $\mathcal{T}$*of size at most h* + *k*.

*Proof:* Let ${\mathcal{C}}^{\prime}$ ⊆ $\mathcal{C}$ be a set cover for *S* of size *h*. We claim that ${\mathcal{T}}^{\prime}$ := ${\mathcal{C}}^{\prime}$ ∪ {*T*_{
i
}| *i* = 1,...,*k*} is a testing set for *D*, which proves the lemma. Indeed, consider two elements *s*_{
i
}(or *r*_{
i
}) and *s*_{
j
}(or *r*_{
j
}). If *i* ≠ *j* then the binary strings associated to *i* and *j* differ in some position *x*, and hence *T*_{
x
}distinguishes between them. Otherwise, if *i* = *j* and the two elements still differ, then we are talking about *s*_{
i
}and *r*_{
i
}, for some *i* = 1,...,*m*. Notice that *s*_{
i
}is contained in at least one set *C* in ${\mathcal{C}}^{\prime}$ since ${\mathcal{C}}^{\prime}$ covers *S*. Moreover, *r*_{
i
}∉ *C* since *C* ⊆ *S*. It follows that there exists some set in ${\mathcal{C}}^{\prime}$, and hence in ${\mathcal{T}}^{\prime}$, which distinguishes between *s*_{
i
}and *r*_{
i
}. □

**Lemma 2** *If D has a testing set* ${\mathcal{T}}^{\prime}$ ⊆ $\mathcal{T}$*of size h, then S has a set cover* ${\mathcal{C}}^{\prime}$ ⊆ $\mathcal{C}$*of size at most h*.

*Proof:* Let ${\mathcal{T}}^{\prime}$ ⊆ $\mathcal{T}$ be a testing set of *D* of size *h*. We propose a polynomial-time algorithm to produce a set ${\mathcal{C}}^{\prime}$ ⊆ $\mathcal{C}$ with |${\mathcal{C}}^{\prime}$| ≤ |${\mathcal{T}}^{\prime}$| such that ${\mathcal{C}}^{\prime}$ ∪ {*T*_{
i
}| *i* = 1,...,*k*} is also a testing set of *D*. At the end, we argue that such a ${\mathcal{C}}^{\prime}$ must be a set cover of *S*.

Let *X* = ${\mathcal{T}}^{\prime}$. Clearly, *X* ∪ {*T*_{
i
}| *i* = 1,...,*k*} distinguishes all the elements in *D*, and this invariant will be maintained throughout the algorithm. If *X* ⊆ $\mathcal{C}$, then we just let ${\mathcal{C}}^{\prime}$ = *X*, and stop. Otherwise, let *T* ∈ *X* \ $\mathcal{C}$. Notice that all pairs of elements which are not distinguished by (*X* \ {*T*}) ∪ {*T*_{
i
}| *i* = 1,...,*k*} necessarily belong to the set *P* = {{*s*_{
i
}, *r*_{
i
}} | *i* = 1,...,*m*}. Our plan is hence to replace *T* by any set in $\mathcal{C}$ which distinguishes all the pairs in *P* that are distinguished by *T*. It remains to show that such a set in $\mathcal{C}$ always exists. Indeed, if *T* is a test *D*_{
j
}with *j* = 2*i* and *j* ≤ 2*m*, then the ordering we have imposed among the elements of *D* implies that *T* distinguishes only the pair {*s*_{
i
}, *r*_{
i
}} of *P*, so it can be replaced by any *C* ∈ $\mathcal{C}$ with *s*_{
i
}∈ $\mathcal{C}$; if *T* is a test *D*_{
j
}with *j* odd or *j* > 2*m*, then *T* distinguishes no pair in *D*, so that *T* can be dropped from *X* without the need for any replacement. If *T* is a test of the form *T*_{
i
}∩ *D*_{
j
}, then it again distinguishes at most one pair in *P*, and a similar reasoning holds. The same holds if *T* ∈ *T*_{
I
}, that is, *T* = {*d*} for some *d* ∈ *D*. Finally, if *T* is a test *C* ∩ *D*_{
j
}for some *C* ∈ $\mathcal{C}$, then, clearly, it can be replaced with *C*. Hence, by substituting every test *T* ∈ *X* \ $\mathcal{C}$ by tests in $\mathcal{C}$ as shown, we obtain that *X* ⊆ $\mathcal{C}$, and we let ${\mathcal{C}}^{\prime}$ = *X*.

We now argue simply that, since ${\mathcal{C}}^{\prime}$ ∪ {*T*_{
i
}| *i* = 1,...,*k*} is a testing set of *D*, then ${\mathcal{C}}^{\prime}$ is a set cover of *S*. Indeed, no pair in *P* is distinguished by a set *T*_{
i
}. Therefore, for each *j* = 1,...,*m*, the pair {*r*_{
j
}, *s*_{
j
}} is distinguished by some test $\overline{T}$ ∈ ${\mathcal{C}}^{\prime}$. Moreover, since *r*_{
j
}∉ *T* for any *T* ∈ ${\mathcal{C}}^{\prime}$ ⊆ $\mathcal{C}$, it must be that *s*_{
j
}∈ $\overline{T}$. Therefore, each *s*_{
j
}is covered, and ${\mathcal{C}}^{\prime}$ is a set cover of *S*.□

With Lemmas 1 and 2, we are now ready to prove Theorem 1.

**Proof of Theorem 1:** We first remark that SC is not approximable within (1 - *ε*) log *m* even when restricted to instances for which *opt* = *ω*(log *m*). Indeed, just consider duplicating a generic instance of SC into *t* := ⌈log^{2} *m*⌉ = *ω*(log *m*) identical and disjoint copies to obtain a new instance (*S**, $\mathcal{C}$*) with |*S**| = *tm*. Let *opt* denote the optimum value for the original instance (*S*, $\mathcal{C}$) and *opt** the optimum value for the instance (*S**, $\mathcal{C}$*). Then *opt** = *t opt* ≥ *t* = *ω*(log|*S**|). Notice also that a solution to the instance (*S**, $\mathcal{C}$*) of size at most *opt**(1 - *ε*) log|*S**| could be immediately translated into a solution to the instance (*S*, $\mathcal{C}$)of size at most

$\frac{1}{t}opt*(1-\epsilon )\mathrm{log}|S*|=opt(1-\epsilon )\mathrm{log}|S*|=(1-\epsilon )(\mathrm{log}m+\mathrm{log}t)opt.$

Here, *ε* > 0 and log *t* = *o*(log *m*), in contrast with the inapproximability results explicitly derived in [12]. In the analysis to follow we therefore assume that *opt* = *ω*(log *m*).

Denote now by *opt* and *opt'* the optimal solution values for the original problem (SC) and the transformed problem (MTC) respectively, and by *apx* and *apx'* the values of the respective approximated solutions that we can produce in polynomial time. By Lemma 1,

*opt'* ≤ *opt* + *k* = *opt* + *o*(*opt*).

Then, if we assume that we can obtain an approximate solution

*apx'* ≤ *f*(|*D*|)*opt'*

for the MTC problem, we can also guarantee that

*apx'* ≤ *f*(|*D*|)(*opt* + *o*(*opt*)).

Since the proof of Lemma 2 is constructive, we obtain that

*apx* ≤ *apx'* ≤ *f*(|*D*|)(*opt* + *o*(*opt*)).

Notice that *p* := |*D*| ≤ 2*m*. Consequently, since we know that SC is not approximable within (1 - *ε*) log *m* for any *ε* > 0, then we can conclude that MTC is not approximable within (1 - *ε*) log *p* for any *ε* > 0. □

## The String Barcoding problems

The following is a formal definition of the String Barcoding problem (SBC):

### SBC INSTANCE

An alphabet Σ (e.g., Σ = {*A*, *C*, *G*, *T*}) and a set *V* = {*v*_{1},...,*v*_{
n
}} of strings over Σ (representing virus genomes).

### SBC PROBLEM

Find a minimum-size set ∏ of strings such that for any pair of strings *v, v'* ∈ *V* there is at least one string *π* ∈ ∏ such that *π* is a substring of *v* or *v'*, but not of both. A set that verifies this property is called a *testing set* of *V*; ∏ is a *minimum testing set* of *V*.

Rash and Gusfield state in [1] that it is unknown whether the basic String Barcoding problem is NP-hard or not and they also state that a variant of SBC called Max-length String Barcoding (MLSBC) is NP-hard when the underlying alphabet contains at least three elements. In this variant, a constraint on the maximum length of the substrings in ∏ is specified in input. More formally, MLSBC is the following problem:

### MLSBC INSTANCE

An alphabet Σ, a set *V* = {*v*_{1},...,*v*_{
n
}} of strings over Σ and a constant *L*.

### MLSBC PROBLEM

Find a testing set ∏ of *V* such that the length of each string *π* ∈ ∏ is less than or equal to *L*, and ∏ has smallest possible cardinality among such testing sets.

The main point of this paper is to link the approximability of SBC (both constrained and unconstrained) to the approximability of the classical Set Cover problem. Indeed, both SBC and MLSBC can be naturally regarded as instances of MTC, for which, in turn, a natural reduction to Set Cover is well known. In the next section we provide reductions for the reverse direction. These reductions will characterize the approximability of SBC and MLSBC from a computational complexity point of view. To better appreciate some aspects of these reductions, we make the following remark.

**Fact 1** *MLSBC can be solved in linear time whenever L and* |Σ| *are bounded by a constant*.

*Proof:* Indeed, the number of strings *π* which may possibly end up in the testing set ∏ is bounded by

$f(|\sum |,L):={\displaystyle \sum _{t=1}^{L}|\sum {|}^{t}}=\frac{|\sum {|}^{L+1}-1}{|\sum |-1},$

whence the number of possible solutions is bounded by 2^{f(|Σ|,L)}. Thus we have a constant number of possible solutions, and each can be checked in linear time. □.

#### Inapproximability of SBC and MLSBC

In this subsection we prove the inapproximability of both SBC and MLSBC by means of a common reduction from the restricted form of MTC introduced in Section.

**Theorem 2** *The String Barcoding (SBC) problem cannot be approximated within* (1 - *ε*) log *n for any ε* > 0. *This negative result holds already for binary alphabets*.

**Theorem 3** *The Max-length String Barcoding problem cannot be approximated within* (1 - *ε*) log *n for any ε* > 0. *This negative result holds already for binary alphabets*.

Let *D* = {*d*_{1},...,*d*_{
p
}} and $\mathcal{T}$ = {*T*_{1},...,*T*_{
q
}} = $\mathcal{T}$_{
D
}∪ $\mathcal{T}$_{
I
}∪ $\mathcal{T}$_{
A
}∪ $\mathcal{T}$_{
E
}be a standard instance of MTC, with

_{
D
}= {*T*_{1},...,*T*_{
q'
}},

_{
I
}= {*T*_{q'+1},...,*T*_{q'+p}},

_{
A
}= {*T*_{q'+p+1},...,*T*_{q'+2p}},

_{
E
}= {*T*_{q'+2p+1},...,*T*_{p(q'+2)}}.

Where Ω is a set of strings, ◯*σ*∈Ω(*σ*) denotes the string obtained as the concatenation of all the strings in Ω lined up in lexicographic order (as a matter of fact, for the purpose of our reduction to work, the strings in Ω could be concatenated in any order, but we prefer to refer to a specific order so that the instance generated through the proposed reduction is uniquely defined).

An instance of SBC (or of MLSBC) is obtained in the following way. First, let *k* = ⌈log_{2} *q*⌉. Then, let Σ = {*A, B*} and Σ_{+} = {*A, B, X*} (the dummy symbol *X* will be used as a separator, to divide the really interesting substrings, made only of *A* s and *B* s). We will often treat Σ and Σ_{+} as alphabets, even if the *intermediate symbols A, B*, and *X* actually stand for binary strings according to the rules: *A* ↦ 10101, *B* ↦ 11011, and *X* ↦ 00000. Thanks to these rules, any given string in Σ* or ${\sum}_{+}^{*}$ ultimately represents a unique binary string in Σ = {0,1}*. Let Σ^{l}denote the set of all the strings of length *l* over the alphabet Σ. Finally, uniquely encode each different test *T* ∈ $\mathcal{T}$ by a string *f*_{
T
}∈ Σ^{k}(called the *signature* of *T*) and let *F* = {*f*_{
T
}| *T* ∈ $\mathcal{T}$}; certainly this is possible since |Σ^{k}| = 2^{k}≥ *q* = |$\mathcal{T}$|. Now, the instance of SBC is completed by constructing the set of strings *V* = {*v*_{
j
}| *j* = 1,...,*p*} such that each string *v*_{
j
}∈ *V* contains all the strings in Σ^{2k-1}plus the signatures *f* ∈ *F* of those tests *T* ∈ $\mathcal{T}$ that succeed on *d*_{
j
}(that is, such that *d*_{
j
}∈ *T*). More formally, the codification of an element *d*_{
j
}∈ *D* is the string

${v}_{j}={X}^{2k+j}{\u25cb}_{\sigma \in {\sum}^{2k-1}}(\sigma {X}^{2k+j}){\u25cb}_{T\u220d{d}_{j}}({f}_{T}{f}_{T}{X}^{2k+j})$

seen as a binary string. Notice that the role of *X* is to separate the substrings, and that a different number of *X* characters is used in each string *v* in order to uniquely identify it when dealing with one of its substrings which includes a whole block of *X'* s. The MLSBC instance is the same as the SBC instance plus the bound *L* = 10 *k*.

The number and size of the strings constructed above, and hence the above described transformation from an MTC instance to either an SBC instance or an MLSBC instance, is polynomial. With the next two lemmas we show that this is an objective-function preserving reduction from MTC to either SBC or MLSBC whence Theorems 2 and 3 follow.

**Lemma 3** *If D has a testing set* ${\mathcal{T}}^{\prime}$ ⊆ $\mathcal{T}$*of size h, then V has a testing set* ∏ *of size at most h. Furthermore*, |*π*| ≤ *L for every π* ∈ ∏.

*Proof:* Consider the set of strings ∏ = {*f*_{
T
}*f*_{
T
}| *f*_{
T
}is the signature of *T* ∈ ${\mathcal{T}}^{\prime}$}. Clearly, |∏| ≤ |${\mathcal{T}}^{\prime}$| and we aim at showing that ∏ is a testing set for *V*. More precisely, we claim that the binary string *f*_{
T
}*f*_{
T
}is a substring of the binary string *v*_{
j
}if and only if *d*_{
j
}∈ *T*. Indeed, when *d*_{
j
}∈ *T*, it follows immediately from the construction of *v*_{
j
}that *f*_{
T
}*f*_{
T
}is a substring of *v*_{
j
}. As for the converse, when *f*_{
T
}*f*_{
T
}is a substring of *v*_{
j
}, then the shift of any of its occurrences within *v*_{
j
}is necessarily a multiple of 5, and hence *f*_{
T
}*f*_{
T
}is actually a substring of *v*_{
j
}also when *f*_{
T
}*f*_{
T
}and *v*_{
j
}are regarded as strings over Σ_{+}. It follows that *d*_{
j
}∈ *T*. Notice moreover that |*π*| = 10 *k* ≤ *L* for every *π* ∈ ∏. □

**Lemma 4** *If V has a testing set of size h, then D has a testing set* ${\mathcal{T}}^{\prime}$ ⊆ $\mathcal{T}$*of size at most h*.

*Proof:* We want to show that, given a testing set ∏ for *V*, there exists a testing set ${\mathcal{T}}^{\prime}$ ⊆ $\mathcal{T}$ for *D* with |${\mathcal{T}}^{\prime}$| ≤ |∏|. We actually commit ourselves to show that for every binary string *π* ∈ ∏ we can find a *T*_{
π
}∈ $\mathcal{T}$ such that, for each *j* = 1,...,*p*, the string *π* occurs as a substring of *v*_{
j
}if and only if *d*_{
j
}∈ *T*_{
π
}. In following this plan of action, for each *π* ∈ ∏, we can clearly assume that *π* is a substring of some *v*_{
j
}∈ *V* but not all. Thus, if *π* contains a substring of the form 10^{y}1 for some *y* > 1, then *y* is a multiple of 5, that is, *y* = 5*t*, and, actually, *t* = 2*k* + *j* with 1 ≤ *j* ≤ *p*, in which case we can take *T*_{
π
}:= {*d*_{
j
}}. This works since *v*_{
j
}is the only string in *V* of which *π* is a substring. Similarly, in case the string *π* contains no symbol 1 except in the first (or except in the last) *x* ≤ 2 positions, and where *t* = ⌈(|*π*| - *x*)/5⌉ (here we are assuming that the symbol in position *x* is forced to be a 1 if *x* > 0), then *t* = 2*k* + *j* with 1 <*j* ≤ *p*, in which case we can take *T*_{
π
}:= *D*_{
j
}. This works since *v*_{
i
}contains *π* as a substring if and only if *i* ≥ *j*. Furthermore, in case 00 is not a substring of *π*, and since *π* is a substring of some *v*_{
j
}∈ *V* but not all, then 10 *k* - 8 ≤ |*π*| ≤ 10 *k* + 2.

Actually, where *π'* is the longest substring of *π* which both begins and ends with 1, then 10 *k* - 8 ≤ |*π'*| ≤ 10 *k*, and *π'* is a substring of ${f}_{\tilde{T}}{f}_{\tilde{T}}$ for precisely one $\tilde{T}$ ∈ $\mathcal{T}$ – and in this case *T*_{
π
}:= $\tilde{T}$ works. We are left with the case *π* = 0^{a}1*α* 10^{b}with *α* containing no 00 substring and where one of *a* or *b* may possibly be 0 but *M* := max{*a*, *b*} ≥ 2. Assume w.l.o.g. that *a* = *M*. Again, let *t* = ⌈*M*/5⌉. Clearly, we can assume *t* ≤ 2*k* + *p*. If *t* ≤ 2*k* + 1, then we can also assume that 1*α* 10^{b}is a substring of ${f}_{\tilde{T}}{f}_{\tilde{T}}$0^{b}for precisely one $\tilde{T}$ ∈ $\mathcal{T}$ – in this case *T*_{
π
}:= $\tilde{T}$ works since the set of those strings in *V* having *π* as a substring is precisely {*v*_{
j
}| *d*_{
j
}∈ $\tilde{T}$}. We hence turn to consider *t* = 2*k* + *j* with 1 <*j* ≤ *p*. We can also assume that |*α*| ≤ 10 *k* - 2. Let *z* be an indicator variable whose value is 1 if *b* ≠ 0 and 0 otherwise. If |1*α* 1| + *z* < 5 *k* - 3 then consider *T*_{
π
}:= *D*_{
j
}, which works since the set of those strings in *V* having *π* as a substring is precisely {*v*_{
i
}| *i* ≥ *j*} = {*v*_{
i
}| *d*_{
i
}∈ *D*_{
j
}}. (Actually, for the sake of precision, it can be observed that whenever |1*α* 1| ≥ 10 *k* - 5, the string 001*α* 100 will be a substring of all *v*_{
i
}, or none at all). If |1*α* 1| + *z* ≥ 5 *k* - 3 then 1*α* 10 is a substring of ${f}_{\tilde{T}}{f}_{\tilde{T}}$0 for precisely one $\tilde{T}$ ∈ $\mathcal{T}$ and *T*_{
π
}:= $\tilde{T}$ ∩ *D*_{
j
}works since the set of those strings in *V* having *π* as a substring is precisely {*v*_{
i
}| *i* ≥ *j*, *d*_{
i
}∈ $\tilde{T}$} = {*v*_{
i
}| *d*_{
i
}∈ *D*_{
j
}∩ $\tilde{T}$}. □

## References

- 1.
Rash S, Gusfield D: String Barcoding: Uncovering Optimal Virus Signatures. Proceedings of the Annual International Conference on on Computational Molecular Biology (RECOMB). 2002, 254-261. ACM press

- 2.
DasGupta B, Konwar KM, Mandoiu II, Shvartsman A: Highly scalable algorithms for robust string barcoding. Int J of Bioinf Res and Appls. 2005, 1 (2): 145-161.

- 3.
DasGupta B, Konwar KM, Mandoiu II, Shvartsman A: DNA-BAR: distinguisher selection for DNA barcoding. Bioinf. 2005, 21 (16): 3424-3426. 10.1093/bioinformatics/bti547.

- 4.
Borneman J, Chrobak M, Della Vedova G, Figueroa A, Jiang T: Probe selection algorithms with applications in the analysis of microbial communities. Bioinf. 2001, 17 (Suppl 1): 39-48.

- 5.
Berman P, DasGupta B, Kao MY: Tight approximability results for test set problems in bioinformatics. J of Comp and Sys Sc. 2004, 71 (2): 145-162. 10.1016/j.jcss.2005.02.001. [Also in

*Proc. Workshop on Algorithm Theory*, Lec Notes in Comp Sc, Springer, 3111:39–50, 2004], - 6.
Garey MR, Johnson DS: Computers and Intractability: A Guide to the Theory of NP-Completeness. 1979, San Francisco: W. H. Freeman and Co

- 7.
Moret BME, Shapiro HD: On minimizing a set of tests. SIAM J on Sc and Stat Comp. 1985, 6: 983-1003. 10.1137/0906067.

- 8.
Downey RG, Fellows MR: Parametrized Complexity. 1998, Berlin: Springer-Verlag

- 9.
Karp RM: Reducibility among combinatorial problems. Compl and Comp Computations. 1972

- 10.
De Bontridder KMJ, Halldórsson BV, Halldórsson MM, Hurkens CAJ, Lenstra JK, Ravi R, Stougie L: Approximation algorithms for the test cover problem. Math Prog B. 2003, 1–3: 477-491. 10.1007/s10107-003-0414-6.

- 11.
Cormen TH, Leiserson CE, Rivest RL: Introduction to Algorithms. 2001, Boston: MIT press

- 12.
Feige U: A threshold of for approximating set cover. J ACM. 1998, 45: 634-652. 10.1145/285055.285059.

## Acknowledgements

We thank two anonymous referees for their careful reading of the paper. In particular, the first referee is acknowledged for pointing out to us the important reference [5], and the second referee for his detailed list of suggestions which greatly helped in improving the presentation. Part of this work was supported through MIUR grants P.R.I.N. and the F.I.R.B. project "Bioinformatica per la Genomica e la Proteomica".

## Author information

### Affiliations

### Corresponding author

## Additional information

### Authors' contributions

All authors equally contributed to this paper. All authors read and approved the final manuscript.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Lancia, G., Rizzi, R. The approximability of the String Barcoding problem.
*Algorithms Mol Biol* **1, **12 (2006). https://doi.org/10.1186/1748-7188-1-12

Received:

Accepted:

Published:

### Keywords

- Greedy Algorithm
- Approximation Ratio
- Hybridization Experiment
- Binary String
- Restricted Version