Fast algorithms for approximate circular string matching
 Carl Barton^{1},
 Costas S Iliopoulos^{1, 2, 3} and
 Solon P Pissis^{1}Email author
https://doi.org/10.1186/1748718899
© Barton et al.; licensee BioMed Central Ltd. 2014
Received: 19 September 2013
Accepted: 17 March 2014
Published: 22 March 2014
Abstract
Background
Circular string matching is a problem which naturally arises in many biological contexts. It consists in finding all occurrences of the rotations of a pattern of length m in a text of length n. There exist optimal averagecase algorithms for exact circular string matching. Approximate circular string matching is a rather undeveloped area.
Results
In this article, we present a suboptimal averagecase algorithm for exact circular string matching requiring time
. Based on our solution for the exact case, we present two fast averagecase algorithms for approximate circular string matching with kmismatches, under the Hamming distance model, requiring time
for moderate values of k, that is
. We show how the same results can be easily obtained under the edit distance model. The presented algorithms are also implemented as library functions. Experimental results demonstrate that the functions provided in this library accelerate the computations by more than three orders of magnitude compared to a naïve approach.
Conclusions
We present two fast averagecase algorithms for approximate circular string matching with kmismatches; and show that they also perform very well in practice. The importance of our contribution is underlined by the fact that the provided functions may be seamlessly integrated into any biological pipeline. The source code of the library is freely available at http://www.inf.kcl.ac.uk/research/projects/asmf/.
Keywords
Approximate circular string matching Circular pattern matching Algorithms on stringsBackground
Circular sequences appear in a number of biological contexts. This type of structure occurs in the DNA of viruses [1, 2], bacteria [3], eukaryotic cells [4], and archaea [5]. In [6], it was noted that, due to this, algorithms on circular strings may be important in the analysis of organisms with such structure. Circular strings have previously been studied in the context of sequence alignment. In [7], basic algorithms for pairwise and multiple circular sequence alignment were presented. These results were later improved in [8], where an additional preprocessing stage was added to speed up the execution time of the algorithm. In [9], the authors also presented efficient algorithms for finding the optimal alignment and consensus sequence of circular sequences under the Hamming distance metric.
In order to provide an overview of our results and algorithms, we begin with a few definitions, generally following [10]. We think of a string x of length n as an array x[ 0..n−1], where every x[ i], 0≤i<n, is a letter drawn from some fixed alphabet Σ of size σ=Σ. The empty string of length 0 is denoted by ε. A string x is a factor of a string y if there exist two strings u and v, such that y=u x v. Let the strings x,y,u, and v be such that y=u x v. If u=ε, then x is a prefix of y. If v=ε, then x is a suffix of y.
Let x be a nonempty string of length n and y be a string. We say that there exists an occurrence of x in y, or, more simply, that x occurs in y, when x is a factor of y. Every occurrence of x can be characterised by a position in y. Thus we say that x occurs at the starting position i in y when y[ i..i+n−1]=x. The Hamming distance between strings x and y, both of length n, is the number of positions i, 0≤i<n, such that x[ i]≠y[ i]. Given a nonnegative integer k, we write x≡_{ k }y if the Hamming distance between x and y is at most k.
A circular string of length n can be viewed as a traditional linear string which has the left and rightmost symbols wrapped around and stuck together in some way. Under this notion, the same circular string can be seen as n different linear strings, which would all be considered equivalent. Given a string x of length n, we denote by x^{ i }=x[ i..n−1]x[0..i−1], 0<i<n, the ith rotation of x and x^{0}=x. Consider, for instance, the string x=x^{0}=abababbc; this string has the following rotations: x^{1}=bababbca, x^{2}=ababbcab, x^{3}=babbcaba, x^{4}=abbcabab, x^{5}=bbcababa, x^{6}=bcababab, x^{7}=cabababb.
Here we consider the problem of finding occurrences of a pattern string x of length m with circular structure in a text string t of length n with linear structure. For instance, the DNA sequence of many viruses has circular structure, so if a biologist wishes to find occurrences of a particular virus in a carriers DNA sequence—which may not be circular—they must consider how to locate all positions in t that at least one rotation of x occurs. This is the problem of circular string matching.
The problem of exact circular string matching has been considered in [11], where an $\mathcal{O}\left(n\right)$time algorithm was presented. A naïve solution with quadratic complexity consists in applying a classical algorithm for searching a finite set of strings after having built the trie of rotations of x. The approach presented in [11] consists in preprocessing x by constructing a suffix automaton of the string xx, by noting that every rotation of x is a factor of xx. Then, by feeding t into the automaton, the lengths of the longest factors of xx occurring in t can be found by the links followed in the automaton in time $\mathcal{O}\left(n\right)$. In [12], the authors presented an optimal averagecase algorithm for exact circular string matching, by also showing that the averagecase lower bound for single string matching of $\mathcal{O}(n\underset{\sigma}{log}m/m)$ also holds for circular string matching. Very recently, in [13], the authors presented two fast averagecase algorithms based on wordlevel parallelism. The first algorithm requires averagecase time $\mathcal{O}(n\underset{\sigma}{log}m/w)$, where w is the number of bits in the computer word. The second one is based on a mixture of wordlevel parallelism and qgrams. The authors showed that with the addition of qgrams, and by setting $q=\mathcal{O}\left(\underset{\sigma}{log}m\right)$, an optimal averagecase time of $\mathcal{O}(n\underset{\sigma}{log}m/m)$ is achieved. Indexing circular patterns [14] and variations of approximate circular string matching under the edit distance model [15]—both based on the construction of a suffix tree—have also been considered.
In this article, we consider the following problems.
Problem 1 (Exact Circular String Matching).
Given a pattern x of length m and a text t of length n>m, find all factors u of t such that u=x^{ i }, 0≤i<m.
Problem
2 (Approximate Circular String Matching with kMismatches).
Given a pattern x of length m, a text t of length n>m, and an integer threshold k<m, find all factors u of t such that u≡_{ k }x^{ i }, 0≤i<m.
The aforementioned algorithms for the exact case exhibit the following disadvantages: first, they cannot be applied in a biological context since both single nucleotide polymorphisms as well as errors introduced by wetlab sequencing platforms might have occurred in the sequences; second, it is not clear whether they could easily be adapted to deal with the approximate case. Similar to the exact case [12], it can be shown that the averagecase lower bound for single approximate string matching of $\mathcal{O}\left(n\right(k+\underset{\sigma}{log}m)/m)$[16] also holds for approximate circular string matching with kmismatches under the Hamming distance model. To the best of our knowledge, no optimal averagecase algorithm exists for this problem. Therefore, to achieve optimality, one could use the optimal averagecase algorithm for multiple approximate string matching, presented in [17], for matching the r=m rotations of x requiring, on average, time $\mathcal{O}\left(n\right(k+\underset{\sigma}{log}\mathit{\text{rm}})/m)$, only if $k/m<1/2\mathcal{O}(1/\sqrt{\sigma})$, $r=\mathcal{O}(min({n}^{1/3}/{m}^{2},{\sigma}^{o\left(m\right)}\left)\right)$, and we have $\mathcal{O}\left({m}^{4}{r}^{2}{\sigma}^{\mathcal{O}\left(1\right)}\right)$ space available; which is impractical for large m: e.g. the genome of the smallest known viruses replicating autonomously in eukaryotic cells is around 1.8 KB long. The authors propose solutions to reduce the required space, however using various space–time tradeoff techniques.
Our Contribution. We present a new suboptimal averagecase algorithm for exact circular string matching requiring time $\mathcal{O}\left(n\right)$. Although suboptimal, this algorithm can be easily extended to tackle the approximate case efficiently. Based on our solution for the exact case, we present two new fast averagecase algorithms for approximate circular string matching with kmismatches, under the Hamming distance model, requiring time $\mathcal{O}\left(n\right)$ for moderate values of k, that is $k=\mathcal{O}(m/\underset{\sigma}{log}m)$. The first algorithm requires space $\mathcal{O}\left(n\right)$ and the second one $\mathcal{O}\left(m\right)$. We show how the same results can be easily obtained under the edit distance model. The presented algorithms are also implemented as library functions. Experimental results demonstrate that the functions provided in this library accelerate the computations by more than three orders of magnitude compared to a naïve approach. The source code of the library is freely available at http://www.inf.kcl.ac.uk/research/projects/asmf/.
Properties of the partitioning technique
In this section, we give a brief outline of the partitioning technique in general; and then show some properties of the version of the technique we use for our algorithms. The partitioning technique, introduced in [18], and in some sense earlier in [19], is an algorithm based on filtering out candidate positions that could never give a solution to speed up stringmatching algorithms. An important point to note about this technique is that it reduces the search space but does not, by design, verify potential occurrences. To create a stringmatching algorithm filtering must be combined with some verification technique. The idea behind the partitioning technique was initially proposed for approximate string matching, but here we show that this can also be used for exact circular string matching.
The idea behind the partitioning technique is to partition the given pattern in such a way that at least one of the fragments must occur exactly in any valid approximate occurrence of the pattern. It is then possible to search for these fragments exactly to give a set of candidate occurrences of the pattern. It is then left to the verification portion of the algorithm to check if these are valid approximate occurrences of the pattern. It has been experimentally shown that this approach yields very good practical performance on largescale datasets [20], even if it is not theoretically optimal.
For exact circular string matching, for an efficient solution, we cannot simply apply wellknown exact stringmatching algorithms, as we must also take into account the rotations of the pattern. We can, however, make use of the partitioning technique and, by choosing an appropriate number of fragments, ensure that at least one fragment must occur in any valid exact occurrence of a rotation. Lemma 1 together with the following fact provide this number.
Fact
1. Any rotation of x=x[ 0..m−1] is a factor of x^{′}=x[ 0..m−1]x[ 0..m−2]; and any factor of length m of x^{′} is a rotation of x.
Lemma
1. If we partition x^{′}=x[ 0..m−1]x[ 0.. m−2] in 4 fragments of length ⌊(2m−1)/4⌋ and ⌈(2m−1)/4⌉, at least one of the 4 fragments is a factor of any factor of length m of x^{′}.
Proof.
Lemma
2. Let x and y=y_{0}y_{1}…y_{ k } be two strings, both of length n, such that y_{0},y_{1},…,y_{ k } are k+1≤n nonempty strings and x≡_{ k }y. Then there exists at least one string y_{ i }, 0≤i≤k, starting at position j of y, 0≤j<n, occurring at the starting position j of x.
Proof. Immediate from the pigeonhole principle—if n items are put into m<n pigeonholes, then at least one pigeonhole must contain more than one item. ■
Based on Lemma 2, we take a similar approach to the one described by Lemma 1, to obtain the sufficient number of fragments in the case of approximate circular string matching with kmismatches.
Lemma
3. If we partition x^{′}=x[ 0.. m−1]x[ 0.. m−2] in 2k+4 fragments of length ⌊(2m−1)/(2k+4)⌋ and ⌈(2m−1)/(2k+4)⌉, at least k+1 of the 2k+4 fragments are factors of any factor of length m of x^{′}.
Exact circular string matching via filtering
In this section, we present ECSMF, a new suboptimal averagecase algorithm for exact circular string matching via filtering. It is based on the partitioning technique and a series of practical and wellestablished data structures such as the suffix array (for more details see [21]).
Longest common extension
First, we describe how to compute the longest common extension, denoted by lce, of two suffixes of a string in constant time (for more details see [22]). lce queries are an important part of the algorithms presented later on.
Let SA denote the array of positions of the sorted suffixes of string x of length n, i.e. for all 1≤r<n, we have x[ SA[ r−1]..n−1]<x[ SA[ r]..n−1]. The inverse iSA of the array SA is defined by iSA[ SA[ r]]=r, for all 0≤r<n. Let lcp(r,s) denote the length of the longest common prefix of the strings x[ SA[ r].. n−1] and x[ SA[ s].. n−1], for all 0≤r,s<n, and 0 otherwise. Let LCP denote the array defined by LCP[ r]=lcp(r−1,r), for all 1<r<n, and LCP[ 0]=0. We perform the following lineartime and linearspace preprocessing:
Example
We have LCE(x,1,2)=LCP[ RMQ_{LCP}(iSA[ 2]+1,iSA[ 1])]=LCP[ RMQ_{LCP}(6,8)]=1, implying that the lce of bbababba and bababba is 1.
Algorithm ECSMF
 1.
Construct the string x ^{′}=x[ 0.. m−1]x[ 0.. m−2] of length 2m−1. By Fact 1, any rotation of x is a factor of x ^{′}.
 2.
The pattern x ^{′} is partitioned in 4 fragments of length ⌊(2m−1)/4⌋ and ⌈(2m−1)/4⌉. By Lemma 1, at least one of the 4 fragments is a factor of any rotation of x.
 3.
Match the 4 fragments against the text t using an Aho Corasick automaton [25]. Let be a list of size Occ of tuples, where $<{p}_{{x}^{\prime}},\ell ,{p}_{t}>\in \mathcal{\mathcal{L}}$ is a 3tuple such that $0\le {p}_{{x}^{\prime}}<2m1$ is the position where the fragment occurs in x ^{′}, ℓ is the length of the corresponding fragment, and 0≤p _{ t }<n is the position where the fragment occurs in t.
 4.
Compute SA, iSA, LCP, and RMQ_{LCP} of T=x ^{′} t. Compute SA, iSA, LCP, and RMQ_{LCP} of T _{ r }=rev(t x ^{′}), that is the reverse string of t x ^{′}.
 5.For each tuple $<{p}_{{x}^{\prime}},\ell ,{p}_{t}>\in \mathcal{\mathcal{L}}$, we try to extend to the right via computing${\mathcal{E}}_{r}\leftarrow \mathsf{\text{LCE}}(T,{p}_{{x}^{\prime}}+\ell ,2m1+{p}_{t}+\ell );$
in other words, we compute the length ${\mathcal{E}}_{r}$ of the longest common prefix of ${x}^{\prime}[\phantom{\rule{0.3em}{0ex}}{p}_{{x}^{\prime}}+\ell \phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}}2m1]$ and t[ p_{ t }+ℓ.. n−1], both being suffixes of T. Similarly, we try to extend to the left via computing ${\mathcal{E}}_{l}$ using lce queries on the suffixes of T_{ r }.
 6.For each ${\mathcal{E}}_{l},{\mathcal{E}}_{r}$ computed for tuple $<{p}_{{x}^{\prime}},\ell ,{p}_{t}>\in \mathcal{\mathcal{L}}$, we report all the valid starting positions in t by first checking if the total length ${\mathcal{E}}_{l}+\ell +{\mathcal{E}}_{r}\ge m$; that is the length of the full extension of the fragment is greater than or equal to m, matching at least one rotation of x. If that is the case, then we report positions$max\{{p}_{t}{\mathcal{E}}_{\ell},{p}_{t}+\ell m\},\dots ,min\{{p}_{t}+\ell m+{\mathcal{E}}_{r},{p}_{t}\}.$
Example 2.
that is, x^{4}=CTAGGGT occurs at starting position 10 in t.
Theorem
1. Given a pattern x of length m drawn from alphabet Σ, σ=Σ, and a text t of length n>m drawn from Σ, algorithm ECSMF requires averagecase time $\mathcal{O}\left(n\right)$ to solve Problem 1.
Proof.
Approximate circular string matching with kmismatches via filtering
In this section, based on the ideas presented in algorithm ECSMF, we present algorithms ACSMF and ACSMFSimple, two new fast averagecase algorithms for approximate circular string matching with kmismatches via filtering.
Algorithm ACSMF
 1.
Construct the string x ^{′}=x[ 0.. m−1]x[0.. m−2] of length 2m−1. By Fact 1, any rotation of x is a factor of x ^{′}.
 2.
The pattern x ^{′} is partitioned in 2k+4 fragments of length ⌊(2m−1)/(2k+4)⌋ and ⌈(2m−1)/(2k+4)⌉. By Lemma 3, at least k+1 of the 2k+4 fragments are factors of any rotation of x.
 3.
Match the 2k+4 fragments against the text t using an Aho Corasick automaton [25]. Let be a list of size Occ of tuples, where $<{p}_{{x}^{\prime}},\ell ,{p}_{t}>\in \mathcal{\mathcal{L}}$ is a 3tuple such that $0\le {p}_{{x}^{\prime}}<2m1$ is the position where the fragment occurs in x ^{′}, ℓ is the length of the corresponding fragment, and 0≤p _{ t }<n is the position where the fragment occurs in t.
 4.
Compute SA, iSA, LCP, and RMQ_{LCP} of T=x ^{′} t. Compute SA, iSA, LCP, and RMQ_{LCP} of T _{ r }=rev(t x ^{′}), that is the reverse string of t x ^{′}.
 5.For each tuple $<{p}_{{x}^{\prime}},\ell ,{p}_{t}>\in \mathcal{\mathcal{L}}$, we try to extend k+1 times to the right via computing${\mathcal{E}}_{r}^{0}\leftarrow \mathsf{\text{LCE}}(T,{p}_{{x}^{\prime}}+\ell ,2m1+{p}_{t}+\ell )+1$$\begin{array}{c}{\mathcal{E}}_{r}^{1}\leftarrow \mathsf{\text{LCE}}(T,{p}_{{x}^{\prime}}+\ell +{\mathcal{E}}_{r}^{0},2m1+{p}_{t}+\ell +{\mathcal{E}}_{r}^{0})+1\\ \dots \end{array}$${\mathcal{E}}_{r}^{k1}\leftarrow \mathsf{\text{LCE}}(T,{p}_{{x}^{\prime}}+\ell +{\mathcal{E}}_{r}^{k2},2m1+{p}_{t}+\ell +{\mathcal{E}}_{r}^{k2})+1$${\mathcal{E}}_{r}^{k}\leftarrow \mathsf{\text{LCE}}(T,{p}_{{x}^{\prime}}+\ell +{\mathcal{E}}_{r}^{k1},2m1+{p}_{t}+\ell +{\mathcal{E}}_{r}^{k1});$
in other words, we compute the length ${\mathcal{E}}_{r}^{k}$ of the longest common prefix of ${x}^{\prime}[\phantom{\rule{0.3em}{0ex}}{p}_{{x}^{\prime}}+\ell \phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}}2m1]$ and t[ p_{ t }+ℓ.. n−1], both being suffixes of T, with k mismatches. Similarly, we try to extend to the left k+1 times via computing ${\mathcal{E}}_{l}^{k}$ using lce queries on the suffixes of T_{ r }.
 6.For each tuple $<{p}_{{x}^{\prime}},\ell ,{p}_{t}>\in \mathcal{\mathcal{L}}$ we try to extend, we also maintain an array M of size 2m−1, initialised with zeros, where we mark the position of the ith left and right mismatch, 1≤i≤k, by setting$\mathsf{\text{M}}[\phantom{\rule{0.3em}{0ex}}{p}_{{x}^{\prime}}{\mathcal{E}}_{l}^{i1}1]\leftarrow 1\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\text{and}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{M}}[\phantom{\rule{0.3em}{0ex}}{p}_{{x}^{\prime}}+\ell +{\mathcal{E}}_{r}^{i1}]\leftarrow 1.$
 7.For each ${\mathcal{E}}_{l}^{k},{\mathcal{E}}_{r}^{k},\mathsf{\text{M}}$ computed for tuple $<{p}_{{x}^{\prime}},\ell ,{p}_{t}>\in \mathcal{\mathcal{L}}$, we report all the valid starting positions in t by first checking if the total length ${\mathcal{E}}_{l}^{k}+\ell +{\mathcal{E}}_{r}^{k}\ge m$; that is the length of the full extension of the fragment is greater than or equal to m. If that is the case, then we count the total number of mismatches of the occurrences at starting positionsby first summing up the mismatches for the leftmost starting position$max\{{p}_{t}\underset{\ell}{\overset{k}{\mathcal{E}}},{p}_{t}+\ell m\},\dots ,min\{{p}_{t}+\ell m+\underset{r}{\overset{k}{\mathcal{E}}},{p}_{t}\},$${\mu}_{j}\leftarrow \mathsf{\text{M}}[{p}_{{x}^{\prime}}{\mathcal{E}}_{l}^{k}]+\dots +\mathsf{\text{M}}[\phantom{\rule{0.3em}{0ex}}{p}_{{x}^{\prime}}{\mathcal{E}}_{l}^{k}+m1],$$\text{where}j=max\{{p}_{t}\underset{\ell}{\overset{k}{\mathcal{E}}},{p}_{t}+\ell m\}.$
For each subsequent position j+1, we subtract the value of the leftmost element of M computed for μ_{ j } and add the value of the next element to compute μ_{j+1}. In case μ_{ j }≤k, we report position j.
Example
that is, x ^{4} =CTAGGGT and x ^{5} =TAGGGTC occur at starting position 10 in t with no mismatch and at starting position 11 in t with 1 mismatch, respectively.
Theorem
2. Given a pattern x of length m drawn from alphabet Σ, σ=Σ, a text t of length n>m drawn from Σ, and an integer threshold k<m, algorithm ACSMF requires averagecase time $\mathcal{O}\left(\right(1+\frac{\mathit{\text{km}}}{{\sigma}^{\frac{2m1}{2k+4}}}\left)n\right)$ and space $\mathcal{O}\left(n\right)$ to solve Problem 2.
Proof.
Constructing and partitioning the string x^{′} from x can trivially be done in time $\mathcal{O}\left(m\right)$ (Step 12). Building the AhoCorasick automaton of the 2k+4 fragments requires time $\mathcal{O}\left(m\right)$; and the search time is $\mathcal{O}(n+\mathit{\text{Occ}})$ (Step 3) [25]. The preprocessing step for the lce queries on the suffixes of T and T_{ r } can be done in time and space $\mathcal{O}\left(n\right)$ (Step 4)—see Section 3. Computing ${\mathcal{E}}_{l}^{k}$ and ${\mathcal{E}}_{r}^{k}$ for each occurrence of a fragment requires time $\mathcal{O}\left(k\mathit{\text{Occ}}\right)$ (Step 5)—see Section 3. Maintaining array M is of no extra cost (Step 6). For each extended occurrence of a fragment, we report $\mathcal{O}\left(m\right)$ valid starting positions, thus $\mathcal{O}\left(m\mathit{\text{Occ}}\right)$ in total (Step 7). Since the expected number Occ of occurrences of the 2k+4 fragments is $(2k+4)n/{\sigma}^{(2m1)/(2k+4)}=\mathcal{O}\left(\frac{\mathit{\text{kn}}}{{\sigma}^{\frac{2m1}{2k+4}}}\right)$, algorithm ACSMF requires averagecase time $\mathcal{O}\left(\right(1+\frac{\mathit{\text{km}}}{{\sigma}^{\frac{2m1}{2k+4}}}\left)n\right)$ and space $\mathcal{O}\left(n\right)$. ■
Corollary
1. Given a pattern x of length m drawn from alphabet Σ, σ=Σ, a text t of length n>m drawn from Σ, and an integer threshold $k=\mathcal{O}(m/\underset{\sigma}{log}m)$, algorithm ACSMF requires averagecase time $\mathcal{O}\left(n\right)$.
Algorithm ACSMFsimple
Algorithm ACSMFsimple is very similar to Algorithm ACSMF. The only differences are:

Algorithm ACSMFsimple does not perform Step 4 of Algorithm ACSMF;

For each tuple $<{p}_{{x}^{\prime}},\ell ,{p}_{t}>\in \mathcal{\mathcal{L}}$, Step 5 of Algorithm ACSMF is performed without the use of the precomputed indexes. In other words, we compute ${\mathcal{E}}_{r}^{k}$ and ${\mathcal{E}}_{\ell}^{k}$ by simply performing letter comparisons and counting the number of mismatches occurred. The extension stops right before the k+1th mismatch.
Fact
2. The expected number of letter comparisons required for each extension in algorithm ACSMFsimple is less than 3.
which as n→∞ approaches r/(1−r)^{2}<2 for all r. Thus S, the expected number of matching positions, is less than 2, and hence the expected number of letter comparisons required for each extension in algorithm ACSMFSimple is less than 3. ■
Theorem
3. Given a pattern x of length m drawn from alphabet Σ, σ=Σ, a text t of length n>m drawn from Σ, and an integer threshold k<m, algorithm ACSMFsimple requires averagecase time $\mathcal{O}\left(\right(1+\frac{\mathit{\text{km}}}{{\sigma}^{\frac{2m1}{2k+4}}}\left)n\right)$ and space $\mathcal{O}\left(m\right)$ to solve Problem 2.
Proof.
By Fact 2, computing ${\mathcal{E}}_{\ell}^{k}$ and ${\mathcal{E}}_{r}^{k}$ for each occurrence of a fragment requires time $\mathcal{O}\left(k\mathit{\text{Occ}}\right)$. Therefore algorithm ACSMFsimple requires averagecase time $\mathcal{O}\left(\right(1+\frac{\mathit{\text{km}}}{{\sigma}^{\frac{2m1}{2k+4}}}\left)n\right)$. The required space is reduced to $\mathcal{O}\left(m\right)$ since Step 4 of Algorithm ACSMF is not performed. ■
Corollary
2. Given a pattern x of length m drawn from alphabet Σ, σ=Σ, a text t of length n>m drawn from Σ, and an integer threshold $k=\mathcal{O}(m/\underset{\sigma}{log}m)$, algorithm ACSMFsimple requires averagecase time $\mathcal{O}\left(n\right)$.
In practical cases, algorithm ACSMFsimple should be preferred over algorithm ACSMF as (i) it has less memory requirements (see Theorem 3); and (ii) it avoids the construction of a series of data structures (see Section 3 in this regard).
Edit distance model
Algorithm ACSMFsimple could be easily extended for approximate circular string matching under the edit distance model (for a definition, see [10]). Since each singleletter edit operation can change at most one of the 2k+4 fragments of x^{′}, any set of at most k edit operations leaves at least one of the fragments untouched. In other words, Lemma 2 holds under the edit distance model as well [27]. An area of length $\mathcal{O}\left(m\right)$ surrounding each potential occurrence found in the filtration phase (Steps 13 of algorithm ACSMF) is then searched using the standard dynamicprogramming algorithm in time $\mathcal{O}\left({m}^{2}\right)$[28] and space $\mathcal{O}\left(m\right)$[29]. Since the expected number Occ of occurrences of the 2k+4 fragments is $\mathcal{O}\left(\frac{\mathit{\text{kn}}}{{\sigma}^{\frac{2m1}{2k+4}}}\right)$, the averagecase time complexity becomes $\mathcal{O}\left(\right(1+\frac{k{m}^{2}}{{\sigma}^{\frac{2m1}{2k+4}}}\left)n\right)$ and the space complexity remains $\mathcal{O}\left(m\right)$. When $k=\mathcal{O}(m/\underset{\sigma}{log}m)$, the averagecase time complexity is $\mathcal{O}\left(n\right)$.
Experimental results
We implemented algorithms ACSMF and ACSMFSimple as library functions to perform approximate circular string matching with kmismatches. The functions were implemented in the C programming language and developed under GNU/Linux operating system. They take as input arguments the pattern x of length m, the text t of length n, and the integer threshold k<m; and then return the list of starting positions of the occurrences of the rotations of x in t with kmismatches as output. The library implementation is distributed under the GNU General Public License (GPL), and it is available at http://www.inf.kcl.ac.uk/research/projects/asmf/, which is set up for maintaining the source code and the manpage documentation. The experiments were conducted on a Desktop PC using one core of Intel i7 2600 CPU at 3.4 GHz under GNU/Linux.
Approximate circular string matching is a rather undeveloped area. To the best of our knowledge, there does not exist an optimal (average or worstcase) algorithm for approximate circular string matching with kmismatches. Therefore, keeping in mind that we wish to evaluate the efficiency of our algorithms in practical terms, we compared their performance to the respective performance of the C implementation^{a} of the optimal averagecase algorithm for multiple approximate string matching, presented in [17], for matching the r=m rotations of x. We denote this algorithm by FredNava.
Elapsedtime and speedup comparisons of FredNava , ACSMF , and ACSMFSimple for n =1MB
Elapsed Time (s)  Speedup of ACSMFSimple  

m  k  FredNava  ACSMF  ACSMFSimple  FredNava  ACSMF 
100  5  1.63  0.40  0.06  27  7 
200  5  6.77  0.40  0.05  135  8 
300  5  16.84  0.41  0.05  337  8 
400  5  31.99  0.41  0.05  640  8 
500  5  53.26  0.41  0.05  1065  8 
600  5  81.35  0.41  0.05  1627  8 
700  5  116.24  0.41  0.05  2325  8 
800  5  158.73  0.41  0.06  2645  7 
900  5  206.43  0.42  0.06  3440  7 
1000  5  264.84  0.41  0.06  4414  7 
100  10  1.65  0.43  0.05  33  9 
200  10  6.94  0.40  0.05  139  8 
300  10  16.55  0.41  0.05  331  8 
400  10  31.70  0.40  0.05  634  8 
500  10  53.11  0.41  0.05  1062  8 
600  10  81.04  0.40  0.05  1620  8 
700  10  116.25  0.41  0.06  1937  7 
800  10  158.1  0.41  0.06  2635  7 
900  10  207.33  0.41  0.05  4146  8 
1000  10  264.11  0.41  0.05  5282  8 
100  15  1.65  0.42  0.06  28  7 
200  15  6.91  0.41  0.06  115  7 
300  15  16.45  0.41  0.06  274  7 
400  15  31.48  0.41  0.05  630  8 
500  15  52.55  0.41  0.05  1051  8 
600  15  80.46  0.41  0.05  1069  8 
700  15  115.86  0.41  0.06  1931  7 
800  15  157.81  0.41  0.06  2630  7 
900  15  206.56  0.42  0.06  3443  7 
1000  15  262.16  0.42  0.06  4369  7 
Elapsedtime and speedup comparisons of ACSMF and ACSMFSimple for n =10MB
Elapsed Time (s)  Speedup of ACSMFSimple  

m  k k  ACSMF  ACSMFSimple  ACSMF 
10000  100  6.54  0.67  10 
11000  100  6.69  0.70  10 
12000  100  6.57  0.72  9 
13000  100  6.64  0.74  9 
14000  100  6.58  0.75  9 
10000  300  6.54  0.69  9 
11000  300  6.67  0.69  10 
12000  300  6.64  0.68  10 
13000  300  6.71  0.71  9 
14000  300  6.63  0.72  9 
10000  500  6.74  0.66  10 
11000  500  6.58  0.67  10 
12000  500  6.69  0.66  10 
13000  500  6.66  0.67  10 
14000  500  6.71  0.68  10 
Elapsedtime and speedup comparisons of ACSMF and ACSMFSimple for n =50MB
Elapsed Time (s)  Speedup of ACSMFSimple  

m  k  ACSMF  ACSMFSimple  ACSMF 
50000  500  45.71  4.33  11 
51000  500  45.81  4.35  11 
52000  500  45.73  4.37  10 
53000  500  44.99  4.40  10 
54000  500  45.05  4.40  10 
50000  700  45.00  4.26  11 
51000  700  44.79  4.18  11 
52000  700  44.96  4.36  10 
53000  700  44.83  4.32  10 
54000  700  45.00  4.32  10 
50000  900  46.79  4.32  11 
51000  900  44.89  4.28  10 
52000  900  45.06  4.33  10 
53000  900  45.14  4.35  10 
54000  900  44.81  4.12  11 
Conclusions
In this article, we presented new averagecase algorithms for exact and approximate circular string matching. Algorithm ECSMF for exact circular string matching requires averagecase time $\mathcal{O}\left(n\right)$; and Algorithms ACSMF and ACSMFsimple for approximate circular string matching with kmismatches require time $\mathcal{O}\left(n\right)$ for moderate values of k, that is $k=\mathcal{O}(m/\underset{\sigma}{log}m)$. We showed how the same results can be easily obtained under the edit distance model. The presented algorithms were also implemented as library functions. Experimental results demonstrate that the functions provided in this library accelerate the computations by more than three orders of magnitude compared to a naïve approach.
For future work, we will explore the possibility of optimising our algorithms and the corresponding library implementation for the approximate case by using lossless filters for eliminating a possibly large fraction of the input that is guaranteed not to contain any approximate occurrence, such as [31] for the Hamming distance model or [32] for the edit distance model. In addition, we will try to improve our algorithms for the approximate case in order to achieve averagecase optimality.
Endnote
^{a} Personal communication with author.
Declarations
Acknowledgements
The publication costs for this article were funded by the Open Access funding scheme of King’s College London. CB is supported by an EPSRC grant (Doctoral Training Grant #EP/J500252/1). The authors would like to warmly thank the “Reviewer #1” and the “Reviewer #2” whose meticulous comments were beyond the call of duty.
Authors’ Affiliations
References
 Weil R, Vinograd J:The cyclic helix and cyclic coil forms of polyoma viral DNA. Proc Natl Acad Sci. 1963, 50 (4): 730738.View ArticlePubMedPubMed CentralGoogle Scholar
 Dulbecco R, Vogt M:Evidence for a ring structure of polyoma virus DNA. Proc Natl Acad Sci. 1963, 50 (2): 236243.View ArticlePubMedPubMed CentralGoogle Scholar
 Thanbichler M, Wang SC, Shapiro L:The bacterial nucleoid: A highly organized and dynamic structure. J Cell Biochem. 2005, 96 (3): 506521. [http://dx.doi.org/10.1002/jcb.20519], []View ArticlePubMedGoogle Scholar
 Lipps G: Plasmids: Current Research and Future Trends. 2008, Norfolk, UK: Caister Academic Press.Google Scholar
 Allers T, Mevarech M:Archaeal genetics — the third way. Nat Rev Genet. 2005, 6: 5873.View ArticlePubMedGoogle Scholar
 Gusfield D: Algorithms on Strings, Trees and Sequences. 1997, New York, NY, USA: Cambridge University Press.View ArticleGoogle Scholar
 Mosig A, Hofacker IL, Stadler PF, Zell A:Comparative analysis of cyclic sequences: viroids and other small circular RNAs. German Conference on Bioinformatics, Volume 83 of LNI. Edited by: Huson DH, Kohlbacher O, Lupas AN, Nieselt K. 2006, 93102. GI.Google Scholar
 Fernandes F, Pereira L, Freitas A:CSA: An efficient algorithm to improve circular DNA multiple alignment. BMC Bioinformatics. 2009, 10: 113.View ArticleGoogle Scholar
 Lee T, Na JC, Park H, Park K, Sim JS:Finding optimal alignment and consensus of circular strings. Proceedings of the 21st annual Conference on Combinatorial Pattern Matching. 2010, 310322. CPM’10, Berlin, Heidelberg: SpringerVerlag.View ArticleGoogle Scholar
 Crochemore M, Hancart C, Lecroq T: Algorithms on Strings. 2007, New York, NY, USA: Cambridge University Press.View ArticleGoogle Scholar
 Applied Combinatorics on Words. Edited by: Lothaire M. 2005, New York, NY, USA: Cambridge University Press.Google Scholar
 Fredriksson K, Grabowski S:Averageoptimal string matching. J Discrete Algorithms. 2009, 7 (4): 579594. 10.1016/j.jda.2008.09.001.View ArticleGoogle Scholar
 Chen KH, Huang GS, Lee RCT:Bitparallel algorithms for exact circular string matching. Comput J. 2013, doi:10.1093/comjnl/bxt023.Google Scholar
 Iliopoulos CS, Rahman MS:Indexing circular patterns. Proceedings of the 2nd International Conference on Algorithms and Computation. 2008, 4657. WALCOM’08, Berlin, Heidelberg: SpringerVerlag.Google Scholar
 Lin J, Adjeroh D:Allagainstall circular pattern matching. Comput J. 2012, 55 (7): 897906. 10.1093/comjnl/bxr126.View ArticleGoogle Scholar
 Chang WI, Marr TG:Approximate string matching and local similarity. Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching. 1994, 259273. CPM ’94, London, UK: SpringerVerlag.View ArticleGoogle Scholar
 Fredriksson K, Navarro G:Averageoptimal single and multiple approximate string matching. J Exp Algorithmics. 2004, 9:http://dl.acm.org/citation.cfm?id=1041513, Google Scholar
 Wu S, Manber U:Fast text searching: allowing errors. Commun ACM. 1992, 35 (10): 8391. 10.1145/135239.135244.View ArticleGoogle Scholar
 Rivest R:Partialmatch retrieval algorithms. SIAM J Comput. 1976, 5: 1950. 10.1137/0205003.View ArticleGoogle Scholar
 Frousios K, Iliopoulos CS, Mouchard L, Pissis SP, Tischler G:REAL: an efficient REad ALigner for next generation sequencing reads. Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology. 2010, 154159. BCB ’10, USA: ACM.View ArticleGoogle Scholar
 Nong G, Zhang S, Chan WH:Linear suffix array construction by almost pure inducedsorting. Proceedings of the 2009 Data Compression Conference. 2009, 193202. DCC ’09, Washington, DC, USA: IEEE Computer Society.View ArticleGoogle Scholar
 Ilie L, Navarro G, Tinta L:The longest common extension problem revisited and applications to approximate string searching. J Discrete Algorithms. 2010, 8 (4): 418428. 10.1016/j.jda.2010.08.004.View ArticleGoogle Scholar
 Fischer J:Inducing the LCPArray. Algorithms and Data Structures, Volume 6844 of Lecture Notes in Computer Science. Edited by: Dehne F, Iacono J, Sack JR. 2011, 374385. Berlin Heidelberg: Springer.Google Scholar
 Fischer J, Heun V:Spaceefficient preprocessing schemes for range minimum queries on static arrays. SIAM J Comput. 2011, 40 (2): 465492. 10.1137/090779759.View ArticleGoogle Scholar
 Dori S, Landau GM:Construction of Aho Corasick automaton in linear time for integer alphabets. Inf Process Lett. 2006, 98 (2): 6672. 10.1016/j.ipl.2005.11.019.View ArticleGoogle Scholar
 Hall HS, Knight SR: Higher Algebra. 1950, London, UK: MacMillan.Google Scholar
 BaezaYates RA, Perleberg CH:Fast and practical approximate string matching. Inform Process Lett. 1996, 59: 2127.0.1016/00200190(96)00083X.http://www.sciencedirect.com/science/article/pii/002001909600083X],View ArticleGoogle Scholar
 Wagner RA, Fischer MJ:The stringtostring correction problem. J ACM. 1974, 21: 168173. 10.1145/321796.321811.View ArticleGoogle Scholar
 Hirschberg DS:A linear space algorithm for computing maximal common subsequences. Commun ACM. 1975, 18 (6): 341343. 10.1145/360825.360861.View ArticleGoogle Scholar
 Pizza & Chili. http://pizzachili.dcc.uchile.cl/ 2013,
 Peterlongo P, Pisanti N, Boyer F, do Lago AP, Sagot MF:Lossless filter for multiple repetitions with Hamming distance. J Discrete Algorithms. 2008, 6 (3): 497509. 10.1016/j.jda.2007.03.003.View ArticleGoogle Scholar
 Peterlongo P, Sacomoto GAT, do Lago AP, Pisanti N, Sagot MF:Lossless filter for multiple repeats with bounded edit distance. Algorithms Molecular Biol. 2009,4.http://www.almob.org/content/pdf/1748718843.pdf, Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.