Best hits of 11110110111: modelfree selection and parameterfree sensitivity calculation of spaced seeds
 Laurent Noé^{1}Email authorView ORCID ID profile
DOI: 10.1186/s1301501700921
© The Author(s) 2017
Received: 19 September 2016
Accepted: 30 January 2017
Published: 14 February 2017
Abstract
Background
Spaced seeds, also named gapped qgrams, gapped kmers, spaced qgrams, have been proven to be more sensitive than contiguous seeds (contiguous qgrams, contiguous kmers) in nucleic and aminoacid sequences analysis. Initially proposed to detect sequence similarities and to anchor sequence alignments, spaced seeds have more recently been applied in several alignmentfree related methods. Unfortunately, spaced seeds need to be initially designed. This task is known to be timeconsuming due to the number of spaced seed candidates. Moreover, it can be altered by a set of arbitrary chosen parameters from the probabilistic alignment models used. In this general context, Dominant seeds have been introduced by Mak and Benson (Bioinformatics 25:302–308, 2009) on the Bernoulli model, in order to reduce the number of spaced seed candidates that are further processed in a parameterfree calculation of the sensitivity.
Results
We expand the scope of work of Mak and Benson on single and multiple seeds by considering the Hit Integration model of Chung and Park (BMC Bioinform 11:31, 2010), demonstrate that the same dominance definition can be applied, and that a parameterfree study can be performed without any significant additional cost. We also consider two new discrete models, namely the Heaviside and the Dirac models, where lossless seeds can be integrated. From a theoretical standpoint, we establish a generic framework on all the proposed models, by applying a counting semiring to quickly compute large polynomial coefficients needed by the dominance filter. From a practical standpoint, we confirm that dominant seeds reduce the set of, either single seeds to thoroughly analyse, or multiple seeds to store. Moreover, in http://bioinfo.cristal.univlille.fr/yass/iedera_dominance, we provide a full list of spaced seeds computed on the four aforementioned models, with one (continuous) parameter left free for each model, and with several (discrete) alignment lengths.
Keywords
Spaced seeds Dominant seeds Bernoulli Hit Integration Heaviside Dirac Counting semiring Polynomial form DFABackground
Optimized spaced seeds, or best gapped qgrams, have independently been proposed in PatternHunter [3] and by Burkhardt and Karkkainen [4]. The primary objective was either to improve the sensitivity of the heuristic but efficient hit and extend BLASTlike strategy (without using the neighborhood word principle ^{1}), or to increase the selectivity for lossless filters on alignments of size \(\ell\) under a given Hamming distance of k.
Several extensions of the spaced seed model have then been proposed on the two aforementioned problems: vector seeds [5], one gapped qgrams [6] or indel seeds [7, 8], neighbor seeds [9, 10], transition seeds [11–15], multiple seeds [16–19], adaptive seeds [20] and related work on the associated indexes [21–26], just to mention a few.
Unfortunately, spaced seeds are known to produce hard problems, both on the seed sensitivity computation [27] or the lossless computation [28], and moreover on the seed design [29]. But the choice of the right seed pattern has a significant impact on genomic sequence comparison [3, 12, 16, 20, 30–38], on oligonucleotide design [39–44], as well as on amino acid sequence comparison [45–53]; this has led to several effective methods to (possibly greedily) select spaced seeds [54–61] with elaborated alignment models and their associated algorithms [62–70].
Another less frequently mentioned problem is that the seed design is mostly performed on a fixed and already fully parameterized alignment model (for example, a Bernoulli model where the probability of a match p is set to 0.7). There is not so much choice for the optimal seed, when, for example, the scoring system is changed, and thus the expected distribution of alignments.
We note that several recent works mention the use of spaced seeds in alignmentfree methods [71–73] with applications in phylogenetic distance estimation [74], metagenomic classification [75, 76], just to cite a few.
Finally, we also noticed that several recent studies use the overlap complexity [54, 56, 57, 77–79] which is closely linked to the variance of the number of spacedword matches [80] and is known to provide an upper/lower bound for the expectation of the length preceding the first seed hit [27, 66, 81]. We mention here that a similar parameterfree approach could also be applied for the variance induced selection of seeds, but an interesting question remains in that case: to find a dominance equivalent criterion associated with the selection of candidate seeds.
The paper is organized as follows. We start with an introduction to the spaced seed model and its associated sensitivity or lossless aspect, and show how semirings on DFA can help determining such features. Section “Semirings and number of alignments” restricts the description to counting semirings that are applied on a specific DFA to perform an efficient dynamic programming algorithm on a set of counters. This is a prerequisite for the two next sections that present respectively continuous models and discrete models. Section “Continuous models” is divided into two parts : the first one outlines the polynomial form of the sensitivity proposed by [1] to compute the sensitivity on the Bernoulli model together with the associated dominance principle, whereas the second one extends this polynomial form to the Hit Integration model of [2], and explains why the dominance principle remains valid. Section “Discrete models” describes two new Dirac and Heaviside models, and shows how lossless seeds can be integrated into them. Then, we report our experimental analysis on all the aforementioned models, display and explain several optimal seed Pareto plots for the restricted case of one single seed, and links to a wide range of compiled results for multiple seeds. The last section brings the discussion to the asymptotic problem, and to several finite extensions.
Spaced seeds and seed sensitivity
We suppose here that strings are indexed starting from position number 1. For a given string u, we will use the following notation: u[i] gives the ith symbol of u, u is the length of u, and \(u_a\) is the number of symbol letters a that u contains.
Nucleotide sequence alignments without indels can be represented as a succession of match or mismatch symbols, and thus represented as a string x over a binary alphabet \(\{\texttt {0},\texttt {1}\}\).
A spaced seed can be represented as a string \(\pi\) over a binary alphabet \(\{\text {0},\text {1}\}\) but with a different meaning for each of the two symbols: \(\text {1}\) indicates a position on the seed \(\pi\) where a single match must occur in the alignment x (it is thus called a must match symbol), whereas \(\text {0}\) indicates a position where a single match or a single mismatch is allowed (it is thus called a don’tcare symbol).
The weight of a seed \(\pi\) (denoted by w or \(w_\pi\)) is defined as the number of must match symbols (\(w_\pi = \pi _1\)): the weight is frequently set constant or with a minimal value, because it is related to the selectivity of the seed. The span or length of a seed \(\pi\) (denoted by \(s_\pi\)) is its full length (\(s_\pi = \pi \)). We will also frequently use \(\ell\) for the length of the alignment (\(\ell =x\)).
Naturally, the shape of the seed, i.e. possible placement of a set of don’tcare symbols between any consecutive pair of the w must match symbols, plays a significant role and must be carefully controlled. Requiring at least one hit for a seed, on an alignment x, is the most common (but not unique) way to select a good seed.
 a.
When considering that any alignment x is of given length \(\ell\), and each symbol is generated by a Bernoulli model (so there is no restriction on the number of match or mismatch symbols an alignment must contain, but with some configurations more probable than others), the problem is to select a good seed (respectively the best seed) as the one that has a high probability (respectively the best probability) to hit at least once.
 b.
When considering that any alignment x is of given length \(\ell\), and contains at most k mismatch symbols, a classical requirement for a good seed is to guarantee that all the possible alignments, obtained by any placements of k mismatch symbols on the \(\ell\) alignment symbols, will all be detected by at least one seed hit each: when this distinctive feature occurs, the seed is considered lossless or \((\ell ,k)\) lossless.
The two problems can be solved by first considering the language recognized by the seed \(\pi\), in this context the at least one hit regular language, and its associated DFA. As an illustration, Fig. 1 displays the at least one hit DFA for the spaced seed \(\text {1101}\): this automaton recognizes the associated regular language \(\{\texttt {0},\texttt {1}\}^{*} ( \texttt {1101}  \texttt {1111}) \{\texttt {0},\texttt {1}\}^{*}\), or less formally, any binary alignment sequence x that has at least one occurrence of \(\texttt {1101}\) or \(\texttt {1111}\) as a factor.
 a.
Either, the probability to reach any of the automaton states.
 b.
Otherwise, the minimal number of mismatch symbols 0 that have been crossed to reach any state.
Another example, considering now the lossless property (b) for the spaced seed \(\pi = \text {1101}\): we can show that this seed is lossless for one single mismatch, when \(\ell \ge 6\) (but computational details are left to the reader, after a remark on tropical semirings in the next paragraph): the seed is thus \((\ell =6,k=1)\)lossless ; however, this seed is not \((\ell =5,k=1)\)lossless, since reading the consistent sequence \(\texttt {10111}\) leads to a nonfinal state.
 a.
Either probability semirings: \((E = \mathbb {R}_{0 \le r \le 1},\; \oplus = +,\; \otimes = \;\times \;,\; 0_{\oplus ,\epsilon _\otimes } = 0,\; 1_{\otimes } = 1)\) ; the final state(s) of the DFA give(s) the probability of having at least one hit after \(\ell\) steps of the DP algorithm,
 b.
Otherwise tropical semirings: \((E = \mathbb {R}_{\ge 0},\; \oplus = min,\; \otimes = +\;\; 0_{\oplus ,\epsilon _\otimes } = \infty ,\; 1_{\otimes } = 0)\). The seed is \((\ell ,k)\) lossless iff all the nonfinal states of the DFA have a minimal number of mismatches that is strictly greater than k, after \(\ell\) steps of the DP algorithm.^{2}
Semirings and number of alignments
Counting semirings [84] are adapted for this task: when applied on the right language and its right automaton, they can report the number of alignments \(c_{\pi ,m}\) that are at the same time detected by the seed \(\pi\) while having m matches out of \(\ell\) alignment symbols. The main idea that enables the computation of these \(c_{\pi ,m}\) counting coefficients (illustrated on Fig. 2 as the intersection product) is first to intersect the language recognized by the seed \(\pi\) (the at least one hit language of \(\pi\)) with the classes of alignments that have exactly m matches: the automaton associated with all of these classes of alignments with m matches has a very simple linear form with \(\ell +1\) states, where several distinct final states are defined according to all the possible values of \(m \in [0\ldots\,\ell ]\). Finally, since the intersection of two regular languages is regular [Theorem 4.8 of the timeless 85], it can thus be represented by a conventional DFA, while keeping the feature of having several distinct final states.
As an illustration, Fig. 2 displays the at least one hit DFA for the spaced seed \(\text {101}\) (on the top), the linear \(\text {1}\)counting DFA (on the vertical left part) to isolate alignments with exactly m matches, and finally their intersection product, that represent the intersecting language as a new DFA (itself obtained by crossing synchronously the two previous DFAs). Note that each of the final states \(p_m \times q_5\) (for \(m < \ell\)) of the resulting DFA is reached by alignment sequences with exactly m matches that are also detected by the seed \(\text {101}\) (unless for the last state \(p_l \times q_5\), where \(\ge \ell\) matches may have been detected).
Then, starting with the empty word (counted once from the initial state \(p_0 \times q_1\)), it is then possible to count the number of words of size one (two words 0 and 1 on a binary alphabet) by following transitions from the initial state to \(p_0 \times q_1\) and \(p_1 \times q_2\), respectively; from the (two) states already reached, it is then possible to count words of size two (four words on a binary alphabet), and so on, while keeping, for each DFA state \(p_m \times q_j\) and on each step i, a single count record, which represents the size of the subset of the partition of the \(2^i\) words that reach \(p_m \times q_j\).
Note that, for a seed \(\pi\) of weight \(w_\pi\) and span \(s_\pi\) (thus with \(s_\pi w_\pi\) don’tcare symbols), the at least one hit automaton size is in \(\mathcal {O}(w_\pi 2^{s_\pi w_\pi })\), so the intersection with the classes of alignments that have m matches out of \(\ell\) leads to a full size in \(\mathcal {O}(\ell w_\pi 2^{s_\pi w_\pi })\): the computational complexity of the algorithm can thus be estimated in \(\mathcal {O}(\ell ^2 w_\pi 2^{s_\pi w_\pi })\) in time and \(\mathcal {O}(\ell w_\pi 2^{s_\pi w_\pi })\) in space. As shown by [1], it can be processed incrementally for all the alignment lengths up to \(\ell\), with the only restriction that the numbers of alignments per state (\(\le 2^\ell\)) fit inside an integer word (64 or 128bits).
We first mention that a breadthfirst construction of the intersection product can be used to limit the depth of the reached states to \(\ell\). We have already noticed that several authors have performed equivalent tasks with a matrix for the full automaton [86], or with a vector for each automaton state [1], probably because contiguous memory performance is better. An advantage of such lazy automaton product evaluation may be that, besides the fact that it is a generic automaton product, we avoid sparse datastructures combined with many nonreachable states (for example, \(p_{\ell 1} \times q_1\) and \(p_{\ell } \times q_1\) will never be reached on any sequences of size \(\ell > 2\): since two mismatches are needed to reach them, then \(p_m\) must always have its associated number of matches \(m \le \ell 2\)).
We finally mention that a similar method was used in [87] to compute correlation coefficients between the seed number of hits or the seed coverage, and the true alignment Hamming distance.^{3}
In the following sections, we will use the (mmatches counting) \(c_{\pi ,m}\) coefficients to compute, either probabilities on continuous models, or frequencies on discrete models.
Continuous models
Bernoulli polynomial form and dominance between seeds
Mark and Benson [1] also include in their paper an elegant and simple partial order named dominance between seeds: suppose that two spaced seeds \(\pi _a\) and \(\pi _b\) have to be compared according to their respective \(c_{\pi _a,m}\) and \(c_{\pi _b,m}\) coefficients: now, assume that, \(\forall m \in [0\ldots\,\ell ] \quad c_{\pi _a,m} \ge c_{\pi _b,m}\) (with at least a single difference on at least one of the coefficients), then we can conclude that \(\pi _a\) dominates \(\pi _b\), and thus that \(\pi _b\) can be discarded from the possible set of optimal seeds. Indeed, the sensitivity, defined by the formula (1) as a sum of same positive terms \(p^m (1p)^{\ell m}\) , each term being respectively multiplied by a seeddependent positive coefficient \(c_{\pi ,m}\), guarantee that the sensitivity of \(\pi _b\) will never be better than the sensitivity of \(\pi _a\), whatever parameter \(p \in [0,1]\) is chosen.
As an illustration, Table 1 lists the \(c_{\pi ,m}\) coefficients of two single seeds, the contiguous seed (11111111111), and the Patternhunter I spaced seed (111010010100110111), for the alignment length \(\ell =64\). Note that comparing only the pairs of coefficients \(c_{\mathtt{11111111111},m}\) and \(c_{\mathtt{111010010100110111},m}\) does not help in choosing/discarding any of the two seeds by the dominance principle, since \(c_{\mathtt{11111111111},m} > c_{\mathtt{111010010100110111},m}\) when \(m \le 18\), or \(c_{\mathtt{11111111111},m} \le c_{\mathtt{111010010100110111},m}\) otherwise (with a strict inequality when \(m \le 59\)). Actually, both seeds are included in the set of the dominant seeds of weight \(w=11\) found on alignments of length \(\ell =64\), as mentioned by [1], and verified in our experiments.
Surprisingly, according to the experiments of [1], very few single seeds are overall dominant in the class of seeds of same weight w and fixed or restricted span s (e.g. \(s \le 2\times w\)) : this dominance criterion was thus used as a filter for the preselection of optimal seeds. In the section “Experiments” , we show that the dominance selection also scales reasonably well for selecting multiple seeds candidates.
Hit Integration and its associated polynomial form
Hit Integration (HI) for a given seed \(\pi\) was proposed by [2] as \(\frac{\int _{p_a}^{p_b} Pr_\pi (p,\ell ) \, dp}{p_bp_a}\) for a given interval \([p_a,p_b]\) (with \(0 \le p_a < p_b \le 1\)), where \(Pr_\pi (p,\ell )\) is the probability for the seed \(\pi\) to hit an alignment of length \(\ell\) generated by a Bernoulli model of parameter p, as mentioned at the beginning of the previous part.
An illustration of the full probability mass function for the Hit Integration compared with the Bernoulli and the Heaviside distributions (the latter is defined in the next section) is given in Fig. 3 for alignments of length \(\ell =64\).

We propose a dynamic programming algorithm that is strictly equivalent to the one previously proposed for the the Bernoulli model : in fact, both modeldependent algorithms can even pool their most timeconsuming part. Moreover, the automaton used by the dynamic programming algorithm can be previously minimized: this reduction is greatly appreciated when multiple seeds are processed.

We propose a parameterfree approach for the \(p_a\) or \(p_b\) parameters: it is therefore possible to compute, on any interval, how far a seed is optimal; moreover, we will show that the dominance criterion can be applied as a preprocessing step.
First, for any constant integers u and v, since the integral of the polynomial part \(\int _{p_a}^{p_b} p^u (1p)^{v} \, dp = \Big [ p^{u+1} \sum _{k=0}^{v} {v \atopwithdelims ()k} \frac{(p)^k}{u+k+1} \Big ]_{p_a}^{p_b}\) can be easily computed (as a larger degree polynomial), the integral of the right part of the formula (2) can be precomputed independently of the counting coefficients \(c_{\pi ,m}\), and thus independently of the seed \(\pi\). Thus, only \(c_{\pi ,m}\) coefficients characterize the seed \(\pi\) for both the Bernoulli model and the Hit Integration model.
Moreover, we can see that, for \(0 \le p_a < p_b \le 1\) and for all \(m \in [0\ldots\,\ell ]\), the integral \(\int _{p_a}^{p_b} p^m (1p)^{\ell m} \, dp\) of the right part of the formula (2) is always positive. Therefore, the dominance between seeds also can be directly applied on the \(c_{\pi ,m}\) coefficients to select dominant seeds before computing the Hit Integration (for any range \([p_a,p_b]\)) by applying the formula (2), thereby saving computation time for the optimal set of seeds.
Discrete models and lossless seeds
 1
\(Dirac_\pi (m,\ell ) = \frac{c_{m,\pi }}{{\ell \atopwithdelims ()m}}\), to give the ratio between the number of alignments detected by the seed \(\pi\) over all the alignments of length \(\ell\) with exactly m matches,
 2
\(Heaviside_\pi (m_a,m_b,\ell ) = \frac{\sum \limits _{m=m_a}^{m_b} Dirac_\pi (m,\ell )}{m_b  m_a + 1}\), to give the average ratio, over any number of matches m between \(m_a\) and \(m_b\) (out of \(\ell\)) of the previously defined Dirac model. The Heaviside full distribution has already been illustrated in Fig. 3, together with the Hit Integration distribution with similar parameters.
In addition, the Dirac and Heaviside functions are based on rational number computations/comparisons: they are thus one or two orders of magnitude faster and lighter to compute and store, compared to the polynomial forms given by the continuous models of the previous section.
Finally, an interesting feature of the \(Dirac_\pi (m,\ell )\), also true for the specific \(Heaviside_\pi (m,\ell ,\ell )\), is that, when the number of match symbols m is large enough, one seed \(\pi\) (or sometime several seeds) can meet the equality \(c_{\pi ,m'} = {\ell \atopwithdelims ()m'}\) for all \(m' \ge m\). Such seeds are thus lossless since they can detect all the alignments of length \(\ell\) with at least m matches (or with at most \(\ell m\) mismatches), and obviously the best lossless ones are retained in the set of dominant seeds, when the equality \(c_{\pi ,m} = {\ell \atopwithdelims ()m}\) occurs. As a side consequence, the best lossless seeds are also in the set of dominant seeds and will be reported in the experiments.
Note that, to keep a symmetric notation with the \(\int _{p_a}^{p_b}\,\) Hit Integration, and also have the same range for the domain of definition (\(0 \le p_a < p_b \le 1\)), we will use the “frequency” notation \(\sum _{f_a}^{f_b}\,\) Heaviside to designate \(Heaviside(\lfloor \ell \times f_a \rfloor ,\lfloor \ell \times f_b \rfloor ,\ell )\). We will also rescale the Dirac function on the Bernoulli’s domain of definition, by using the frequency f (\(0 \le f \le 1\)) to designate \(Dirac(\lfloor \ell \times f \rfloor ,\ell )\).
Experiments
Single spaced seeds (\(n =1\)) and multiple codesigned spaced seeds (\(n \in [2\ldots\,4]\)) of weight \(w \in [3\ldots\,16]\) and span s at most \(2 \times w\) have been considered. Note that, for single seeds of large weight (\(w \ge 15\)), or for multiple seed, the full enumeration is respectively burdensome or intractable, so we prefer to apply the hillclimbing algorithm of Iedera [88]: selected dominant spaced seeds are thus locally dominant, since it would be computationally unfeasible to guarantee their overall dominance. All the spaced seeds are evaluated on alignments of length \(\ell \in [2 \times w\ldots\,64]\).
 1
Selecting the set of dominant seeds is the first stage: it provides a reduced set of candidate seeds. Note that the dominant selection can be applicable without prior knowledge of the sensitivity criterion being used, provided that this sensitivity criterion is established on i.i.d sequence alignments (this last requirement is true for the Bernoulli, the Hit Integration, the Dirac, and the Heaviside models).
 2Comparing each of the seeds from the set of dominant seeds with a sensitivity criterion is the second stage: it usually depends on at least one parameter (for example, for the Bernoulli model: the probability p to generate a match) which has different consequences on continuous and discrete models:

For the Bernoulli and the Hit Integration continuous models, this implies comparing pparametrized or \([p_a,p_b]\)parametrized polynomials: we follow the idea proposed in [1] for the Bernoulli model and also apply it on the Hit Integration model where we compute the \(\int _0^x\) HI and the \(\int _x^1\) HI respectively. Let us concentrate on the Bernoulli model with a (single) free parameter p: For two dominant seeds \(\pi _a\) and \(\pi _b\) and a given length \(\ell\), we compute their respective polynomials \(Pr_{\pi _a}(p,\ell )\) and \(Pr_{\pi _b}(p,\ell )\) and their difference \(Pr_{\pi _a  \pi _b}(p,\ell ) = Pr_{\pi _a}(p,\ell )  Pr_{\pi _b}(p,\ell )\) (an example of its associated coefficients is illustrated on the third column of Table 1), from which zeros in the range \(p \in [0,1]\) are numerically extracted using solvers from maple or maxima. Using the pintervals between these zeros, it is then possible to determine whether \(Pr_{\pi _a  \pi _b}(p,\ell )\) is positive or negative, and thus which of the two seeds \(\pi _a\) or \(\pi _b\) is better according to p. Finally, the Pareto envelope (optimal seeds) can be extracted from the initial set of dominant seeds.

For the Dirac and the Heaviside discrete models, this implies comparing, instead of realvalued polynomials, integer numbers for the Dirac model (and respectively rational numbers for the Heaviside model), which is an easier and lighter process. The Pareto envelope can then be easily extracted from these discrete models to select the optimal seeds from the set of dominant seeds. We have also extracted the lossless part for the Dirac and the \(\sum _x^1\) Heaviside criteria.

Maximum size of the set of dominant seeds
w  

n  3  4  5  6  7  8  9  10  11  12  13  14  15  16 
1  2  7  8  13  15  26  23  32  40  45  46  48  74  84 
(64)  (64)  (62)  (64)  (64)  (61)  (60)  (62)  (64)  (63)  (64)  (59)  (64)  (64)  
2  5  12  35  41  52  99  128  197  231  207  350  320  439  376 
(64)  (63)  (63)  (61)  (64)  (64)  (60)  (62)  (61)  (59)  (63)  (64)  (64)  (41)  
3  6  26  85  84  204  320  391  485  854  932  1103  1449  1508  1812 
(60)  (64)  (64)  (62)  (64)  (60)  (56)  (56)  (62)  (64)  (64)  (41)  (64)  (63)  
4  7  29  124  190  254  535  811  1041  1450  1908  1775  2364  3125  3359 
(64)  (64)  (64)  (64)  (64)  (59)  (64)  (58)  (63)  (64)  (62)  (39)  (63)  (37) 
Sensitivity comparison of different programs
w  p  SpEED  AcoSeed  FastHC  MuteHC  Rasbhari  Current sensitivity (\(\delta\)) 

10  0.75  90.9098  90.9513  90.7312  92.6812  90.9614  90.8753 (1.8059%) 
0.80  97.8337  97.8521  97.7625  98.3836  97.8554  97.8203 (0.5633%)  
0.85  99.7569  99.7614  99.7431  99.8356  99.7618  99.7568 (0.0788%)  
11  0.75  83.3793  83.4728  83.3068  83.4127  83.4679  83.4297 (0.0431%) 
0.80  94.9861  95.037  94.9453  95.0194  95.0386  95.0127 (0.0259%)  
0.85  99.2431  99.2478  99.2250  99.2486  99.2506  99.2452 (0.0054%)  
12  0.80  90.5750  90.6328  90.4735  90.5820  90.6648  90.5571 (0.1077%) 
0.85  98.1589  98.1766  98.1199  98.1670  98.1824  98.1591 (0.0233%)  
0.90  99.8821  99.8853  99.8771  99.8836  99.8864  99.8840 (0.0024%)  
16  0.85  84.8212  84.9829  84.6558  84.8764  84.969  84.9668 (0.0161%) 
0.90  97.4321  97.4712  97.3556  97.4460  97.5035  97.4730 (0.0305%)  
0.95  99.9388  99.9419  99.9347  99.9424  99.9441  99.9414 (0.0027%) 
Note that we did not use any Overlap Complexity/Covariance heuristic optimisation here (to stay in a generic framework), and simply apply the very simple hillclimbing algorithm of Iedera. We also mention that our seeds are not definitely the best ones, but since they are published, their sensitivity can be checked using other software, as mandala [63], SpEED [56], or rasbhari [80] ([43, 57] did the same with the seeds obtained with the SpEED software).
Finally, to show a typical output of this generalized parameterfree approach, optimal single (\(n=1\)) seeds of weight \(w=11\) have been plotted according to the main parameter of each model (horizontal axis) and the length \(\ell\) of the alignment (vertical axis) in Figs. 5 and 6. On discrete models, a pink mark represents the lossless border: seeds on the right of this border are by essence lossless for the set of parameters. On the right margin of the discrete models, we indicate the fraction of the minimum number of matches m over the alignment length \(\ell\) to be lossless.
We provide the scripts and the whole set of single and multiple seeds, in http://bioinfo.cristal.univlille.fr/yass/iedera_dominance in the hope this will be useful to alignment software and spaced seeds alignmentfree metagenomic classifiers.
Discussion
In this paper, we have presented a generalization of the usage of dominant seeds, first on the Hit integration model with a parameterfree approach, and also on two new discrete models (named Dirac and Heaviside) that are related to lossless seeds. In this parameterfree context, we show that all these models can be computed with help of a method for counting alignments of particular classes, themselves represented by regular languages, and a counting semiring to perform an efficient set size computation.
We open the discussion with the complementary asymptotic problem, before going to finite but multivariate model extensions.
Complementary asymptotic problem
So far, we only have considered a set of finite alignment lengths \(\ell\) to design seeds. But limiting the length is far from satisfactory, so the next problem deserves consideration too: the asymptotic hit probability of seeds [63, 89–91].
As an example, for \(p = 0.7\) and for the Patternhunter I spaced seed, we have (with help of a Maple script) \(\{\lambda ,\beta \}_\mathtt{111010010100110111} = \{0.98731,0.22667\}\), that can be compared with the contiguous seed of same weight \(\{\lambda ,\beta \}_\mathtt{11111111111} = \{0.99364,0.44784\}\). [63] have proven that, in the class of seeds with the same weight, contiguous seeds have the largest value \(\lambda\) and thus are the asymptotic worstcase in terms of hit probability, a trait shared with the uniformly spaced seeds of same weight (e.g. 101010101010101010101 or 1001001001001001001001001001001).
Comparing seeds asymptotically can thus be done easily by comparing their respective \(\lambda\) eigenvalue, or their \(\beta\) when \(\lambda\) equality occurs, but it seems to be computationally possible ^{9} only if p is set numerically before the analysis.
Moreover dominant seeds’ extracted from this paper on a limited alignment length \(\ell\) (here \(\ell \le 64\)) would not always be optimal for any \(\ell\): such seeds can, however, be justified as “good” candidates for seeds of restricted span (e.g. \(s \le 2\times w\)), but definitely not the optimal ones, unless dominance is computed on a wider range of alignment length \(\ell\) values.
For example, the best (smallest) \(\lambda\) for any dominant seed of weight \(w=11\) and span at most \(2 \times w\), on alignments of length \(\ell \le 64\) is 0.98714 for the seed 1110010100110010111. Surprisingly, even if this seed reaches the smallest \(\lambda\) out of its dominant class, it never occurred in the optimal seeds, in any of our experiments. Moreover, we have checked that another seed 1110010100100100010111 has an even smaller \(\lambda = 0.98669\): this last seed was not dominant for \(\ell \le 64\), but would be in the class of seeds of span at most \(2 \times w\) if larger values of \(\ell\) were selected.
Finally, a parameterfree analysis implying both p and \(\ell\) seems difficult to apply for large seeds. It is interesting to notice that several of our preliminary experiments suggest that, asymptotically, and only^{10} for a restricted set of seeds (e.g. of weight \(w=11\) and span at most \(2 \times w\)), one seed is optimal whatever the value of p. This remains to be confirmed experimentally and theoretically because it might be possible that special cases exist, where at least two (or even more) seeds share the p partition.
Models and multivariate analysis
As far as i.i.d sequences are considered, the full framework of [1], including the dominant seed selection, can be applied on any extended spaced seed model (such as transition constrained seeds, vector seeds, indel seeds,...). However, additional freeparameters (such as the transition/transversion rate, the indel/mismatch rate, ...) lead to an increase in the number of alignment classes (for example, alignments of length \(\ell\), with i indels, v transversion errors, t transitions errors, and remaining m matches, such that \(\ell =i+v+t+m\)) that have to be considered by the dominance selection. Moreover, it involves a much more complex multivariate polynomial analysis, if more than one parameter is, at this point, left free.
In a more general way, if i.i.d sequences are ignored, and dominant seed selection thus abandoned in its original form, one could mix several numericallyfixed models: for example, mixing a given HMM representing coding sequences, with a numericallyfixed Bernoulli model. The idea is here to use a free probability parameter to create a balance between the two models: either initially before generating the alignment, to choose each of the two models; or along the alignment generation process, to switch between each of the two models. Seeds designed could thus be twohanded for analyzing both coding and noncoding genomic sequences at the same time, but with an additional control parameter that helps to change the known percentage of such genomic sequences. To compute the sensitivity in this model, a simple idea is to apply a polynomial semiring (with at least one parameterfree variable: here the one used to create the balance) on the automaton, and perform, not a numeric, but a symbolic computation.
Finally, as a logical consequence of the two previous remarks, we mention that any HMM with one (or possibly several) free probability parameter(s) could always be analysed with a (multivariate) polynomial semiring, increasing thus the scope of the method to applications that depend on Finite State Machines : such parameterfree preprocessing can, at some point, be applied; moreover if several equivalence classes are established in term of probability, it may be possible to use equivalent dominance method to filter out candidates when comparing several elements.
The opposite is equivalent to say that at least one string of length \(\ell\) with \(\le k\) mismatches is not hit by the seed; in other words, that the seed is not \((\ell ,k)\)lossless. Note that k does not need to be initially set: it can be estimated using this requirement, even after the DP run.
Technical details at http://bioinfo.cristal.univlille.fr/yass/iedera_coverage/index_additional.html.
This side result is not discussed in [2], probably because they were more interested by the seed rank and not necessary the “optimal seed”, which they sometime called “dominant”.
As already mentioned by [2], but for the nonparametrized \(\int _0^1\) and \(\int _{\frac{1}{2}}^1\) Hit Integration model.
To give a quick and intuitive example, we consider an extreme case : an alignment of fixed length \(\ell\) without any mismatch symbol. Any seed \(\pi\) of weight \(w_\pi \le \ell\) and span \(s_\pi \le \ell\) obviously detects this alignment, whatever its shape is, so \(Dirac_\pi (m=\ell ,\ell )\) and \(Heaviside_\pi (m_a=\ell ,m_b=\ell ,\ell )\) reach their maximal sensitivity of 1. For a given weight w, the restriction of all these seeds to dominant seeds implies that many are lost when dominance selection is applied to keep the best representatives.
Declarations
Acknowledgments and funding
Donald E. K. Martin provided substantive comments on an earlier version of this manuscript. The author would like to thank the second reviewer for his/her thorough review which significantly contributed to improving the quality of the paper. The publication costs were covered by the French Institute for Research in Computer Science and Automation (inria).
Competing interests
The author declare that he has no competing interests.
Availability of data and materials
All data and source code are freely available and may be downloaded from: http://bioinfo.cristal.univlille.fr/yass/iedera_dominance/
Consent for publication
Not applicable. The manuscript does not contain any data from any individual person.
Ethical approval
The manuscript does not report new studies involving any animal or human data or tissue.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Mak DYF, Benson G. All hits all the time: parameter free calculation of seed sensitivity. Bioinformatics. 2009;25(3):302–8.View ArticlePubMedGoogle Scholar
 Chung WH, Park SB. Hit integration for identifying optimal spaced seeds. BMC Bioinform. 2010;11(1):S37.View ArticleGoogle Scholar
 Ma B, Tromp J, Li M. PatternHunter: faster and more sensitive homology search. Bioinformatics. 2002;18(3):440–5.View ArticlePubMedGoogle Scholar
 Burkhardt S, Kärkkäinen J. Better filtering with gapped \(q\)grams. Fund Inform. 2002;56(1—2):51–70.Google Scholar
 Brejová B, Brown DG, Vinař T. Vector seeds: an extension to spaced seeds. J Comput Syst Sci. 2005;70(3):364–80.View ArticleGoogle Scholar
 Burkhardt S, Kärkkäinen J. Onegapped $$q$$ q gram filters for Levenshtein distance. Proceedings of the 13th symposium on combinatorial pattern matching (CPM), vol 2373, Lecture Notes in Computer Science Fukuoka (Japan). Berlin: Springer; 2002. p. 225–34.Google Scholar
 Mak DYF, Gelfand Y, Benson G. Indel seeds for homology search. Bioinformatics. 2006;22(14):341–9.View ArticleGoogle Scholar
 Chen K, Zhu Q, Yang F, Tang D. An efficient way of finding good indel seeds for local homology search. Chin Sci Bull. 2009;54(20):3837–42.View ArticleGoogle Scholar
 Csűrös M, Ma B. Rapid homology search with neighbor seeds. Algorithmica. 2007;48(2):187–202.View ArticleGoogle Scholar
 Ilie L, Ilie S. Fast computation of neighbor seeds. Bioinformatics. 2009;25(6):822–3.View ArticlePubMedGoogle Scholar
 Chen W, Sung WK. On half gapped seed. Genome Inform. 2003;14:176–85.PubMedGoogle Scholar
 Noé L, Kucherov G. Improved hit criteria for DNA local alignment. BMC Bioinform. 2004;5:149.View ArticleGoogle Scholar
 Yang J, Zhang L. Run probabilities of seedlike patterns and identifying good transition seeds. J Comput Biol. 2008;5(10):1295–313.View ArticleGoogle Scholar
 Zhou L, Stanton J, Florea L. Universal seeds for cDNAtogenome comparison. BMC Bioinform. 2008;9:36.View ArticleGoogle Scholar
 Frith MC, Noé L. Improved search heuristics find 20 000 new alignments between human and mouse genomes. Nucleic Acids Res. 2014;42(7):59.View ArticleGoogle Scholar
 Li M, Ma B, Kisman D, Tromp J. PatternHunter II: highly sensitive and fast homology search. J Bioinform Comput Biol. 2004;2(3):417–39.View ArticlePubMedGoogle Scholar
 Sun Y, Buhler J. Designing multiple simultaneous seeds for DNA similarity search. J Comput Biol. 2005;12(6):847–61.View ArticlePubMedGoogle Scholar
 Kucherov G, Noé L, Roytberg MA. Multiseed lossless filtration. IEEE/ACM Trans Comput Biol Bioinform. 2005;2(1):51–61.View ArticlePubMedGoogle Scholar
 FarachColton M, Landau GM, Cenk Sahinalp S, Tsur D. Optimal spaced seeds for faster approximate string matching. J Comput Syst Sci. 2007;73(7):1035–44.View ArticleGoogle Scholar
 Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011;21(3):487–93.View ArticlePubMedPubMed CentralGoogle Scholar
 Peterlongo P, Pisanti N, Boyer F, Sagot MF. Lossless filter for finding long multiple approximate repetitions using a new data structure, the bifactor array. In: Consens M, Navarro G, editor. Proceedings of the 12th international conference, on string processing and information retrieval (SPIRE). Lecture Notes in Computer Science, vol 3772. Buenos Aires; 2005. p. 179–190.Google Scholar
 Crochemore M, Tischler G. The gapped suffix array: a new index structure for fast approximate matching. In: Chavez E, Lonardi S, editors. Proceedings of the 17th international—symposium on string processing and information retrieval (SPIRE), vol. 6393., Lecture notes in computer scienceLos Cabos: Springer; 2010. p. 359–64.View ArticleGoogle Scholar
 Onodera T, Shibuya T. An index structure for spaced seed search. In: Asano T, Nakano SI, Okamoto Y, Watanabe O, editors. Proceedings of the 22nd international symposium on algorithms and computation (ISAAC), vol. 7074., Lecture notes in computer scienceYokohama (Japan): Springer; 2011. p. 764–72.Google Scholar
 Gagie T, Manzini G, Valenzuela D. Compressed spaced suffix arrays. In: Proceedings of the 2nd international conference on algorithms for big data (ICABD). CEURWS, vol 1146. Palermo; 2014. p. 37–45.Google Scholar
 Shrestha AMS, Frith MC, Horton P. A bioinformatician’s guide to the forefront of suffix array construction algorithms. Brief Bioinform. 2014;15(2):138–54.View ArticlePubMedPubMed CentralGoogle Scholar
 Birol I, Chu J, Mohamadi H, Jackman SD, Raghavan K, Vandervalk BP, Raymond A, Warren RL. Spaced seed data structures for de novo assembly. Int J Genom. 2015;2015:196591.Google Scholar
 Keich U, Li M, Ma B, Tromp J. On spaced seeds for similarity search. Discret Appl Math. 2004;138(3):253–63.View ArticleGoogle Scholar
 Nicolas F, Rivals É. Hardness of optimal spaced seed design. J Comput Syst Sci. 2008;74(5):831–49.View ArticleGoogle Scholar
 Ma B, Yao H. Seed optimization for i.i.d. similarities is no easier than optimal Golomb ruler design. Inf Process Lett. 2009;109(19):1120–4.View ArticleGoogle Scholar
 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Humanmouse alignments with BLASTZ. Genome Res. 2003;13:103–7.View ArticlePubMedPubMed CentralGoogle Scholar
 Darling AE, Treangen TJ, Zhang L, Kuiken C, Messeguer X, Perna NT. Procrastination leads to efficient filtration for local multiple alignment. Proceedings of the 6th international workshop on algorithms in bioinformatics (WABI), vol 4175. Lecture notes in bioinformatics. Zürich: Springer; 2006. p. 126–37.Google Scholar
 Harris RS. Improved pairwise alignment of genomic dna. Ph.d. thesis, The Pennsylvania State University; 2007Google Scholar
 Lin H, Zhang Z, Zhang MQ, Ma B, Li M. ZOOM! Zillions Of Oligos Mapped. Bioinformatics. 2008;24(21):2431–7.View ArticlePubMedPubMed CentralGoogle Scholar
 Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, Brudno M. SHRiMP: accurate mapping of short colorspace reads. PLoS Comp Biol. 2009;5(5):1000386.View ArticleGoogle Scholar
 Chen Y, Souaiaia T, Chen T. PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics. 2009;25(19):2514–21.View ArticlePubMedPubMed CentralGoogle Scholar
 Giladi E, Healy J, Myers G, Hart C, Kapranov P, Lipson D, Roels S, Thayer E, Letovsky S. Error tolerant indexing and alignment of short reads with covering template families. J Comput Biol. 2010;17(10):1397–411.View ArticlePubMedGoogle Scholar
 David M, Dzamba M, Lister D, Ilie L, Brudno M. SHRiMP2: Sensitive yet practical short read mapping. Bioinformatics. 2011;27(7):1011–2.View ArticlePubMedGoogle Scholar
 Sović I, Šikić M, Wilm A, Fenlon SN, Chen S, Nagarajan N. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat Commun. 2016;7:11307.View ArticlePubMedPubMed CentralGoogle Scholar
 Preparata FP, Oliver JS. DNA sequencing by hybridization using semidegenerate bases. J Comput Biol. 2005;11(4):753–65.View ArticleGoogle Scholar
 Tsur D. Optimal probing patterns for sequencing by hybridization. Proceedings of the 6th international workshop on algorithms in bioinformatics (WABI), vol 4175. Lecture notes in bioinformatics. Zürich: Springer; 2006. p. 366–75.Google Scholar
 Feng S, Tillier ERM. A fast and flexible approach to oligonucleotide probe design for genomes and gene families. Bioinformatics. 2007;23(10):1195–202.View ArticlePubMedGoogle Scholar
 Chung WH, Park SB. An empirical study of choosing efficient discriminative seeds for oligonucleotide design. BMC Genom. 2009;10(Suppl 3):3.View ArticleGoogle Scholar
 Ilie L, Ilie S, Khoshraftar S, Mansouri Bigvand A. Seeds for effective oligonucleotide design. BMC Genom. 2011;12:280.View ArticleGoogle Scholar
 Ilie L, Mohamadi H, Brian Golding G, Smyth WF. BOND: Basic Oligo Nucleotide Design. BMC Bioinform. 2013;14:69.View ArticleGoogle Scholar
 Kisman D, Li M, Ma B, Li W. tPatternhunter: gapped, fast and sensitive translated homology search. Bioinformatics. 2005;21(4):542–4.View ArticlePubMedGoogle Scholar
 Brown DG. Optimizing multiple seeds for protein homology search. IEEE/ACM Trans Comput Biol Bioinform. 2005;2(1):23–38.View ArticleGoogle Scholar
 Roytberg MA, Gambin A, Noé L, Lasota S, Furletova E, Szczurek E, Kucherov G. On subset seeds for protein alignment. IEEE/ACM Trans Comput Biol Bioinform. 2009;6(3):483–94.View ArticlePubMedGoogle Scholar
 Nguyen VH, Lavenier D. PLAST: parallel local alignment search tool for database comparison. BMC Bioinform. 2009;10:329.View ArticleGoogle Scholar
 Startek M, Lasota S, Sykulski M, Bułak A, Noé L, Kucherov G, Gambin A. Efficient alternatives to PSIBLAST. Bull Pol Acad Sci Tech Sci. 2012;60(3):495–505.Google Scholar
 Li W, Ma B, Zhang K. Optimizing spaced kmer neighbors for efficient filtration in protein similarity search. IEEE/ACM Trans Comput Biol Bioinform. 2014;11(2):398–406.View ArticlePubMedGoogle Scholar
 Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2014;12:59–60.View ArticlePubMedGoogle Scholar
 Somervuo P, Holm L. SANSparallel: interactive homology search against Uniprot. Nucleic Acids Res. 2015;43(W1):24–9.View ArticleGoogle Scholar
 Petrov I, Brillet S, Drezen E, Quiniou S, Antin L, Durand P, Lavenier D. KLAST: fast and sensitive software to compare large genomic databanks on cloud. In: Proceedings world congress in computer science, computer engineering, and applied computing (WORLDCOMP). Las Vegas; 2015. p. 85–90.Google Scholar
 Yang IH, Wang SH, Chen YH, Huang PH, Ye L, Huang X, Chao KM. Efficient methods for generating optimal single and multiple spaced seeds. In: Proceedings of the IEEE 4th symposium on bioinformatics and bioengineering (BIBE). Taichung: IEEE Computer Society Press; 2004. p. 411–16.Google Scholar
 Ilie L, Ilie S. Multiple spaced seeds for homology search. Bioinformatics. 2007;23(22):2969–77.View ArticlePubMedGoogle Scholar
 Ilie L, Ilie S. SpEED: fast computation of sensitive spaced seeds. Bioinformatics. 2011;27(17):2433–4.View ArticlePubMedGoogle Scholar
 Ilie S. Efficient computation of spaced seeds. BMC Res Notes. 2012;5:123.View ArticlePubMedPubMed CentralGoogle Scholar
 Egidi L, Manzini G. Better spaced seeds using quadratic residues. J Comput Syst Sci. 2013;79(7):1144–55.View ArticleGoogle Scholar
 Egidi L, Manzini G. Design and analysis of periodic multiple seeds. Theor Comput Sci. 2014;522:62–76.View ArticleGoogle Scholar
 Egidi L, Manzini G. Spaced seeds design using perfect rulers. Fund Inform. 2014;131(2):187–203.Google Scholar
 Egidi L, Manzini G. Multiple seeds sensitivity using a single seed with threshold. J Bioinform Comput Biol. 2015;13(4):1550011.View ArticlePubMedGoogle Scholar
 Brejová B, Brown DG, Vinař T. Optimal spaced seeds for homologous coding regions. J Bioinform Comput Biol. 2004;1(4):595–610.View ArticlePubMedGoogle Scholar
 Buhler J, Keich U, Sun Y. Designing seeds for similarity search in genomic DNA. J Comput Syst Sci. 2005;70(3):342–63.View ArticleGoogle Scholar
 Preparata FP, Zhang L, Choi KP. Quick, practical selection of effective seeds for homology search. J Comput Biol. 2005;12(9):1137–52.View ArticlePubMedGoogle Scholar
 Kucherov G, Noé L, Roytberg MA. A unifying framework for seed sensitivity and its application to subset seeds. J Bioinform Comput Biol. 2006;4(2):553–69.View ArticlePubMedPubMed CentralGoogle Scholar
 Zhang L. Superiority of spaced seeds for homology search. IEEE/ACM Trans Comput Biol Bioinform. 2007;4(3):496–505.View ArticlePubMedGoogle Scholar
 Kong Y. Generalized correlation functions and their applications in selection of optimal multiple spaced seeds for homology search. J Comput Biol. 2007;14(2):238–54.View ArticlePubMedGoogle Scholar
 Noé L, Gîrdea M, Kucherov G. Designing efficient spaced seeds for SOLiD read mapping. Adv Bioinform. 2010;2010:708501.View ArticleGoogle Scholar
 Marschall T, Herms I, Kaltenbach HM, Rahmann S. Probabilistic arithmetic automata and their applications. IEEE/ACM Trans Comput Biol Bioinform. 2012;9(6):1737–50.View ArticlePubMedGoogle Scholar
 Martin DEK, Noé L. Faster exact distributions of pattern statistics through sequential elimination of states. Ann Inst Stat Math. 2017;69:1–18.View ArticleGoogle Scholar
 Horwege S, Lindner S, Boden M, Hatje K, Kollmar M, Leimeister CA, Morgenstern B. Spaced words and kmacs: fast alignmentfree sequence comparison based on inexact word matches. Nucleic Acids Res. 2014;42(W1):7–11.View ArticleGoogle Scholar
 Leimeister CA, Boden M, Horwege S, Lindner S, et al., Morgenstern B. Fast alignmentfree sequence comparison using spacedword frequencies. Bioinformatics. 2014;30(14):1991–9.View ArticlePubMedPubMed CentralGoogle Scholar
 Ghandi M, MohammadNoori M, Beer MA. Robust kmer frequency estimation using gapped kmers. J Math Biol. 2014;69(2):469–500.View ArticlePubMedGoogle Scholar
 Morgenstern B, Zhu B, Horwege S, Leimeister CA. Estimating evolutionary distances between genomic sequences from spacedword matches. Algorithms Mol Biol. 2015;10:5.View ArticlePubMedPubMed CentralGoogle Scholar
 Břinda K, Sykulski M, Kucherov G. Spaced seeds improve kmer based metagenomic classification. Bioinformatics. 2015;31(22):3584–92.View ArticlePubMedGoogle Scholar
 Ounit R, Lonardi S. Higher classification sensitivity of short metagenomic reads with CLARKS. Bioinformatics. 2016.Google Scholar
 Duc DD, Dinh HQ, Dang TH, Laukens K, Hoang XH. AcoSeeD: an ant colony optimization for finding optimal spaced seeds in biological sequence search. Proceedings of the 8th international conference on swarm intelligence (ANTS), vol 7461. Lecture notes in computer science. Brussels: Springer; 2012. p. 204–11.Google Scholar
 Do PT, TranThi CG. An improvement of the overlap complexity in the spaced seed searching problem between genomic DNAs. In: Proceedings of the 2nd National Foundation for Science and Technology Development Conference on Information and Computer Science (NICS). Ho Chi Minh City; 2015. p. 271–76.Google Scholar
 Gheraibia Y, Moussaoui A, Djenouri Y, Kabir S, Yin PY, Mazouzi S. Penguin search optimisation algorithm for finding optimal spaced seeds. Int J Softw Sci Comput Intell. 2015;7(2):85–99.View ArticleGoogle Scholar
 Hahn L, Leimeister CA, Ounit R, Lonardi S, Morgenstern B. rasbhari: optimizing spaced seeds for database searching, read mapping and alignmentfree sequence comparison. PLoS Comput Biol. 2016;12(10):1005107.View ArticleGoogle Scholar
 Choi KP, Zeng F, Zhang L. Good spaced seeds for homology search. Bioinformatics. 2004;20(7):1053–9.View ArticlePubMedGoogle Scholar
 Allauzen C, Riley M, Schalkwyk J, Skut W, Mohri M. OpenFst: a general and efficient weighted finitestate transducer library. In: Holub J, Zdarek J, editors. Proceedings of the 12th international conference on implementation and application of automata (CIAA), vol. 4783., Lecture notes in computer sciencePrague: Springer; 2007. p. 11–23.View ArticleGoogle Scholar
 Mohri M. Weighted automata algorithms. In: Handbook of weighted automata. Berlin: Springer; 2009. p. 213–54.Google Scholar
 Huang L. Dynamic programming algorithms in semiring and hypergraph frameworks. Technical report, University of Pennsylvania, Philadelphia, USA; 2006.Google Scholar
 Hopcroft JE, Motwani R, Ullman JD. Introduction to automata theory languages and computation. 3rd ed. New York: Pearson; 2007.Google Scholar
 Aston JAD, Martin DEK. Distributions associated with general runs and patterns in hidden Markov models. Ann Appl Stat. 2007;1(2):585–611.View ArticleGoogle Scholar
 Noé L, Martin DEK. A coverage criterion for spaced seeds and its applications to support vector machine string kernels and kmer distances. J Comput Biol. 2014;21(12):947–63.View ArticlePubMedPubMed CentralGoogle Scholar
 Kucherov G, Noé L, Roytberg MA. Iedera subset seed design tool. http://bioinfo.lifl.fr/yass/iedera.php; 2016.
 Ma B, Li M. On the complexity of spaced seeds. J Comput Syst Sci. 2007;73(7):1024–34.View ArticleGoogle Scholar
 Li M, Ma B, Zhang L. Superiority and complexity of the spaced seeds. In: Proceedings of the 17th symposium on discrete algorithms (SODA). Miami: ACM Press; 2006. p. 444–53.Google Scholar
 Nicodème P, Salvy B, Flajolet P. Motif statistics. Theor Comput Sci. 2002;287(2):593–617.View ArticleGoogle Scholar
 Myers G. 1. What’s behind blast. Models and algorithms for genome evolution, vol 19. Computational biology. Berlin: Springer; 2013. p. 3–15.Google Scholar