Research  Open  Published:
Analysis of pattern overlaps and exact computation of Pvalues of pattern occurrences numbers: case of Hidden Markov Models
Algorithms for Molecular Biologyvolume 9, Article number: 25 (2014)
Abstract
Background
Finding new functional fragments in biological sequences is a challenging problem. Methods addressing this problem commonly search for clusters of pattern occurrences that are statistically significant. A measure of statistical significance is the Pvalue of a number of pattern occurrences, i.e. the probability to find at least S occurrences of words from a pattern in a random text of length N generated according to a given probability model. All words of the pattern are supposed to be of same length.
Results
We present a novel algorithm SufPref that computes an exact Pvalue for Hidden Markov models (HMM). The algorithm is based on recursive equations on text sets related to pattern occurrences; the equations can be used for any probability model. The algorithm inductively traverses a specific data structure, an overlap graph. The nodes of the graph are associated with the overlaps of words from . The edges are associated to the prefix and suffix relations between overlaps. An originality of our data structure is that pattern need not be explicitly represented in nodes or leaves. The algorithm relies on the Cartesian product of the overlap graph and the graph of HMM states; this approach is analogous to the automaton approach from JBCB 4: 553569. The gain in size of SufPref data structure leads to significant improvements in space and time complexity compared to existent algorithms. The algorithm SufPref was implemented as a C++ program; the program can be used both as Webserver and a stand alone program for Linux and Windows. The program interface admits special formats to describe probability models of various types (HMM, Bernoulli, Markov); a pattern can be described with a list of words, a PSSM, a degenerate pattern or a word and a number of mismatches. It is available at http://server2.lpm.org.ru/bio/online/sf/. The program was applied to compare sensitivity and specificity of methods for TFBS prediction based on Pvalues computed for Bernoulli models, Markov models of orders one and two and HMMs. The experiments show that the methods have approximately the same qualities.
Background
The recognition of functionally significant fragments in biological sequences is a key issue in computational biology. Many functionally significant fragments are characterized by a set of specific words that is called a pattern and denoted below. These patterns may represent different biological objects, such as transcription factor binding sites [13], polyadenylation signals [4], protein domains, etc. The functional fragments recognition problem can be solved by finding sequences in which the words from a given pattern are overrepresented. Defining a meaningful significance criteria for this overrepresentation is a delicate goal, that, in turn, requires a clarification of the probability model. A current criteria is the socalled Pvalue computed as the probability that a random sequence of length N contains at least S occurrences of a pattern. There are many methods for Pvalue computation designed for Bernoulli or Markov models. However, Hidden Markov models (HMM) were considered in only a few papers [5,6] despite the models being widely used in bioinformatics. This is a motivation to develop methods for Pvalue calculation with respect to HMMs.
Existing methods for Pvalue calculation can be divided into several groups and reviews of the methods can be found in [710]. Studies on word probabilities started as early as the eighties with the seed paper [11] that introduced basic word combinatorics and derived inductive equations for a single word and a uniform Bernoulli model. Some works in the same vein, reviewed in [12] followed for several words, multioccurrences and extended probability models. The time complexity is proportional to the text length N and the desired number of occurrences S: computations are carried out by induction for n ranging over 1,…,N and, for a given n, by induction on the number of occurrences. Although these “mathematicsdriven” approaches allow for mathematical formula derivation, the actual computation suffers from a combinatorial explosion when $\left\mathcal{\mathscr{H}}\right$ or Markov order increase.
Later on, a first group of methods [1317] formalized systematically these inductions by the introduction of bivariate generating functions. Coefficients are the Pvalues to be computed. Expectations and variances for the number of occurrences of the different words in pattern can be expressed explicitly in terms of these generating functions [14,15,18]. Moreover, coefficients may be computed from the analytical expression, when it is available, or through a suitable manipulation of a functional equation, where the theoretical time complexity reduces to S logN. Nevertheless, computing the generating function, or the functional equation, requires the computation of a system of linear equations or, equivalently, the determinant of a matrix of polynomials of size $O\left(\right\mathcal{\mathscr{H}}\left\right)$. It takes $O\left(\right\mathcal{\mathscr{H}}{}^{3})$ operations and it is the main drawback of this approach.
A second group of methods computes asymptotics. They rely on convergence results to the normal law proved by [19] or [20]. An approximated Pvalue is derived, based on Gaussian approximations [21] or Poisson approximations [2225]. Nevertheless, this approximation is not suitable for exceptional words, when the observed number of occurrences S significantly differs from the expected number. This was proved experimentally by [26] and theoretically [27]. Large deviation principles are used in [28,29] with a much better precision. Nevertheless, no computable formula are available for large sets.
A third group of methods revisits recursive Pvalue computation, with a O(S×N) time complexity. They avoid combinatorial explosion by a suitable use of appropriate data structures, tightly related to word overlap properties. Therefore, loss in time dependency to N or S is compensated by a gain on data structure size. A significant part of algorithms in this group are based on traversals of a specific graph. The graph may not be defined explicitly [30]. It can be based on the graph corresponding to the finite automaton recognizing the given pattern, see algorithms AHOPRO [31], SPATT [25,32] and REGEXPCOUNT [17]. MOTIFRANK [33] that is designed for first order Markov models makes use of suffix sets. In [25,32,34], a Markov chain embedding technique was suggested. Counting occurrences of regular patterns in random strings produced by Markov chains reduces to problems regarding the behavior of a firstorder homogeneous Markov chain in the state space of a suitable deterministic finite automaton (DFA) [35,36]. In a recent paper [6], a probabilistic arithmetic automaton for computing Pvalues for a HMM was proposed. In this paper two algorithms were suggested. The first one has a time complexity O(Q^{2}×N×S×Ω×V) and a space complexity O(Q×S×Ω), where Q is the number of states of the HMM, Ω is the number of states of the automaton recognizing the given pattern, V is the alphabet size. The second algorithm has a time complexity O(Q^{3}×l o g(N)×S ^{2}×Ω^{3}) and a space complexity O(Q^{2}×S×Ω^{2}). This algorithm uses the “divide and conquer” technique. The drawback is the lack of control on the number of states Ω when $\left\mathcal{\mathscr{H}}\right$ increases. Finally, despite these great efforts, existing methods perform badly for rather big patterns. Besides this, most of the proposed algorithms are not implemented or implemented only for Bernoulli models or Markov models of small orders.
The present paper provides an algorithm supporting the HMM probability model. It assumes that all words have the same length m and that a HMM with Q states is given. It is a generalization of algorithm SUFPREF designed in [37] for Bernoulli models and Markov models of order K. It relies on recurrent equations based on an overlap graph, whose vertices are associated with the overlaps of words from , and edges correspond to the prefix and suffix relations between overlaps. The time complexity is $O\left(\rightQ{}^{2}\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}N\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}S\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\left(\right\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\mathcal{\mathscr{H}}\left\right))$ and the space complexity is $O\left(\rightQ{}^{2}\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\left(\right\mathit{\text{OV}}\phantom{\rule{0.3em}{0ex}}\left(\mathcal{\mathscr{H}}\right)+\mathcal{\mathscr{H}}\left\right)+\leftQ\right\times S\times m\times \left\mathit{\text{OV}}\right(\mathcal{\mathscr{H}}\left)\right+m\times \left\mathcal{\mathscr{H}}\right)$, where $\left\mathit{\text{OV}}\right(\mathcal{\mathscr{H}}\left)\right$ is the number of overlaps between the words from the pattern . In the case of a Markov model of order K, where K ≤ m, bounds for time and space above can be reduced to $O(N\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}S\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}(K\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\mathrm{V}{}^{K+1}+\phantom{\rule{0.3em}{0ex}}\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\mathcal{\mathscr{H}}\left\right))$ and to $O(S\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}K\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\mathrm{V}{}^{K+1}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}S\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}m\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\left\mathit{\text{OV}}\right(\mathcal{\mathscr{H}}\left)\right\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}m\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\left\mathcal{\mathscr{H}}\right)$, respectively. Algorithm SUFPREF is implemented as a Webserver, see http://server2.lpm.org.ru/bio/online/sf/, and a standalone program for Windows and Linux. The program is available by request from the authors.
The paper is organized as follows. Basic notions on word overlaps are introduced, that lead to an overlap graph that is the main data structure to be used. Then, one recalls the Hidden Markov models definition. Main text sets are defined and equations for their probabilities are derived. The next section describes the algorithm SUFPREF that computes these equations using the overlap graph as a main data structure. Then, the space and time complexities are analyzed and our algorithm is compared with other methods [3,24,31,38,39]. Finally, usage of Pvalues for TFBS prediction is discussed.
Overlap words
Our approach strongly relies on overlaps of words from a given pattern. In this section we provide necessary definitions for these overlaps, following the notations of [37]. The main deference is in definition of overlap graph, see definition 3. By definition from [37], overlap graph has additional nodes (leaves) that correspond to the words from the pattern . In the present paper overlap graph has deep edges instead of the nodes. This modification is not affect on upper bounds of time and space complexity. But in practice it gives significant improvements.
Definition 1.
Given a pattern over an alphabet V, a word w is an overlap (an overlap word) for if there exist words H and F in such that w is a proper suffix of H and w is a proper prefix of F. The set of overlaps of the pattern is denoted $\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$.
Example 1.
Let be the set
The overlap set is
Notation.
Below we will use the following notations: 1) w, for an overlap from $\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$; 2) H, for a word from the pattern ; 3) v, for a word from $\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)\cup \mathrm{H}$.
Notation.
For an overlap w in $\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$, one denotes
with the convention $\mathcal{\mathscr{H}}\left(\epsilon \right)=\mathcal{\mathscr{H}}$.
Notation.
${x}^{\prime}\u2291x$ (x ^{′}⊂x) means that x ^{′} is a suffix (proper suffix) of x; x ^{′}≼x (x ^{′}≺x) means that x ^{′} is a prefix (proper prefix) of x. The elements of $\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$ that are proper prefixes (respectively suffixes) of a given word are totally ordered. The empty string is the minimal element. The maximal elements are crucial for our algorithms and data structures.
Definition 2.
Given a word v in $\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)\cup \mathcal{\mathscr{H}}\setminus \left\{\epsilon \right\}$, one denotes
Two words H and F from the pattern are called equivalent if they satisfy
Notation.
Given two words x and w in $\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$, let H^{∗}(x,w) denote the equivalence class consisting of all words $\mathrm{H}\in \mathcal{\mathscr{H}}$ such that l p r e d(H)=x and r p r e d(H)=w.
One notes, for a word H in H^{∗}(x,w),
Let $\mathcal{P}\left(\mathcal{\mathscr{H}}\right)$ denote the set of all equivalence classes on .
Example 2.
Consider the pattern from the previous example.

1.
For the overlap $\text{ACA}\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$, l p r e d(ACA)=A, because A is the maximal prefix of ACA that is overlap. Analogously, r p r e d(ACA)=A.

2.
The words ACAGCTA and ACATATA from the pattern are equivalent because
$$\mathit{\text{lpred}}\left(\text{ACAGCTA}\right)=\mathit{\text{lpred}}\left(\text{ACATATA}\right)=\text{ACA and}$$$$\mathit{\text{rpred}}\left(\text{ACAGCTA}\right)=\mathit{\text{rpred}}\left(\text{ACATATA}\right)=\text{TA}.$$
These words are in the class ${\mathcal{\mathscr{H}}}^{\ast}\left(\text{ACA, TA}\right)$. The partition $\mathcal{P}\left(\mathcal{\mathscr{H}}\right)$ consists of three classes: ${\mathcal{\mathscr{H}}}^{\ast}\left(\text{ACA, TA}\right)=\left\{\text{ACAGCTA, ACATATA}\right\}$, ${\mathcal{\mathscr{H}}}^{\ast}\left(\text{C, C}\right)=\left\{\text{CTTTCGC}\right\}$ and ${\mathcal{\mathscr{H}}}^{\ast}\left(\text{TA, ACA}\right)=\left\{\text{TACCACA}\right\}$.
Order relations are commonly associated to oriented graphs.
Definition 3.
The overlap graph of a given pattern is an oriented graph where the set of nodes is $\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$ and the set of edges, $E\left(\mathcal{\mathscr{H}}\right)$, contains the left, right and deep edges, that are defined as follows:

A left edge links node x to node w iff x=l p r e d(w);

A right edge links node x to node w iff x=r p r e d(w);

A deep edge links node x to node w iff there exists a nonempty class H^{∗}(x,w) in $\mathcal{P}\left(\mathcal{\mathscr{H}}\right)$.
It is denoted OvGraph.
The root is the node corresponding to the empty word.
Definition 4.
An overlap $w\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$ is called a left deep node, respectively a right deep node, if there exists a word $\mathrm{H}\in \mathcal{\mathscr{H}}$ such that w=l p r e d(H), respectively w=r p r e d(H). The sets of all left and right deep nodes are denoted by $\mathit{\text{DLOV}}\left(\mathcal{\mathscr{H}}\right)$ and $\mathit{\text{DROV}}\left(\mathcal{\mathscr{H}}\right)$.
Notation.
For a right deep node $r\in \mathit{\text{DROV}}\left(\mathcal{\mathscr{H}}\right)$, one denotes
Below we will use r for notation of a right deep node.
Definition 5.
Let v be in $\left(\mathit{\text{OV}}\right(\mathcal{\mathscr{H}})\cup \mathcal{\mathscr{H}})\setminus \epsilon $.
The set of nonempty prefixes of v (including v) that belong to $\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$ is denoted by O v e r l a p P r e f i x(v). For any prefix x in O v e r l a p P r e f i x(v), let B a c k(x,v) denote the suffix of v that satisfies the equation
Let B a c k(v) denote B a c k(l p r e d(v),v).
Also for ${\mathrm{H}}^{\ast}(w,r)\in \mathcal{P}\left(\mathcal{\mathscr{H}}\right)$ we denote
Remark.
One can ascribe to each deep edge (w,r) the class H^{∗}(w,r) and to each left edge (l p r e d(w),w) a word label B a c k(w).
Example 3.
The overlap graph for the pattern $\mathcal{\mathscr{H}}=\left\{\text{ACAGCTA, ACATATA, CTTTCGC, TACCACA}\right\}$ is shown in Figure 1. The nodes of the graph correspond to the overlaps from the set $\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)=\{\epsilon ,\text{A, ACA, C, TA}\}$. The index numbers of nodes are the index numbers of overlaps in the prefix order. The graph has four left edges (shown by straight lines), five right edges (shown by dashed lines) and three deep edges (shown by double lines).
Text sets
The computation of Pvalues will be done by induction on the text length n (n=1,…,N), and, for each given n, by induction on the number of occurrences s (s=1,…,S).
It relies on formulas introduced in [37], that in turn was based on the ideas from [12,13]. In [37] we give formulas for Pvalues computation for Bernoulli and Markov models. In the present paper we introduce equations on texts sets that underlie these formulas. Using these equations one can derive formulas for Pvalue computation for different probabilities models. Also these equations take into account improvements in the overlap graph structure, see section “Overlap words”.
Definition 6.
Let be a pattern.
By convention, B(n,0)=V ^{n}.
Definition 7.
Given a right deep node $r\in \mathit{\text{DROV}}\left(\mathcal{\mathscr{H}}\right)$, one defines, for s=1,…,S,S+1
These sets are called Esets.
Definition 8.
Let $w\phantom{\rule{0.3em}{0ex}}\in \phantom{\rule{0.3em}{0ex}}\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$, one defines, for s = 1,…,S
These sets are called Rsets.
Remark.
We remark that
Note, if t∈E(n,s,r) then t ends with a word H from $\mathcal{\mathscr{H}}\left(r\right)$, where r=r p r e d(H). In contrast, if t∈R(n,s,w) then t ends with a word H from $\mathcal{\mathscr{H}}\left(w\right)$, i.e. w is a suffix of H.
Example 4.
Consider the pattern $\mathcal{\mathscr{H}}=\{\mathrm{ACAGCTA},\mathit{\text{ACATATA}},\mathit{\text{CTTTCGC}},\mathit{\text{TACCACA}}\}$ from the example 1. And consider the text t _{1} = CTTTCGCCGAATCACAGCTA. The texts is of length 20, contains exactly 2 occurrences of (the occurrences are given in bold) and ends with ACAGCTA. Obviously, r p r e d(ACAGCTA)=TA. Thus t _{1} is in B(20,1), B(20,2), E(20,1,TA), E(20,2,TA), R(20,2,TA), R(20,2,A) and R(20,2,ε).
Example 5.
Consider the pattern from the previous examples and the set E(20,2,TA). A text t from E(20,2,TA) is of length 20, has at least 2 occurrences of and ends with a word H from such that r p r e d(H) = TA. Obviously, H is ACAGCTA or ACATATA. The words ACAGCTA and ACATATA are from the same class H^{∗}(ACA,TA). For example, texts t _{1}=CTTTCGCCGAATCACAGCTA, t _{2}=CTTTCGCGGTACC ACA TATA, t _{3}=TACC ACA TATACCACAGCTA, t _{4}=ACGTTTCCATACC ACA GCTA, t _{5} = ACTAAGACAGCT A CATATA are in E(20,2,TA). The occurrences of are given in bold or italic.
Definition 9.
Given a right deep node $r\in \mathit{\text{DROV}}\left(\mathcal{\mathscr{H}}\right)$, one defines, for s=1,…,S
Remark that
Example 6.
Consider the pattern
where is the pattern from the previous examples. Obviously, $\mathit{\text{OV}}\left({\mathcal{\mathscr{H}}}_{1}\right)=\{\epsilon ,\text{A, ACA, ATA, C, TA}\}$. Consider the texts t _{1}=CTTTCGCCGAATCACAGCTA and t _{5}=ACTAAGACAGCT A CATATA. The texts t _{1} and t _{5} belong to R(20,2,TA) because the texts: 1) have length 20; 2) contain exactly two occurrences of ${\mathcal{\mathscr{H}}}_{1}$ and 3) end with the words from ${\mathcal{\mathscr{H}}}_{1}\left(\text{TA}\right)$, here TA is the suffix of the words. Also the text t _{1} is in R E(20,2,TA) because it ends with ACAGCTA, and r p r e d(ACAGCTA)=TA. In contrast, t _{5} is not in R E(20,2,TA) because it ends with ACATATA, and r p r e d(ACATATA)=ATA.
The following proposition gives the inductive relations allowing effective computation of probabilities of Rsets.
Proposition 1.
Let $w\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$. If w is a deep right node, i.e. w=r p r e d(H) for a word $\mathrm{H}\in \mathcal{\mathscr{H}}$, then
otherwise,
The proof follows from the definition of Rsets.
Example 7.
Lets illustrate the proposition 1 with the data from the example 6. As we have seen before, t _{1},t _{5}∈R(20,2,TA). Further, t _{1}∈R E(20,2,TA), and t _{5}∈R(20,2,ATA). Here, TA=r p r e d(ATA). Also note, R(20,2,ATA)=R E(20,2,ATA).
Remark.
For given n and s, we have to compute the probabilities of sets R(n,s,w) for all $w\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$. The equations (6) and (7) allow us to do this by recursive traversal of $\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$ from the leaves (deep nodes) of OvGraph to the root according to the right edges. The calculation starts from probabilities of REsets found according to the equation (5).
Below we introduce Dsets and give the equations for Dsets, Rsets and Esets leading to recursive equations for Esets probabilities. The Dsets defined below consist of texts of length n containing at least s occurrences of the pattern , ending with a given nonempty overlap word w that has a common part with the sth occurrence of the pattern .
Definition 10.
Let $w\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$, w≠ε, k≥1.
By definition, D(k,s,ε)=∅.
Notation.
Below we will use the following notations: 1) l e n(x), for the length of a word x; 2) M, for the number of words in a set of words M.
Notation.
For a prefix $w\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$ and any integer n, one denotes
where m is the length of words from .
Example 8.
Let n=20 and s=2. Consider the pattern $\mathcal{\mathscr{H}}=\left\{\text{ACAGCTA, ACATATA, CTTTCGC, TACCACA}\right\}$ and the texts t _{4}=ACGTTTCCATACC ACA GCTA, t _{5}=ACTAAGACAGCT A CATATA from the example 5. In the both cases, the first occurrence of intersects the ending occurrence of . The texts end with words from the class H^{∗}(ACA,TA)={ACAGCTA, ACATATA}.
Consider the overlap w=ACA. Then k(n,w)=16. Consider the prefixes t _{4}[ 1,16]=ACGTTTCCATACC ACA and t _{5}[ 1,16]=ACTAAGACAGCT A CA of the texts. For these prefixes we have: (1) their length is 16; (2) the prefixes end with ACA; (3) the prefixes have at least s−1=1 occurrence of and (4) the first occurrence of intersects the suffix ACA. Thus the prefixes t _{4}[ 1,16] and t _{5}[ 1,16] are in D(16,1,ACA). Further, t _{4} and t _{5} are in D(16,1,ACA)·B a c k(H^{∗}(ACA,TA)), where B a c k(H^{∗}(ACA,TA))={GCTA,TATA}. Note, that t _{5}[ 1,14] also belongs to D(14,1,A).
The next propositions describe the relation between Dsets and Rsets.
Proposition 2.
Let $w\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$, w≠ε. Then
Proof: [see Additional file 1].
Informally speaking, x is the common part of the sth occurrence of the pattern in the text t∈D(k(n,w),s,w) and the suffix w of the text t. Remark that according to the definition 5: (1) ε is not in O v e r l a p P r e f i x(w), (2) w is in O v e r l a p P r e f i x(w).
Proposition 3.
Let $w\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)\setminus \epsilon $, n≥m,s≥1. Then
Proof: Follows from the proposition 2 [see Additional file 1].
Corollary 1.
If l p r e d(w)=ε then D(n,s,w)=R(n,s,w).
One observes that, whenever n<m, B(n,s)=∅, and for all $w\phantom{\rule{0.3em}{0ex}}\in \phantom{\rule{0.3em}{0ex}}\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$ and $r\phantom{\rule{0.3em}{0ex}}\in \phantom{\rule{0.3em}{0ex}}\mathit{\text{DROV}}\left(\mathcal{\mathscr{H}}\right)$, R(n,s,w) = E(n,s,r) =∅.
Now we are ready to formulate the main theorem of the section. The theorem gives recursive equations for Bsets and Esets. The main equations (13)–(15) are based on the following observation. The set E(n,s+1,r), s≥1, can be divided in two disjoint sets: F(n,s+1,r) and C(n,s+1,r). The set F(n,s+1,r) consists of such words that sth occurrence of the pattern does not intersect the ending occurrence of . And C(n,s+1,r) consists of those texts t from E(n,s+1,r) that sth occurrence of in t intersects the ending occurrence of .
Theorem 1.
Let n≥m, s≥1 and $r\in \mathit{\text{DROV}}\left(\mathcal{\mathscr{H}}\right)$, i.e. r is a right deep node.

1.
Sets B(n,s) and E(n,s,r) meet the following equations:
$$\begin{array}{lcr}B(n,s)& =& B(n1,s)\xb7\mathrm{V}\cup R(n,s,\epsilon )\end{array}$$(11)$$\begin{array}{lcr}E(n,1,r)& =& {\mathrm{V}}^{nm}\xb7\stackrel{~}{\mathcal{\mathscr{H}}}\left(r\right)\end{array}$$(12)$$\begin{array}{lcr}F(n,s+1,r)& =& B(nm,s)\xb7\stackrel{~}{\mathcal{\mathscr{H}}}\left(r\right)\end{array}$$(13)$$\begin{array}{lcr}C(n,s+1,r)& =& \bigcup _{w:(w,r)\text{is a deep edge}}D\left(k\right(n,w),s,w)\times \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\\ \times \mathit{\text{Back}}\left({\mathrm{H}}^{\ast}\right(w,r\left)\right)\end{array}$$(14)$$\begin{array}{lcr}E(n,s+1,r)& =& F(n,s+1,r)\cup C(n,s+1,r)\end{array}$$(15)Note, that (w,r) is a deep edge iff ${\mathrm{H}}^{\ast}(w,r)\in \mathcal{P}\left(\mathcal{\mathscr{H}}\right)$, see definition 3.

2.
Unions (11), (14) and (15) are disjoint, i.e.
$$B(n1,s)\xb7\mathrm{V}\cap R(n,s,\epsilon )=\varnothing \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}};$$
if (w,r)≠(v,x) then
Example 9.
The statements (13)–(15) can be illustrated with the data from the examples 5 and 8. Let n=20, s=1, r=TA. Then (15) can be rewritten as
Consider the texts t _{1},…,t _{5} from the example 5.
In each of the texts t _{1},t _{2},t _{3} the ending occurrence of the pattern does not intersect the first occurrence. Therefore the texts are in F(20,2,TA). Note, that the ending occurrence ACATATA of the pattern in t _{2} intersects the second occurrence but not the first. Consider the prefixes of t _{1}, t _{2} and t _{3} of length n−m=20−7=13, t _{1}[ 1,13]=CTTTCGCCGAATC, t _{2}[ 1,13]=CTTTCGCGGTACC and t _{3}[ 1,13]=TACC ACA TATACC. The prefixes contain at least one occurrence of , i.e. the prefixes are in B(13,1). Thus ${t}_{1},{t}_{2},{t}_{3}\in B(13,1)\xb7\stackrel{~}{\mathcal{\mathscr{H}}}\left(\text{TA}\right)$, that is in agreement with the statement (13) of the theorem. Obviously,
In contrast, in each of the texts t _{4} and t _{5} the last occurrence of the pattern intersects the first occurrence. Therefore the texts t _{4},t _{5}∈C(20,2,TA). According to the example 7, the texts t _{4},t _{5} are in D(16,1,ACA)·B a c k(H^{∗}(ACA,TA)), that illustrates the statement (14) of the theorem.
Note, there is only one overlap w such that ${\mathcal{\mathscr{H}}}^{\ast}(w,\text{TA})\phantom{\rule{0.3em}{0ex}}\ne \varnothing $, that is w=ACA. Thus
Proof:

1.
Consider statement (11). A text t is in B(n,s) if and only if either its prefix of length n−1 contains at least s occurrences of or a sth occurrence H from ends at position n. In the first case, t is in B(n−m,s)·V. In the second case, text t belongs to R(n,s,ε). The two cases are mutually exclusive; therefore B(n,s) is a disjoint union and (11) is proved.

2.
The statement (12) directly follows from the definition of E(n,1,r).

3.
Consider the statement (13).

(a)
First, we prove that $F(n,s+1,r)\subseteq B(nm,s)\xb7\stackrel{~}{\mathcal{\mathscr{H}}}\left(r\right)$. When a text t is in F(n,s+1,r), it ends with a word $\mathrm{H}\in \mathcal{\mathscr{H}}$ such that r=r p r e d(H), i.e. $\mathrm{H}\in \stackrel{~}{\mathcal{\mathscr{H}}}\left(r\right)$. The last occurrence H of the pattern does not intersect the sth occurrence in the text t. Thus the prefix of t of length n−m contains at least s occurrences of , i.e. it is in B(n−m,s), where m is the length of pattern words. Therefore t is in $B(nm,s)\xb7\stackrel{~}{\mathcal{\mathscr{H}}}\left(r\right)$.

(b)
Obviously, if $t\in B(nm,s)\xb7\stackrel{~}{\mathcal{\mathscr{H}}}\left(r\right)$ then

t has the length n;

t contains at least s+1 occurrences of the pattern ;

sth occurrence of lies on the prefix of t of length n−m, i. e. it does not intersect the last occurrence;

t ends with $\mathrm{H}\in \stackrel{~}{\mathcal{\mathscr{H}}}\left(r\right)$.

Therefore t∈F(n,s+1,r).

(a)

4.
Consider the statement (14). Let Y denote the right side of equation (14).

(a)
Prove that C(n,s+1,r)⊆Y. If a text t is in C(n,s+1,r) then it ends with a word $\mathrm{H}\in \stackrel{~}{\mathcal{\mathscr{H}}}\left(r\right)$. The last occurrence H intersects the sth occurrence of the pattern in the text t. Let H_{1} be the sth occurrence of in t, and x be the overlap between H_{1} and H in t. Obviously, x∈O v e r l a p P r e f i x(w), where w=l p r e d(H), see definition 5 of O v e r l a p P r e f i x(w). The prefix of t of length k(n,x), where k(n,x)=n−m+l e n(x), contains exactly s occurrences of and ends with H_{1}, where ${\mathrm{H}}_{1}\in \mathcal{\mathscr{H}}\left(x\right)$. By definition of Rsets, the prefix is in R(k(n,x),s,x). Therefore t∈R(k(n,x),s,x)·B a c k(x,H). Observing that
$$\mathit{\text{Back}}(x,\mathrm{H})=\mathit{\text{Back}}(x,w)\xb7\mathit{\text{Back}}\left(\mathrm{H}\right)$$

(a)
we obtain
Note, k(n,x)=k(n,w)−l e n(B a c k(x,w)), where l e n(B a c k(x,w))=l e n(w)−l e n(x).
According to the proposition 2,
Thus
Note, if H∈H^{∗}(w,r) then B a c k(H)⊆B a c k(H^{∗}(w,r)). Therefore,
This yields that t∈Y.

(b)
Proof that Y⊆C(n,s+1,r). Let t∈Y, i.e t∈D(k(n,w),s,w)·B a c k(H^{∗}(w,r)). By the definition of Dsets,
t has the length n;
t contains at least s+1 occurrences of the pattern;
sth occurrence intersects (s+1)th occurrence of ;
t ends with $\mathrm{H}\in \stackrel{~}{\mathcal{\mathscr{H}}}\left(r\right)$.
Thus t∈C(n,s+1,r).

5.
The statement (15) follows from the definitions of F and Csets.
Notation.
Given two integers n and s, and a class H^{∗}(w,r), one introduces
Obviously,
Remark.
The unions in equations (11), (14), (15) and (17) are disjoint. Therefore the probability of the set in the left part of an equation is the sum of probabilities of sets in the right side.
Probability models
We suppose that the probability distribution is described by a Hidden Markov Model (HMM). In this section, we recall some basic notions about HMMs and introduce the needed notations. In fact, it is shown in [6] that our definition is equivalent to the classical definition of HMM [40].
Definition 11.
A HMM G is a triple G=〈Q,q _{0},π〉, where Q is the set of states, q _{0}∈Q is an initial state, and π is a function: Q×V×Q→ [ 0,1] such that π(q~,a,q) is the probability, being in state q~, to generate symbol a and traverse to state q. For any state q~ in Q, the function π meets the condition:
A HMM G is called deterministic if for any (q~,a) in Q×V there is only one state q such that π(q~,a,q)>0. In this case the function π can be described with two functions:

1.
a transition function ϕ:Q×V→Q;

2.
a probability function ρ:Q×V→ [ 0,1].
Namely, ϕ(q~,a) is equal to the unique state q such that π(q~,a,q)>0 and ρ(q~,a) is π(q~,a,q).
A HMM G=〈Q,q _{0},π〉 can be represented as a graph where Q is the set of vertices. Each edge is assigned with a label a∈V and with a probability p∈(0;1]. There exists an edge from q~ to q with the label a and probability p iff π(q~,a,q)>0 and p=π(q~,a,q). The graph is called the traversal graph of HMM G.
Definition 12.
Let h be a path in the traversal graph of the HMM G. The label of h is the concatenation of the labels of edges that constitute the path h. The probability P r o b(h) of a path h is the product of the probabilities of the edges that constitute the path h.
Definition 13.
The probability P r o b(t) of a word t with respect to the HMM G is the sum of probabilities of all paths that start in the initial state q _{0} and have the label t.
Let q and q~ belong to Q and t be a word. By definition, the probability P r o b(q~,t,q) to move from the state q~ to the state q with the emitted word t is the sum of probabilities of all paths starting in the state q~, ending in the state q and having the word label t.
To describe effective algorithms related to HMMs, we need the notion of reachability.
Definition 14.
Given a state q~ and a word t, we define
Given a state q and a string t, we define
A state q is called treachable from a state q~ if and only if P r o b(q~,t,q)>0.
Definition 15.
For a given word t, A l l S t a t e(t)is the set of states that are reachable from initial state q _{0} by at least one text with suffix t. For a set of words M,
Remark.
Definition 16.
Let w be an overlap word. We denote by P r i o r S t a t e(w,q) the set of states q~∈A l l S t a t e(l p r e d(w))such that q is B a c k(w)reachable from q~, i.e.
Analogously, for each deep edge (w,r) and its associated class H^{∗}(w,r), one notes
HMM and probabilistic automata
The definition of HMM is very close to the definition of probabilistic automaton PA, [41,42]. The main difference lies in the interpretation of the behavior of a model. For a HMM, one considers a label as a symbol emitted by the HMM; for automata, one imagines an automaton that processes a given word letter by letter. Another difference connected with the previous one is that PAs are typically used to describe word sets; thus, for a given PA, the subset of accepting states is defined. HMMs are mainly used to describe probability models and thus have no accepting states.
In applications, one often uses a probabilistic automata built as a Cartesian product of a deterministic automaton accepting a given set of words and a HMM describing the word probabilities, see e.g. [6,43]. A similar construction is used below. In fact, we describe generalized probabilistic automata, GPA. As opposed to PAs, the edges in a graph that represents our automaton are labeled with words rather than with letters, and thus it can be named a generalized probabilistic automaton, analogously to the definition of generalized HMM [44].
An originality of SUFPREF is that words from pattern , or classes, that represent terminal states in classical automata need not be explicitly represented. Nevertheless, each class is uniquely associated to one deep edge.
Probabilities equations for HMM
In the section above the main text sets and corresponding equations were described. One can apply the equations to compute probabilities of the text sets for arbitrary probability models. Here we give formulas to compute the probabilities for an HMM. The formulas are based on the following observations. First, all unions in the text equations are disjoint. Second, an item of a set union is a set with already known probability or concatenation of such sets. In the latter case the probability P r o b(q _{1},L _{1}·L _{2},q _{2}) can be computed by the formula
where P r o b(q ^{′},L,q) is a probability that, being in the state q ^{′}, the chain will go to the state q emitting a word t from the set L.
Let n,s be integers, $w\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$, $r\in \mathit{\text{DROV}}\left(\mathcal{\mathscr{H}}\right)$ and q∈Q. Then

1.
From (11) follows
$$\begin{array}{cc}\mathit{\text{Prob}}\left(B\right(n,s),q)=& \phantom{\rule{0.3em}{0ex}}\sum _{\mathit{\text{q~}}\in Q}\mathit{\text{Prob}}\left(B\right(n1,s),\mathit{\text{q~}})\xb7\pi (\mathit{\text{q~}},q)+\\ +\mathit{\text{Prob}}\left(R\right(n,s,\epsilon ),q),\end{array}$$(21)where $\pi (\mathit{\text{q~}},q)=\sum _{a\in \mathrm{V}}\pi (\mathit{\text{q~}},a,q)$;

2.
From (12) follows
$$\begin{array}{cc}\mathit{\text{Prob}}\left(E\right(n,1,r),q)=& \phantom{\rule{0.3em}{0ex}}\sum _{\mathit{\text{q~}}\in \mathit{\text{StartState}}\left(\stackrel{~}{\mathcal{\mathscr{H}}}\right(r),q)}\mathit{\text{Prob}}({\mathrm{V}}^{nm},\mathit{\text{q~}})\times \\ \times \mathit{\text{Prob}}(\mathit{\text{q~}},\stackrel{~}{\mathcal{\mathscr{H}}}(r),q);\end{array}$$(22) 
3.
From (13)–(15) and (17) follows
$$\phantom{\rule{15.0pt}{0ex}}\begin{array}{cc}\mathit{\text{Prob}}\left(F\right(n,s+1,r),q)=\phantom{\rule{0.3em}{0ex}}& \phantom{\rule{0.3em}{0ex}}\sum _{\mathit{\text{q~}}\in \mathit{\text{StartState}}\left(\stackrel{~}{\mathcal{\mathscr{H}}}\right(r),q)}\phantom{\rule{0.3em}{0ex}}\mathit{\text{Prob}}\left(B\right(nm,s),\stackrel{\u0304}{\mathit{\text{q}}})\times \\ \times \mathit{\text{Prob}}(\mathit{\text{q~}},\stackrel{~}{\mathcal{\mathscr{H}}}(r),q);\end{array}$$(23)$$\phantom{\rule{18.0pt}{0ex}}\begin{array}{c}\text{Pr}\phantom{\rule{1em}{0ex}}\mathit{\text{ob}}\left({C}^{\prime}\right(n,s\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1,w,r),q)=\\ =& \phantom{\rule{0.3em}{0ex}}\sum _{\mathit{\text{q~}}\in \mathit{\text{PriorState}}\left({H}^{\ast}\right(w,r),q)}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\text{Pr}\phantom{\rule{1em}{0ex}}\mathit{\text{ob}}\left(D\right(k(n,x),s,w),\stackrel{\u0304}{\mathit{\text{q}}})\times \\ \phantom{\rule{2em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\times \text{Pr}\phantom{\rule{1em}{0ex}}\mathit{\text{ob}}(\mathit{\text{q~}},\mathit{\text{Back}}({\mathrm{H}}^{\ast}(w,r)),q);\end{array}$$(24)$$\phantom{\rule{29.0pt}{0ex}}\begin{array}{cc}\text{Pr}\phantom{\rule{1em}{0ex}}\mathit{\text{ob}}\left(E\right(n,s+1,r),q)=& \phantom{\rule{0.3em}{0ex}}\text{Pr}\phantom{\rule{1em}{0ex}}\mathit{\text{ob}}\left(F\right(n,s+1,r),q)+\\ \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}+\left(\sum _{w:\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}(w,r)\phantom{\rule{2.77626pt}{0ex}}\mathit{\text{is}}\phantom{\rule{1em}{0ex}}a\phantom{\rule{1em}{0ex}}\mathit{\text{deep}}\phantom{\rule{1em}{0ex}}\mathit{\text{edge}}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\text{Pr}\phantom{\rule{1em}{0ex}}\mathit{\text{ob}}\left({C}^{\prime}\right(n,s+\phantom{\rule{2.77626pt}{0ex}}1,w,r),q)\right)\phantom{\rule{0.3em}{0ex}};\end{array}$$(25) 
4.
Let l p r e d(w)≠ε. Then from (10) follows
$$\phantom{\rule{17.0pt}{0ex}}\begin{array}{cc}\text{Pr}\phantom{\rule{1em}{0ex}}\mathit{\text{ob}}\left(D\right(k(n,w),s,w),q)=& \phantom{\rule{0.3em}{0ex}}\sum _{\mathit{\text{q~}}\in \mathit{\text{PriorCloseState}}(w,q)}\\ \text{Pr}\phantom{\rule{1em}{0ex}}\mathit{\text{ob}}\left(D\right(k(n,\mathit{\text{lpred}}(w\left)\right),s,\mathit{\text{lpred}}\left(w\right)),\mathit{\text{q~}})\xb7\\ \times \text{Pr}\phantom{\rule{1em}{0ex}}\mathit{\text{ob}}(\mathit{\text{q~}},\mathit{\text{Back}}(w),q)\\ +\text{Pr}\phantom{\rule{1em}{0ex}}\mathit{\text{ob}}\left(R\right(k(n,w),s,w),q);\end{array}$$(26)If l p r e d(w)=ε then
$$\begin{array}{lcr}\mathit{\text{Prob}}\left(D\right(n,s,w),q)& =& \mathit{\text{Prob}}\left(R\right(n,s,w),q);\end{array}$$ 
5.
From (5) follows
$$\begin{array}{cc}\mathit{\text{Prob}}\left(\mathit{\text{RE}}\right(n,s,r),q)=& \phantom{\rule{0.3em}{0ex}}\mathit{\text{Prob}}\left(E\right(n,s,r),q)\\ \mathit{\text{Prob}}\left(E\right(n,s+1,r),q);\end{array}$$(27) 
6.
Let w be a right deep node. Then from (6) follows
$$\phantom{\rule{15.0pt}{0ex}}\begin{array}{cc}\mathit{\text{Prob}}\left(R\right(n,s,w),q)=& \phantom{\rule{0.3em}{0ex}}\mathit{\text{Prob}}\left(\mathit{\text{RE}}\right(n,s,w),q)+\\ +\left(\sum _{x\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right):w=\mathit{\text{rpred}}\left(x\right)}\mathit{\text{Prob}}\left(R\right(n,s,x),q)\right)\phantom{\rule{0.3em}{0ex}};\end{array}$$(28)
Otherwise, from (7) follows
Algorithms
General description
Our goal is to compute P r o b(B(N,S)), that is the probability to find at least S occurrences of a pattern in a random text of length N, given a HMM G=〈Q,q _{0},π〉. The algorithm SUFPREF, see Algorithm 1, computes the probability by induction on a text length n, where m≤n≤N, and, for a given n, by induction on a number of occurrences s, where 1≤s≤S.
The computation within the main loop is based on equations (21)–(29), related to Bsets, Csets, Fsets, Esets, Dsets, REsets and Rsets.
The computation related to texts of length n will be referred to as nth stage of the algorithm’s work. The main computation within the nth stage is performed by depthfirst traversal of OvGraph following left and deep edges. During the depthfirst traversal for each visited node $w\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$, the algorithm computes the probabilities of REsets and auxiliary probabilities of D, F and Csets by induction on number of occurrences s=1,…,S. Within the traversal we store the probabilities of Dsets related to the nodes on the path from the root of OvGraph to a current node w, i.e. the nodes x from O v e r l a p P r e f i x(w), in the temporary arrays T e m p D P r o b(x,q) of the size S; the size of the data related to a node x on the path is O(Q×S), see subsection “Main loop” below. Then update of auxiliary information stored in nodes of OvGraph, namely, probabilities of Rsets, is performed by a bottomup traversal of OvGraph using right edges.
Computation on the inductive equations relies on a generic procedure, analogous to the forward algorithm for HMM [40], see also [5].
Preprocessing and data structures
On the preprocessing stage we initialize the global data structures of the algorithm, i.e. the OvGraph, including auxiliary structures assigned to its nodes and some other structures that are described at the end of this subsection.
Overlap graph The graph OvGraph is built from the AhoCorasick trie ${T}_{\mathcal{\mathscr{H}}}$ for the set [45]. The nodes belonging to the OvGraph correspond to the overlaps and therefore can be easily revealed using suffix links of the AhoCorasick trie, see [37] and [Additional file 2], for details of the procedure. The nodes of OvGraph are assigned with additional data (constant data and data to be updated at each stage n=m+1,…,N). All these data are initialized at the preprocessing stage, see below.
Constant transition probabilities related to nodes of overlap graph During the computation, algorithm SUFPREFuses some probabilities that are constant and can be precomputed and stored.
For each node w and all states q in A l l S t a t e(w) and q~ in P r i o r S t a t e(w,q), we store the “left transition probability” P r o b(q~,B a c k(w),q), see definitions 15 and 16. The left transition probabilities are used for the computation of Dsets probabilities, see (26);
Given a right deep node r, the “word probabilities” $\mathit{\text{Prob}}(\mathit{\text{q~}},\stackrel{~}{\mathcal{\mathscr{H}}}(r),q)$ are memorized for states q in A l l S t a t e(r) and q~ in Q. They are used to compute probabilities of the Fsets, see (23);
Given a right deep node r, we store, for each class H^{∗}(w,r), the “deep transition probabilities” P r o b(q~,B a c k(H^{∗}(w,r)),q) where q ranges over A l l S t a t e(H^{∗}(w,r)) and q~ ranges over P r i o r S t a t e(H^{∗}(w,r),q). The probabilities are needed for the computation of Csets probabilities, see (24).
The sets of states A l l S t a t e(w) and P r i o r S t a t e(w,q), left and deep transition probabilities and word probabilities are computed in a depthfirst traversal along the left edges of OvGraph [see Additional file 2].
Updatable probabilities related to nodes of overlap graph At the beginning of the nth stage, for each pair 〈w,q〉, where $w\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$ and q∈A l l S t a t e(w) we store a (m−l e n(w))×S matrix R P r o b s(w,q), where
l∈ [ k(n,w),n−1];s=1,…,S;i=l m o d (m−l e n(w)). The probabilities were computed at the previous stages. And the values in the matrices are updated at the end of the nth stage.
At the preprocessing stage, we compute the probabilities for n=1,…,m; s=1,…,S and q∈A l l S t a t e(w) according to the formulas:
if n<m or (n=m and s>1),
The global data unrelated to overlap graph Besides the data related to nodes of OvGraphwe store the following data.
Transition probabilities. For each q~,q∈Q we store the constant probability
At the beginning of nth stage, the following values computed at the previous stages are stored
For each q∈Q, updatable probabilities P r o b(V ^{n−m−1},q). They are used for computation of P r o b(E(n,1,r),q) by the formula (22);
For each s=1,…,S and q∈Q, updatable Bsets probabilities P r o b(B(n−m−1,s),q). At the preprocessing stage, we compute the probabilities for n=1,…,m, s=1,…,S and q∈Q according to the formulas:
Main loop
The aim of the nth stage (main loop, see lines 2–13 of the algorithm SUFPREF, see Algorithm 1) is to compute for all s=1,…,Sthe values
P r o b(B(n−m,s),q), n>2m;
P r o b(R(n,s,w),q) for all $w\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$, q∈A l l S t a t e(w).
To compute the probabilities P r o b(R(n,s,w),q) the algorithm for each pair 〈w,q〉, where $w\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$, q∈A l l S t a t e(w), uses local array T e m p R P r o b(w,q) of size S. Initially, for each s, T e m p R P r o b(w,q)[ s]=0.
The value n is not changed within the main loop. The body of the loop consists of three parts.
Within the part 2.1, for all s=1,…,S the values P r o b(B(n−m,s),q) are computed according to the formula (21); the values P r o b(B(n−m−1,s),q) and P r o b(R(n−m,s,ε),q) were computed and stored at the previous stages.
The aim of the part 2.2 (procedure COMPUTEREPROB, see Algorithm 2) is to compute the values P r o b(R E(n,s,r),q) for all $r\in \mathit{\text{DROV}}\left(\mathcal{\mathscr{H}}\right)$, q∈A l l S t a t e(r) and s=1,…,S.
The computation is performed using the recursive depthfirst traversal of OvGraph along the left edges; it is based on the formulas (22)–(27). Let a node w is visited, it corresponds to the call of COMPUTEREPROB (n,w). Firstly, COMPUTEREPROB computes P r o b(E(n,1,w),q) by the formula (22) and puts the values to T e r m R P r o b(w,q)[ 1].
Then by induction on s=1,…,S the procedure computes the following probabilities.
Within the part B, see lines 8–14, for all states q∈A l l S t a t e(w), the procedure computes P r o b(D(k(n,w),s,w),q) by the formula (26). To make the computation by the formula (26) one needs the value P r o b(D(k(n,l p r e d(w)),s,l p r e d(w)),q~); the value is stored in the array T e m p D P r o b(w,q), see subsection “General description”.
Now consider the part C of Algorithm 2, see lines 15–26. Although the calculation of probabilities of Rsets and REsets is based on the formulas (25) and (27) we avoid explicit usage of Esets in our calculations. From (25) and (27) we have (here s>1)
For s=1 we have
The value P r o b(E(n,1,r),q) was computed and stored in T e m p R P r o b(w,q)[ 1] at the part A of the procedure. During the computation we accumulate the needed probabilities in the arrays T e m p R P r o b(w,q), see section C of the algorithm 2, lines 15–26. Visiting a left deep node w, for each r such that there is a deep edge (w,r), and for each q∈A l l S t a t e(r), we firstly calculate the value P r o b(C ^{′}(n,s+1,w,r),q) using (24). Then add to the current value of T e m p R P r o b(w,q)[ s] the value P r o b(C ^{′}(n,s,w,r),q)−P r o b(C ^{′}(n,s+1,w,r),q) (if s>1) or substract the value P r o b(C ^{′}(n,2,w,r),q) (if s=1).
In section D of the Algorithm 2, see lines 27–36 we analogously take into account the probabilities of Fsets.
At part 2.3 of the algorithm SUFPREF (procedure COMPUTERPROB, see Algorithm 3), the values P r o b(R(n,s,w),q) are computed according to the formulas (28), (29).
The computation is done by a recursive bottomup traversal of OvGraph along the right edges. Also the procedure records the computed P r o b(R(n,s,w),q) probabilities to the corresponding cells of the matrix R P r o b(w,q) and initializes elements of T e m p R P r o b(w,q) by zeros.
Remark.
The above traversals are implemented with a recursive procedure initially called at the root (node corresponding to ε) of OvGraph, see lines 11, 12 of the algorithm SUFPREF (Algorithm 1).
Postprocessing
At the postprocessing step of the algorithm (see Algorithm 1, lines 14–19), Pvalue P r o b(B(N,S)) follows by summation over Q states:
Discussion
Space complexity The data stored consist of input data, temporary data used at the preprocessing step, the main data structure OvGraph and the working data unrelated to the OvGraph. The detailed description of all of the data is given in the section “Preprocessing and data structures”. The space complexity is mainly determined by the memory needed for the data related to the OvGraph and temporary data used at the preprocessing step. We first briefly consider the data unrelated to the overlap graph; then we consider OvGraph data. The input data consist of the text length N, the number of occurrences S, a representation of an HMM and a pattern . The data related to the pattern representation are included in the data related to OvGraph nodes and will be considered below. Storage size for an HMM is O(Q^{2}×V). Thus the input data size is O(Q^{2}×V).
At the preprocessing stage the algorithm uses a temporary structure, the AhoCorasick trie, to build the OvGraph and temporary data structures to store intermediate probabilities P r o b(q~,B a c k(w),q) for each $w\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$, and probabilities P r o b(q~,B a c k(H),q) and P r o b(q~,H,q) for each $\mathrm{H}\in \mathcal{\mathscr{H}}$, where q~,q∈Q. The memory needed for AhoCorasick trie is $O(m\times \mathcal{\mathscr{H}}\left\right)$ where m is the pattern length. The memory needed to store the intermediate probabilities is $O\left(\rightQ{}^{2}\times \left(\right\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)+\mathcal{\mathscr{H}}\left\right))$. The temporary data structures used by subalgorithms in the preprocessing stage are released after their running. Thus, the total memory used during this stage is $O\left(\rightQ{}^{2}\times \left(\right\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)+\mathcal{\mathscr{H}}\left\right)+m\times \left\mathcal{\mathscr{H}}\right)$.
The working data unrelated to OvGraph consist of Bsets probabilities P r o b(B(n−m−1,s),q) and probabilities P r o b(V ^{n−m−1},q), q∈Q. These data need O(Q×S) and O(Q) memory, respectively. Within the main loop we use local arrays with Dsets probabilities (the number of these arrays is at most m×Q, see remark below) and arrays T e m p R P r o b(w,q) (for all $w\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$, q∈A l l S t a t e(w)). These arrays are of size S. Therefore the necessary memory to store all of the arrays is $O\left(\rightQ\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}S\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}m\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}Q\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}S\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)\left\right)$.
As we will see, all this memory, except for the memory needed to store AhoCorasick trie, does not increase the space complexity of the algorithm.
Remark.
During processing of a node w in main loop one stores arrays with Dset probabilities for all left predecessors of w, i. e. for all x∈O v e r l a p P r e f i x(x). The number of left predecessors is bounded by the number of all prefixes of w, that is l e n(w), where l e n(w)≤m. Thus the number of arrays with Dsets probabilities used by the algorithm during the performing of main loop is at most m×Q.
Consider now the data related to the OvGraph. The OvGraph structure is determined by the pattern . The number of nodes and the number of left and right edges is $O\left(\right\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)\left\right)$, that is upper bounded by $m\times \left\mathcal{\mathscr{H}}\right$. However, usually $\left\mathit{\text{OV}}\right(\mathcal{\mathscr{H}}\left)\right\le \left\mathcal{\mathscr{H}}\right$, see Table 1. The number of deep edges is equal to the number of classes, $\left\mathcal{P}\right(\mathcal{\mathscr{H}}\left)\right$, that is upper bounded by $\left\mathcal{\mathscr{H}}\right$. Then the storage size for OvGraph is $O\left(\right\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)+\mathcal{\mathscr{H}}\left\right)$. The data assigned to a node of OvGraph consist of constant data and updatable data. The constant data consist of left transition probabilities assigned to the nodes of the OvGraph, deep transition probabilities assigned to the deep edges and word probabilities assigned to the right deep nodes. The updatable data are probabilities of Rsets assigned to all nodes. More precisely, left transition probabilities P r o b(q~,B a c k(w),q) are stored in the memory associated with a node w; deep transition probabilities P r o b(q~,B a c k(H^{∗}(w,r)),q) are stored in the memory associated with deep edge (w,r); word probabilities $\mathit{\text{Prob}}(\mathit{\text{q~}},\stackrel{~}{\mathcal{\mathscr{H}}}(r),q)$ are stored in the memory associated with a right deep node r. As a whole, it gives
To store Rsets probabilities one needs $O(S\times Q\times m\times \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)\left\right)$ memory. Thus the size of memory needed to store global data related to OvGraph is
Finally, the overall space complexity of the algorithm is
Observe that the storage of classes in deep nodes saves a $O(S\times Q\times m\times \mathcal{P}\left(\mathcal{\mathscr{H}}\right)\left\right)$ memory for Rsets.
Remark.
The parameter $\left\mathit{\text{OV}}\right(\mathcal{\mathscr{H}}\left)\right$ belongs to the bounds of space and time complexities. It is upper bounded by $m\times \left\mathcal{\mathscr{H}}\right$. Assume that a pattern consists of random words of length m generated according to the uniform Bernoulli model. It was shown that in such case $\left\mathit{\text{OV}}\right(\mathcal{\mathscr{H}}\left)\right\approx \left\mathcal{\mathscr{H}}\right$, see [46] and supplementary materials, file “Comparison_with_AhoPro.xls”. But for a majority of patterns described by PositionSpecific Scoring Matrices and cutoffs that were considered in the present paper, $\left\mathit{\text{OV}}\right(\mathcal{\mathscr{H}}\left)\right\le 0.1\times \left\mathcal{\mathscr{H}}\right$, see Table 1 in this paper and [Additional file 3].
Time complexity The algorithm SUFPREF (see Algorithm 1) consists of three parts: preprocessing, main loop and postprocessing. The time complexity of the preprocessing part is mainly determined by the construction of the AhoCorasick trie and OvGraph, their traversals and the computation of intermediate probabilities. The complexity is $O\left(\rightQ{}^{2}\times m\times \left(\right\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)+\mathcal{\mathscr{H}}\left\right)$. Some details are given in [Additional file 2]. The time complexity of the postprocessing part (see lines 14–19) is O(m×Q^{2}).
The time complexity of the algorithm SUFPREF is mainly determined by the main loop (see lines 2–13), i.e. by the total runtime of the computation of parts 2.1, 2.2 and 2.3 for n=m+1,…,N. Within the part 2.1 (lines 3–10), computing probabilities P r o b(B(n−m,s),q) for all s=1,…,S and q∈Q requires O(S×Q^{2}) operations.
Consider the part 2.2 (procedure COMPUTEREPROB, see Algorithm 2). The procedure performs computations by depthfirst traversal of OvGraph for all $w\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$. For a given n and w the computation consists of four parts: A, B, C and D. If w is right deep node then at the part A (lines 1–6) one computes P r o b(E(n,1,w),q) for all q∈A l l S t a t e(w); overall nodes this requires $O\left(\rightQ{}^{2}\times \left\mathit{\text{OV}}\right(\mathcal{\mathscr{H}}\left)\right)$ operations.
The parts B, C and D run for S values of s. To execute parts B, C and D (lines 8–14, 15–26 and 27–36 respectively) overall nodes of OvGraph one needs $O(S\times Q{}^{2}\times \left\mathit{\text{OV}}\right(\mathcal{\mathscr{H}}\left)\right)$, $O(S\times Q{}^{2}\times \left(\right\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)+\mathcal{\mathscr{H}}\left\right))$ and $O(S\times Q{}^{2}\times \left\mathit{\text{OV}}\right(\mathcal{\mathscr{H}}\left)\right)$ operations respectively.
As a whole, $O(S\times Q{}^{2}\times \left(\right\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)+\mathcal{\mathscr{H}}\left\right))$ operations are needed to execute COMPUTEREPROB.
Analogously, for computation of part 2.3 (see procedure COMPUTERPROB, see Algorithm 3) one needs $O\left(\rightQ\times S\times \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)\left\right)$ operations. Therefore, the time complexity of the algorithm SUFPREF for a general HMM is
Time and space asymptotics In the previous subsection we gave upper bounds of the space and time complexities of the algorithm SUFPREF. All bounds are given as bigO notations. For example, the time complexity bounds have form $N\times S\times \lambda \left(G\right)\times \mu \left(\mathcal{\mathscr{H}}\right)$, here N is the text length, S is the number of occurrences, λ(G) is a factor depending on the HMM G and $\mu \left(\mathcal{\mathscr{H}}\right)$ is a factor depending on the pattern . The estimation of space complexity is analogous except of absence of factor N, see subsection “Space complexity” for details.
In the case of a general HMM λ(G)=k×Q^{2}, here Q is the number of states of the HMM G; the value of k depends on features of the HMM.
We have performed computer experiments to get a better understanding of the asymptotic behavior of time and space complexity. Let N _{ Trans } be the number of states where the HMM can transit in one step from a given state. This parameter describes the “density” of an HMM; the smaller N _{ Trans }, the smaller the complexities of the algorithm. The factor λ(G) was studied as a function of N _{ Trans } and the number of states N _{ States }in used HMMs. We have performed 96=4×24 series of experiments, 100 experiments in each series. In all series we have used following input data:
the pattern is defined by a PSSM for transcription factor FOXA2 from the database HOCOMOCO [47] and cutoff 5.89 that corresponds to roughly 0.03% of all words of length 12;
number of occurrences is 10;
text length is 1000.
Thus, a series differs from the others only with the used HMMs. Each series is determined by the number N _{ State } of states in the HMMs, and the number N _{ Trans }, see above. The value N _{ State } ranges from 2 to 25, therefore 24 values of N _{ State } were considered. For each number of states four values of N _{ Trans } were used, namely, 1;2;0.25·N _{ State } and N _{ State }. Given values N _{ State } and N _{ Trans }, we have created 100 HMMs by the following randomized procedure. For each state q~, we firstly have randomly chosen N _{ Trans } states q∈Q such that there exists a transition from q~ to q. In our models if there exists transition from q~ to q then π(q~,a,q)>0 for all a∈V. Then we assign to each triple 〈q~,a,q〉, where a∈V, a random positive value π(q~,a,q) from [0,1], and then normalize the values to make the needed sums of probabilities equal to 1. For each series we report average runtime and used space. The results of the experiments are presented in Figure 2 and Figure 3. The experiments show that for N _{ Trans }=0.25·N _{ State } and N _{ Trans }=N _{ State } the time and space are not much different. This is because most of states from Q are reachable for nodes of overlap graph. In contrast, for N _{ Trans }=1 (the models are deterministic) the runtime and used space are significantly less than in the case considered above. The case N _{ Trans }=2 is an intermediate case. Note that Markov models are deterministic and correspond to the case N _{ Trans }=1.
In the cases N _{ Trans }=2;0.25·N _{ State } and N _{ State }, a proportion k×Q^{2} is reached. The smaller is N _{ Trans }, the smaller is k. When N _{ Trans }=1, the function λ(G) has approximately linear behavior.
Analogous experiments for patterns described by other PSSMs and cutoffs show the same results. The results are given in [Additional file 4].
Now let’s consider in details the complexity of the algorithm for Bernoulli and Markov models.
Bernoulli models In a Bernoulli model, the set Q contains only 1 state. Therefore formulas for space and time complexities turn into $O(m\times S\times \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)+m\times \mathcal{\mathscr{H}}\left\right)$ and $O(N\times S\times (\left\mathit{\text{OV}}\right(\mathcal{\mathscr{H}}\left)\right+\left\mathcal{\mathscr{H}}\right\left)\right)$. Note (see algorithm SUFPREF, Algorithm 1) that time and space complexity of the algorithm does not depend on symbol probabilities given by a Bernoulli model [see Additional file 5].
Markov models. Further refinements Complexity results are presented with (possibly rough) upper bounds. In particular, the Q^{2} factor arises from transition probabilities representation. It actually stands for the sum of the cardinalities of P r i o r S t a t e(w,q) sets in a given node $w\in \mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)$, q∈A l l S t a t e(w).
In practical cases, this number may be significantly smaller than Q^{2}. In particular, this is the case for Markov models that can be treated as a special case of Hidden Markov Models. Let K denote the order of Markov model. For an overlap node w, such that l e n(w)≥K, the set A l l S t a t e(w) consists of only one state. We use the technique of “reachable states”, see section “Probability models” to take into account this issue. It does not decrease the upper bounds in the case of a general HMM but leads to a significant improvement of the software for Markov models. At the same time, combining the technique with other improvements of the algorithm, see [37], allows one to obtain better complexity bounds for the Markov case. Namely, $O(S\times m\times (K\times \mathrm{V}{}^{K+1}+\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)\left\right)+m\times \left\mathcal{\mathscr{H}}\right)$ space complexity and $O(N\times S\times (K\times \mathrm{V}{}^{K+1}+\mathit{\text{OV}}\left(\mathcal{\mathscr{H}}\right)+\mathcal{\mathscr{H}}\left\right))$ time complexities are achievable. The details of the optimized algorithm for the Markov case achieving the above bounds will be presented in a separate paper.
Comparison with the existing algorithms The algorithm SUFPREF implements a Pvalue computation for HMM and achieves the theoretical complexity of the best algorithms for Pvalue computation. Notably, the complexities of SUFPREF are consistent with the complexities of algorithms based on finite automata. Our optimization of the data structure provides an improvement for the constant factor. A comparison of the number of nodes of OvGraph and the number of states of a minimal automaton for a given pattern is given in paper [37]. It was observed in [46] that an average number of overlaps (nodes of OvGraph) for random patterns generated according to Bernoulli models is proportional to the number of words in the patterns and is independent of the length of the words.
For Bernoulli and first order Markov model cases, we have compared program SUFPREF with the implementation of program AHOPRO [31]. The program AHOPRO admits only Bernoulli and first order Markov models.
The Pvalues were computed with the following input parameters: (1) alphabet (V)  {A, C, G, T}; (2) Bernoulli probabilities of letters  {0.25,0.25,0.25,0.25}; Markov model is described by a 4×4 matrix where all elements are 0.25; (3) text length  1000; (4) minimal number of occurrences  10 and (5) two types of patterns: patterns containing words of lengths 12 and 14 randomly generated according to a uniform probability model and patterns of lengths 12 and 14 defined by a PositionSpecific Scoring Matrix (PSSM or PWM) and different cutoffs. A pattern presented by PSSM and cutoff consists of all words whose score according to PSSM is greater than the cutoff. The cutoffs were precalculated such that the numbers of words matching the PSSM and a cutoff are equal to the fractions of all 12 (14)mers in range from 0.00001 to 0.003. The fractions correspond to the fractions of words in patterns using for transcription factor binding sites (TFBS) prediction.
The experiments were performed using a quadcore Intel Core i5 system running at 2.67 GHz (only one core used) with 8 GB RAM and dualdisk stripped swap partition. Both programs AHOPRO and SUFPREF were compiled using the GCC 4.5 tool chain for the 64bit Linux target. To measure running time and maximum sizes of memory during the program’s runs we used POSIX’s “getrusage()” function twice: before and after processing to measure data size excluding program code itself. We have slightly modified the source code of AHOPRO to call this function before and after main program execution.
We consider the matrices PSSM from the database HOCOMOCO [47] describing binding sites of lengths 12 and 14 in human genome for transcription factors FOXA2 and E2F1; the matrices are given in [Additional file 6]. Observe that the Pvalues computed for both probability models are the same, when the other parameters are identical.
The results of the experiments for PSSMbased patterns of length 12 are presented in Tables 1 and 2. The results for other patterns are given in [Additional file 7]. Table 1 provides details on the patterns structures; N _{ AC } denotes the number of nodes of the AhoCorasick trie (the size of automaton used by AHOPRO). Table 2 provides space and runtime results. The running time is given in seconds and the memory size in megabytes.
It turns out that in all cases our program is several times faster than AHOPRO. And for a majority of cases, it is faster than AHOPRO by more than ten and five times for Bernoulli and Markov models respectively. Also it outperforms AHOPRO in space.
Remark.
The advantages of SUFPREF are crucial for patterns of big sizes. For example, consider the pattern described by PSSM corresponding to binding sites of lengths 16 for factor ANDR with cutoff 4.64, where pattern contain about 0.001 of all 16mers (4270349 words). For this pattern, the run time and space of SUFPREF’s work are 12.71 seconds and 691.58 megabytes. But the run time and space of AHOPRO’s work are 351.59 seconds and 1868.18 megabytes.
Remark.
For a Bernoulli model the time complexities of AHOPRO and SUFPREF are O(N×S×V×N _{ AC }) and $O(N\times S\times (\left\mathit{\text{OV}}\right(\mathcal{\mathscr{H}}\left)\right+\left\mathcal{\mathscr{H}}\right\left)\right)$. Note, ${N}_{\mathit{\text{AC}}}\ge \left\mathit{\text{OV}}\right(\mathcal{\mathscr{H}}\left)\right+\left\mathcal{\mathscr{H}}\right$.
Usage of P values for TFBS prediction The majority of methods for TFBS prediction firstly search for genome regions with high number of occurrences of a pattern corresponding to needed TFBS. Then the candidate regions have to be chosen following proper criterion of statistical significance [48,49]. We have compared predictive abilities of methods using criteria based on Pvalues for different probability models and a method using criterion based on a number of occurrences. The experiments were performed with human transcription factor FOXA2. We have considered several patterns based on the PSSM of length 12 from the database HOCOMOCO [47] and different cutoffs. The best results were obtained for the cutoff 5.89; about 0.0003 of all words of length 12 match the PSSM with this cutoff. The pattern that is discussed below consists of all words having score exceeding the cutoff and their reversecomplemented words.
We have considered the test set of 1800 genome regions of length from 200 to 400; the set consists of 900 “positive” regions and 900 “negative” ones. The positive regions were taken from the database ENCODE [50]. We have chosen top 900 regions related to human transcription factor FOXA2 having length from 200 to 400 b.p. in accordance with their quality (Signal value). The length distribution of regions is almost uniform; all the regions belong to Top 1000 of the FOXA2related regions according to their Signal value. The negative regions presumably do not bind FOXA2. They were taken from random places of the first chromosome of human genome, the length of negative regions by construction are uniformly distributed from 200 to 400 b.p. For each region (positive or negative) we have computed 5 variants of Pvalues related to different probability models. The other parameters of computation were chosen as follows.

1.
Text length N is the length of the region.

2.
Number of pattern occurrences S is the number of occurrences of the pattern found in the region.

3.
Let MinScore be the minimal PSSM score among scores of the pattern words found in the region. The pattern ${\mathcal{\mathscr{H}}}^{\prime}$ used within the Pvalue calculation corresponds to the FOXA2 PSSM and the cutoff MinScore.
The Pvalues were calculated w.r.t. five probability models (for each model it’s short notation is given): Bernoulli (Bernoulli), Markov models of orders 1 (Markov1) and 2 (Markov2), HMM with 3 states (HMM3) and 4 states (HMM4). The parameters of the models were estimated on the adjacent fragments of length 4000 b.p. taken from both sides of the considered region. To estimate parameters of Bernoulli and Markov models we have used maximal likelihood method; for HMMs we have used BaumWelch algorithm, see [40].
The main results are given in Table 3 and Figure 4; the details of the experiments are given in [Additional file 8]. The Table shows sensitivity and specificity of recognition for various thresholds and probability models. The thresholds for Pvalue based methods were chosen to obtain approximately the same sensitivity as the method based on number of occurrences with corresponding minimal number of occurrences. One can see (see also Figure 4) that all Pvalue methods have approximately the same quality and outperform the method based on number of occurrences.
Remark.
The signal value of ChIPSeq data reflects the amount of binded proteins. Therefore the signal values of considered ENCODE regions show better correlation with number of pattern occurrences, than with Pvalues, see Table 4. However, the methods for TFBS prediction based on Pvalues show significantly better predictive abilities.
Conclusions
This work presents an approach to compute the Pvalue of multiple pattern occurrence within a randomly generated text of a given length. The approach provides significant space and time improvements compared to the existing software that is crucially important for applications. The improvements are achieved due to the use of an overlap graph: taking into account overlaps between the pattern words allows one to decrease necessary space and time. The number of nodes of a AhoCorasick trie, a structure that is extensively used in automaton approach, is much larger than the number of overlaps.
Another advantage is that, unlike existing algorithms and programs, it allows us to deal with Hidden Markov Models, the most general class of popular probabilistic models. The algorithm relies on the Cartesian product of the overlap graph and the graph of HMM states. A further reduction to the reachable vertices leads to extra improvement of time and space complexity. Despite the fact that Bernoulli and Markov models can be treated as special HMMs, it is worth implementing specialized and optimized versions of software for these models. Indeed, paper [37] can be viewed as a meta version of SUFPREF. The peculiarity of the implementation of Markov models of higher orders will be presented in a separate paper.
The implementation of the algorithm SUFPREF was compared with the program AHOPRO for a Bernoulli model and a first order Markov model. The comparison shows that, for a majority of cases, our algorithm is faster than AHOPRO in more than ten times for the Bernoulli model and in more than five times for the Markov model. The greatest advantage of SUFPREF is to decrease the needs in space. It outperforms AHOPRO in space. Therefore it can be run with patterns with a greater number of words and a greater length.
Availability and requirements
The algorithm SUFPREF was implemented as a C++ program and was compiled for Unix and Windows. The program was implemented both as webserver and as a standalone program with the command line interface. It is available at http://server2.lpm.org.ru/bio/online/sf/. Implementation details are provided in http://server2.lpm.org.ru/static/downloads/SufPrefHMM/Website.pdf.
The algorithm SufPref supports Pvalues computation taking into account pattern occurrences on the both strands of genome fragments. To do this the algorithm adds to the pattern reverse complement words to the words from the pattern. After the procedure, the pattern size is not increased by more than twice.
Additional files
Abbreviations
 HMM:

Hidden Markov model
 PSSM:

Positionspecific scoring matrix synonym of PWM
 PWM:

Position weight matrix
 TFBS:

Transcription factor binding sites
References
 1.
Qian Z, Lu L, Qi L, Li Y: An efficient method for statistical significance calculation of transcription factor binding sites . Bioinformation. 2007, 2 (5): 169174. 10.6026/97320630002169.
 2.
Berman B, Pfeiffer B, Laverty T, Salzberg S, Rubin G, Eisen M, Celniker S: Computational identification of developmental enhancers: conservation and function of transcription factor bindingsite clusters in Drosophila melanogaster and Drosophila pseudoobscura . Genome Biol. 2004, 5 (9): R6110.1186/gb200459r61. [doi:10.1186/gb200459r61.],
 3.
Cartharius K, Frech K, Grote K, Klocke B, Haltmeier M, Klingenhoff A, Frisch M, Bayerlein M, Werner T: MatInspector and beyond: promoter analysis based on transcription factor binding sites . Bioinformatics. 2005, 21 (13): 29332942. 10.1093/bioinformatics/bti473. [http://bioinformatics.oxfordjournals.org/content/21/13/2933.short],
 4.
Helden JV, Olmo M, PerezOrtin J: Statistical analysis of yeast genomic downstream sequences revels putative polyadenylation signals . Nucleic Acids Res. 2000, 28 (4): 10001010. 10.1093/nar/28.4.1000.
 5.
Roytberg MA: Computation of the probabilities of families of biological sequences . Biophysics. 2009, 54 (5): 569573. 10.1134/S0006350909050029.
 6.
Marschal T, Herms I, Kaltenbach H, Rahmann S: Probabilistic arithmetic automata and their applications . IEEE/ACM Trans Comput Biol Bioinformatics. 2012, 59 (6): 17371750. 10.1109/TCBB.2012.109.
 7.
Reinert G, Schbath S: Probabilistic and statistical properties of words: an overview . J Comput Biol. 2000, 7 (1–2): 146. 10.1089/10665270050081360.
 8.
Tompa M, Li N, Bailey T, Church G, De Moor B, Eskin E, Favorov A, Frith M, Fu Y, Kent J, Makeev V, Mironov A, Noble W, Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: An assessment of computational tools for the discovery of transcription factor binding sites . Nat Biotechnol. 2005, 23: 137144. 10.1038/nbt1053.
 9.
Nuel G: Numerical solutions for patterns statistics on Markov chains . Stat Appl Genet Mol Biol. 2006, 5: 26
 10.
Lladser M, Betterton MD, Knight R: Multiple pattern matching: A Markov chain approach . J Math Biol. 2008, 56 (1–2): 5192.
 11.
Guibas L, Odlyzko A: String overlaps, pattern matching and nontransitive games . J Comb Theory, Series A. 1981, 30: 183208. 10.1016/00973165(81)900054.
 12.
Szpankowski W: Average case analysis of algorithms on sequences . 2001, John Wiley and Sons, New York
 13.
Régnier M: A unified approach to word occurrences probabilities . Discrete Appl Math. 2000, 104: 259280. 10.1016/S0166218X(00)001955. [Special issue on Computational Biology; preliminary version at RECOMB’98],
 14.
Régnier M, Szpankowski W: On pattern frequency occurrences in a Markovian sequence . Algorithmica. 1997, 22 (4): 631649. 10.1007/PL00009244. [Preliminary draft at ISIT’97],
 15.
Régnier M, Denise A: Rare events and conditional events on random strings . Discrete Math Theor Comput Sci. 2004, 6 (2): 191214.
 16.
Nicodéme P: Motif statistics . Theor Comput Sci. 2004, 287: 593617. 10.1016/S03043975(01)00264X.
 17.
Nicodéme P: Regexpcount, a symbolic package for counting problems on regular expressions and words . Fundamenta Informaticae. 2003, 56 (1–2): 7188.
 18.
Régnier M, Lifanov A, Makeev V: Three variations on word counting. In Proceedings German Conference on Bioinformatics. Heidelberg; 2000:75–82.
 19.
Prum B, Rodolphe F, Turckheim E: Finding words with unexpected frequencies in DNA sequences . J R Stat Soc B. 1995, 11: 190192.
 20.
Bender EA, Kochman F: The distribution of subword counts is usually normal . Eur J Comb. 1993, 14 (4): 265275. 10.1006/eujc.1993.1030.
 21.
Cowan R: Expected frequencies of DNA patterns using Whittle’s formula . J Appl Prob. 1991, 28: 886892. 10.2307/3214691.
 22.
Godbole AP: Poissons approximations for runs and patterns of rare events . Adv Appl Prob. 1991, 23: 851865. 10.2307/1427680.
 23.
Geske MX, Godbole AP, Schaffner AA, Skrolnick AM, Wallstrom GL: Compound Poisson approximations for word patterns under Markovian hypotheses . J Appl Prob. 1995, 32: 877892. 10.2307/3215201.
 24.
Reinert G, Schbath S: Compound Poisson approximation for occurrences of multiple words in Markov chains . J Comput Biol. 1998, 5 (2): 223253. 10.1089/cmb.1998.5.223.
 25.
Nuel G: Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata . J Appl Prob. 2008, 45: 226243. 10.1239/jap/1208358964.
 26.
MR L, Spouge J, Kanga G, Landsman D: Statistical analysis of overrepresented words in human promoter sequences . Nucleic Acids Res. 2004, 32 (3): 949958. 10.1093/nar/gkh246. [http://0www.ncbi.nlm.nih.gov.iiiserver.ualr.edu/pubmed/14963262],
 27.
Regnier M, Vandenbogaert M: Comparison of statistical significance criteria . J Bioinformatics Comput Biol. 2006, 4 (2): 537551. 10.1142/S0219720006002028.
 28.
Regnier M, Bourdon J: Large deviation properties for patterns . J Discrete Algorithms. 2014, 24: 211. 10.1016/j.jda.2013.09.004.
 29.
Nuel G: LDSPatt: large deviations statistics for patterns on Markov chains . J Comp Biol. 2004, 11 (6): 10231033. 10.1089/cmb.2004.11.1023.
 30.
Hertzberg L, Zuk O, Getz G, Domany E: Finding motifs in promoter regions . J Comput Biol. 2005, 12 (3): 314330. 10.1089/cmb.2005.12.314.
 31.
Boeva V, Clément J, Régnier M, Roytberg M, Makeev V: Exact pvalue calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cisregulatory modules . Algorithms Mol Biol. 2007, 2 (13): 25[http://www.almob.org/content/2/1/13],
 32.
Nuel G: Effective pvalue computations using finite Markov chain imbedding (FMCI): application to local score and to pattern statistics . Algorithms Mol Biol. 2006, 1 (5): 14[http://www.almob.org/content/1/1/5],
 33.
Zhang J, Jiang B, Li M, Tromp J, Zhang X, Zhang M: Computing exact pvalues for DNA motifs . Bioinformatics. 2006, 23: 531537. 10.1093/bioinformatics/btl662.
 34.
Fu J, Lou W: Distribution theory of runs and patterns and its applications. A finite Markov chain imbedding approach . 2003, World Scientific, Singapore
 35.
Crochemore M, Stefanov V: Waiting time and complexity for matching patterns with automata . Inform Process Lett. 2003, 87 (3): 119125. 10.1016/S00200190(03)002710.
 36.
Ribeca P, Raineri E: Faster exact Markovian probability functions for motif occurrences: a DFAonly approach . Bioinformatics. 2008, 24 (24): 28392848. 10.1093/bioinformatics/btn525.
 37.
Regnier M, Kirakossian Z, Furletova E, Roytberg MA: A word counting graph . London Algorithmics 2008: Theory and Practice (Texts in Algorithmics) . Edited by: Joseph Chan JWD, Rahman MS. 2009, London College Publications, London, 3131. [http://hal.inria.fr/inria00437147/en/], [http://hal.inria.fr/inria00437147/en/]
 38.
Karlin S, Burge C, Campbell A: Statistical analyses of counts and distributions of restriction sites in DNA sequences . Nucleic Acids Res. 1992, 20 (6): 13631370. 10.1093/nar/20.6.1363.
 39.
Nicodème P, Salvy B, Flajolet P: Motif Statistics . Theor Comput Sci. 2002, 287 (2): 593618. 10.1016/S03043975(01)00264X. [Preliminary version at ESA’99],
 40.
Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis: probabilistic models of proteins and nucleic acids . 1998, Cambridge University, Cambridge
 41.
Rabin M: Probabilistic automata . Inform Control. 1963, 6: 230245. 10.1016/S00199958(63)902900.
 42.
Salomaa A: Theory of automata . 1969, Pergamon Press, Oxford
 43.
Kucherov G, Noé L, Roytberg M: A unifying framework for seed sensitivity and its application to subset seeds . J Bioinformatics Comput Biol. 2009, 4 (2): 553569. 10.1142/S0219720006001977.
 44.
Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition . Proc IEEE. 1989, 77 (2): 257286. 10.1109/5.18626.
 45.
Aho A, Corasick M: Efficient string matching . CACM. 1975, 18 (6): 333340. 10.1145/360825.360855.
 46.
Regnier M, Furletova E, Roytberg MA: An average number of suffixprefixes. In Proceedings of the International Moscow Conference on computational molecular biology. Moscow, Russia; 2009:313–314.
 47.
Kulakovskiy I, Medvedeva YA, Shaefer U, Kasianov AS, Vorontsov IE, Bajic VB, Makeev VJ: HOCOMOCO: A comprehensive collection of human transcription factor binding sites models . Nucleic Acids Res. 2013, 41: D195—D20210.1093/nar/gks1089.
 48.
Stormo GD: DNA binding sites: representation and discovery . Bioinformatics. 2000, 16: 1623. 10.1093/bioinformatics/16.1.16.
 49.
Kulakovskiy IV, Makeev VJ: DNA sequence motif: a jack of all trades for ChIPSeq data . Adv Protein Chem Struct Biol. 2013, 91: 135171. 10.1016/B9780124116375.000056.
 50.
Bernstein BE, Birney E, Dunham I, Green E, Gunter C, Snyder C, ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome . Nature. 2012, 489 (7414): 5774. 10.1038/nature11247.
Acknowledgements
This work was supported by Inria associated team MIGEC, FrenchRussian grant Carnage and 140193106 from RFBR. Evgenia Furletova, Mikhail Roytberg and Victor Yakovlev were supported by grants 080192496NCNILa, 090401053a, 120400944a, 140193106NCNILa from RFBR and contract 07.514.11.4004 within the Russian Federation research program 20111.4514008009. Evgenia Furletova also was supported by the grant 140432220mol_a from RFBR.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
The idea of the work and the basic ideas of the algorithms belong to M Régnier. She also took part in the development of the algorithms. M Roytberg took part in the development of the algorithms, especially with respect to the HMM. He also supervised the programming. EF took part in the development of the algorithms especially with respect to details of implementation and has written the vast majority of the code. VY consulted EF during the programming, took part in the program testing and computer experiments and implemented the Website. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 3: Number of overlaps between pattern words. Description of data: The file contains information about numbers of overlaps between words of the patterns defined by PSSMs from HOCOMOCO [47]. The detailed description of the data is given in sheet “INFO” of the file. (XLS 72 KB)
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Pvalue
 Pattern occurrences
 PSSM (PWM)
 Hidden Markov model