Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models

Background Finding new functional fragments in biological sequences is a challenging problem. Methods addressing this problem commonly search for clusters of pattern occurrences that are statistically significant. A measure of statistical significance is the P-value of a number of pattern occurrences, i.e. the probability to find at least S occurrences of words from a pattern in a random text of length N generated according to a given probability model. All words of the pattern are supposed to be of same length. Results We present a novel algorithm SufPref that computes an exact P-value for Hidden Markov models (HMM). The algorithm is based on recursive equations on text sets related to pattern occurrences; the equations can be used for any probability model. The algorithm inductively traverses a specific data structure, an overlap graph. The nodes of the graph are associated with the overlaps of words from . The edges are associated to the prefix and suffix relations between overlaps. An originality of our data structure is that pattern need not be explicitly represented in nodes or leaves. The algorithm relies on the Cartesian product of the overlap graph and the graph of HMM states; this approach is analogous to the automaton approach from JBCB 4: 553-569. The gain in size of SufPref data structure leads to significant improvements in space and time complexity compared to existent algorithms. The algorithm SufPref was implemented as a C++ program; the program can be used both as Web-server and a stand alone program for Linux and Windows. The program interface admits special formats to describe probability models of various types (HMM, Bernoulli, Markov); a pattern can be described with a list of words, a PSSM, a degenerate pattern or a word and a number of mismatches. It is available at http://server2.lpm.org.ru/bio/online/sf/. The program was applied to compare sensitivity and specificity of methods for TFBS prediction based on P-values computed for Bernoulli models, Markov models of orders one and two and HMMs. The experiments show that the methods have approximately the same qualities. Electronic supplementary material The online version of this article (doi:10.1186/s13015-014-0025-1) contains supplementary material, which is available to authorized users.


Background
The recognition of functionally significant fragments in biological sequences is a key issue in computational biology. Many functionally significant fragments are characterized by a set of specific words that is called a pattern and denoted H below. These patterns may represent different biological objects, such as transcription factor binding sites [1][2][3], polyadenylation signals [4], protein domains, etc. The functional fragments recognition problem can be solved by finding sequences in which the words from a given pattern are overrepresented. Defining a meaningful significance criteria for this overrepresentation is a delicate goal, that, in turn, requires a clarification of the probability model. A current criteria is the socalled P-value computed as the probability that a random sequence of length N contains at least S occurrences of a pattern. There are many methods for P-value computation designed for Bernoulli or Markov models. However, Hidden Markov models (HMM) were considered in only a few papers [5,6] despite the models being widely used in bioinformatics. This is a motivation to develop methods for P-value calculation with respect to HMMs.
Existing methods for P-value calculation can be divided into several groups and reviews of the methods can be found in [7][8][9][10]. Studies on word probabilities started as early as the eighties with the seed paper [11] that introduced basic word combinatorics and derived inductive equations for a single word and a uniform Bernoulli model. Some works in the same vein, reviewed in [12] followed for several words, multi-occurrences and extended probability models. The time complexity is proportional to the text length N and the desired number of occurrences S: computations are carried out by induction for n ranging over 1, . . . , N and, for a given n, by induction on the number of occurrences. Although these "mathematics-driven" approaches allow for mathematical formula derivation, the actual computation suffers from a combinatorial explosion when |H| or Markov order increase.
Later on, a first group of methods [13][14][15][16][17] formalized systematically these inductions by the introduction of bivariate generating functions. Coefficients are the Pvalues to be computed. Expectations and variances for the number of occurrences of the different words in pattern H can be expressed explicitly in terms of these generating functions [14,15,18]. Moreover, coefficients may be computed from the analytical expression, when it is available, or through a suitable manipulation of a functional equation, where the theoretical time complexity reduces to S log N. Nevertheless, computing the generating function, or the functional equation, requires the computation of a system of linear equations or, equivalently, the determinant of a matrix of polynomials of size O(|H|). It takes O(|H| 3 ) operations and it is the main drawback of this approach.
A second group of methods computes asymptotics. They rely on convergence results to the normal law proved by [19] or [20]. An approximated P-value is derived, based on Gaussian approximations [21] or Poisson approximations [22][23][24][25]. Nevertheless, this approximation is not suitable for exceptional words, when the observed number of occurrences S significantly differs from the expected number. This was proved experimentally by [26] and theoretically [27]. Large deviation principles are used in [28,29] with a much better precision. Nevertheless, no computable formula are available for large sets.
A third group of methods revisits recursive P-value computation, with a O(S×N) time complexity. They avoid combinatorial explosion by a suitable use of appropriate data structures, tightly related to word overlap properties. Therefore, loss in time dependency to N or S is compensated by a gain on data structure size. A significant part of algorithms in this group are based on traversals of a specific graph. The graph may not be defined explicitly [30]. It can be based on the graph corresponding to the finite automaton recognizing the given pattern, see algorithms AHOPRO [31], SPATT [25,32] and REGEXPCOUNT [17]. MOTIFRANK [33] that is designed for first order Markov models makes use of suffix sets. In [25,32,34], a Markov chain embedding technique was suggested. Counting occurrences of regular patterns in random strings produced by Markov chains reduces to problems regarding the behavior of a first-order homogeneous Markov chain in the state space of a suitable deterministic finite automaton (DFA) [35,36]. In a recent paper [6], a probabilistic arithmetic automaton for computing P-values for a HMM was proposed. In this paper two algorithms were suggested. The first one has a time complexity O(|Q| 2 × N × S × | | × |V|) and a space complexity O(|Q| × S × | |), where |Q| is the number of states of the HMM, | | is the number of states of the automaton recognizing the given pattern, |V| is the alphabet size. The second algorithm has a time complexity O(|Q| 3 × log(N) × S 2 × | | 3 ) and a space complexity O(|Q| 2 × S × | | 2 ). This algorithm uses the "divide and conquer" technique. The drawback is the lack of control on the number of states | | when |H| increases. Finally, despite these great efforts, existing methods perform badly for rather big patterns. Besides this, most of the proposed algorithms are not implemented or implemented only for Bernoulli models or Markov models of small orders.
The present paper provides an algorithm supporting the HMM probability model. It assumes that all words have the same length m and that a HMM with |Q| states is given. It is a generalization of algorithm SUFPREF designed in [37] for Bernoulli models and Markov models of order K . It relies on recurrent equations based on an overlap graph, whose vertices are associated with the overlaps of words from H, and edges correspond to the prefix and suffix relations between overlaps.
respectively. Algorithm SUFPREF is implemented as a Web-server, see http://server2.lpm.org.ru/bio/online/sf/, and a stand-alone program for Windows and Linux. The program is available by request from the authors. The paper is organized as follows. Basic notions on word overlaps are introduced, that lead to an overlap graph that is the main data structure to be used. Then, one recalls the Hidden Markov models definition. Main text sets are defined and equations for their probabilities are derived. The next section describes the algorithm SUFPREF that computes these equations using the overlap graph as a main data structure. Then, the space and time complexities are analyzed and our algorithm is compared with other methods [3,24,31,38,39]. Finally, usage of P-values for TFBS prediction is discussed.

Overlap words
Our approach strongly relies on overlaps of words from a given pattern. In this section we provide necessary definitions for these overlaps, following the notations of [37]. The main deference is in definition of overlap graph, see definition 3. By definition from [37], overlap graph has additional nodes (leaves) that correspond to the words from the pattern H. In the present paper overlap graph has deep edges instead of the nodes. This modification is not affect on upper bounds of time and space complexity. But in practice it gives significant improvements.

Definition 1. Given a pattern
means that x is a prefix (proper prefix) of x. The elements of OV (H) that are proper prefixes (respectively suffixes) of a given word are totally ordered. The empty string is the minimal element. The maximal elements are crucial for our algorithms and data structures.  Order relations are commonly associated to oriented graphs. Definition 3. The overlap graph of a given pattern H is an oriented graph where the set of nodes is OV (H) and the set of edges, E(H), contains the left, right and deep edges, that are defined as follows:

Definition 2. Given a word v in OV
• A left edge links node x to node w iff x = lpred(w); • A right edge links node x to node w iff x = rpred(w); • A deep edge links node x to node w iff there exists a non-empty class H * (x, w) in P(H).
It is denoted OvGraph.
The root is the node corresponding to the empty word.
Below we will use r for notation of a right deep node.
Remark. One can ascribe to each deep edge (w, r) the class H * (w, r) and to each left edge (lpred(w), w) a word label Back(w).

Text sets
The computation of P-values will be done by induction on the text length n (n = 1, . . . , N), and, for each given n, by induction on the number of occurrences s (s = 1, . . . , S).
It relies on formulas introduced in [37], that in turn was based on the ideas from [12,13]. In [37] we give formulas for P-values computation for Bernoulli and Markov models. In the present paper we introduce equations on texts sets that underlie these formulas. Using these equations one can derive formulas for P-value computation for different probabilities models. Also these equations take into account improvements in the overlap graph structure, see section "Overlap words".
These sets are called E-sets.
The proof follows from the definition of R-sets.
Remark. For given n and s, we have to compute the probabilities of sets R(n, s, w) for all w ∈ OV (H). The equations (6) and (7) allow us to do this by recursive traversal of OV (H) from the leaves (deep nodes) of OvGraph to the root according to the right edges. The calculation starts from probabilities of RE-sets found according to the equation (5).
Below we introduce D-sets and give the equations for Dsets, R-sets and E-sets leading to recursive equations for E-sets probabilities. The D-sets defined below consist of texts of length n containing at least s occurrences of the pattern H, ending with a given non-empty overlap word w that has a common part with the s-th occurrence of the pattern H.
Notation. Below we will use the following notations: 1) len(x), for the length of a word x; 2) |M|, for the number of words in a set of words M.
Notation. For a prefix w ∈ OV (H) and any integer n, one denotes where m is the length of words from H. The next propositions describe the relation between D-sets and R-sets.
Proof: [see Additional file 1]. Informally speaking, x is the common part of the s-th occurrence of the pattern H in the text t ∈ D(k(n, w), s, w) and the suffix w of the text t. Remark that according to the Proof: Follows from the proposition 2 [see Additional file 1].

Corollary 1. If lpred(w) = then D(n, s, w) = R(n, s, w).
One observes that, whenever n < m , B(n, s) = ∅, and for all w ∈ OV (H) and r ∈ DROV (H), R(n, s, w) = E(n, s, r) = ∅. Now we are ready to formulate the main theorem of the section. The theorem gives recursive equations for B-sets and E-sets. The main equations (13)- (15) are based on the following observation. The set E(n, s + 1, r), s ≥ 1, can be divided in two disjoint sets: F(n, s + 1, r) and C(n, s + 1, r). The set F(n, s + 1, r) consists of such words that s-th occurrence of the pattern H does not intersect the ending occurrence of H. And C(n, s + 1, r) consists of those texts t from E(n, s + 1, r) that s-th occurrence of H in t intersects the ending occurrence of H. Theorem 1. Let n ≥ m, s ≥ 1 and r ∈ DROV (H), i.e. r is a right deep node.
In each of the texts t 1 , t 2 , t 3 the ending occurrence of the pattern does not intersect the first occurrence. Therefore the texts are in F(20, 2, TA). Note, that the ending occurrence ACATATA of the pattern in t 2 intersects the second occurrence but not the first. Consider the prefixes of t 1 , t 2  In contrast, in each of the texts t 4 and t 5 the last occurrence of the pattern intersects the first occurrence. Therefore the texts t 4 , t 5 ∈ C(20, 2, TA). According to the example 7, the texts t 4 , t 5 are in D(16, 1, ACA) · Back(H * (ACA,TA)), that illustrates the statement (14) of the theorem.

Proof:
1. Consider statement (11). A text t is in B(n, s) if and only if either its prefix of length n − 1 contains at least s occurrences of H or a s-th occurrence H from H ends at position n. In the first case, t is in In the second case, text t belongs to R(n, s, ). The two cases are mutually exclusive; therefore B(n, s) is a disjoint union and (11) is proved. 2. The statement (12) directly follows from the definition of E(n, 1, r).

Consider the statement (13).
(a) First, we prove that The last occurrence H of the pattern does not intersect the s-th occurrence in the text t.
Thus the prefix of t of length n − m contains at least s occurrences of H, i.e. it is in • t has the length n; • t contains at least s + 1 occurrences of the pattern H; • s-th occurrence of H lies on the prefix of t of length n − m, i. e. it does not intersect the last occurrence; Therefore t ∈ F(n, s + 1, r). (14). Let Y denote the right side of equation (14).

Consider the statement
According to the proposition 2, This r)). By the definition of D-sets, • t has the length n; • t contains at least s + 1 occurrences of the pattern; • s-th occurrence intersects (s + 1)-th occurrence of H; • t ends with H ∈ H(r).

The statement (15) follows from the definitions of F
and C-sets.
Notation. Given two integers n and s, and a class H * (w, r), one introduces Obviously, Remark. The unions in equations (11), (14), (15) and (17) are disjoint. Therefore the probability of the set in the left part of an equation is the sum of probabilities of sets in the right side.

Probability models
We suppose that the probability distribution is described by a Hidden Markov Model (HMM). In this section, we recall some basic notions about HMMs and introduce the needed notations. In fact, it is shown in [6] that our definition is equivalent to the classical definition of HMM [40].
where Q is the set of states, q 0 ∈ Q is an initial state, and π is a function: Q × V × Q → [0, 1] such that π(q, a, q) is the probability, being in stateq, to generate symbol a and traverse to state q. For any stateq in Q, the function π meets the condition: A HMM G is called deterministic if for any (q, a) in Q×V there is only one state q such that π(q, a, q) > 0. In this case the function π can be described with two functions: Namely, φ(q, a) is equal to the unique state q such that π(q, a, q) > 0 and ρ(q, a) is π(q, a, q).
A HMM G = Q, q 0 , π can be represented as a graph where Q is the set of vertices. Each edge is assigned with a label a ∈ V and with a probability p ∈ (0; 1]. There exists an edge fromq to q with the label a and probability p iff π(q, a, q) > 0 and p = π(q, a, q). The graph is called the traversal graph of HMM G.

Definition 12. Let h be a path in the traversal graph of the HMM G. The label of h is the concatenation of the labels of edges that constitute the path h. The probability Prob(h) of a path h is the product of the probabilities of the edges that constitute the path h.
Definition 13. The probability Prob(t) of a word t with respect to the HMM G is the sum of probabilities of all paths that start in the initial state q 0 and have the label t.
Let q andq belong to Q and t be a word. By definition, the probability Prob(q, t, q) to move from the stateq to the state q with the emitted word t is the sum of probabilities of all paths starting in the stateq, ending in the state q and having the word label t.
To describe effective algorithms related to HMMs, we need the notion of reachability.

Definition 14. Given a stateq and a word t, we define
Given a state q and a string t, we define A state q is called t-reachable from a stateq if and only if Prob(q, t, q) > 0.
Definition 15. For a given word t, AllState(t) is the set of states that are reachable from initial state q 0 by at least one text with suffix t. For a set of words M, Remark.
Definition 16. Let w be an overlap word. We denote by Analogously, for each deep edge (w, r) and its associated class H * (w, r), one notes

HMM and probabilistic automata
The definition of HMM is very close to the definition of probabilistic automaton PA, [41,42]. The main difference lies in the interpretation of the behavior of a model. For a HMM, one considers a label as a symbol emitted by the HMM; for automata, one imagines an automaton that processes a given word letter by letter. Another difference connected with the previous one is that PAs are typically used to describe word sets; thus, for a given PA, the subset of accepting states is defined. HMMs are mainly used to describe probability models and thus have no accepting states.
In applications, one often uses a probabilistic automata built as a Cartesian product of a deterministic automaton accepting a given set of words and a HMM describing the word probabilities, see e.g. [6,43]. A similar construction is used below. In fact, we describe generalized probabilistic automata, GPA. As opposed to PAs, the edges in a graph that represents our automaton are labeled with words rather than with letters, and thus it can be named a generalized probabilistic automaton, analogously to the definition of generalized HMM [44].
An originality of SUFPREF is that words from pattern H, or classes, that represent terminal states in classical automata need not be explicitly represented. Nevertheless, each class is uniquely associated to one deep edge.

Probabilities equations for HMM
In the section above the main text sets and corresponding equations were described. One can apply the equations to compute probabilities of the text sets for arbitrary probability models. Here we give formulas to compute the probabilities for an HMM. The formulas are based on the following observations. First, all unions in the text equations are disjoint. Second, an item of a set union is a set with already known probability or concatenation of such sets. In the latter case the probability Prob(q 1 , L 1 · L 2 , q 2 ) can be computed by the formula where Prob(q , L, q) is a probability that, being in the state q , the chain will go to the state q emitting a word t from the set L.

General description
Our goal is to compute Prob(B(N, S)), that is the probability to find at least S occurrences of a pattern H in a random text of length N, given a HMM G = Q, q 0 , π . The algo-rithm SUFPREF, see Algorithm 1, computes the probability by induction on a text length n, where m ≤ n ≤ N, and, for a given n, by induction on a number of occurrences s, where 1 ≤ s ≤ S. The computation within the main loop is based on equations (21)- (29), related to B-sets, C-sets, F-sets, Esets, D-sets, RE-sets and R-sets.
The computation related to texts of length n will be referred to as n-th stage of the algorithm's work. The main computation within the n-th stage is performed by depthfirst traversal of OvGraph following left and deep edges. During the depth-first traversal for each visited node w ∈ OV (H), the algorithm computes the probabilities of RE-sets and auxiliary probabilities of D, F and C-sets by induction on number of occurrences s = 1, . . . , S. Within the traversal we store the probabilities of D-sets related to the nodes on the path from the root of OvGraph to a current node w, i.e. the nodes x from OverlapPrefix(w), in the temporary arrays TempDProb(x, q) of the size S; the size of the data related to a node x on the path is O(|Q| × S), see sub-section "Main loop" below. Then update of auxiliary information stored in nodes of OvGraph, namely, probabilities of R-sets, is performed by a bottom-up traversal of OvGraph using right edges.
Computation on the inductive equations relies on a generic procedure, analogous to the forward algorithm for HMM [40], see also [5].

Preprocessing and data structures
On the preprocessing stage we initialize the global data structures of the algorithm, i.e. the OvGraph, including auxiliary structures assigned to its nodes and some other structures that are described at the end of this subsection.

Overlap graph
The graph OvGraph is built from the Aho-Corasick trie T H for the set H [45]. The nodes belonging to the OvGraph correspond to the overlaps and therefore can be easily revealed using suffix links of the Aho-Corasick trie, see [37] and [Additional file 2], for details of the procedure. The nodes of OvGraph are assigned with additional data (constant data and data to be updated at each stage n = m + 1, . . . , N). All these data are initialized at the preprocessing stage, see below.
Constant transition probabilities related to nodes of overlap graph During the computation, algorithm SUF-PREF uses some probabilities that are constant and can be precomputed and stored.
• For each node w and all states q in AllState(w) andq in PriorState(w, q), we store the "left transition probability" Prob(q, Back(w), q), see definitions 15 and 16. The left transition probabilities are used for the computation of D-sets probabilities, see (26); • Given a right deep node r, the "word probabilities" Prob(q, H(r), q) are memorized for states q in AllState(r) andq in Q. They are used to compute probabilities of the F-sets, see (23); • Given a right deep node r, we store, for each class H * (w, r), the "deep transition probabilities" Prob(q, Back(H  *  (w, r)), q) where q ranges over AllState (H  *  (w, r)) andq ranges over PriorState (H  *  (w, r), q). The probabilities are needed for the computation of C-sets probabilities, see (24). The global data unrelated to overlap graph Besides the data related to nodes of OvGraph we store the following data.
• Transition probabilities. For eachq, q ∈ Q we store the constant probability The aim of the part 2.2 (procedure COMPUT-EREPROB, see Algorithm 2) is to compute the values Prob(RE(n, s, r), q) for all r ∈ DROV (H), q ∈ AllState(r) and s = 1, . . . , S.
The computation is performed using the recursive depth-first traversal of OvGraph along the left edges; it is based on the formulas (22)- (27). Let a node w is visited, it corresponds to the call of COMPUTEREPROB(n, w). Firstly, COMPUTEREPROB computes Prob (E(n, 1, w), q) by the formula (22) and puts the values to TermRProb(w, q) [1].
Then by induction on s = 1, . . . , S the procedure computes the following probabilities.
Within the part B, see lines 8-14, for all states q ∈ AllState(w), the procedure computes Prob (D(k(n, w), s, w), q) by the formula (26). To make the computation by the formula (26) one needs the value Prob(D(k(n, lpred(w)), s, lpred(w)),q); the value is stored in the array TempDProb(w, q), see sub-section "General description". Now consider the part C of Algorithm 2, see lines 15-26. Although the calculation of probabilities of R-sets and RE-sets is based on the formulas (25) and (27) we avoid explicit usage of E-sets in our calculations. From (25) and (27) we have (here s > 1) Pr ob (RE(n, s, r), q) = Pr ob (E(n, s, r), q) − Pr ob(E(n, s + 1, r), q) = = Pr ob (F(n, s, r), q) + + w:(w,r) is a deep edge Pr ob (C (n, s, w, r), q) − − Pr ob (F(n, s + 1, r)

For s = 1 we have
Prob (RE(n, 1, r), q) = Prob (E(n, 1, r) The value Prob(E(n, 1, r), q) was computed and stored in TempRProb(w, q) [1] at the part A of the procedure. During the computation we accumulate the needed probabilities in the arrays TempRProb(w, q), see section C of the algorithm 2, lines 15-26. Visiting a left deep node w, for each r such that there is a deep edge (w, r), and for each q ∈ AllState(r), we firstly calculate the value Prob(C (n, s + 1, w, r), q) using (24). Then add to the current value of TempRProb(w, q)[s] the value Prob (C (n, s, w, r), q) − Prob(C (n, s + 1, w, r), q) (if s > 1) or substract the value Prob(C (n, 2, w, r), q) (if s = 1).
In section D of the Algorithm 2, see lines 27-36 we analogously take into account the probabilities of F-sets.
The computation is done by a recursive bottom-up traversal of OvGraph along the right edges. Also the procedure records the computed Prob (R(n, s, w), q) probabilities to the corresponding cells of the matrix RProb(w, q) and initializes elements of TempRProb(w, q) by zeros.

Remark.
The above traversals are implemented with a recursive procedure initially called at the root (node corresponding to ) of OvGraph, see lines 11, 12 of the algorithm SUFPREF (Algorithm 1).

Discussion
Space complexity The data stored consist of input data, temporary data used at the preprocessing step, the main data structure OvGraph and the working data unrelated to the OvGraph. The detailed description of all of the data is given in the section "Preprocessing and data structures". The space complexity is mainly determined by the memory needed for the data related to the OvGraph and temporary data used at the preprocessing step. We first briefly consider the data unrelated to the overlap graph; then we consider OvGraph data. The input data consist of the text length N, the number of occurrences S, a representation of an HMM and a pattern H. The data related to the pattern representation are included in the data related to OvGraph nodes and will be considered below. Storage size for an HMM is O(|Q| 2 × |V|). Thus the input data size is O(|Q| 2 × |V|).
At the preprocessing stage the algorithm uses a temporary structure, the Aho-Corasick trie, to build the OvGraph and temporary data structures to store intermediate probabilities Prob   E(n, 1, w), q) 1 if w is right deep node then 2 foreach q ∈ AllState(w) do 3 Compute Prob (E(n, 1, w), q) using (22); 4 Put Prob (E(n, 1, w), q) to TempRProb(w, q) [1] The temporary data structures used by sub-algorithms in the preprocessing stage are released after their running. Thus, the total memory used during this stage is O(|Q| 2 × (|OV (H)| + |H|) + m × |H|).
The working data unrelated to OvGraph consist of Bsets probabilities Prob(B(n − m − 1, s), q) and probabilities Prob(V n−m−1 , q), q ∈ Q. These data need O(|Q| × S) and O(|Q|) memory, respectively. Within the main loop we use local arrays with D-sets probabilities (the number of these arrays is at most m × |Q|, see remark below) and arrays TempRProb(w, q) (for all w ∈ OV (H), q ∈ AllState(w)).
These arrays are of size S. Therefore the necessary memory to store all of the arrays is O (|Q|×S×m+|Q|×S×|OV (H)|).
As we will see, all this memory, except for the memory needed to store Aho-Corasick trie, does not increase the space complexity of the algorithm.  Table 1. The number of deep edges is equal to the number of classes, |P(H)|, that is upper bounded by |H|. Then the storage size for OvGraph is O(|OV (H)| + |H|). The data assigned to a node of OvGraph consist of constant data and updatable data. The constant data consist of left transition probabilities assigned to the nodes of the OvGraph, deep transition probabilities assigned to the deep edges and word probabilities assigned to the right deep nodes. The updatable data are probabilities of R-sets assigned to all nodes. More precisely, left transition probabilities Prob(q, Back(w), q) are stored in the memory associated with a node w; deep transition probabilities Prob(q, Back(H * (w, r)), q) are stored in the memory associated with deep edge (w, r); word probabilities Prob(q, H(r), q) are stored in the memory associated with a right deep node r. As a whole, it gives The number x in "PSSM(12,x)" denotes the cut-off. The P-values are given w.r.t. the text length and probability models described in the text of the paper. To store R-sets probabilities one needs O(S × |Q| × m × |OV (H)|) memory. Thus the size of memory needed to store global data related to OvGraph is
Remark. The parameter |OV (H)| belongs to the bounds of space and time complexities. It is upper bounded by m × |H|. Assume that a pattern consists of random words of length m generated according to the uniform Bernoulli model. It was shown that in such case |OV (H)| ≈ |H|, see [46] and supplementary materials, file "Comparison_with_AhoPro.xls". But for a majority of patterns described by Position-Specific Scoring Matrices and cut-offs that were considered in the present paper, |OV (H)| ≤ 0.1 × |H|, see Table 1  The estimation of space complexity is analogous except of absence of factor N, see sub-section "Space complexity" for details.
In the case of a general HMM λ(G) = k × |Q| 2 , here |Q| is the number of states of the HMM G; the value of k depends on features of the HMM.
We have performed computer experiments to get a better understanding of the asymptotic behavior of time and space complexity. Let N Trans be the number of states where the HMM can transit in one step from a given state. This parameter describes the "density" of an HMM; the smaller N Trans , the smaller the complexities of the algorithm. The factor λ(G) was studied as a function of N Trans and the number of states N States in used HMMs. We have performed 96 = 4 × 24 series of experiments, 100 experiments in each series. In all series we have used following input data: • the pattern is defined by a PSSM for transcription factor FOXA2 from the database HOCOMOCO [47] and cut-off 5.89 that corresponds to roughly 0.03% of all words of length 12; • number of occurrences is 10; • text length is 1000.
Thus, a series differs from the others only with the used HMMs. Each series is determined by the number N State of states in the HMMs, and the number N Trans , see above. The value N State ranges from 2 to 25, therefore 24 values of N State were considered. For each number of states four values of N Trans were used, namely, 1; 2; 0.25 · N State and N State . Given values N State and N Trans , we have created 100 HMMs by the following randomized procedure. For each stateq, we firstly have randomly chosen N Trans states q ∈ Q such that there exists a transition fromq to q. In our models if there exists transition fromq to q then π(q, a, q) > 0 for all a ∈ V. Then we assign to each triple q, a, q , where a ∈ V, a random positive value π(q, a, q) from [ 0, 1], and then normalize the values to make the needed sums of probabilities equal to 1. For each series we report average run-time and used space. The results of the experiments are presented in Figure 2 and Figure 3.  Markov models. Further refinements Complexity results are presented with (possibly rough) upper bounds. In particular, the |Q| 2 factor arises from transition probabilities representation. It actually stands for the sum of  the cardinalities of PriorState(w, q) sets in a given node w ∈ OV (H), q ∈ AllState(w).
In practical cases, this number may be significantly smaller than |Q| 2 . In particular, this is the case for Markov models that can be treated as a special case of Hidden Markov Models. Let K denote the order of Markov model. For an overlap node w, such that len(w) ≥ K , the set AllState(w) consists of only one state. We use the technique of "reachable states", see section "Probability models" to take into account this issue. It does not decrease the upper bounds in the case of a general HMM but leads to a significant improvement of the software for Markov models. At the same time, combining the technique with other improvements of the algorithm, see [37], allows one to obtain better complexity bounds for the Markov case. Namely, O(S×m×(K ×|V| K +1 +|OV (H)|)+m×|H|) space complexity and O(N × S × (K × |V| K +1 +|OV (H)|+|H|)) time complexities are achievable. The details of the optimized algorithm for the Markov case achieving the above bounds will be presented in a separate paper.
Comparison with the existing algorithms The algorithm SUFPREF implements a P-value computation for HMM and achieves the theoretical complexity of the best algorithms for P-value computation. Notably, the complexities of SUFPREF are consistent with the complexities of algorithms based on finite automata. Our optimization of the data structure provides an improvement for the constant factor. A comparison of the number of nodes of OvGraph and the number of states of a minimal automaton for a given pattern is given in paper [37]. It was observed in [46] that an average number of overlaps (nodes of OvGraph) for random patterns generated according to Bernoulli models is proportional to the number of words in the patterns and is independent of the length of the words.
For Bernoulli and first order Markov model cases, we have compared program SUFPREF with the implementation of program AHOPRO [31]. The program AHOPRO admits only Bernoulli and first order Markov models.
The P-values were computed with the following input parameters: (1)  A pattern presented by PSSM and cut-off consists of all words whose score according to PSSM is greater than the cut-off. The cut-offs were precalculated such that the numbers of words matching the PSSM and a cut-off are equal to the fractions of all 12 (14)-mers in range from 0.00001 to 0.003. The fractions correspond to the fractions of words in patterns using for transcription factor binding sites (TFBS) prediction.
The experiments were performed using a quad-core Intel Core i5 system running at 2.67 GHz (only one core used) with 8 GB RAM and dual-disk stripped swap partition. Both programs AHOPRO and SUFPREF were compiled using the GCC 4.5 tool chain for the 64-bit Linux target. To measure running time and maximum sizes of memory during the program's runs we used POSIX's "getrusage()" function twice: before and after processing to measure data size excluding program code itself. We have slightly modified the source code of AHOPRO to call this function before and after main program execution.
We consider the matrices PSSM from the database HOCOMOCO [47] describing binding sites of lengths 12 and 14 in human genome for transcription factors FOXA2 and E2F1; the matrices are given in [Additional file 6]. Observe that the P-values computed for both probability models are the same, when the other parameters are identical.
The results of the experiments for PSSM-based patterns of length 12 are presented in Tables 1 and 2. The results for other patterns are given in [Additional file 7]. Table 1 provides details on the patterns structures; N AC denotes the number of nodes of the Aho-Corasick trie (the size of automaton used by AHOPRO). Table 2 provides space and run-time results. The running time is given in seconds and the memory size in megabytes.
It turns out that in all cases our program is several times faster than AHOPRO. And for a majority of cases, it is faster than AHOPRO by more than ten and five times for Bernoulli and Markov models respectively. Also it outperforms AHOPRO in space.
Remark. The advantages of SUFPREF are crucial for patterns of big sizes. For example, consider the pattern described by PSSM corresponding to binding sites of lengths 16 for factor ANDR with cut-off 4.64, where pattern contain about 0.001 of all 16-mers (4270349 words). For this pattern, the run time and space of SUFPREF's work are 12.71 seconds and 691.58 megabytes. But the run

Usage of P-values for TFBS prediction
The majority of methods for TFBS prediction firstly search for genome regions with high number of occurrences of a pattern corresponding to needed TFBS. Then the candidate regions have to be chosen following proper criterion of statistical significance [48,49]. We have compared predictive abilities of methods using criteria based on P-values for different probability models and a method using criterion based on a number of occurrences. The experiments were performed with human transcription factor FOXA2. We have considered several patterns based on the PSSM of length 12 from the database HOCOMOCO [47] and different cut-offs. The best results were obtained for the cut-off 5.89; about 0.0003 of all words of length 12 match the PSSM with this cut-off. The pattern H that is discussed below consists of all words having score exceeding the cut-off and their reverse-complemented words.
We have considered the test set of 1800 genome regions of length from 200 to 400; the set consists of 900 "positive" regions and 900 "negative" ones. The positive regions were taken from the database ENCODE [50]. We have chosen top 900 regions related to human transcription factor FOXA2 having length from 200 to 400 b.p. in accordance with their quality (Signal value). The length distribution of regions is almost uniform; all the regions belong to Top 1000 of the FOXA2-related regions according to their Signal value. The negative regions presumably do not bind FOXA2. They were taken from random places of the first chromosome of human genome, the length of negative regions by construction are uniformly distributed from 200 to 400 b.p. For each region (positive or negative) we have computed 5 variants of P-values related to different probability models. The other parameters of computation were chosen as follows.  Table 3. Blue squares correspond to the method based on the number of occurrences. The ROC-curves for P-value based methods are almost coincide. (HMM4). The parameters of the models were estimated on the adjacent fragments of length 4000 b.p. taken from both sides of the considered region. To estimate parameters of Bernoulli and Markov models we have used maximal likelihood method; for HMMs we have used Baum-Welch algorithm, see [40].
The main results are given in Table 3 and Figure 4; the details of the experiments are given in [Additional file 8]. The Table shows sensitivity and specificity of recognition for various thresholds and probability models. The thresholds for P-value based methods were chosen to obtain approximately the same sensitivity as the method based on number of occurrences with corresponding minimal number of occurrences. One can see (see also Figure 4) that all P-value methods have approximately the same quality and outperform the method based on number of occurrences.
Remark. The signal value of ChIP-Seq data reflects the amount of binded proteins. Therefore the signal values of considered ENCODE regions show better correlation with number of pattern occurrences, than with P-values, see Table 4. However, the methods for TFBS prediction based on P-values show significantly better predictive abilities.

Conclusions
This work presents an approach to compute the P-value of multiple pattern occurrence within a randomly generated text of a given length. The approach provides significant space and time improvements compared to the existing software that is crucially important for applications. The improvements are achieved due to the use of an overlap graph: taking into account overlaps between the pattern words allows one to decrease necessary space and time. The number of nodes of a Aho-Corasick trie, a structure that is extensively used in automaton approach, is much larger than the number of overlaps.
Another advantage is that, unlike existing algorithms and programs, it allows us to deal with Hidden Markov Models, the most general class of popular probabilistic models. The algorithm relies on the Cartesian product of the overlap graph and the graph of HMM states. A further reduction to the reachable vertices leads to extra improvement of time and space complexity. Despite the fact that Bernoulli and Markov models can be treated as special HMMs, it is worth implementing specialized and optimized versions of software for these models. Indeed, paper [37] can be viewed as a meta version of SUFPREF. The peculiarity of the implementation of Markov models of higher orders will be presented in a separate paper.
The implementation of the algorithm SUFPREF was compared with the program AHOPRO for a Bernoulli model and a first order Markov model. The comparison shows that, for a majority of cases, our algorithm is faster than AHOPRO in more than ten times for the Bernoulli model and in more than five times for the Markov model. The greatest advantage of SUFPREF is to decrease the needs in space. It outperforms AHOPRO in space. Therefore it can be run with patterns with a greater number of words and a greater length.
complexities of the algorithm SUFPREF for HMMs. The detailed description of the data is given in sheet "INFO" of the file.