 Research
 Open Access
Efficient and accurate Pvalue computation for Position Weight Matrices
 Hélène Touzet^{1, 2}Email author and
 JeanStéphane Varré^{1, 2}Email author
https://doi.org/10.1186/17487188215
© Touzet and Varré; licensee BioMed Central Ltd. 2007
 Received: 06 July 2007
 Accepted: 11 December 2007
 Published: 11 December 2007
Abstract
Background
Position Weight Matrices (PWMs) are probabilistic representations of signals in sequences. They are widely used to model approximate patterns in DNA or in protein sequences. The usage of PWMs needs as a prerequisite to knowing the statistical significance of a word according to its score. This is done by defining the Pvalue of a score, which is the probability that the background model can achieve a score larger than or equal to the observed value. This gives rise to the following problem: Given a Pvalue, find the corresponding score threshold. Existing methods rely on dynamic programming or probability generating functions. For many examples of PWMs, they fail to give accurate results in a reasonable amount of time.
Results
The contribution of this paper is two fold. First, we study the theoretical complexity of the problem, and we prove that it is NPhard. Then, we describe a novel algorithm that solves the Pvalue problem efficiently. The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met. Moreover, the algorithm is capable of calculating the exact Pvalue without any error, even for matrices with noninteger coefficient values. The same approach is also used to devise an accurate algorithm for the reverse problem: finding the Pvalue for a given score. Both methods are implemented in a software called TFMPVALUE, that is freely available.
Conclusion
We have tested TFMPVALUE on a large set of PWMs representing transcription factor binding sites. Experimental results show that it achieves better performance in terms of computational time and precision than existing tools.
Keywords
 Transcription Factor Binding Site
 Background Model
 Score Distribution
 Score Threshold
 Probability Generate Function
Background
A key problem in the understanding of gene regulation is the identification of transcription factor binding sites. Transcription factor binding sites are often modeled by Position Weighted Matrices (PWMs for short), also known as Position Specific Scoring Matrices (PSSMs for short), or simply matrices. Examples are to be found in the Jaspar [1] or Transfac [2] databases. The usage of such matrices goes with global bioinformatics strategies that help to elucidate regulation mechanisms: comparative genomics, identification of overrepresented motifs, identification of correlation between binding sites, ... Similar matrixbased models also serve to represent splice sites in messenger RNAs [3] or signatures in amino acid sequences [4].
where u_{ i }denotes the character symbol at position i in u.
Searching for occurrences of a matrix in a sequence requires to choose an appropriate score threshold to decide whether a position is relevant or not. Let α be such a score. We say that the matrix M has an occurrence in the sequence S at position i if Score(S_{ i }... S_{i+m1}, M) ≥ α. The problem of efficiently finding occurrences of a matrix in a text has recently attracted a lot of interest [5–7]. Here we address the problem of computing the score threshold α. To determine such a score threshold, the standard method is to use a Pvalue function, which gives the statistical significance of an occurrence according to its score. The Pvalue Pvalue(M, α) is the probability that the background model can achieve a score equal to or greater than α. In other words, the Pvalue is the proportion of strings (with respect to the background model) whose score is greater than the threshold α for M. In [8], the authors introduce a generic approach to Pvalue computation for nonparametric models. In the context of matrices, the computation can be carried out using probability generating functions or dynamic programming [9–12]. In both cases, the time complexity is proportional to the product of the length of the matrix and the number of possible different scores. If the matrix has nonnegative integer coefficient values, then the number of possible different scores is bounded by $\sum}_{i=1}^{m}\mathrm{max}\{M(i,x)x\in \Sigma \$. It follows that known algorithms are pseudopolynomial. In real life, matrices have actually real coefficient values, such as logratio matrices, or entropy matrices. In this context, the number of different scores that the matrix can achieve is significantly larger.
Error with round matrices. We report the percentage of Jaspar matrices for which the Pvalue computed from a round matrix leads to a different number of words as for the Pvalue computed with the original matrix. The rounding ranges from 10^{2} to 10^{6}, and the Pvalue is 10^{3} for a multinomial background model.
Granularity  10^{2}  10^{3}  10^{4}  10^{5}  10^{6} 
% matrices with error  76  55  30  15  7 
In this paper, we study the theoretical complexity of the Pvalue problem and prove that it is intrinsically difficult. It is actually NPhard. We then introduce a novel algorithm that achieves significant speed up compared to existing algorithms when we allow for some errors like other methods do. This algorithm is also capable to solve the Pvalue problem without error within a reasonable amount of time.
Complexity of the Pvalue problem
We begin by introducing formally the Pvalue problem. We actually define two complementary problems, depending on what is given and what is searched for. In both cases, we are given a finite alphabet Σ, a matrix M of length m and a probability distribution on Σ^{ m }. We say that s in ℝ is an accessible score if there exists a word u in Σ^{ m }such that Score(u, M) = s.
Pvalue problem – from score to Pvalue: Given a score value α, find the probability of the set {u ∈ Σ^{ m }, Score(u, M) ≥ α}. This probability is denoted Pvalue(M, α).
Threshold problem – from Pvalue to score: Given a Pvalue P (0 ≤ P ≤ 1), find the highest accessible score α such that Pvalue(M, α) ≥ P. We write Threshold(M, P) for α.
As we will see later on in this paper, they are closely related problems. We show here that neither of them admits a polynomial algorithm, unless P = NP. For that, we first define the decision problem ACCESSIBLE SCORE as follows.
Instance: a finite alphabet Σ, a matrix M of length m whose coefficients are natural numbers, a natural number t
Question: does there exist a string u of Σ^{ m }such that Score(u, M) = t?
Theorem 1 ACCESSIBLE SCORE is NPhard.
The proof of Theorem 1 is by reduction of the SUBSET SUM problem, which is a pseudopolynomial NPcomplete problem [14].
Instance: a set of positive integers A = {a_{0},...,a_{ n }} and a positive integer s
Question: does there exist a subset A' of A such that the sum of the elements of A' equals exactly s?
Lemma 1 There exists a polynomial reduction from the SUBSET SUM problem to the ACCESSIBLE SCORE problem.
Proof. Let A = {a_{0},...,a_{ n }} be a set of positive integers, and let s be the target integer. We define the matrix M of length n + 1 on the two letter alphabet Σ = {x, y} as follows: M (i, x) = a_{ i }and M (i, y) = 0 for each i, 0 ≤ i ≤ n. The set A has 2^{n+1}different subsets. So we can define a bijection φ from the set of subsets of A onto Σ^{n+1}. For each subset A', the word φ (A') is such as the i th letter is x if and only if a_{ i }∈ A', otherwise the i th letter is y. It is easy to see that Score(φ (A'), M) = s if, and only if, ∑_{a∈A'}a = s.
It remains to prove that the ACCESSIBLE SCORE problem polynomially reduces to instances of the From score to Pvalue and From Pvalue to score. problems. We are now given a finite alphabet Σ, a matrice M of length m, and a score value t.
Reduction to the From score to Pvalue problem
We assume that the probability of each nonempty word of Σ^{ m }is non null. Under this hypothesis, the ACCESSIBLE SCORE problem admits a solution if, and only if, Pvalue(M, t) ≠ Pvalue(M, t + 1).
Reduction to the From Pvalue to score problem
We assume that the background model for Σ* is provided with a multinomial model. In this context, all words of length m have the same probability: $\frac{1}{{\left\Sigma \right}^{m}}$ and all Pvalues are of the form $\frac{k}{{\left\Sigma \right}^{m}}$. Solving the ACCESSIBLE SCORE problem amounts to decide whether there exists an integer k, 0 ≤ k ≤ Σ^{ m }, such that Threshold(M, $\frac{k}{{\left\Sigma \right}^{m}}$) = t. The existence of such k can be decided with iterative computations of From Pvalue to Score for different values of k. This search can be performed within O(log_{2} (Σ^{ m })) steps using binary search, because k decreases monotonically in t and there are at most Σ^{ m }different values for k.
Algorithms for the Pvalue problems
From now on, we assume that the positions in the sequence are independently distributed. We denote p(x) the background probability associated to the letter x of the alphabet Σ. By extension, we write p(u) for the probability of the word u = u_{1} ... u_{ m }: p(u) = p(u_{1}) × ⋯ × p(u_{ m }).
Definition of the score distribution
The computation of the Pvalue is done through the computation of the score distribution. This concept is the core of the large majority of existing algorithms [9–11, 15]. Given a matrix M of length m and a score α, we define Q(M, α) as the probability that the background model can achieve a score equal to α. In other words, Q (M, α) is the probability of the set {u ∈ Σ^{ m } Score(u, M) = α}. In the case where s is not an accessible score, then Q(M, s) = 0.
The computation of Q is easily performed by dynamic programming. For that purpose, we need some preliminary notation. Given two integers i, j satisfying 0 ≤ i, j ≤ m, M [i..j] denotes the submatrix of M obtained by selecting only columns from i to j for all character symbols. M [i..j] is called a slice of M. By convention, if i > j, then M [i..j] is an empty matrix.
Conversely, given P, Threshold (M, P) is computed from Q by searching for the greatest accessible score until the required Pvalue is reached.
Computing the score distribution for a range of scores
Formula 1 does not explicitly state which score ranges should be taken into account in intermediate steps of the calculation of Q. To this end, we introduce the best score and the worst score of a matrix slice.
The notion of best scores is already present in [16], where it is used to speed up the search for occurrences of a matrix in a text. It gives rise to look ahead scoring. Best scores allow to stop the calculation of Score(u, M) in advance as soon as it is guaranteed that the score threshold cannot be achieved, because we know the maximal remaining score. It has been exploited in [5, 6] in the same context. Here we adapt it to the score distribution problem. Let α and β be two scores such that α ≤ β. If one wants to compute the score distribution Q for the range [α, β], then given an intermediate score s and a matrix position i, we say that Q(M [1..i], s) is useful if there exists a word v of length m  i such that α ≤ s + Score(v, M [i + 1..m]) ≤ β. Lemma 2 characterizes useful intermediate scores.
Lemma 2 Let M be a matrix of length m, let α and β be two score bounds defining a score range for which we want to compute the score distribution Q. Q(M [1..i], s) is useful if, and only if,
α  BS(M [i + 1..m]) ≤ s ≤ β  WS(M [i + 1..m])
Proof. This is a straightforward consequence of Definition 1.
If one wants only to calculate the Pvalue of a given score without knowing the score distribution, Algorithm SCOREDISTRIBUTION can be further improved. We introduce a complementary optimization that leads to a significant speed up. The idea is that for good words, we can anticipate that the final score will be above the given threshold without calculating it.
 1.
Score(u, M [1..i]) ≥ α  WS(M [i + 1..m])
 2.
Score(u_{1} ... u_{i1}, M [1..i  1]) <α  WS(M [i..m])
Lemma 3 Let u be a good word for α. Then for all v in u Σ^{mu}, we have Score(v, M) ≥ α.
Lemma 4 Let u be a string of Σ^{ m }such that Score(u, M) ≥ α. Then there exists a unique prefix v of u such that v is good for α.
Proof. We first remark that if Score(u, M) ≥ α, then Score(u, M) ≥ α  WS(M[m + 1..m]). So there exists at least one prefix of u satisfying the first condition of Definition 2: u itself. Now, consider a prefix v of length i such that Score(v, M[1..i]) ≥ α  WS(M[i + 1..m]). Then for each letter x of Σ, we have Score(vx, M[1..i + 1]) ≥ α  WS(M[i + 2..m]): It comes from the fact that M(i + 1, x) ≥ WS(M[i + 1..m])  WS(M[i + 2..m]). This property implies that if a prefix v of u satisfies the first condition of Definition 2, then all longer prefixes also do. According to the second condition of Definition 2, it follows that only the shortest prefix v such that Score(v, M[1..i]) ≥ α  WS(M[i + 1..m]) is a good word.
where p(u) denotes the probability of the string u in the background model. By definition of Q, we can deduce the expected result from Formula 3.
Permuting columns of the matrix
Algorithms 1 and 2 can also be used in combination with permutated lookahead scoring [16]. The matrix M can be transformed by permuting columns without modifying the overall score distribution. This is possible because the columns of the matrix are supposed to be independent. We show that it is also relevant for Pvalue calculation.
Lemma 6 Let M and N be two matrices of length m such that there exists a permutation π on {1,..., m} satisfying, for each letter x of Σ, M(i, x) = N(π_{ i }, x). Then for any α, Q(M, α) = Q(N, α).
Proof. Let u be a word of Σ^{ m }and let $v={u}_{{\pi}_{1}}\mathrm{...}u{\pi}_{m}$. By construction of N, we have Score(u, M) = Score(v, N). Since the background model is multinomial, we have p(u) = p(v). This completes the proof.
The question is how to permute the columns of a given matrix to enhance the performances of the algorithms. In [6], it is suggested to sort columns by decreasing information content. We refine this rule of thumb and propose to minimize the total size of all score ranges involved in the dynamic programming decomposition for Q in Algorithm SCOREDISTRIBUTION. For each i, 1 ≤ i ≤ m, define δ_{ i }as δ_{ i }= BS(M[i..i])  WS(M[i..i]).
Lemma 7 Let M be a matrix such that δ_{1} ≥ ... ≥ δ_{ m }. Then M minimizes the total size of all score ranges amongst all matrices that can be obtained by permutation of M.
Since permutation of matrices induces a permutation of the sequence δ_{2},..., δ_{ m }, the value $\sum}_{i=2}^{m}(i1){\delta}_{i$ is minimal when δ_{1} ≥ δ_{2} ≥ ... ≥ δ_{ m }.
In the remaining of this paper, we shall always assume that the matrix M has been permuted so that it fulfills the condition on (δ_{ i })_{1≤i≤m}of Lemma 7. This is simply a preprocessing of the matrix that does not affect the course of the algorithms.
Efficient algorithms for computing the Pvalue without error
We now come to the presentation of two exact algorithms, which is are the main algorithms of this paper. In Algorithms SCOREDISTRIBUTION and FASTPVALUE, the number of accessible scores plays an essential role in the time and space complexity. As mentioned in the Background section, this number can be as large as Σ^{ m }. In practice, it strongly depends on the involved matrix and on the way the score distribution is approximated by round matrices. The choice of the precision is critical. Algorithms SCOREDISTRIBUTION and FASTPVALUE should compromise between accuracy, with faithful approximation, and efficiency, with rough approximation.
To overcome this problem, we propose to define successive discretized score distributions with growing accuracy. The key idea is to take advantage of the shape of the score distribution Q, and to use small granularity values only in the portions of the distribution where it is required. This is a kind of selective zooming process. Discretized score distributions are built from round matrices.
Lemma 8 Let M be a matrix, ε the granularity, and E the maximal error associated. For each word u of Σ^{ m }, we have 0 ≤ Score(u, M)  Score(u, M_{ ε }) ≤ E.
Proof. This is a straightforward consequence of Definition 3 for M_{ ε }and E.
Lemma 9 Let M, N and N' be three matrices of length m, E, E' be two nonnegative real numbers, α, β be two scores such that α ≤ β, satisfying the following hypotheses:
(i) for each word u in Σ^{ m }, Score(u, N) ≤ Score(u, M) ≤ Score(u, N) + E,
(ii) for each word u in Σ^{ m }, Score(u, N') ≤ Score(u, N) ≤ Score(u, M) ≤ Score(u, N') + E',
(iii) Pvalue(N, α  E) = Pvalue(N, α),
(iv) Pvalue(N', β  E') = Pvalue(N', β),

If α ≤ Score(u, N), then α ≤ Score(u, M): This is a consequence of Score(u, N) ≤ Score(u, M) in (i).

If α ≤ Score(u, M), then α ≤ Score(u, N): By hypothesis (i) on E, α ≤ Score(u, M) implies α  E ≤ Score(u, N). Since Pvalue(N, α  E) = Pvalue(N, α) with (iii), it follows that α ≤ Score(u, N).

If Score(u, N) <β, then Score(u, M) <β: By hypothesis (ii), Score(u, N) <β implies that Score(u, N') <β. According to (iv), this ensures that Score(u, N') <β  E', which with (ii) guarantees Score(u, M) <β

If Score(u, M) <β, then Score(u, N) <β: This is a consequence of Score(u, N) ≤ Score(u, M) in (i).
What does this statement tell us ? It provides a sufficient condition for the distribution score Q computed with a round matrix to be valid for the initial matrix M. Assume that you can observe two plateaux ending respectively at α and β in the score distribution of M_{ ε }. Then the approximation of the total probability for the score range [α, β[obtained with the round matrix is indeed the exact probability. In other words, there is no need to use smaller granularity values in this region to improve the result.
From score to Pvalue
The correctness of the algorithm comes from the two next Lemmas. The first Lemma establishes that the loop invariants hold.
Lemma 10 Throughout Algorithm 3, the variables β and P satisfy the invariant relation P = Pvalue(M, β).
Proof. This is a consequence of invariant 1 and invariant 2 in Algorithm 3. Both invariants are valid for initial conditions. When P = 0 and β = BS(M) + 1: Pvalue(M, BS(M) + 1) = 0. Regarding N', choose N' = M_{ ε }.
There are two cases to consider for invariant 1.
 If s does not exist. P and β remain unchanged, so we still have P = Pvalue(M, β). Regarding invariant 2, if there exists such a matrix N' at the former step for M_{ kε }, then it is still suitable for M_{ ε }.
 If s actually exists. invariant 1 implies that P is updated to Pvalue(M, β) + ∑_{s≤t<β}Q(M_{ ε }, t).
According to Lemma 9 and invariant 2, we have ∑_{s≤t<β}Q(M_{ ε }, t) = ∑_{s≤t<β}Q(M, t). Hence P = Pvalue(M, s). Since β is updated to s, it follows that P = Pvalue(M, β). Regarding invariant 2, take N' = M_{ ε }.
The second Lemma shows that when the stop condition is met, the final value of the variable P is indeed the expected result Pvalue(M, α).
Lemma 11 At the end of Algorithm 3, P = Pvalue(M, α).
Proof. When s = α  E, then β = α. According to Lemma 10, it implies P = Pvalue(M_{ ε }, α). Since the stop condition implies that Pvalue(M_{ ε }, α  E) = Pvalue(M_{ ε }, α), Lemma 9 ensures that Pvalue(M_{ ε }, α) = Pvalue(M, α).
From Pvalue to score
Similarly, Lemma 9 is used to design an algorithm to compute the score threshold associated to a given Pvalue. We first show that the score threshold obtained with a round matrix for a Pvalue gives some insight about the potential score interval for the initial matrix M.
Lemma 12 Let M be a matrix, ε a granularity and E the maximal error associated. Given P, 0 ≤ P ≤ 1, we have
Threshold(M_{ ε }, P) ≤ Threshold(M, P) ≤ Threshold(M_{ ε }, P) + E
Proof. Let β = Threshold(M_{ ε }, P). According to Lemma 8, Pvalue(M_{ ε }, β) ≥ P implies Pvalue(M, β) ≥ P, which yields β ≤ Threshold(M, P). So it remains to establish that Threshold(M, P) ≤ β + E. If Pvalue(M, β + E) = 0, then the highest accessible score for M is smaller than β + E. In this case, the expected result is straightforward. Otherwise, there exists β' such that β' is the lowest accessible score for M that is strictly greater than β + E. Since s → Pvalue(M, s) is a decreasing function in s, we have to verify that Pvalue(M, β') <P to complete the proof of the Lemma. Assume that Pvalue(M, β') ≥ P. Let γ = min {Score(u, M_{ ε })u ∈ Σ^{ m }∧ Score(u, M) ≥ β'}. On the one hand, the definition of γ implies that
Pvalue(M, β') ≤ Pvalue(M_{ ε }, γ)
On the other hand, γ is an accessible score for M_{ ε }that satisfies γ ≥ β'  E > β. By hypothesis of β, it follows that
Pvalue(M_{ ε }, γ) <P
Equations 5 and 6 contradict the assumption that Pvalue(M, β') ≥ P. Thus Pvalue(M, β') <P.
Lemma 13 Let M be a matrix, ε the granularity and E the maximal error associated. If Pvalue(M_{ ε }, α  E) = Pvalue(M_{ ε }, α), then Pvalue(M, α) = Pvalue(M_{ ε }, α).
Proof. This is a corollary of Lemma 9 with M_{ ε }in the role of N and N', and BS(M) + E in the role of β.
Experimental Results
The ideas presented in this paper have been incorporated in a software called TFMPVALUE (TFM stands for Transcription factor matrix). The software is written in C++ and implements the FROM PVALUE TO SCORE and FROM SCORE TO PVALUE algorithms as described in Algorithms 5 and 6, together with permutated lookahead scoring. It is available for download at [17]. In the worst case, TFMPVALUE does not improve the theoretical complexity of the score threshold problem. This was expected from the NPhardness proof provided in the second section. Nevertheless, experimental results show considerable speedups in practice.
Methods
We chose a multinomial background model with identically and independently distributed character symbols on the four letter alphabet {A, C, G, T} to conduct our experiments. The decreasing step (k) in the algorithm was set to 10 and the initial granularity (ε) was set to 0.1. The test set is made of the Jaspar database of transcription factor binding sites [1]. It contains 123 matrices, whose length ranges from 4 to 30. The matrices are transformed into logratio matrices following the technique given in [18]. For each Pvalue P, we report only results for matrices whose length is suitable for P: we requested that the probability of a single word is smaller than P. So a matrix of length m cannot not achieve a Pvalue smaller than $\frac{1}{{4}^{m}}$. For example, matrices of length 4 have not been considered for a Pvalue equal to 10^{3}, and matrices of length smaller than 10 have not be considered for a Pvalue equal to 10^{6}.
Experimental results are concerned with the error rate depending on the chosen granularity. To estimate the error made at a given granularity, we first computed α_{ ε }, the score threshold associated to the Pvalue with the round matrix M_{ ε }, and a the score threshold associated to the Pvalue with the original matrix M. We then denumerate the number of words whose score is between α_{ ε }and α for M. Concerning the time efficiency, all computation times were measured on a 2.33 GHz Intel Core 2 Duo processor with 2 Go of main memory under Mac OS 10.4.
Concerning FROM PVALUE TO SCORE, We also compared our results with those of algorithm LAZYDISTRIBUTION described in [6]. To the best of our knowledge, this algorithm is the most efficient algorithm today to compute the score associated to a Pvalue. It uses the dynamic programming formulas of Equation 1 in a lazy way and takes advantage of permutated lookahead scoring as presented in the previous Section. We implemented it in C++, like TFMPVALUE.
Computation times for a given granularity
In this first experiment, we study the time performance of TFMPVALUE compared to LAZYDISTRIBUTION when using the same approximation for the distribution score. So in both cases we use round matrices with the same granularity. To set a maximal granularity for TFMPVALUE, we interrupt the loop of decreasing granularities and output the score threshold found at this granularity. We thus obtain exactly the same score threshold as LAZYDISTRIBUTION.
Granularity 10^{3}
Granularity 10^{6}
Ability to compute accurate thresholds
Granularity required for accurate computation with From Pvalue to Score. This table indicates the granularity value that is required for FROM PVALUE TO SCORE to compute the accurate score threshold without any error. Each row of the table corresponds to a Pvalue: 10^{3}, 10^{4}, 10^{5}, and 10^{6}. Each cell gives the percentage of matrices for which FROM PVALUE TO SCORE ends at the granularity of the corresponding column. For example, 52.4% matrices need a granularity larger than or equal to 10^{3} when computing threshold for Pvalue 10^{5}.
Granularity  

Pvalue  1e1  1e2  1e3  1e4  1e5  1e6  1e7  1e8  1e9 
1e3  9  22.9  39.3  63.1  77.8  88.5  91.8  95.1  100 
1e4  7.7  20.2  49  70.2  85.6  92.3  97.1  99  100 
1e5  1.2  25.6  52.4  76.8  88.4  94.2  96.5  96.5  100 
1e6  5.4  42.7  66.7  82.7  94.7  96  98.7  98.7  100 
Discussion and Conclusion
We performed an extensive analysis of the computation of Pvalues for matrices. We gave a simple proof that the From Pvalue to score and From Score to Pvalue problems are NPhard. We then presented two algorithms to solve them efficiently and accurately for reallife examples. As the problem is intrinsically difficult, the worst complexity is not changed and then some matrices may require large computation time and memory. Fortunately, our experiments show that this arises only in very few cases. Our algorithms can be of interest for at least two tasks. First, they can be exploited to obtain significantly faster algorithms than existing ones when a loss of precision is allowed. Indeed, for a same computation time and amount of memory, our algorithms perform better than existing ones. This allows to avoid precomputation of scores associated to fixed Pvalues as done in some software programs [16], and to compute the desired Pvalue on the fly, as specified by the user. Secondly, the algorithms can be used where it is needed to compute a score threshold with high precision, with arbitrary low granularity, in a reasonable amount of space and time. We provided thus a significant improvement to compute scores and Pvalues with high accuracy.
When running experiments on Jaspar database, we chose a value for k, the decreasing step for successive granularities, equal to 10. A different value may be selected. With a lower decreasing step value, the algorithms stop with more accurate granularity and so may avoid useless computations. But this leads to more iterations and then globally to a higher runtime. With a larger decreasing step value, there are less iterations and then the global runtime is lowered. But choosing a very large decreasing step value (more than 10^{3} for example) amounts to compute almost the complete score distribution and the algorithms become inefficient because they do not take advantage of the reduction of the score range for which exact Pvalues are computed. As the algorithms are mainly based onto the computation of accessible scores, the memory required is almost the same independently of the decreasing step value (until the value is not very large).
When we allowed for some error, such as in the first experiment, this implicitly amounts to calculate the exact score distribution, and thus the exact Pvalue, for the round matrix as described in Definition 3. One can choose an alternative rounding construction for the initial matrix, such as $\epsilon \times \lfloor \frac{M(i,x)}{\epsilon}+0.5\rfloor $, before running TFMPVALUE. This leaves the course of the algorithms unchanged.
Finally, in the paper, we assumed that the background model is provided with a multinomial model. All results, except permutated lookahead scoring, can be extended to more sophisticate random sources, such as Markov models [19]. The consequence is an increasing of the computation time by a factor Σ^{ n }, where n is the order of the Markov model. But the optimization based on successive decreasing granularities still holds.
Declarations
Acknowledgements
Part of this work was supported by PPF Bioinformatics – University Lille 1. The authors thank Mireille Regnier for fruitful discussions.
Authors’ Affiliations
References
 Sandelin A, Alkema W, Engstrom P, Wasserman W, Lenhard B: JASPAR: an openaccess database for eukaryotic transcription factor binding profiles. Nucleic Acids Research. 2004, D914. DatabaseGoogle Scholar
 Wingender E, Chen X, Hehl R, Karas I, Liebich I, Matys V, Meinhardt T, Pruss M, Reuter I, Schacherer F: TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Research. 2000, 28 (1): 316319. 10.1093/nar/28.1.316PubMedPubMed CentralView ArticleGoogle Scholar
 Mount S: A catalogue of splice junction sequences. Nucleic Acids Research. 1982, 10: 459472. 10.1093/nar/10.2.459PubMedPubMed CentralView ArticleGoogle Scholar
 NHulo , Sigrist C, Saux VL, LangendijkGenevaux P, Bordoli1 L, Gattiker A, Castro ED, Bucher P, Bairoch A: Recent improvements to the PROSITE database. Nucleic Acids Research. 2004, D134D137. 32 DatabaseGoogle Scholar
 Liefooghe A, Touzet H, Varré JS: Large Scale Matching for Position Weight Matrices. Proceedings 17th Annual Symposium on Combinatorial Pattern Matching (CPM), Volume 4009 of Lecture Notes in Computer Science. 2006, 401412. Springer Verlaghttp://www.springerlink.com/content/7113757vj6205067/Google Scholar
 Beckstette M, Homann R, Giegerich R, Kurtz S: Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics. 2006, 7:Google Scholar
 Pizzi C, Rastas P, Ukkonen E: Fast Search Algorithms for Position Specific Scoring Matrices. proceedings of BIRD, of Lecture Notes in Computer Science. 2007, 4414: 239250.View ArticleGoogle Scholar
 Bejerano G, Friedman N, Tishby N: Efficient Exact pValue Computation for Small Sample, Sparse, and Surprising Categorical Data. Journal of Computational Biology. 2004, 11 (5): 867886.PubMedView ArticleGoogle Scholar
 Staden R: Methods for calculating the probabilities of finding patterns in sequences. Comput Appl Biosci. 1989, 5 (2): 8996.PubMedGoogle Scholar
 Claverie JM, Audic S: The statistical significance of nucleotide positionweight matrix matches. Comput Appl Biosci. 1996, 12 (5): 4319.PubMedGoogle Scholar
 Rahmann S: Dynamic Programming Algorithms for Two Statistical Problems in Computational Biology. WABI. 2003Google Scholar
 Zhang J, Jiang B, Li M, Tromp J, Zhang X, Zhang M: Computing exact pvalues for DNA motifs (Part I). Bioinformatics. 2007, http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/5/531Google Scholar
 Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14 (6): 118890. 10.1101/gr.849004PubMedPubMed CentralView ArticleGoogle Scholar
 Garey M, Johnson D: Computers and Intractability: A Guide to the Theory of NPCompleteness. 1979, WH Freeman and CompanyGoogle Scholar
 Malde K, Giegerich R: Calculating PSSM probabilities with lazy dynamic programming. J Funct Program. 2006, 16: 7581. 10.1017/S0956796805005708. 10.1017/S0956796805005708View ArticleGoogle Scholar
 Wu TD, NevillManning CG, Brutlag DL: Fast probabilistic analysis of sequence function using scoring matrices. Bioinformatics. 2000, 16 (3): 23344. http://bioinformatics.oxfordjournals.org/cgi/reprint/16/3/233 10.1093/bioinformatics/16.3.233PubMedView ArticleGoogle Scholar
 TFMPVALUE. http://bioinfo.lifl.fr/TFM/TFMpvalue
 Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999, 15 (7–8): 56377. http://bioinformatics.oxfordjournals.org/cgi/reprint/15/7/563 10.1093/bioinformatics/15.7.563PubMedView ArticleGoogle Scholar
 Huang H, Kao MCJ, Zhou X, Liu JS, Wong WH: Determination of local statistical significance of patterns in Markov sequences with application to promoter element identification. J Comput Biol. 2004, 11: 114. 10.1089/106652704773416858PubMedView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.