From now on, we assume that the positions in the sequence are independently distributed. We denote p(x) the background probability associated to the letter x of the alphabet Σ. By extension, we write p(u) for the probability of the word u = u1 ... u
m
: p(u) = p(u1) × ⋯ × p(u
m
).
Definition of the score distribution
The computation of the P-value is done through the computation of the score distribution. This concept is the core of the large majority of existing algorithms [9–11, 15]. Given a matrix M of length m and a score α, we define Q(M, α) as the probability that the background model can achieve a score equal to α. In other words, Q (M, α) is the probability of the set {u ∈ Σm| Score(u, M) = α}. In the case where s is not an accessible score, then Q(M, s) = 0.
The computation of Q is easily performed by dynamic programming. For that purpose, we need some preliminary notation. Given two integers i, j satisfying 0 ≤ i, j ≤ m, M [i..j] denotes the submatrix of M obtained by selecting only columns from i to j for all character symbols. M [i..j] is called a slice of M. By convention, if i > j, then M [i..j] is an empty matrix.
The score distribution for the slice M [1..i] is expressed from the sore distribution of the previous slice M [1..i - 1] as follows.
(1)
The time complexity is in O(m|Σ|S), and the space complexity in O(S), where S is the number of scores that have to be visited. If coefficients of M are natural numbers, then S is bounded by m × max {M (i, x) | x ∈ Σ, 1 ≤ i ≤ m}. Equation 1 enables to solve the From score to P-value and From P-value to score problems. Given a score α, the P-value is obtained with the relation:
Conversely, given P, Threshold (M, P) is computed from Q by searching for the greatest accessible score until the required P-value is reached.
Computing the score distribution for a range of scores
Formula 1 does not explicitly state which score ranges should be taken into account in intermediate steps of the calculation of Q. To this end, we introduce the best score and the worst score of a matrix slice.
Definition 1 (Best and worst scores) Let M be a matrix. The best score of the slice M [i..j] is defined as
Similarly, the worst score of the slice M [i..j] is defined as
The notion of best scores is already present in [16], where it is used to speed up the search for occurrences of a matrix in a text. It gives rise to look ahead scoring. Best scores allow to stop the calculation of Score(u, M) in advance as soon as it is guaranteed that the score threshold cannot be achieved, because we know the maximal remaining score. It has been exploited in [5, 6] in the same context. Here we adapt it to the score distribution problem. Let α and β be two scores such that α ≤ β. If one wants to compute the score distribution Q for the range [α, β], then given an intermediate score s and a matrix position i, we say that Q(M [1..i], s) is useful if there exists a word v of length m - i such that α ≤ s + Score(v, M [i + 1..m]) ≤ β. Lemma 2 characterizes useful intermediate scores.
Lemma 2 Let M be a matrix of length m, let α and β be two score bounds defining a score range for which we want to compute the score distribution Q. Q(M [1..i], s) is useful if, and only if,
α - BS(M [i + 1..m]) ≤ s ≤ β - WS(M [i + 1..m])
Proof. This is a straightforward consequence of Definition 1.
This result is implemented in Algorithm SCOREDISTRIBUTION, displayed in Figure 3. The algorithm ensures that only accessible scores are visited. In practice, this is done by using a hash table for storing values of Q.
If one wants only to calculate the P-value of a given score without knowing the score distribution, Algorithm SCOREDISTRIBUTION can be further improved. We introduce a complementary optimization that leads to a significant speed up. The idea is that for good words, we can anticipate that the final score will be above the given threshold without calculating it.
Definition 2 (Good words) Let α be a score and i be a position of M. Given u = u1 ... u
i
a word of Σi, we say that u is good for α if the following conditions are fulfilled:
-
1.
Score(u, M [1..i]) ≥ α - WS(M [i + 1..m])
-
2.
Score(u1 ... ui-1, M [1..i - 1]) <α - WS(M [i..m])
Lemma 3 Let u be a good word for α. Then for all v in u Σm-|u|, we have Score(v, M) ≥ α.
Proof. Let w in Σm-|u|such that v = uw and let i be the length of u. We have
Lemma 4 Let u be a string of Σmsuch that Score(u, M) ≥ α. Then there exists a unique prefix v of u such that v is good for α.
Proof. We first remark that if Score(u, M) ≥ α, then Score(u, M) ≥ α - WS(M[m + 1..m]). So there exists at least one prefix of u satisfying the first condition of Definition 2: u itself. Now, consider a prefix v of length i such that Score(v, M[1..i]) ≥ α - WS(M[i + 1..m]). Then for each letter x of Σ, we have Score(vx, M[1..i + 1]) ≥ α - WS(M[i + 2..m]): It comes from the fact that M(i + 1, x) ≥ WS(M[i + 1..m]) - WS(M[i + 2..m]). This property implies that if a prefix v of u satisfies the first condition of Definition 2, then all longer prefixes also do. According to the second condition of Definition 2, it follows that only the shortest prefix v such that Score(v, M[1..i]) ≥ α - WS(M[i + 1..m]) is a good word.
Lemma 5 Let M be a matrix of length m.
Proof. We consider the set (α) of words whose score is greater than or equal to α: (α) = {w ∈ Σm|Score(w, M) ≥ α}. According to Lemma 4, each word of (α) has a unique prefix that is good for α. Conversely, Lemma 3 ensures that each word whose prefix is good for α belongs to (α). (α) can thus be expressed as a union of disjoint sets.
(2)
It follows that
(3)
where p(u) denotes the probability of the string u in the background model. By definition of Q, we can deduce the expected result from Formula 3.
Lemma 5 shows that it is not necessary to build the entire dynamic programming table for Q. Only values for Q(M[1..i], s) such that s <α - WS(M[i + 1..m]) are to be computed. This gives rise to the FASTPVALUE algorithm, described in Figure 4.
Permuting columns of the matrix
Algorithms 1 and 2 can also be used in combination with permutated lookahead scoring [16]. The matrix M can be transformed by permuting columns without modifying the overall score distribution. This is possible because the columns of the matrix are supposed to be independent. We show that it is also relevant for P-value calculation.
Lemma 6 Let M and N be two matrices of length m such that there exists a permutation π on {1,..., m} satisfying, for each letter x of Σ, M(i, x) = N(π
i
, x). Then for any α, Q(M, α) = Q(N, α).
Proof. Let u be a word of Σmand let . By construction of N, we have Score(u, M) = Score(v, N). Since the background model is multinomial, we have p(u) = p(v). This completes the proof.
The question is how to permute the columns of a given matrix to enhance the performances of the algorithms. In [6], it is suggested to sort columns by decreasing information content. We refine this rule of thumb and propose to minimize the total size of all score ranges involved in the dynamic programming decomposition for Q in Algorithm SCOREDISTRIBUTION. For each i, 1 ≤ i ≤ m, define δ
i
as δ
i
= BS(M[i..i]) - WS(M[i..i]).
Lemma 7 Let M be a matrix such that δ1 ≥ ... ≥ δ
m
. Then M minimizes the total size of all score ranges amongst all matrices that can be obtained by permutation of M.
Proof. We write SR(M) for the total size of all score ranges of the matrix M. We have
Since permutation of matrices induces a permutation of the sequence δ2,..., δ
m
, the value is minimal when δ1 ≥ δ2 ≥ ... ≥ δ
m
.
In the remaining of this paper, we shall always assume that the matrix M has been permuted so that it fulfills the condition on (δ
i
)1≤i≤mof Lemma 7. This is simply a pre-processing of the matrix that does not affect the course of the algorithms.
Efficient algorithms for computing the P-value without error
We now come to the presentation of two exact algorithms, which is are the main algorithms of this paper. In Algorithms SCOREDISTRIBUTION and FASTPVALUE, the number of accessible scores plays an essential role in the time and space complexity. As mentioned in the Background section, this number can be as large as |Σ|m. In practice, it strongly depends on the involved matrix and on the way the score distribution is approximated by round matrices. The choice of the precision is critical. Algorithms SCOREDISTRIBUTION and FASTPVALUE should compromise between accuracy, with faithful approximation, and efficiency, with rough approximation.
To overcome this problem, we propose to define successive discretized score distributions with growing accuracy. The key idea is to take advantage of the shape of the score distribution Q, and to use small granularity values only in the portions of the distribution where it is required. This is a kind of selective zooming process. Discretized score distributions are built from round matrices.
Definition 3 (Round matrix)
Let M be a matrix of real coefficient values of length m and let ε be a positive real number. We denote M
ε
the round matrix deduced from M by rounding each value by ε:
ε is called the granularity. Given ε, we can define E, the maximal error induced by M
ε
.
(4)
Lemma 8 Let M be a matrix, ε the granularity, and E the maximal error associated. For each word u of Σm, we have 0 ≤ Score(u, M) - Score(u, M
ε
) ≤ E.
Proof. This is a straightforward consequence of Definition 3 for M
ε
and E.
Lemma 9
Let M, N and N' be three matrices of length m, E, E' be two non-negative real numbers, α, β be two scores such that α ≤ β, satisfying the following hypotheses:
(i) for each word u in Σm, Score(u, N) ≤ Score(u, M) ≤ Score(u, N) + E,
(ii) for each word u in Σm, Score(u, N') ≤ Score(u, N) ≤ Score(u, M) ≤ Score(u, N') + E',
(iii) P-value(N, α - E) = P-value(N, α),
(iv) P-value(N', β - E') = P-value(N', β),
then
Proof. Let u be a string in Σm. It is enough to establish that α ≤ Score(u, N) <β if, and only if, α ≤ Score(u, M) <β. The proof is divided into four parts.
-
If α ≤ Score(u, N), then α ≤ Score(u, M): This is a consequence of Score(u, N) ≤ Score(u, M) in (i).
-
If α ≤ Score(u, M), then α ≤ Score(u, N): By hypothesis (i) on E, α ≤ Score(u, M) implies α - E ≤ Score(u, N). Since P-value(N, α - E) = P-value(N, α) with (iii), it follows that α ≤ Score(u, N).
-
If Score(u, N) <β, then Score(u, M) <β: By hypothesis (ii), Score(u, N) <β implies that Score(u, N') <β. According to (iv), this ensures that Score(u, N') <β - E', which with (ii) guarantees Score(u, M) <β
-
If Score(u, M) <β, then Score(u, N) <β: This is a consequence of Score(u, N) ≤ Score(u, M) in (i).
What does this statement tell us ? It provides a sufficient condition for the distribution score Q computed with a round matrix to be valid for the initial matrix M. Assume that you can observe two plateaux ending respectively at α and β in the score distribution of M
ε
. Then the approximation of the total probability for the score range [α, β[obtained with the round matrix is indeed the exact probability. In other words, there is no need to use smaller granularity values in this region to improve the result.
From score to P-value
Lemma 9 is used through a stepwise algorithm to compute the P-value of a score threshold. Let α be the score for which we want to determine the associated P-value. We estimate the score distribution Q iteratively. For that, we consider a series of round matrices M
ε
for decreasing values of ε, and calculate successive values P-value (M
ε
, α). The efficiency of the method is guaranteed by two properties. First, we introduce a stop condition that allows us to stop as soon as it is guaranteed that the exact value of the P-value is reached. Second, we carefully select relevant portions of the score distribution for which the computation should go on. This tends to restrain the score range to inspect at each step. The algorithm is displayed in Figure 5.
The correctness of the algorithm comes from the two next Lemmas. The first Lemma establishes that the loop invariants hold.
Lemma 10 Throughout Algorithm 3, the variables β and P satisfy the invariant relation P = P-value(M, β).
Proof. This is a consequence of invariant 1 and invariant 2 in Algorithm 3. Both invariants are valid for initial conditions. When P = 0 and β = BS(M) + 1: P-value(M, BS(M) + 1) = 0. Regarding N', choose N' = M
ε
.
There are two cases to consider for invariant 1.
- If s does not exist. P and β remain unchanged, so we still have P = P-value(M, β). Regarding invariant 2, if there exists such a matrix N' at the former step for M
kε
, then it is still suitable for M
ε
.
- If s actually exists. invariant 1 implies that P is updated to P-value(M, β) + ∑s≤t<βQ(M
ε
, t).
According to Lemma 9 and invariant 2, we have ∑s≤t<βQ(M
ε
, t) = ∑s≤t<βQ(M, t). Hence P = P-value(M, s). Since β is updated to s, it follows that P = P-value(M, β). Regarding invariant 2, take N' = M
ε
.
The second Lemma shows that when the stop condition is met, the final value of the variable P is indeed the expected result P-value(M, α).
Lemma 11 At the end of Algorithm 3, P = P-value(M, α).
Proof. When s = α - E, then β = α. According to Lemma 10, it implies P = P-value(M
ε
, α). Since the stop condition implies that P-value(M
ε
, α - E) = P-value(M
ε
, α), Lemma 9 ensures that P-value(M
ε
, α) = P-value(M, α).
From P-value to score
Similarly, Lemma 9 is used to design an algorithm to compute the score threshold associated to a given P-value. We first show that the score threshold obtained with a round matrix for a P-value gives some insight about the potential score interval for the initial matrix M.
Lemma 12 Let M be a matrix, ε a granularity and E the maximal error associated. Given P, 0 ≤ P ≤ 1, we have
Threshold(M
ε
, P) ≤ Threshold(M, P) ≤ Threshold(M
ε
, P) + E
Proof. Let β = Threshold(M
ε
, P). According to Lemma 8, P-value(M
ε
, β) ≥ P implies P-value(M, β) ≥ P, which yields β ≤ Threshold(M, P). So it remains to establish that Threshold(M, P) ≤ β + E. If P-value(M, β + E) = 0, then the highest accessible score for M is smaller than β + E. In this case, the expected result is straightforward. Otherwise, there exists β' such that β' is the lowest accessible score for M that is strictly greater than β + E. Since s → P-value(M, s) is a decreasing function in s, we have to verify that P-value(M, β') <P to complete the proof of the Lemma. Assume that P-value(M, β') ≥ P. Let γ = min {Score(u, M
ε
)|u ∈ Σm∧ Score(u, M) ≥ β'}. On the one hand, the definition of γ implies that
P-value(M, β') ≤ P-value(M
ε
, γ)
On the other hand, γ is an accessible score for M
ε
that satisfies γ ≥ β' - E > β. By hypothesis of β, it follows that
P-value(M
ε
, γ) <P
Equations 5 and 6 contradict the assumption that P-value(M, β') ≥ P. Thus P-value(M, β') <P.
The sketch of the algorithm is as follows. Let P be the desired P-value. We compute iteratively the associated score threshold for successive decreasing values of ε. At each step, we use Lemma 12 to speed the calculation for the matrix M
ε
. This Lemma allows us to restrain the computation of the detailed score distribution Q to a small interval of length 2 × E. For the remaining of the distribution, we can use the FASTPVALUE algorithm. Lemma 13 ensures that when P-value(M
ε
, α - E) = P-value(M
ε
, α), then α is the required score value for M. The algorithm is displayed in more details in Figure 6.
Lemma 13 Let M be a matrix, ε the granularity and E the maximal error associated. If P-value(M
ε
, α - E) = P-value(M
ε
, α), then P-value(M, α) = P-value(M
ε
, α).
Proof. This is a corollary of Lemma 9 with M
ε
in the role of N and N', and BS(M) + E in the role of β.