Distribution of N = (N0, N1)
As the estimators defined in (4) are expressed as functions of N0 and N1 we first study their distribution. Using a Gaussian approximation, we have
where, for i, j ∈ {0, 1}, E
i
∈ , and C
i,j
∈ × with d
i
= km+i. One can note that C0,0 and C1,1 are symmetric, and t(C1,0) = C0,1 (where tis the matrix transpose operator).
In the stationary case, exact expression of E and C can be computed according to [5].
Expectation is simply given ∀w ∈ mby
E0(w) = (n - m + 1) μ(w) E1 (wa) = (n - m) μ(w)Π(w, a) ∀(w, a) ∈ m× (6)
In order to give more fluidity to this paper, the expression of the covariance matrix C have been moved in appendix A. Let us remark, before going forward that substituting N by E in (4) immediately gives
Delta method
Let us start with a simple case. We consider a single pattern which is over-represented (seen more than expected) so we have
where the function F+ also depends on the sequence length ℓ and the considered pattern.
If F+ is differentiate, the delta-method (a simple first order Taylor expansion around N = E, see [13]) provides the following approximation:
and hence, using (7) we have
for n large enough. The distribution of is therefore approximated by
(1)
(S
N
) ≃ (S, σ2) (11)
with
In consequence, computing σ requires both to compute C (done in appendix A) and ∇F+ (E).
Single pattern
The exact expression of F+ is computable through many different methods [1–4] but is too much complicated to derive explicitly ∇F+. To overcome this problem, we propose to consider an approximation of F+. As said in introduction, many kind of approximations are available (Gaussian, binomial, compound Poisson or large deviations). In this paper, we have chosen to use a binomial approximation as it provides an expression which is analytically differentiable and is known to be a good heuristic to the problem [8].
For a single non-degenerate pattern (i.e. a simple word) W = w1 ... w
h
(w
i
∈ ) with h ≥ m - 1 we first denote by
P(N) = μ
N
(w1 ... w
m
) × π
N
(w1 ... w
m
, wm+1) × ... × π
N
(wh-m... wh-1, w
h
) (13)
the probability for W to occur at a given position in the sequence and then we get
where denotes the binomial distribution, with ℓ
h
= ℓ - h + 1 and where the β functions (complete and incomplete) and their relation to the binomial cumulative distribution function are described in appendix B.
Note that if we consider non-overlapping occurrences instead of overlapping ones, we can still use a binomial approximation for the distribution of N, but the expression of P(N) is more complicated as it involves the auto-correlation polynome of the pattern [14]. This point is not developed in this paper.
Replacing μ
N
and π
N
by their expression easily gives
where A1(wa) counts occurrences of the word wa in W = w1 ... w
h
and A0 (w) counts occurrences of the word w in w2 ... wh-1. Note that in the particular case where h = m - 1, all A0 (w) are null and we simply get (n - m + l) × P (N) = N1 (W).
Using the derivative properties of the incomplete beta function (see appendix B for more details) we hence get
so all we need is to compute ∇P(N).
For all (w, a) ∈ m× we have
and
If we denote by
P = μ (w1 ... w
m
) × π (w1 ... w
m
, wm+1) × ... × π (wh-m... wh-1, w
h
) (19)
the true probability for W to occur at a given position in the sequence X then we get, using (7) in (13), that
for n large enough. We hence get
where tG = [tG0 tG1] is defined by
Using equation we finally get
where
and then, a computation of σ is possible by plug-in. Without considering the computation of E and C, the complexity of this approach is O(h) (where h is the size of the pattern).
When a degenerate pattern (finite set of words) is considered instead of a single word, it is easy to adapt this method by summing the contribution p of each word belonging to the pattern. This point is left to the reader.
Under-represented pattern
In the case of an under-represented pattern we have
Using a binomial approximation we get
and, by the same method than in the over-represented case we finally have
Two distinct patterns
We consider now two patterns V and W instead of one and want to study the joint distribution of S
N
(V) and S
N
(W) their corresponding pattern statistics.
With a similar argument as in section "delta method", it is easy to show that
where σ
V
(resp. σ
W
) is the standard deviation σ for the pattern V (resp. W) and where
where
And after using results of sections "single pattern" and "under-represented pattern" we finally get
where (resp. W) and G
V
(resp. W) are the constant Q (Q+ and Q-) and the vector G for the pattern V (resp. W).
Simulations
It is also possible to study the empirical distribution of a S
N
(for one or more patterns) through simulations.
In order to do so, we first draw M independent sequences Yj= ... using an order m stationary Markov model of parameters π. Complexity of this step is O(M × n).
For each j we get the frequencies Nj= (,) (with complexity O(n) for each sequence) of the words of size m and m + 1 in the sequence Yjand use it to compute Sj= (exact value or approximation). Complexity here depends on the statistical method used to compute Sj(e.g. O(h) using a binomial approximation).
We now have a M – sample S1, ..., SMof S
N
from which we can easily estimate σ and thus, valid or invalid the approximation through the delta-method.
When used with large value of n (e.g. several millions or more), the complexity of this approach is slowed by the drawn of the sequences Y
j
. It is therefore possible to improve the method by simulating directly the frequencies N through (5). As this approximation has a very small impact on the distribution of S
N
(data not shown) it may dramatically speed-up the computations when considering large n or M. It is nevertheless important to point out that drawing a Gaussian vector size L requires to precompute the Choleski decomposition of its covariance matrix which could be a limiting factor when considering large L.