### Distribution of N = (N_{0}, N_{1})

As the estimators defined in (4) are expressed as functions of **N**_{0} and **N**_{1} we first study their distribution. Using a Gaussian approximation, we have

where, for *i*, *j* ∈ {0, 1}, **E**_{
i
}∈ {\mathbb{R}}^{{d}_{i}}, and **C**_{
i,j
}∈ {\mathbb{R}}^{{d}_{i}} × {\mathbb{R}}^{{d}_{j}} with *d*_{
i
}= *k*^{m+i}. One can note that **C**_{0,0} and **C**_{1,1} are symmetric, and ^{t}(**C**_{1,0}) = **C**_{0,1} (where ^{t}is the matrix transpose operator).

In the stationary case, exact expression of **E** and **C** can be computed according to [5].

Expectation is simply given ∀*w* ∈ \mathcal{A}^{m}by

**E**_{0}(*w*) = (*n* - *m* + 1) *μ*(*w*) **E**_{1} (*wa*) = (*n* - *m*) *μ*(*w*)Π(*w*, *a*) ∀(*w*, *a*) ∈ \mathcal{A}^{m}× \mathcal{A} (6)

In order to give more fluidity to this paper, the expression of the covariance matrix **C** have been moved in appendix A. Let us remark, before going forward that substituting **N** by **E** in (4) immediately gives

\begin{array}{ccc}{\mu}_{E}=\mu & \text{and}& {\pi}_{E}=\left(1-\frac{1}{n-m+1}\right)\pi \left(7\right)\end{array}

### Delta method

Let us start with a simple case. We consider a single pattern which is over-represented (seen more than expected) so we have

\begin{array}{lll}{S}_{N}=-{\mathrm{log}}_{10}{F}^{+}(N)\hfill & \text{with}\hfill & {F}^{+}(N)\triangleq {\mathbb{P}}_{{\mu}_{N},{\pi}_{N}}(N\ge {N}_{\text{obs}})\hfill \end{array}\left(8\right)

where the function *F*^{+} also depends on the sequence length ℓ and the considered pattern.

If *F*^{+} is differentiate, the delta-method (a simple first order Taylor expansion around **N** = **E**, see [13]) provides the following approximation:

and hence, using (7) we have

for *n* large enough. The distribution of \widehat{S} is therefore approximated by

(*S*_{
N
}) ≃ \mathcal{N} (*S*, *σ*^{2}) (11)

with

In consequence, computing *σ* requires both to compute **C** (done in appendix A) and ∇*F*^{+} (**E**).

### Single pattern

The exact expression of *F*^{+} is computable through many different methods [1–4] but is too much complicated to derive explicitly ∇*F*^{+}. To overcome this problem, we propose to consider an approximation of *F*^{+}. As said in introduction, many kind of approximations are available (Gaussian, binomial, compound Poisson or large deviations). In this paper, we have chosen to use a binomial approximation as it provides an expression which is analytically differentiable and is known to be a good heuristic to the problem [8].

For a single non-degenerate pattern (*i.e*. a simple word) *W* = *w*_{1} ... *w*_{
h
}(*w*_{
i
}∈ \mathcal{A}) with *h* ≥ *m* - 1 we first denote by

*P*(**N**) = *μ*_{
N
}(*w*_{1} ... *w*_{
m
}) × *π*_{
N
}(*w*_{1} ... *w*_{
m
}, *w*_{m+1}) × ... × *π*_{
N
}(*w*_{h-m}... *w*_{h-1}, *w*_{
h
}) (13)

the probability for *W* to occur at a given position in the sequence and then we get

{F}^{+}(N)\simeq \mathbb{P}(\mathcal{B}({\ell}_{h},P(N))\ge {N}_{obs})=\frac{\beta (P(N),{N}_{\text{obs}},{\ell}_{h}-{N}_{obs}+1)}{\beta ({N}_{obs},{\ell}_{h}-{N}_{obs}+1)}(14)

where \mathcal{B} denotes the binomial distribution, with ℓ_{
h
}= ℓ - *h* + 1 and where the β functions (complete and incomplete) and their relation to the binomial cumulative distribution function are described in appendix B.

Note that if we consider non-overlapping occurrences instead of overlapping ones, we can still use a binomial approximation for the distribution of *N*, but the expression of *P*(**N**) is more complicated as it involves the auto-correlation polynome of the pattern [14]. This point is not developed in this paper.

Replacing *μ*_{
N
}and *π*_{
N
}by their expression easily gives

P(N)=\frac{1}{n-m+1}{\displaystyle \prod _{w\in {\mathcal{A}}^{m}}\frac{{\displaystyle {\prod}_{a\in \mathcal{A}}{N}_{1}{(wa)}^{{A}_{1}(wa)}}}{{N}_{0}{(w)}^{{A}_{0}(w)}}}(15)

where *A*_{1}(*wa*) counts occurrences of the word *wa* in *W* = *w*_{1} ... *w*_{
h
}and *A*_{0} (*w*) counts occurrences of the word *w* in *w*_{2} ... *w*_{h-1}. Note that in the particular case where *h* = *m* - 1, all *A*_{0} (*w*) are null and we simply get (*n* - *m* + l) × *P* (**N**) = **N**_{1} (*W*).

Using the derivative properties of the incomplete beta function (see appendix B for more details) we hence get

\nabla {F}^{+}(N)\simeq \frac{P{(N)}^{{N}_{\text{obs}}-1}{(1-P(N))}^{{\ell}_{h}-{N}_{\text{obs}}}}{\beta ({N}_{\text{obs}},{\ell}_{h}-{N}_{\text{obs}}+1)}\times \nabla P(N)\left(16\right)

so all we need is to compute ∇*P*(**N**).

For all (*w*, *a*) ∈ \mathcal{A}^{m}× \mathcal{A} we have

\frac{\partial P(N)}{\partial {N}_{0}(w)}=-\frac{{A}_{0}(w)}{{N}_{0}(w)}\times P(N)\left(17\right)

and

\frac{\partial P(N)}{\partial {N}_{1}(w)}=-\frac{{A}_{1}(wa)}{{N}_{1}(wa)}\times P(N)\left(18\right)

If we denote by

*P* = *μ* (*w*_{1} ... *w*_{
m
}) × *π* (*w*_{1} ... *w*_{
m
}, *w*_{m+1}) × ... × *π* (*w*_{h-m}... *w*_{h-1}, *w*_{
h
}) (19)

the *true* probability for *W* to occur at a given position in the sequence *X* then we get, using (7) in (13), that

P(E)=p\times {\left(1-\frac{1}{n-m+1}\right)}^{h-m}\simeq p\left(20\right)

for *n* large enough. We hence get

\nabla {F}^{+}(E)\simeq \frac{{p}^{{N}_{\text{obs}}}{(1-p)}^{{\ell}_{h}-{N}_{\text{obs}}}}{\beta ({N}_{\text{obs}},{\ell}_{h}-{N}_{\text{obs}}+1)}\times G\left(21\right)

where ^{t}**G** = [^{t}**G**_{0} ^{t}**G**_{1}] is defined by

\begin{array}{lll}{G}_{0}(w)=-\frac{{A}_{0}(w)}{{E}_{0}(w)}\hfill & \text{and}\hfill & {G}_{1}(wa)=-\frac{{A}_{1}(wa)}{{E}_{1}(wa)}\hfill \end{array}\left(22\right)

Using equation we finally get

where

and then, a computation of *σ* is possible by plug-in. Without considering the computation of **E** and **C**, the complexity of this approach is *O*(*h*) (where *h* is the size of the pattern).

When a degenerate pattern (finite set of words) is considered instead of a single word, it is easy to adapt this method by summing the contribution *p* of each word belonging to the pattern. This point is left to the reader.

### Under-represented pattern

In the case of an under-represented pattern we have

Using a binomial approximation we get

and, by the same method than in the over-represented case we finally have

### Two distinct patterns

We consider now two patterns *V* and *W* instead of one and want to study the joint distribution of *S*_{
N
}(*V*) and *S*_{
N
}(*W*) their corresponding pattern statistics.

With a similar argument as in section "delta method", it is easy to show that

where *σ*_{
V
}(resp. *σ*_{
W
}) is the standard deviation *σ* for the pattern *V* (resp. *W*) and where

where

And after using results of sections "single pattern" and "under-represented pattern" we finally get

where {Q}_{V}^{\epsilon} (resp. *W*) and **G**_{
V
}(resp. *W*) are the constant *Q* (*Q*^{+} and *Q*^{-}) and the vector **G** for the pattern *V* (resp. *W*).

### Simulations

It is also possible to study the empirical distribution of a *S*_{
N
}(for one or more patterns) through simulations.

In order to do so, we first draw *M* independent sequences *Y*^{j}= {Y}_{1}^{j} ... {Y}_{n}^{j} using an order *m* stationary Markov model of parameters *π*. Complexity of this step is *O*(*M* × *n*).

For each *j* we get the frequencies **N**^{j}= ({N}_{0}^{j},{N}_{1}^{j}) (with complexity *O*(*n*) for each sequence) of the words of size *m* and *m* + 1 in the sequence *Y*^{j}and use it to compute *S*^{j}= {S}_{{N}^{j}} (exact value or approximation). Complexity here depends on the statistical method used to compute *S*^{j}(*e.g*. *O*(*h*) using a binomial approximation).

We now have a *M* – sample *S*^{1}, ..., *S*^{M}of *S*_{
N
}from which we can easily estimate *σ* and thus, valid or invalid the approximation through the delta-method.

When used with large value of *n* (*e.g*. several millions or more), the complexity of this approach is slowed by the drawn of the sequences *Y*_{
j
}. It is therefore possible to improve the method by simulating directly the frequencies **N** through (5). As this approximation has a very small impact on the distribution of *S*_{
N
}(data not shown) it may dramatically speed-up the computations when considering large *n* or *M*. It is nevertheless important to point out that drawing a Gaussian vector size *L* requires to precompute the Choleski decomposition of its covariance matrix which could be a limiting factor when considering large *L*.