Using Theorem 1, one could estimate the match probability *p*—and thereby the average number of substitutions per position—from the *empirical* average length of the *k*-mismatch common substrings returned by *kmacs* in a moment-based approach, similar to the approach proposed in [21].

A problem with this moment-based approach is that, for realistic values of *L* and *p*, one has \(P(j^*=i) \ll P(j^*\not =i)\), so the above sum is heavily dominated by the ‘background’ part, i.e. by the second summand in (10). For the parameter values used in Fig. 2, for example, only 1% of the matches returned by* kmacs* represent homologies while 99% are background noise. There are, in principle, two ways to circumvent this problem. First, one could try to separate homologous from background matches using a suitable threshold value, similarly as we have done in our *Filtered Spaced Word Matches* approach [29]. But this is more difficult for *k*-mismatch common substrings, since there can be much more overlap between homologous and background matches than for *Spaced-Word* matches, see Fig. 2.

There is an alternative to this moment-based approach, however. As can be seen in Fig. 2, the length distribution of the *k*-mismatch longest common substrings is *bimodal*, with a first peak in the distribution corresponding to the background matches and the second peak corresponding to the homologous matches. We show that the number of substitutions per positions can be easily estimated from the position of this second peak.

To simplify the following calculations, we ignore the longest exact match in Eq. (9), and consider only the length of the gap-free ‘extension’ of this match, see Fig. 1 for an illustration. To model the length of these *k*-mismatch *extensions*, we define define random variables

$$\begin{aligned} \hat{X}^{(k)}_i = \tilde{X}_{i}^{(k+1)} - X_i = X^{(k)}_{i+X_i+1,j^*+X_i+1} \end{aligned}$$

(16)

In other words, for a position *i* in sequence \(S_1\), we are looking for the longest substring starting at *i* that exactly matches a substring of \(S_2\). If \(j^*\) is the starting position of this substring of \(S_2\), we define \(\hat{X}^{(k)}_i\) as the length of the longest possible substring of \(S_1\) starting at position \(i+ X_i + 1\) that matches a substring of \(S_2\) starting at position \(j^* + X_i + 1\) with a Hamming distance of *k*.

###
**Theorem 2**

*Let*
\(\hat{X}^{(k)}_i\)
* be defined as in (*
16
*). Then *
\(\hat{X}^{(k)}_i\)
* is the sum of two unimodal distributions, a ‘homologous’ and a ‘background’ contribution, and the maximum of the ‘homologous’ contribution is reached at*

$$\begin{aligned} m_H = \left\lceil \frac{k}{1-p} -1 \right\rceil \end{aligned}$$

*and the maximum of the ‘background contribution’is reached at*

$$\begin{aligned} m_B = \left\lceil \frac{k}{1-q} -1 \right\rceil \end{aligned}$$

###
*Proof*

As in (5), the distribution of \(\hat{X}^{(k)}_i\) conditional on \(j^*=i\) or \(j^*\not =i\), respectively, can be easily calculated as

$$\begin{aligned} P\left( \hat{X}^{(k)}_i = m | j^*=i \right) = P\left( X^{(k)}_{i+ X_i+1,i+ X_i+1} = m \right) = {m \atopwithdelims ()k} p^{m-k} (1-p)^{k+1} \end{aligned}$$

and

$$\begin{aligned} P\left( \hat{X}^{(k)}_i = m\left| j^*\not = i \right) = {m \atopwithdelims ()k} q^{m-k} (1-q)^{k+1} \right. \end{aligned}$$

so we have

$$\begin{aligned} P\left( \hat{X}^{(k)}_i = m\right)&= P(j^* = i) {m \atopwithdelims ()k} p^{m-k} (1-p)^{k+1} \nonumber \\& \quad+ P(j^*\not = i) {m \atopwithdelims ()k} q^{m-k} (1-q)^{k+1} \end{aligned}$$

(17)

For the *homologous* part

$$\begin{aligned} H_k(m) = {P(j^*= i)} {m \atopwithdelims ()k} p^{m-k} (1-p)^{k+1} \end{aligned}$$

we obtain the recursion

$$\begin{aligned} H_k(m+1)= { \frac{m+1}{m+1-k}\cdot p \cdot H_k(m) } \end{aligned}$$

so we have \(H_k(m) \le H_k(m+1)\) if and only if

$$\begin{aligned} \frac{ m+1-k }{m+1} \le p \end{aligned}$$

(18)

Similarly, the ‘background contribution’

$$\begin{aligned} B_k(m) = P(j^*\not = i) {m \atopwithdelims ()k} q^{m-k} (1-q)^{k+1} \end{aligned}$$

is increasing until

$$\begin{aligned} \frac{ m+1-k }{m+1} \le q \end{aligned}$$

holds, which concludes the proof of the theorem. □

The proof of Theorem 2 gives us lower and upper bounds for *p* and an easy approach to estimate *p* from the empirical length distribution of the *k*-mismatch extensions calculated by *kmacs*. Let \(m_{\max }\) be the maximum of the *homologous* part of the distribution \(\hat{X}^{(k)}_i\), i.e. we define

$$\begin{aligned} m_{\max } =\mathop{\text{argmax}}_m {m \atopwithdelims ()k} p^{m-k} (1-p)^{k+1} \end{aligned}$$

Then, by inserting \(m_{\max }-1\) and \(m_{\max }\) into inequality (18), we obtain

$$\begin{aligned} \frac{m_{\max }-k}{m_{\max }} \le p \le \frac{ m_{\max }+1-k }{m_{\max }+1} \end{aligned}$$

Finally, we use (18) to estimate *p* from the second maximum \(m_E\) of the empirical distribution of \(\hat{X}_i\) as

$$\begin{aligned} \hat{p} \approx \frac{ m_E +1-k }{m_E+1} \end{aligned}$$

(19)

For completeness, we calculate the probability \(P(j^* = i)\). First we note that, by definition, for all *i*, we have

$$\begin{aligned} P(j^* = i) = P\left( X_{i,j}\,< \,X_{i,i} \quad \text { for all } j\not =i\right) \end{aligned}$$

so with the law of total probability and Eq. (2), we obtain

$$\begin{aligned} P(j^* = i) & = P\left( X_{i,j} < X_{i,i} \quad \text { for all } j\not =i\right) \nonumber \\ &= \sum _m P\left( X_{i,j} < X_{i,i} \quad \text { for all } j\not =i| X_{i,i} = m \right) P( X_{i,i} = m) \nonumber \\ & = \sum _m P\left( X_{i,j} < m \quad \text { for all } j\not =i \right) P( X_{i,i} = m) \nonumber \\ & = \sum _m \prod _{j\not =i}\, P( X_{i,j} < m)\, P( X_{i,i} = m) \nonumber \\ & = \sum _m (1-q^m)^{L -1} p^m (1-p) \end{aligned}$$

(20)