Using Theorem 1, one could estimate the match probability p—and thereby the average number of substitutions per position—from the empirical average length of the k-mismatch common substrings returned by kmacs in a moment-based approach, similar to the approach proposed in [21].
A problem with this moment-based approach is that, for realistic values of L and p, one has \(P(j^*=i) \ll P(j^*\not =i)\), so the above sum is heavily dominated by the ‘background’ part, i.e. by the second summand in (10). For the parameter values used in Fig. 2, for example, only 1% of the matches returned by kmacs represent homologies while 99% are background noise. There are, in principle, two ways to circumvent this problem. First, one could try to separate homologous from background matches using a suitable threshold value, similarly as we have done in our Filtered Spaced Word Matches approach [29]. But this is more difficult for k-mismatch common substrings, since there can be much more overlap between homologous and background matches than for Spaced-Word matches, see Fig. 2.
There is an alternative to this moment-based approach, however. As can be seen in Fig. 2, the length distribution of the k-mismatch longest common substrings is bimodal, with a first peak in the distribution corresponding to the background matches and the second peak corresponding to the homologous matches. We show that the number of substitutions per positions can be easily estimated from the position of this second peak.
To simplify the following calculations, we ignore the longest exact match in Eq. (9), and consider only the length of the gap-free ‘extension’ of this match, see Fig. 1 for an illustration. To model the length of these k-mismatch extensions, we define define random variables
$$\begin{aligned} \hat{X}^{(k)}_i = \tilde{X}_{i}^{(k+1)} - X_i = X^{(k)}_{i+X_i+1,j^*+X_i+1} \end{aligned}$$
(16)
In other words, for a position i in sequence \(S_1\), we are looking for the longest substring starting at i that exactly matches a substring of \(S_2\). If \(j^*\) is the starting position of this substring of \(S_2\), we define \(\hat{X}^{(k)}_i\) as the length of the longest possible substring of \(S_1\) starting at position \(i+ X_i + 1\) that matches a substring of \(S_2\) starting at position \(j^* + X_i + 1\) with a Hamming distance of k.
Theorem 2
Let
\(\hat{X}^{(k)}_i\)
be defined as in (
16
). Then
\(\hat{X}^{(k)}_i\)
is the sum of two unimodal distributions, a ‘homologous’ and a ‘background’ contribution, and the maximum of the ‘homologous’ contribution is reached at
$$\begin{aligned} m_H = \left\lceil \frac{k}{1-p} -1 \right\rceil \end{aligned}$$
and the maximum of the ‘background contribution’is reached at
$$\begin{aligned} m_B = \left\lceil \frac{k}{1-q} -1 \right\rceil \end{aligned}$$
Proof
As in (5), the distribution of \(\hat{X}^{(k)}_i\) conditional on \(j^*=i\) or \(j^*\not =i\), respectively, can be easily calculated as
$$\begin{aligned} P\left( \hat{X}^{(k)}_i = m | j^*=i \right) = P\left( X^{(k)}_{i+ X_i+1,i+ X_i+1} = m \right) = {m \atopwithdelims ()k} p^{m-k} (1-p)^{k+1} \end{aligned}$$
and
$$\begin{aligned} P\left( \hat{X}^{(k)}_i = m\left| j^*\not = i \right) = {m \atopwithdelims ()k} q^{m-k} (1-q)^{k+1} \right. \end{aligned}$$
so we have
$$\begin{aligned} P\left( \hat{X}^{(k)}_i = m\right)&= P(j^* = i) {m \atopwithdelims ()k} p^{m-k} (1-p)^{k+1} \nonumber \\& \quad+ P(j^*\not = i) {m \atopwithdelims ()k} q^{m-k} (1-q)^{k+1} \end{aligned}$$
(17)
For the homologous part
$$\begin{aligned} H_k(m) = {P(j^*= i)} {m \atopwithdelims ()k} p^{m-k} (1-p)^{k+1} \end{aligned}$$
we obtain the recursion
$$\begin{aligned} H_k(m+1)= { \frac{m+1}{m+1-k}\cdot p \cdot H_k(m) } \end{aligned}$$
so we have \(H_k(m) \le H_k(m+1)\) if and only if
$$\begin{aligned} \frac{ m+1-k }{m+1} \le p \end{aligned}$$
(18)
Similarly, the ‘background contribution’
$$\begin{aligned} B_k(m) = P(j^*\not = i) {m \atopwithdelims ()k} q^{m-k} (1-q)^{k+1} \end{aligned}$$
is increasing until
$$\begin{aligned} \frac{ m+1-k }{m+1} \le q \end{aligned}$$
holds, which concludes the proof of the theorem. □
The proof of Theorem 2 gives us lower and upper bounds for p and an easy approach to estimate p from the empirical length distribution of the k-mismatch extensions calculated by kmacs. Let \(m_{\max }\) be the maximum of the homologous part of the distribution \(\hat{X}^{(k)}_i\), i.e. we define
$$\begin{aligned} m_{\max } =\mathop{\text{argmax}}_m {m \atopwithdelims ()k} p^{m-k} (1-p)^{k+1} \end{aligned}$$
Then, by inserting \(m_{\max }-1\) and \(m_{\max }\) into inequality (18), we obtain
$$\begin{aligned} \frac{m_{\max }-k}{m_{\max }} \le p \le \frac{ m_{\max }+1-k }{m_{\max }+1} \end{aligned}$$
Finally, we use (18) to estimate p from the second maximum \(m_E\) of the empirical distribution of \(\hat{X}_i\) as
$$\begin{aligned} \hat{p} \approx \frac{ m_E +1-k }{m_E+1} \end{aligned}$$
(19)
For completeness, we calculate the probability \(P(j^* = i)\). First we note that, by definition, for all i, we have
$$\begin{aligned} P(j^* = i) = P\left( X_{i,j}\,< \,X_{i,i} \quad \text { for all } j\not =i\right) \end{aligned}$$
so with the law of total probability and Eq. (2), we obtain
$$\begin{aligned} P(j^* = i) & = P\left( X_{i,j} < X_{i,i} \quad \text { for all } j\not =i\right) \nonumber \\ &= \sum _m P\left( X_{i,j} < X_{i,i} \quad \text { for all } j\not =i| X_{i,i} = m \right) P( X_{i,i} = m) \nonumber \\ & = \sum _m P\left( X_{i,j} < m \quad \text { for all } j\not =i \right) P( X_{i,i} = m) \nonumber \\ & = \sum _m \prod _{j\not =i}\, P( X_{i,j} < m)\, P( X_{i,i} = m) \nonumber \\ & = \sum _m (1-q^m)^{L -1} p^m (1-p) \end{aligned}$$
(20)