Skip to main content
Fig. 1 | Algorithms for Molecular Biology

Fig. 1

From: Phylogeny reconstruction based on the length distribution of k-mismatch common substrings

Fig. 1

k-mismatch common substrings with \(k=2\). For position \(i=5\) in \(S_1\), kmacs searches the longest substring of \(S_1\) starting at i that exactly matches a substring of \(S_2\). This is the substring starting at \(i^*=2\) in \(S_2\) (matching substrings shown in red). It then extends this match without gaps until the \(k+1\)st mismatch is reached. In this example, the k-mismatch common substring would consist of the red, blue and green substrings and has length 12. In the paper, the lengths of these k-mismatch common substrings are modelled by the random variables \(X_i^{(k)}\), defined in (1). The original version of kmacs uses the average length of these k-mismatch common substrings to assign a distance value to a pair of sequences. In our modified implementation of kmacs, we consider the k-mismatch extension of the longest common substring at i. That is, the program would return the length of the k-mismatch substring match that starts after the first mismatch following the longest common substring. In our example, for \(i=5,\) this would be the substring match starting with ‘T’ at position 11 in \(S_1\) and at position 8 in \(S_2\), consisting of the blue, green and orange matches; the length of this k-mismatch substring extension would be 9. The length of these k-mismatch extensions are modelled by the random variable \(\hat{X}_i^{(k)},\) defined in (16)

Back to article page