- Research
- Open Access
Estimating evolutionary distances between genomic sequences from spaced-word matches
- Burkhard Morgenstern^{1, 2}Email author,
- Bingyao Zhu^{3},
- Sebastian Horwege^{1} and
- Chris André Leimeister^{1}
https://doi.org/10.1186/s13015-015-0032-x
© Morgenstern et al.; licensee BioMed Central. 2015
- Received: 19 November 2014
- Accepted: 6 January 2015
- Published: 11 February 2015
Abstract
Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator d _{ N } of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of ‘match positions’ and ‘don’t care positions’. Our software is available online and as downloadable source code at: http://spaced.gobics.de/.
Keywords
- k-mers
- Spaced words
- Alignment-free
- Phylogeny
- Word frequency
- Distance estimation
- Variance
- Genome comparison
Background
Alignment-free methods are increasingly used for DNA and protein sequence comparison since they are much faster than traditional alignment-based approaches [1]. Applications of alignment-free approaches include protein classification [2-5], read alignment [6-8], isoform quantification from RNAseq reads [9], sequence assembly [10], read-binning in metagenomics [11-16] or analysis of regulatory elements [17-20]. Most alignment-free algorithms are based on the word or k-mer composition of the sequences under study [21]. To measure pairwise distances between genomic or protein sequences, it is common practice to apply standard metrics such as the Euclidean or the Jensen-Shannon (JS) distance [22] to the relative word frequency vectors of the sequences.
Recently, we proposed an alternative approach to alignment-free sequence comparison. Instead of considering contiguous subwords of the input sequences, our approach considers spaced words, i.e. words containing wildcard or don’t care characters at positions defined by a pre-defined pattern P. This is similar as in the spaced-seeds approach that is used in database searching [23]. As in existing alignment-free methods, we compared the (relative) frequencies of these spaced words using standard distance measures [24]. In [25], we extended this approach by using whole sets \(\mathcal {P} = \{P_{1},\dots,P_{m}\}\) of patterns and calculating the spaced-word frequencies with respect to all patterns in . In this multiple-pattern approach, the distance between two sequences is defined as the average of the distances based on the individual patterns \(P_{i}\in {\mathcal {P}}\), see also [26]. ‘Spaced words’ have been proposed simultaneously by Onodera and Shibuya for protein classification [27] and by Ghandi et al. to study regulatory elements [28,29].
Phylogeny reconstruction is an important application of alignment-free sequence comparison. Consequently, most alignment-free methods were benchmarked by applying them to phylogeny problems [30-35]. The distance metrics used by these methods, however, are only rough measures of dissimilarity, not derived from any explicit model of molecular evolution. This may be one reason why distances calculated by alignment-free algorithms are usually not directly evaluated, but are used as input for distance-based phylogeny methods such as Neighbour-Joining [36]. The resulting tree topologies are then compared to trusted reference topologies. For applications to genomic sequences, we modified our distance d _{ N } by taking into account that sequences can contain repeats and homologies on different strands. Obviously, this is only a very rough way of evaluating sequence-comparison methods, since the resulting tree topologies not only depend on the distance values calculated by the evaluated methods, but also on the tree-reconstruction method that is applied to them. Also, comparing topologies ignores branch lengths, so the results of these benchmark studies depend only indirectly on the distance values calculated by the alignment-free methods that are to be evaluated.
Three remarkable exceptions are the papers describing K _{ r } [37], Co-phylog [38] and andi [39]. K _{ r } estimates evolutionary distances based on shortest unique substrings, Co-phylog uses so-called microalignments defined by spaced-word matches and considers the don’t care positions to estimate distances, while andi uses gap-free local alignments bounded by maximal unique matches. To our knowledge, these approaches are the only alignment-free methods so far that try to estimate phylogenetic distances in a rigorous way, based on a probabilistic model of evolution. Consequently, the authors of K _{ r }, Co-phylog and andy compared the distance values calculated by their methods directly to reference distances. Haubold et al. could show that K _{ r } can correctly estimate evolutionary distances between DNA sequences up to around 0.5 mutations per site [37].
In previous papers, we have shown, that our spaced-word approach is useful for phylogeny reconstruction. Tree topologies calculated with Neighbour-Joining based on spaced-word frequency vectors are usually superior to topologies calculated from the contiguous word frequency vectors that are used by traditional alignment-free methods [25]. Moreover, we could show that the ‘multiple-pattern approach’ leads to much better results than the ‘single-pattern approach’; these results were confirmed by Noé and Martin [40]. We also showed experimentally that distance values and tree topologies produced from spaced-word frequencies are statistically more stable than those based on contiguous words. In fact, the main difference between our spaced words and the commonly used contiguous words is that spaced-word matches at neighbouring positions are statistically less dependent on each other.
Since the aim of our previous papers was to compare (multiple) spaced-word frequencies to contiguous word frequencies, we applied the same distance metrics to our spaced-word frequencies that are applied by standard methods to k-mer frequencies, namely Jensen-Shannon and the Euclidean distance. In the present paper, we propose a new pairwise distance measure based on a probabilistic model of DNA evolution. We estimate the evolutionary distance between two nucleic-acid sequences based on the number N of space-word matches between them. We show that this distance measure is more accurate and works for more distantly related sequences than existing alignment-free distance measures. In addition, we calculate the variance of N for contiguous k-mers, as well as for spaced words using our single and multiple pattern approaches. We show that the variance of N is lower for spaced words than for contiguous words and that the variance is further reduced if multiple patterns are used.
This paper is an extended version of a manuscript that was first published in the proceedings of Workshop on Algorithms in Bioinformatics (WABI) 2013 in Wroclaw, Poland [41]. We added two extensions to our WABI paper that are crucial if our method is to be applied to real-world genomic sequences. (a) While the original version of our distance function assumed that homologies are located on the same strand of two genomes under comparison, we modified our original distance measure to account for homologies that are on different strands. (b) The number N of spaced-word matches is highly sensitive to repeats in the compared sequences, and our previously defined distance function could grossly under-estimate phylogenetic distances in the present of repeats. We therefore propose a simple modification of this distance function that is insensitive to repeats. Finally, we added more test data sets to evaluate our method.
Motifs and spaced words
As usual, for an alphabet Σ and ℓ∈N, Σ ^{ ℓ } denotes the set of all sequences of length ℓ over Σ. For a sequence S∈Σ ^{ ℓ } and 0<i≤ℓ, S[i] denotes the i-th character of S. A pattern of length ℓ is a word P∈{0,1}^{ ℓ }, i.e. a sequence over {0,1} of length ℓ. In the context of our work, a position i with P[i]=1 is called a match position while a position i with P[i]=0 is called a don’t care position. The number of all match positions in a patterns P is called the weight of P. For a pattern P of weight k, \(\hat {P}_{i}\) denotes the i-th match position and \(\hat {P} = \{\hat {P}_{1}, \dots \hat {P}_{k}\}, \hat {P}_{i} < \hat {P}_{i+1},\) denotes the set of all match positions.
A pattern is called contiguous if it consists of match positions only, a spaced word is called contiguous if the underlying pattern is contiguous. So a ‘contiguous spaced word’ is just a ‘word’ in the usual sense.
The number N of spaced-word matches for a pair of sequences with respect to a set of patterns
If q _{ a } is the relative frequency of a single character \(a\in {\mathcal {A}}\), \(q = \sum _{a\in {\mathcal {A}}}{q_{a}^{2}}\) is the background match probability, and p≥q is the match probability for a pair of ‘homologous’ positions.
Estimating evolutionary distances from the number N of spaced-word matches
If the weight of the patterns – i.e. the number of match positions – in the spaced-words approach is sufficiently large, random space-word matches can be ignored. In this case, the Jensen-Shannon distance between two DNA sequences approximates the number of (spaced) words that occur in one of the compared sequences, but not in the other one. Thus, if two sequences of length L are compared and N is the number of (spaced) words that two sequences have in common, their Jenson-Shannon distance can be approximated by L−N. Accordingly, the Euclidean distances between two sequences can be approximated by the square root of this value if the distance is small and k is large enough. For small evolutionary distances, the Jensen-Shannon distance grows therefore roughly linearly with the distance between two sequences, and this explains why it is possible to produce reasonable phylogenies based on this metric. It is clear, however, that the Jensen-Shannon distance is far from linear to the real distance for larger distances. We therefore propose an alternative estimator of the evolutionary distance between two sequences in terms of the number N of spaced-word matches between them.
as an estimator for the distance d between the sequences S _{1} and S _{2}. Note that for a model without insertions and deletions, it is, of course, not necessary to estimate the mismatch probability p from the number N of spaced-word matches. In this case, one could simply count the mismatches between two sequences and use their relative frequency as an estimator for p. The reason why we want to estimate p based on the number of spaced-word matches is that this estimate can be easily adapted to a model with insertions and deletions.
Local homologies, homologies on different strands and repeats
as an estimator for the distance between S _{1} and S _{2}. It is straight-forward, though somewhat cumbersome, to extend this estimator to the case where the homologous region contains insertions and deletions.
Note that, if local homologies between input sequences are known to the user, the best thing would be to remove the non-homologous regions of the sequences and to use the distance d _{ N } defined by equation (4) to the remaining homologous regions (which are then ‘globally’ related to each other). Nevertheless, the distance d _{ loc } might be useful in situations where the extent of homology between genomic sequences can be estimated, even though the precise location and boundaries of these homologies are unknown.
So far, we considered the case where homologies between two genomic sequences S _{1} and S _{2} are located on the same respective strand. For realistic applications, we have to take into account that homologies can occur on both strands of the DNA double helix. More importantly, we have to consider the case where a region of homology is located on one strand of S _{1}, but on the reverse strand on S _{2}. Let L _{1} and L _{2} be the lengths of S _{1} and S _{2}, respectively, with L _{1}≤L _{2}. For simplicity, we assume that the entire sequence S _{1} is homologous to a contiguous segment of S _{2} and we ignore insertions and deletions. We now assume, however, that some segments of S _{1} may align to their homologous counterpart in S _{2} while other segments of S _{2} may align to the reverse complement of their counterparts in S _{2}. The more general situation involving local homology and indels can be accounted for as discussed above.
as an estimator for the distance d between the sequences S _{1} and S _{2} if homologies on the reverse complement are taken into account.
The variance of N
Our new distance measure and other word-based distance measures depend on the number N of (spaced) word matches between sequences. To study the stability of these measures, we want to calculate the variance of N. To do so, we adapt results on the occurrence of words in a sequence as outlined in [45]. Since N can be written as the sum of all random variables \(X_{i,j}^{P}\), we need to calculate the covariances of these random variables. To simplify this, we make a further assumption on our sequence model: we assume that the four nucleotides occur with the same probability 0.25. In this case, the covariance of two random variables \(X_{i,j}^{P}\) and \(X_{i',j'}^{P'}\) can be non-zero only if i ^{′}−i=j ^{′}−j holds (note that this is not true if nucleotides have different probabilities to occur). In particular, for random variables \(X\in \mathcal {X}_{\textit {Hom}}\) and \(X'\in \mathcal {X}_{\textit {BG}}\), their covariance is zero. Thus, we only need to consider covariances of pairs of random variables \(X_{i,j}^{P}\) and \(X_{i+s,j+s}^{P'}\).
for all \(X_{i,j}^{P}, X_{i+s,j+s}^{P'}\)
Test results
Simulated DNA sequences
To evaluate the distance function d _{ N } defined by equation (4), we simulated pairs of DNA sequences with an (average) length of 100,000 and with an average of d substitutions per sequence position. More precisely, we generated sequence pairs by generating a first sequence using a Bernoulli model with probability 0.25 for each nucleotide. A second sequence was then generated from the first sequence by substituting nucleotides with a probability corresponding to the substitution frequency d, as calculated with Jukes-Cantor. We varied d between 0 and 1 and compared the distances estimated by our distance measure and by various other alignment-free programs to the ‘real’ distance d. We performed these experiments for sequence pairs without insertions and deletions and for sequence pairs where we included insertions and deletions with a probability of 1% at every position. The length of indels was randomly chosen between 1 and 50 with uniform probability.
As mentioned, K _{ r } [46] estimates evolutionary distances on the basis of a probabilistic model of evolution. In our study, K _{ r } correctly estimated the true distance d for values of around d≤0.6, this precisely corresponds to the results reported by the authors of the program. For larger distances, K _{ r } grossly overestimates the distance d, though, and the variance strongly increases. The distances estimated by Co-phylog [38] nearly coincide with the substitution rate d for values d≤0.7, then the curve flattens. Moreover, it appears that the distances calculated with Co-phylog are not much affected by indels. The distance values calculated by the program k mismatch average common substring (kmacs) that we previously developed [47] are roughly linear to the real distances d for values of up to around d=0.3. From around d=0.5 on, the curve becomes flat. With k=30 mismatches, the performance of kmacs was better than with k=0, in which case kmacs corresponds to the Average Common Substring (ACS) approach [30].
Real-world genomes
Next, we applied various distance measures to a set of 27 mitochondrial genomes from different primates that were previously used by [46] as a benchmark data set for alignment-free approaches. We used our multiple spaced-words approach with the parameters that we used in [25], that is with a pattern weight (number of match positions) of k=9 and with pattern lengths ℓ between 9 and 39, i.e. with up to 30 don’t-care positions in the patterns. For each value of ℓ, we randomly generated sets of m=100 patterns. For this data set, we used the distance d _{ RC } defined in equation (8) that takes the reverse complement of the input sequences into account. (We did not use the ‘binary’ version \(d_{\textit {RC}}^{bin}\) of our distance function, since these sequences do not contain major repeats). In addition, we used the competing approaches FFP [32], CVTree [48], K _{ r } [37], kmacs [47], ACS [30] and Co-phylog [38]. For some of these methods, program parameters need to be defined, e.g. a predefined word length or the number of allowed mismatches. For these methods we tested various parameters and used the best performing values.
To evaluate our new method for larger sequences we used two prokaryotic data sets. The first data set consists of 26 E. coli and Shigella genomes, which are very closely related and the second data set consist of 32 Roseobacter genomes, which are far more divergent. For these sequences, we used our ‘repeat-aware’ distance function \(d_{\textit {RC}}^{bin}\). As for the primate mitochondrial genomes, we calculated distance matrices using the same alignment-free methods, constructed trees with Neighbor-Joining and compared the resulting tree topologies to a benchmark tree using the RF metric. For the E.coli/Shigella genomes we used the tree proposed by [53] as reference which is based on concatenated alignments of the 2034 core genes. For the Roseobacter genomes we used the tree by [54] as reference. This benchmark tree was constructed based on alignments of 70 universal single-copy genes.
The variance of N: experimental results
Discussion
In this paper, we proposed a new estimator d _{ N } for the evolutionary distance between two DNA sequences that is based on the number N of spaced-word matches between them. While most alignment-free methods use ad-hoc distance measures, the distance function that we defined is based on a probabilistic model of evolution and seems to be a good estimator for the number of substitutions per site that have occurred since two sequences have evolved separately. For simplicity, we used a model of evolution without insertions and deletions. Nevertheless, our test results show that our distance function is still a reasonable estimator if the input sequences contain a moderate number of insertions and deletions although, in this case, distances between the input sequences are overestimated since the number N of spaced-word matches is smaller than it would be for sequences without indels.
The model that we used to derive our distance d _{ N } assumes that two sequences are globally related. If sequences share only local homology, the number N of spaced-word matches would be smaller than for globally related sequences with the same length and rate of mismatches, so their distance would be over-estimated by our distance measure d _{ N }. This is clearly a limitation of our approach. However, as indicated in section Estimating evolutionary distances from the number N of spaced-word matches, our distance function can be adapted to the case of local homologies if the length of these homologies and the number of gaps in the homologous regions can be estimated. In principle, it should therefore be possible to apply our method to locally related sequences by first estimating the extent of their shared (local) homology and then using the distance d _{ loc } defined in equation (5) instead of d _{ N }.
The distance measures introduced in this paper and other distances that we previously used for our spaced words approach depend on the number N of space-word matches between two sequences with respect to a set of patterns of ‘match’ and ‘don’t care’ positions. This is similar for more traditional alignment-free methods that calculated distances based on k-mer frequencies. While the expected number of (spaced) word matches is essentially the same for contiguous words and for spaced words of the corresponding weight, we have showed that the variance of N is considerably lower for spaced-words than for the traditionally used contiguous words. Moreover, with our multiple-pattern approach the variance of the normalized number of spaced-word matches is further reduced. This seems to be the main reason why our multiple spaced words approach outperforms the single-pattern approach that we previously introduced as well as the classical k-mer approach when used for phylogeny reconstruction.
As we have shown, the variance of N depends on the number of overlapping ‘match’ positions if patterns from are shifted against each other. Consequently, in our single-pattern approach, the variance of N is higher for periodic patterns than for non-periodic patterns. For example, for the periodic pattern 101010…, the variance is equal to the variance of the contiguous pattern of the corresponding weight. In our previous benchmark studies, we could experimentally confirm that spaced words performs better with non-periodic patterns than with periodic patterns. The theoretical results of this study may be useful to find patterns or sets of patterns that minimize the variance of N to further improve our spaced-words approach.
Declarations
Acknowledgements
We would like to thank Marcus Boden, Sebastian Lindner, Alec Guyomard and Claudine Devauchelle for help with the program evaluation and Gilles Didier for help with the software to compare distance matrices. We thank Matteo Comin, Ruth Kantorovitz, Saurabh Sinha and an unknown WABI reviewer for pointing out an an error regarding the covariance of spaced-word matches in the previous version of this manuscript that was published at WABI 2014.
Authors’ Affiliations
References
- Vinga S. Editorial: Alignment-free methods in computational biology. Briefings Bioinf. 2014; 15:341–2.View ArticleGoogle Scholar
- Leslie C, Eskin E, Noble WSS. The spectrum kernel: a string kernel for SVM protein classification. In: Pacific Symposium on Biocomputing. Singapore: World Scientific Publishing: 2002. p. 566–75.Google Scholar
- Lingner T, Meinicke P. Remote homology detection based on oligomer distances. Bioinformatics. 2006; 22:2224–31.View ArticlePubMedGoogle Scholar
- Lingner T, Meinicke P. Word correlation matrices for protein sequence analysis and remote homology detection. BMC Bioinf. 2008; 9:259.View ArticleGoogle Scholar
- Comin M, Verzotto D. The irredundant class method for remote homology detection of protein sequences. J Comput Biol. 2011; 18:1819–29.View ArticlePubMedGoogle Scholar
- Li R, Li Y, Kristiansen K, Wang J. SOAP: short oligonucleotide alignment program. Bioinformatics. 2008; 24:713–4.View ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg S. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol. 2009; 10:25.View ArticleGoogle Scholar
- Ahmadi A, Behm A, Honnalli N, Li C, Weng L, Xie X. Hobbes: optimized gram-based methods for efficient read alignment. Nucleic Acids Res. 2011; 40:1.Google Scholar
- Patro R, Mount SM, Kingsford C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol. 2014; 32:462–4.View ArticlePubMed CentralPubMedGoogle Scholar
- Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008; 18:821–9.View ArticlePubMed CentralPubMedGoogle Scholar
- Teeling H, Waldmann J, Lombardot T, Bauer M, Glockner F. Tetra: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in dna sequences. BMC Bioinf. 2004; 5:163.View ArticleGoogle Scholar
- Chatterji S, Yamazaki I, Bai Z, Eisen JA. Compostbin: A DNA composition-based algorithm for binning environmental shotgun reads. In: Research in Computational Molecular Biology, 12th Annual International Conference, RECOMB 2008, Singapore, March 30 - April 2, 2008. Proceedings. Berlin, Heidelberg: Springer: 2008. p. 17–28.Google Scholar
- Wu Y-W, Ye Y. A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J Comput Biol. 2011; 18:523–34.View ArticlePubMed CentralPubMedGoogle Scholar
- Tanaseichuk O, Borneman J, Jiang T. Separating metagenomic short reads into genomes via clustering. Algorithms Mol Biol. 2012; 7:27.View ArticlePubMed CentralPubMedGoogle Scholar
- Leung HCM, Yiu SM, Yang B, Peng Y, Wang Y, Liu Z, et al. A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics. 2011; 27:1489–95.View ArticlePubMedGoogle Scholar
- Wang Y, Leung HCM, Yiu SM, Chin FYL. Metacluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics. 2012; 28:356–62.View ArticleGoogle Scholar
- Meinicke P, Tech M, Morgenstern B, Merkl R. Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites. BMC Bioinf. 2004; 5:169.View ArticleGoogle Scholar
- Kantorovitz M, Robinson G, Sinha S. A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics. 2007; 23:249–55.View ArticleGoogle Scholar
- Leung G, Eisen MB. Identifying cis-regulatory sequences by word profile similarity. PloS one. 2009; 4(9):6901.View ArticleGoogle Scholar
- Federico M, Leoncini M, Montangero M, Valente P. Direct vs 2-stage approaches to structured motif finding. Algorithms Mol Biol. 2012; 7:20.View ArticlePubMed CentralPubMedGoogle Scholar
- Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Nat Acad Sci USA. 1986; 83:5155–9.View ArticlePubMed CentralPubMedGoogle Scholar
- Lin J. Divergence measures based on the shannon entropy. IEEE Trans Inf theory. 1991; 37:145–51.View ArticleGoogle Scholar
- Ma B, Tromp J, Li M. PatternHunter: faster and more sensitive homology search. Bioinformatics. 2002; 18:440–5.View ArticlePubMedGoogle Scholar
- Boden M, Schöneich M, Horwege S, Lindner S, Leimeister C-A, Morgenstern B. German Conference on Bioinformatics 2013 In: Beißbarth T, Kollmar M, Leha A, Morgenstern B, Schultz A-K, Waack S, Wingender E, editors. OpenAccess Series in Informatics (OASIcs). Dagstuhl, Germany: Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik: 2013. p. 24–34. http://drops.dagstuhl.de/opus/volltexte/2013/4233.Google Scholar
- Leimeister C-A, Boden M, Horwege S, Lindner S, Morgenstern B. Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics. 2014; 30:1991–9.View ArticlePubMed CentralPubMedGoogle Scholar
- Horwege S, Lindner S, Boden M, Hatje K, Kollmar M, Leimeister C-A, et al. Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res. 2014; 42:W7–W11.View ArticlePubMed CentralPubMedGoogle Scholar
- Onodera T, Shibuya T. The gapped spectrum kernel for support vector machines In: Perner P, editor. Machine Learning and Data Mining in Pattern Recognition, Lecture Notes in Computer Science. Berlin,Heidelberg: Springer: 2013.Google Scholar
- Ghandi M, Mohammad-Noori M, Beer MA. Robust k-mer frequency estimation using gapped k-mers. J Math Biol. 2014; 69:469–500.View ArticlePubMedGoogle Scholar
- Ghandi M, Lee D, Mohammad-Noori M, Beer MA. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput Biol. 2014; 10(7):1003711.View ArticleGoogle Scholar
- Ulitsky I, Burstein D, Tuller T, Chor B. The average common substring approach to phylogenomic reconstruction. J Comput Biol. 2006; 13:336–50.View ArticlePubMedGoogle Scholar
- Didier G, Debomy L, Pupin M, Zhang M, Grossmann A, Devauchelle C, et al. Comparing sequences without using alignments: application to HIV/SIV subtyping. BMC Bioinf. 2007; 8:1.View ArticleGoogle Scholar
- Sims GE, Jun S-R, Wu GA, Kim S-H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Nat Acad Sci. 2009; 106:2677–82.View ArticlePubMed CentralPubMedGoogle Scholar
- Domazet-Loso M, Haubold B. Alignment-free detection of local similarity among viral and bacterial genomes. Bioinformatics. 2011; 27(11):1466–72.View ArticlePubMedGoogle Scholar
- Haubold B, Reed FA, Pfaffelhuber P. Alignment-free estimation of nucleotide diversity. Bioinformatics. 2011; 27:449–55.View ArticlePubMedGoogle Scholar
- Comin M, Verzotto D. Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms Mol Biol. 2012; 7:34.View ArticlePubMed CentralPubMedGoogle Scholar
- Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987; 4:406–25.PubMedGoogle Scholar
- Haubold B, Pierstorff N, Möller F, Wiehe T. Genome comparison without alignment using shortest unique substrings. BMC Bioinf. 2005; 6:123.View ArticleGoogle Scholar
- Yi H, Jin L. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res. 2013; 41:75.View ArticleGoogle Scholar
- Haubold B, Klötzl F, Pfaffelhuber P. andi: Fast and accurate estimation of evolutionary distances between closely related genomes. Bioinformatics.doi:10.1093/bioinformatics/btu815.Google Scholar
- Noé L, Martin DEK. A coverage criterion for spaced seeds and its applications to SVM string-kernels and k-mer distances. J Comput Biol. 2014; 12:947–63.View ArticleGoogle Scholar
- Morgenstern B, Zhu B, Horwege S, Leimeister C. Estimating evolutionary distances from spaced-word matches. In: Proc. Workshop on Algorithms in Bioinformatics (WABI’14). Lecture Notes in Bioinformatics. Berlin Heidelberg.: Springer: 2014. p. 161–73.Google Scholar
- Lippert RA, Huang H, Waterman MS. Distributional regimes for the number of k-word matches between two random sequences. Proc Nat Acad Sci. 2002; 99:13980–9.View ArticlePubMed CentralPubMedGoogle Scholar
- Reinert G, Chew D, Sun F, Waterman MS. Alignment-free sequence comparison (i): Statistics and power. J Comput Biol. 2009; 16:1615–34.View ArticlePubMed CentralPubMedGoogle Scholar
- Jukes TH, Cantor CR. Evolution of Protein Molecules: Academy Press, NY; 1969.Google Scholar
- Robin S, Rodolphe F, Schbath S. DNA, Words and Models: Statistics of Exceptional Words. Cambridge: Cambridge University Press; 2005.Google Scholar
- Haubold B, Pfaffelhuber P, Domazet-Loso M, Wiehe T. Estimating mutation distances from unaligned genomes. J Comput Biol. 2009; 16:1487–500.View ArticlePubMedGoogle Scholar
- Leimeister C-A, Morgenstern B. kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics. 2014; 30:2000–8.View ArticlePubMed CentralPubMedGoogle Scholar
- Qi J, Luo H, Hao B. CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Res. 2004; 32(suppl 2):45–7.View ArticleGoogle Scholar
- Felsenstein J. PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics. 1989; 5:164–6.Google Scholar
- Bonnet E, de Peer YV. zt: A sofware tool for simple and partial mantel tests. J Stat Software. 2002; 7:1–12.Google Scholar
- Didier G, Laprevotte I, Pupin M, Hénaut A. Local decoding of sequences and alignment-free comparison. J Comput Biol. 2006; 13:1465–76.View ArticlePubMedGoogle Scholar
- Robinson D, Foulds L. Comparison of phylogenetic trees. Mathematical Biosciences. 1981; 53:131–47.View ArticleGoogle Scholar
- Zhou Z, Li X, Liu B, Beutin L, Xu J, Ren Y, et al. Derivation of Escherichia coli O157:H7 from Its O55:H7 Precursor. PLOS One. 2010; 5:8700.View ArticleGoogle Scholar
- Newton RJ, Griffin LE, Bowles KM, Meile C, Gifford S, Givens CE, et al. Genome characteristics of a generalist marine bacterial lineage. ISME J. 2010; 4:784–98.View ArticlePubMedGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.