Separating metagenomic short reads into genomes via clustering
© Tanaseichuk et al.; licensee BioMed Central Ltd. 2012
Received: 4 January 2012
Accepted: 14 September 2012
Published: 26 September 2012
The metagenomics approach allows the simultaneous sequencing of all genomes in an environmental sample. This results in high complexity datasets, where in addition to repeats and sequencing errors, the number of genomes and their abundance ratios are unknown. Recently developed next-generation sequencing (NGS) technologies significantly improve the sequencing efficiency and cost. On the other hand, they result in shorter reads, which makes the separation of reads from different species harder. Among the existing computational tools for metagenomic analysis, there are similarity-based methods that use reference databases to align reads and composition-based methods that use composition patterns (i.e., frequencies of short words or l-mers) to cluster reads. Similarity-based methods are unable to classify reads from unknown species without close references (which constitute the majority of reads). Since composition patterns are preserved only in significantly large fragments, composition-based tools cannot be used for very short reads, which becomes a significant limitation with the development of NGS. A recently proposed algorithm, AbundanceBin, introduced another method that bins reads based on predicted abundances of the genomes sequenced. However, it does not separate reads from genomes of similar abundance levels.
In this work, we present a two-phase heuristic algorithm for separating short paired-end reads from different genomes in a metagenomic dataset. We use the observation that most of the l-mers belong to unique genomes when l is sufficiently large. The first phase of the algorithm results in clusters of l-mers each of which belongs to one genome. During the second phase, clusters are merged based on l-mer repeat information. These final clusters are used to assign reads. The algorithm could handle very short reads and sequencing errors. It is initially designed for genomes with similar abundance levels and then extended to handle arbitrary abundance ratios. The software can be download for free at http://www.cs.ucr.edu/~tanaseio/toss.htm.
Our tests on a large number of simulated metagenomic datasets concerning species at various phylogenetic distances demonstrate that genomes can be separated if the number of common repeats is smaller than the number of genome-specific repeats. For such genomes, our method can separate NGS reads with a high precision and sensitivity.
KeywordsMetagenomics NGS short reads Genome separation Clustering
Metagenomics  is a new field of study that provides a deeper insight into the microbial world compared to the traditional single-genome sequencing technologies. Traditional methods for studying individual genomes are well developed. However, they are not appropriate for studying microbial samples from the environment because traditional methods rely upon cultivated clonal cultures while more than 99% of bacteria are unknown and cannot be cultivated and isolated . Metagenomics uses technologies that sequence uncultured bacterial genomes in an environmental sample directly , and thus makes it possible to study organisms which cannot be isolated or are difficult to grow in a lab. It provides hope for a better understanding of natural microbial diversity as well as their roles and interactions. It also opens new opportunities for medicine, biotechnology, agricultural studies and ecology.
Many well-known metagenomics projects use the whole genome shotgun sequencing approach in combination with Sanger sequencing technologies. This approach has produced datasets from the Sargasso Sea , Human Gut Microbiome  and Acid Mine Drainage Biofilm . However, new sequencing technologies have evolved over the past few years. The sequencing process has been greatly parallelized, producing millions of reads with much faster speed and lower cost. Since NGS technologies are much cheaper, they allow sequencing to be performed at a much greater depth. The only drawback is that read length is reduced - NGS reads are usually of lengths 25-150 (Illumina/SOLiD) compared to 800-1000 bps in Sanger reads.
The primary goals of metagenomics are to describe the populations of microorganisms and to identify their roles in the environment. Ideally, we want to identify complete genomic sequences of all organisms present in a sample. However, metagenomic data is very complex, containing a large number of sequence reads from many species. The number of species and their abundance levels are unknown. The assembly of a single genome is already a difficult problem, complicated by repeats and sequencing errors which may lead to high fragmentation of contigs and misassembly. In a metagenomic data, in addition to repeats within individual genomes, genomes of closely related species may also share homologous sequences, which could lead to even more complex repeat patterns that are very difficult to resolve. A lot of research has been done for assembling single genomes [7–10]. But due to the lack of research on metagenomic assemblers, assemblers designed for individual genomes are routinely used in metagenomic projects [4, 6]. It has been shown that these assemblers may lead not only to misassembly, but also severe fragmentation of contigs . A plausible approach to improve the performance of such assemblers is to separate reads from different organisms present in a dataset before the assembly.
Many computational tools have been developed for separating reads from different species or groups of related species (we will refer to the problem as the clustering of reads). Some of the tools also estimate the abundance levels and genome sizes of species. These tools are usually classified as similarity-based (or phylogeny-based) and composition-based. The purpose of similarity-based methods is to analyze the taxonomic content of a sample. Small-scale approaches involving 16S rRNAs and 18S rRNAs  are commonly used to determine evolutionary relationships by analyzing fragments that contain marker genes and comparing them with known marker genes. These methods take advantage of small number of fragments containing marker genes and require reads to have at least 1000 bps. Two other tools handle a larger number of fragments: MEGAN  and CARMA . MEGAN aligns reads to databases of known sequences using BLAST  and assigns reads to taxa by the lowest common ancestor approach. CARMA performs phylogenetic classification of unassembled reads using all Pfam domains and protein families as phylogenetic markers. These two methods work for very short reads (as short as 35 bps for MEGAN and 80 bps for CARMA). However, a large fraction of sequences may remain unclassified by these methods because of the absence of closely related sequences in the databases.
The second class of methods use compositional properties of the fragments (or reads). These methods are based on the fact that some composition properties, such as CG content and oligonucleotide frequencies are preserved across sufficiently long fragments of the same genome, and vary significantly between fragments from different organisms. K-mer frequency is the most widely used characteristics for binning. For example, the method in  utilizes the property that each genome has a stable distribution of k-mer frequencies for k = 1.6 in fragments as short as 1000 bps. It shows that these fragments have very similar “barcodes” and thus can be clustered based on their barcode similarities. Barcode similarity also correlates with phylogenetic closeness between genomes. The main challenge in the k-mer frequency approach is that these frequencies produce large feature vectors, which can be even larger than the sizes of fragments. Different methods have been proposed to deal with this problem. CompostBin , which uses hexamer frequencies, adopts a modified principle component analysis to extract the top three meaningful components and then cluster the reads based on principal component values. Self-organizing maps are another way to reduce dimensionality by mapping multidimensional data to two dimensional space. The work in  uses SOMs for tri- and tetranucleotide frequency vectors. In TETRA , z-scores are computed for tetranucleotide frequencies and fragments are classified by the Pearson correlation of their z-scores. MetaCluster 3.0  uses Spearman Footrule distance between k-mer feature vectors. Another composition feature is used in TACOA : the ratio between observed oligonucleotide frequencies and expected frequencies given the CG content. To cluster fragments, the k-NN approach is combined with the Gaussian kernel function. The main limitation of composition based methods is that the length of fragments may significantly influence their performance. In general, these methods are not suitable for fragments shorter than 1000 bps .
AbundanceBin  is a recently developed tool for binning reads that uses an approach different from the above similarity and composition based techniques. It is designed to separate reads from genomes that have different abundance levels. It computes frequencies of all l-mers in a metagenomic dataset and, assuming that these frequencies come from a mixture of Poisson distributions, predicts the abundance levels of genomes and clusters l-mers according to their frequencies. Then reads are clustered based on the frequencies of their l-mers. This method is suitable for very short NGS reads. The limitation is that genomes whose abundance levels do not differ very much (within ratio 1:2) will not be separated.
In this paper, we present a two-phase heuristic algorithm for separating short paired-end reads from different organisms in a metagenomic dataset, called TOSS (i.e., TOol for Separating Short reads). The basic algorithm is developed to separate genomes with similar abundance levels. It is based on several interesting observations about unique and repeated l-mers in a metagenomic dataset, which enables us to separate unique l-mers (each of which belongs to only one genome and is not repeated) from repeats (l-mers which are repeated in one or more genomes) at the beginning of the first phase of the algorithm. During the first phase, unique l-mers are clustered so that each cluster consists of l-mers from only one of the genomes. This is possible due to the observation that most l-mers are unique within a genome and, moreover, within a metagenomic dataset. During the second phase, we find connections between clusters through repeated regions and then merge clusters of l-mers that are likely to belong to the same organism. Finally, reads are assigned to clusters. In order to handle metagenomic datasets with genomes of arbitrary abundance ratios, we combine the method with AbundanceBin which attempts to separate l-mers from genomes with significantly different abundance levels. The integrated method works for very short reads, and is able to handle multiple genomes with arbitrary abundance levels and sequencing errors. We test the method on a large number of simulated metagenomic datasets for microbial species with various phylogenetic closeness according to the NCBI taxonomy [24, 25] and show that genomes can be separated if the number of common repeats is less then the number of genome-specific repeats. For example, genomes of different species of the same genus often have a large number of common repeats and thus are very hard to separate. In the tests, our method is able to separate fewer than a half of groups of such closely related genomes. However, with the decrease in the fraction of common repeats, the ability to accurately separate genomes significantly increases. Due to the lack of appropriate short read clustering tools for comparison, we modify a well-known genome assembly software, Velvet , to make it behave like a genome separation tool and compare our clustering results with those of the modified Velvet.
The paper is organized as follows. In the “Methods” section, we consider properties of l-mers in a metagenomic dataset and make several observations which form the intuition behind the algorithm, present the main algorithm, and extend the algorithm to handle arbitrary abundance ratios. The “Experimental results” section gives the comparison with the modified Velvet on short reads and comparison with the well-known composition-based tool CompostBin on longer reads.
The algorithm we are going to present is based on l-mers from metagenomic reads. In this section, we will discuss some properties of l-mers that are important for our algorithm, and also make some important observations that lead to the intuition behind the algorithm.
First, let us analyze the expected number of occurrence of l-mers in reads sequenced from a single genome of length G. Let the number of paired-end reads be N (which corresponds to 2N read sequences) and read length L. In shotgun sequencing projects, as well as NGS, the reads are randomly distributed across the genome. Since reads may begin at any positions of the genome with equal probability, Lander and Waterman suggested that the left ends of reads follow a Poisson distribution , which means that the probability for a read to begin at a given position of the genome is α = 2N/(G − L + 1) and the number of reads starting at each position has a Poisson distribution with parameter α. Consider a substring w i of length l that begins at the i-th position of the genome. Let x(w i ) be the number of reads that cover this particular l-mer. Since there are L − l + 1 possible starting positions for such reads, x(w i ) has a Poisson distribution with parameter λ = α(L − l + 1) (this parameter represents the effective coverage [27, 28]). This analysis assume that the l-mer w i occurs uniquely in the genome, but in general, an l-mer may occur multiple times. Suppose that an l-mer w has n(w) copies in the genome located at positions i 1,…,i n(w). Then the total number of reads containing w is . If we assume that a read covers at most one copy of w, then , are independent and identically distributed. So by the additivity property of the Poisson distribution, the total number of occurrences of w in the reads, x(w), follows a Poisson distribution with parameter α(L − l + 1)n(w). In , this model is used to find repeat families for a single genome, where a repeat family is a collection of l-mers that have the same number of copies in the genome.
Finding unique l-mers
Before performing the first phase of the algorithm, which clusters the unique l-mers, l-mers have to be separated into unique l-mers and repeats. This is done by choosing a threshold value K for the counts of l-mers so that l-mers with counts less than K are most likely unique and the remaining are most likely repeats. Below, we discuss how to chose K.
The set U of unique l-mers is then used to construct a graph which can help detect more repeats and will be used to do the clustering. The nodes of the graph G correspond to the elements of U and there is an edge between two nodes if both l-mers are contained in a same read. To remove previously undetected repeats, we use the fact that nodes that correspond to truly unique l-mers cannot have more than 2(L − l) neighbors.
Clustering the unique l-mers
Threshold T (the minimum required number of edges between an unclustered node and the nodes in a cluster so that the node can be added to this cluster) is chosen to make the expected number of coverage gaps less than one. Recall that the effective coverage is Cov = 2N(L − (l + T) + 1)/(G − L + 1) and expected number of gaps is 2N e −Cov .
Merging clusters and the final clustering of metagenomic reads
Now we discuss how to define big and valid m-clusters. The minimum size of a big m-cluster is specified by the user based on the minimum expected length of a genome. Valid m-clusters are chosen from big m-clusters in the following way. Let W jj and W ii be the total weights of the connections within each of the m-clusters j and i, and W ij the total weight of the connections between these two m-clusters. The big m-cluster i is defined to be valid if for every other big m-clusters j, the inequality holds. The threshold of 10−3 is chosen empirically.
In the final step of the algorithm, the reads are assigned to the resultant clusters of unique l-mers. Iterative algorithm is used to assign the reads. At the first step, each reads that correspond to some cluster is assigned to this cluster. During the second step, unassigned reads that have assigned mates are assigned to the same clusters as their mates. In the third step, for each cluster of unique l-mers we add all the l-mers from the reads assigned to the cluster. We iteratively repeat the three steps for the unassigned reads until no more reads can be assigned. If the read correspond to several clusters, we assign it to one of the clusters.
Handling genomes with arbitrary abundance levels
We would like to extend the above algorithm to metagenomic data containing genomes with different abundance levels. If the abundance level difference is not significant, the above algorithm would still work well. In this case, the number of wrongly determined unique l-mers and repeats in the first phase of the algorithm may slightly increase, but the clustering of l-mers based on their counts using the Poisson mixture model may incur a significantly higher drop of performance. For genomes with significantly different abundance levels, it makes sense to first separate reads according to genome abundance levels. Otherwise, repeats from genomes with lower abundance levels will not be detected, which could lead to a significant increase of granularity in the clustering result produced by the first phase of the above algorithm. For this reason, we propose to use the algorithm AbundanceBin  for the initial abundance-based binning of reads. Then we run the first phase of our method for each of the subsets of reads. For the second phase, we use all the reads to find the connections between clusters so that connections between clusters from genomes with low abundance levels are properly recovered, but MCL is performed on each subset separately.
A key question is what ratios of abundance levels should be considered as significant? This ratio depends on the actual values of abundance levels and also on the sizes of the genomes. Given abundance levels λ 1 and λ 2(λ 1 < λ 2), genome sizes G 1 and G 2, and a threshold K for classifying l-mers into the two genomes based on count frequencies, we can estimate the expected rate of misclassified l-mers from the count distributions of the l-mers in these two genomes as discussed in the “Methods” section. More specifically, the shaded area in Figure 6 represents the expected fraction of misclassification for two distributions. The number of l-mer in this area is . So, we first use AbundanceBin to predict the parameters of count distributions (i.e., the abundance ratios and genome sizes) and then compute the expected rate of misclassification. If this rate is unacceptable (we used 3% as the threshold in the experiments), it means that the abundance levels are not significantly different and thus we do not run AbundanceBin.
We test the performance of our algorithm on a variety of synthetic datasets with different numbers of species, phylogenetic distances between species, abundance ratios and sequencing error rates. Although simulated datasets do not capture all characteristics of real metagenomic data, there are no real benchmark datasets for NGS metagenomic projects and thus they are the only available option. Also, to the best of our knowledge, there are no algorithms that are designed specifically for separating short NGS reads from different genomes. Although similarity-based methods work on short reads, they explore the taxonomic content of metagenomic data according to known genomes rather than classifying reads. AbundanceBin classifies reads, but it does not separate genomes with similar abundance levels. Therefore, we modify a well-known genome assembly software, Velvet , to make it behave like a genome separation tool and compare our clustering results with those of the modified Velvet. In addition, we compare the performance of our algorithm with the well-known composition-based method CompostBin  on simulated metagenomic Sanger reads. We also apply the algorithm to a real metagenomic dataset obtained from gut bacteriocytes of the glassy-winged sharpshooter and achieve results consistent with the original study .
Simulated data sets
We use MetaSim  to simulate paired-end Illumina reads for various bacterial genomes to form metagenomic datasets. MetaSim is a software for generating metagenomic datasets with controllable parameters, such as the abundance level of each genome, read length, sequencing error rate and distribution of errors. Thus, it can be used to simulate different sequencing technologies and generate reads from available completely sequenced genomes (for example, those in the NCBI database). In our experiments, paired-end reads of length 80 bps are considered, with the mean insert size 500 bps and deviation 20 bps. The number of reads for each experiment is adjusted to produce sufficient coverage depth (ranging between 15 and 30). The sequencing error model is set according to the error profile of 80 bps Illumina reads. A detailed description of MetaSim parameters is provided in Additional file 1.
The first experiment is designed to test the performance of our method on a large number of datasets of varying phylogenetic distances. For this experiment, we create 182 synthetic datasets of 4 categories. Each dataset of the first category contains genomes from the same genus but different species. Datasets in the second category consist of genomes from the same family but different genera, datasets in the third category involve genomes from the same order but different families, and datasets in the fourth category involve genomes from the same class but different orders. Genomes in each test are randomly chosen according to a category of phylogenetic distances and assumed to have the same abundance levels. The number of genomes in the datasets varies from 2 to 10 and depends on the number of available complete sequences for each taxonomic group and on the level of the group. Tests on genomes from the same genus typically involve 2 to 4 genomes since such genomes are similar to each other and hard to separate, while tests on genomes from the same class may involve up to 10 genomes. Totally, we have 79 experiments concerning a genus, 66 concerning a family, 29 concerning an order, and 8 concerning a class. These datasets involve 515 complete genomes from the NCBI.
We also performed some small-scale experiments to test the performance on genomes with different abundance levels and on reads with sequencing errors. For each of the experiments, we choose 10 random sets of genomes from the 182 datasets. For each set of genomes, two metagenomic dataset are simulated, one with abundance ratio 1:2 and and the second with the error model but abundance ratio 1:1. Finally, we test the performance of the combination of our algorithm and AbundanceBin on a dataset of 4 genomes with abundances 1:1:4:4. The exact species combinations used in all simulated datasets are listed in Additional file 2.
Comparison with modified velvet
Due to the lack of methods for separating short NGS reads into genomes, we modify a well-known genome assembler, Velvet , so it behaves like a genome separator. Genome assemblers such as Velvet often work with metagenomic data and produce contigs that may actually correspond to sections of individual genomes. Hence, we run Velvet to obtain a set of contigs and use each contig to define a cluster of l-mers. This is equivalent to the first phase of our algorithm. The only difference is that all l-mers (instead of unique l-mers) are clustered. For each read contained in a cluster, we add the l-mers in the mate of the read to the cluster, and then construct a weighted graph whose nodes represent clusters and edges are weighted by the number of common l-mers shared by the clusters connected by each edge. Finally, we apply our merging algorithm to the constructed graph. Based on a series of experiments with the Velvet parameters, we chose l-mer length as 31, which results in the highest N50 in most of the experiments. We also set the coverage cutoff to half the coverage (i.e., abundance level) of the least abundant genome in the dataset, so that Velvet can deal with genomes with different abundance levels without filtering out low coverage contigs.
To evaluate the results of clustering, there are a number of factors that should be considered. First of all, we would like most of the reads from each genome to be located in one cluster. In other words, each genome should correspond to a unique cluster that contains most of its reads. We say that a genome has been broken if there is no cluster that contains more than a half of all its reads. It may happen that several genomes correspond to the same cluster. In this case, we assign the cluster to all the genomes, and say that the genomes are not separated. We will measure the performance of our algorithm in terms of pairwise separability. For example, if a dataset contains 5 genomes, where 3 of them are located in one cluster, and each of the other two are located in its own cluster, then in the pairwise evaluation, we consider the separability of all 10 pairs of genomes. Since 3 pairs of genomes are not separated while the other 7 are separated, the separability rate is 70%. During the separability analysis, we remove broken genomes from consideration. Besides separability, we are interested in the precision and sensitivity of our algorithm on the separated genomes. Since we assign a genome to the cluster that has most of its reads, it is also interesting to know how many of its reads are wrongly assigned to other clusters. We call this sensitivity. One way to estimate sensitivity is to compute how many reads are correctly assigned to each cluster and divide it by the total number of reads that should be in this cluster. Here, true positives are the reads from all genomes located in this cluster. However, consider the case when we have two genomes in a cluster, of lengths 1 Mbps and 5 Mbps respectively. Then, even if sensitivity is very low for the first genome, the overall sensitivity (for all genomes in the cluster) will not be significantly affected. Another way to normalize sensitivity is by computing sensitivity for each genome in the cluster separately and then to find the average of these sensitivities. We use the second approach. To compute precision of a cluster, we find the ratio of the reads that are wrongly assigned to the cluster to the total number of reads in the cluster.
To summarize the results for a set of experiments, we compute separability based on the total number of pairs of genomes in all the experiments. For the precision and sensitivity, we take the average values for all the clusters from all the experiments.
Experiments on genomes separated by different phylogenetic distances
Performance of our method and the Velvet-based approach on pairs of genomes with different phylogenetic distances
# of genomes
# of pairs
Handling sequencing errors
Our approach for handling sequencing errors is very simple: we filter out l-mers with counts lower than a certain threshold, since these infrequent l-mers are likely to contain errors. However, there is a simple intuition behind it. We can aggressively remove potential errors without attempting to correct them or being afraid to lose important information, assuming that the genomes are sufficiently covered by the reads. Note that we could be more aggressive than genome assemblers in throwing out infrequent l-mers here because (i) when the genomes are sufficiently covered, the filtration will not lead to many more gaps, and (ii) we are less concerned with the fragmentation of genomes.
Performance of the method on data with and without sequencing errors
# of genomes
# of pairs
The issue of abundance levels
Performance on synthetic datasets with abundance ratio 1:2
# of genomes
# of pairs
Performance on synthetic datasets with abundance ratio (1:1:4:4)
# of genomes
# of pairs
Comparison with compostBin
Comparison with CompostBin on the datasets described in 
Bacillus halodurans & Bacillus subtilis
Escherichia coli & Yersinia pestis
Methanocaldococcus jannaschii &
Pyrobaculum aerophilum &
Gluconobacter oxydans &
Family and Order
Granulibacter bethesdensis &
Escherichia coli, Pseudomonas putida &
Escherichia coli, Pseudomonas putida,
Bacillus anthracis & Bacillus subtilis
Performance on a real dataset
A metagenomic dataset obtained from gut bacteriocytes of the glassy-winged sharpshooter, Homalodisca coagulata, is known to consist of (Sanger) reads from Baumannia cicadellinicola, Sulcia muelleri and some miscellaneous unclassified reads  and studied in . We apply our algorithm, adapted to handle Sanger reads as discussed in the previous section, to the dataset. As in , we only measure our ability to separate the reads from Baumannia cicadellinicola and Sulcia muelleri. The sensitivity of the classification achieved by our algorithm is 92.21% and the normalized error rate is 1.59%, which is lower than the normalized error rate of 9.04% achieved by CompostBin as reported in .
Conclusions and Discussions
While the NGS sequencing technologies are very promising for metagenomic projects due to their great sequencing depths and low costs, they also present new challenges in the analysis of metagenomic data because of their short read lengths. In this paper, we developed an algorithm for separating short paired-end NGS reads from different bacterial genomes of similar abundance levels and then combined the algorithm with AbundanceBin  to handle arbitrary abundance ratios. We have shown that our algorithm is able to separate genomes when the number of common repeats is small compared to the number of genome-specific repeats. Since the fraction of common repeats in genomes correlates with the phylogenetic distance between the genomes, it is hard to separate genomes of closely related species. However, for the genomes that are separated by sufficient phylogenetic distance, they share few l-mers and can be separated with high precision and sensitivity.
Our algorithm called TOSS was coded in C. Its running time and memory requirement depend on the total length of all the genomes present in a metagenomic dataset and on the number of reads. The first phase of the algorithm is the most time and memory consuming. In this phase, a graph of l-mers is constructed and the clustering of unique l-mers is performed. The size of the graph is proportional to the total size of the genomes and 0.5 GB of RAM is required for every million bases of the genomes. In the experiments, we ran the algorithm on a single CPU with 2.8GHz AMD machine and 64GB RAM. Each of the small-scale tests involving 2-4 genomes of total length of 2-6 Mbps was completed within 1-3 hours and required 2-4 GB of RAM. A test on 15 genomes with the total length of 40 Mbps ran for 14 hours and required 20GB of RAM.
Our algorithm can be further improved. In this paper, to separate the input reads, we construct a graph by using the information about l-mers from all the reads. After clustering the unique l-mers, some clusters are merged if they potentially belong to the same genome. To find connections between clusters, paired-end reads and common repeats are used. However, we believe that additional information can be used to improve the algorithm’s ability in predicting whether two clusters potentially belong to the same genome. For example, the compositional properties of the clusters of unique l-mers may be used to complement the repeat-based information.
In future work, we plan to explore the compositional properties of the clusters of unique l-mers and try to improve the performance of our algorithm by combining the compositional properties with the distribution of l-mers in reads.
This is a WABI’2011 special issue invited paper.
The research was supported in part by NSF grant IIS-0711129 and NIH grant AI078885.
- Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM: Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem & Biol 1998,5(10):R245-R249.View Article
- Rappé MS, Giovannoni SJ: The uncultured microbial majority. Annu Rev Microbiol 2003, 57:369–394.PubMedView Article
- Béjà O, Suzuki MT, Koonin EV, Aravind L, Hadd A, Nguyen LP, Villacorta R, Amjadi M, Garrigues C, Jovanovich SB, Feldman RA, DeLong EF: Construction and analysis of bacterial artificial chromosome libraries from a marine microbial assemblage. Environ Microbiol 2000,2(5):516–529.PubMedView Article
- Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers YH, Smith HO: Environmental Genome Shotgun Sequencing of the Sargasso Sea. Science 2004,304(5667):66–74.PubMedView Article
- Gill SR, Pop M, DeBoy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, Gordon JI, Relman DA, Fraser-Liggett CM, Nelson KE: Metagenomic Analysis of the Human Distal Gut Microbiome. Science 2006,312(5778):1355–1359.PubMedView Article
- Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 2004,428(6978):37–43.PubMedView Article
- Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome Res 2008,18(2):324–330.PubMedView Article
- Warren RL, Sutton GG, Jones SJM, Holt RA: Assembling millions of short DNA sequences using SSAKE. Bioinformatics 2007,23(4):500–501.PubMedView Article
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res 2007,17(11):1697–1706.PubMedView Article
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I: ABySS: A parallel assembler for short read sequence data. Genome Res 2009,19(6):1117–1123.PubMedView Article
- Charuvaka A, Rangwala H: Evaluation of short read metagenomic assembly. BMC Genomics 2011,12(Suppl 2):S8+.PubMedView Article
- Chakravorty S, Helb D, Burday M, Connell N, Alland D: A detailed analysis of 16s ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. J Microbiol Methods 2007,69(2):330–339.PubMedView Article
- Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomic data. Genome Res 2007,17(3):377–386.PubMedView Article
- Krause L, Diaz NN, Goesmann A, Kelley S, Nattkemper TW, Rohwer F, Edwards RA, Stoye J: Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res 2008,36(7):2230–2239.PubMedView Article
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990,215(3):403–410.PubMed
- Zhou F, Olman V, Xu Y: Barcodes for genomes and applications. BMC Bioinformatics 2008, 9:546+.PubMedView Article
- Chatterji S, Yamazaki I, Bai Z, Eisen J: CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads. In Research in Computational Molecular Biology, Volume 4955 of Lecture Notes in Computer Science. Edited by: Vingron M, Wong L. Berlin, Heidelberg: Springer Berlin /Heidelberg; 2008:17–28.
- Chan CK, Hsu A, Halgamuge S, Tang SL: Binning sequences using very sparse labels within a metagenome. BMC Bioinformatics 2008, 9:215+.PubMedView Article
- Teeling H, Waldmann J, Lombardot T, Bauer M, Glockner F: TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 2004, 5:163+.PubMedView Article
- Leung HCM, Yiu SM, Yang B, et al.: A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics 2011,27(11):1489–1495.PubMedView Article
- Diaz N, Krause L, Goesmann A, Niehaus K, Nattkemper T: TACOA - Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinformatics 2009, 10:56+.PubMedView Article
- Bentley SD, Parkhill J: Comparative genomic structure of prokaryotes. Annu Rev Genet 2004, 38:771–791.PubMedView Article
- Wu YW, Ye Y: A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples. In Research in Computational Molecular Biology, Volume 6044 of Lecture Notes in Computer Science. Edited by: Berger B. Berlin, Heidelberg: Springer Berlin /Heidelberg; 2010:535–549.
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, Geer LY, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Ostell J, Miller V, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2007,35(Database issue):D173-D180.
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res 2009,37(Database issue):D26-D31.PubMedView Article
- Zerbino DR, Birney E: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 2008,18(5):821–829.PubMedView Article
- Lander ES, Waterman MS: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 1988,2(3):231–239.PubMedView Article
- Wendl M, Waterston R: Generalized gap model for bacterial artificial chromosome clone fingerprint mapping and shotgun sequencing. Genome Res 2002, 12:1943–1949.PubMedView Article
- Li X, Waterman MS: Estimating the Repeat Structure and Length of DNA Sequences Using l-Tuples. Genome Res 2003,13(8):1916–1922.PubMed
- van Dongen S: Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht 2000
- Wu D, Daugherty SC, Van Aken SE, Pai GH, Watkins KL, Khouri H, Tallon LJ, Zaborsky JM, Dunbar HE, Tran PL, Moran NA, Eisen JA: Metabolic Complementarity and Genomics of the Dual Bacterial Symbiosis of Sharpshooters. PLoS Biol 2006,4(6):e188+.PubMedView Article
- Richter DC, Ott F, Auch AF, Schmid R, Huson DH: MetaSim: a Sequencing Simulator for Genomics and Metagenomics. PLoS ONE 2008,3(10):e3373+.PubMedView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.