WordCluster: detecting clusters of DNA words and genomic elements
© Hackenberg et al; licensee BioMed Central Ltd. 2011
Received: 30 August 2010
Accepted: 24 January 2011
Published: 24 January 2011
Skip to main content
© Hackenberg et al; licensee BioMed Central Ltd. 2011
Received: 30 August 2010
Accepted: 24 January 2011
Published: 24 January 2011
Many k- mers (or DNA words) and genomic elements are known to be spatially clustered in the genome. Well established examples are the genes, TFBSs, CpG dinucleotides, microRNA genes and ultra-conserved non-coding regions. Currently, no algorithm exists to find these clusters in a statistically comprehensible way. The detection of clustering often relies on densities and sliding-window approaches or arbitrarily chosen distance thresholds.
We introduce here an algorithm to detect clusters of DNA words (k- mers), or any other genomic element, based on the distance between consecutive copies and an assigned statistical significance. We implemented the method into a web server connected to a MySQL backend, which also determines the co-localization with gene annotations. We demonstrate the usefulness of this approach by detecting the clusters of CAG/CTG (cytosine contexts that can be methylated in undifferentiated cells), showing that the degree of methylation vary drastically between inside and outside of the clusters. As another example, we used WordCluster to search for statistically significant clusters of olfactory receptor (OR) genes in the human genome.
WordCluster seems to predict biological meaningful clusters of DNA words (k- mers) and genomic entities. The implementation of the method into a web server is available at http://bioinfo2.ugr.es/wordCluster/wordCluster.php including additional features like the detection of co-localization with gene regions or the annotation enrichment tool for functional analysis of overlapped genes.
Genome entities as diverse as genes , CpG dinucleotides , transcription factor binding sites (TFBSs ) or ultra-conserved non-coding regions  usually form clusters along the chromosome sequence. Such spatial clustering often translates into genome structures with a clear functional and/or evolutionary meaning: gene clusters encoding the same or similar products and originated through gene duplication events, CpG islands, cis-regulatory modules, etc. Thus, the spatial clustering of functional genome elements (in general, words or k- mers) would somewhat remember the situation in literary texts, where keywords show a strong clustering, whereas common words are randomly distributed .
Despite its potential importance, no algorithm exists to detect the clustering of DNA words in a rigorous way. Most current methods are based on densities and sliding-window approaches or arbitrary distances. For example, the Galaxy work suite (, http://main.g2.bx.psu.edu/) implements an algorithm which lets the user decide to fix the maximum distance between two entities and the minimum number of entities in the cluster. Recently, we developed an algorithm to detect clusters of CpG dinucleotides in DNA sequences based on the distance between neighboring CpGs, then assigning a statistical significance . Now, we generalize the method to any k- mer or any arbitrary combination of them, as well as to any other genome entity defined by its chromosome coordinates.
The WordCluster algorithm allows the detection of clusters for DNA words (k- mers) and genomic elements (genes, transposons, SINEs, TFBSs, etc.). The algorithm is based on the distances between the entities and an assigned p-value.
Detection of all k- mer copies in the chromosomes, storing its coordinates (this step is unique to the detection of k- mer clusters as the genomic elements already come defined by its coordinates). The copies are detected in a non-overlapping way, i.e. once a copy is found the search is resumed at the end of the word, thus preventing the detection of overlapping copies.
Calculation of the distances between consecutive copies. The distance is defined as: "start coordinate of the downstream copy" minus "end coordinate of the upstream copy". This implies that the minimum distance is 1 when the two entities are located directly next to each other.
Detection of the clusters, defined as those chromosomal regions where all distances are equal or below a given maximum distance. A cluster is defined by its start and end coordinates and the number of k- mers or genomic elements it contain.
Calculation of the statistical significance for each cluster by means of the negative binomial distribution. A p-value threshold is then used to filter out those clusters which are not statistically significant.
A main difference to the originally described algorithm is the way N-runs in the DNA sequence (ambiguous sequence sites occupied by any nucleotide) are treated. While the original CpGcluster method allows up to 10 Ns between two consecutive CpGs, WordCluster detects the DNA words and the distances strictly within the contigs, i.e. not a single N is allowed to lie between two copies.
From now on, we will have to use the word k-mer in different contexts. Therefore, to avoid confusion we define as "target k-mer(s)" the k-mer(s) which are being analysed, i.e. those for which the clusters are going to be detected. On the contrary, "no-target k-mer(s)" are all the remaining k-mer(s). We use k-mer in a generic way, referring to all DNA words of length k.
being n the number of target k- mers within the cluster, n f the number of "failures", i.e. the number of no-target k-mers. For example, if we are detecting clusters of AGCT, all k- mers other than AGCT would be considered as failures. Finally, p is the success probability, i.e. the probability to find a target k- mer or genomic element within the DNA sequence. Note that in the above equation we use (n-1) instead of n, as the first appearance of a target k-mer within the cluster is trivial (i.e. all the clusters start with a target k-mer). While the negative binomial distribution can be defined in the same way for k- mers and genomic elements, differences exist in the way the number of "failures" and the success probability are calculated.
being N the number of non-overlapping occurrences of the target k- mers in the sequence, k the length of the k- mer and L s the sequence length. The formula is simply the number of target k-mers in the sequence divided by the total number of k-mers in the sequence. As we do not consider overlapping instances, N*(k-1) was subtracted from the total number of k-mers (L s - k + 1), as those sequence positions are not considered, in order to take this effect into account.
being L s the length of the sequence, L mean the mean length of the genomic elements and N the number of genomic elements.
Percentile distance: The distance corresponding to a given percentile of the observed distance distribution is calculated and used as the maximum distance threshold.
Chromosomal intersection: The distance corresponding to the intersection between the observed and the expected distributions is used as the maximum distance (see Figure 1).
Genome intersection: The distance distributions for all chromosomes are merged, then calculating the distance corresponding to the "genome intersection point". If this distance model is chosen, the success probabilities (i.e. the probability to find the target k- mers in the chromosome) are not calculated for each chromosome separately (like in the two models above), but a genome wide success probability (probability to find the target k-mers) is calculated.
Fixed distance: the user can set the distance threshold.
We implemented the described algorithm into a web server. The tool uses PHP for the interaction with the user, to access the core program (written in Java) and the MySQL database. Two types of input data can be supplied: 1) a group of k- mers and a genomic sequence to be scanned by the program (the user can upload his own sequence or choose one of the 24 genome assemblies stored in our database - see below); and 2) a file in BED format [8, 9] with the coordinates of the genomic elements whose clustering properties should be analyzed. No mandatory input parameters exist, but the user can select between different distance models (the default is the chromosome intersection) and set the cut-off for the statistical significance (the default here is p-value ≤ 1E-5).
The output generated by the web server depends on whether the user chooses a genome assembly from our database or supplies an anonymous sequence. The minimum output consists of the basic statistics of the clusters (base composition, entity composition and statistical significance) and the statistics by chromosome. Furthermore, for all species in the database, the co-localization of detected clusters with different gene regions (promoters, introns, etc.) is reported.
Finally, for some species (human, mouse, rat, cow, C. elegans, zebrafish and chicken) an enrichment/depletion analysis for the genes overlapped by the clusters is carried out using the Gene Ontology  and the Annotation-Modules database [11, 12].
Currently, the genomes of 24 genome assemblies are stored into our database. The following sequences where downloaded from the UCSC genome browser or the corresponding project homepages (plant genomes): Human (hg18, hg19), Mouse (mm8, mm9), Rat (rn4), Fruit fly (dm3), Anopheles gambiae (anogam1), Honey bee (apimel2), Cow (bosTau4), Dog (canFam2), C. briggsae (cb3), C. elegans (ce6), Sea squirt (ci2), Zebrafish (danrer5), Chicken (galgal3), Stickleback (gasacu1), Medaka (orylat2), Chimp (pantro2), Rhesus macaque (rhemac2), S. cerevisiae (saccer1), Tetraodon (tetnig1), Arabidopsis thaliana (tair8, tair9), and Zea mays (zm1). To determine the co-localization with genes, we used RefSeq genes whenever they were available , Ensembl genes otherwise .
To demonstrate the ability of our algorithm in finding biologically significant and relevant clusters in the genome, at the same time illustrating the different distance models, we carried out three analysis: 1) detection of clusters of CpGs (CpG islands) using different distance models, 2) detection of clusters of the word CWG (where W = A, T) and 3) detection of clusters of olfactory receptor genes in the human chromosome 11.
WordCluster predictions of CpG clusters*
Length ± SD
GC ± SD
OE ± SD
273.2 ± 246.4
63.8 ± 7.5
0.855 ± 0.265
218.7 ± 200.1
65.6 ± 7.7
0.916 ± 0.273
202.6 ± 183.8
66.3 ± 7.5
0.930 ± 0.274
Biological meaning of WordCluster predictions*
Independently of this open question, we can summarize: 1) the chromosome intersection seems to be a good replacement for the median and furthermore removes one input parameter from the method, as the intersection is a fixed statistical property of the chromosome; 2) the genome intersection may be used when the expected clusters are known to be not dependent on the chromosome. The CpG islands are probably not dependent on the chromosome, as the biological mechanisms forming and maintaining them are probably the same for all chromosomes. This may suggest the use of the genome intersection, which is confirmed by producing slightly better results than the other two tested distance models.
Clusters of CWG trinucleotides*
Genome coverage (bp)
Average length (bp)
No. of clusters co-locating with gene regions:
TSS ± 100 bp
Clusters of OR genes in human chromosome 11*
WordCluster generalizes the previous CpGcluster algorithm  to any word or genomic element in the genome, at the same time associating a statistical significance to the clusters found. It outperforms current methods relying on densities and sliding-window approaches or arbitrarily chosen distance thresholds. The implementation as a web server connected to a MySQL backend allows for co-localization studies with different gene regions, as well as for genome wide enrichment/depletion analysis of functional terms (GO).
The WordCluster webserver (http://bioinfo2.ugr.es/wordCluster/wordCluster.php) is freely available. No registering is needed but every access is logged. For large jobs, a long-life web link to the results is provided.
Transcription Start Site
Transcription Factor Binding Site
promoter region [TSS-1500 bp TSS+500 bp].
The Spanish Government grants BIO2008-01353 to JLO, mobility PR2009-0285 to PC, Spanish Junta de Andalucía grants P07-FQM3163 to PC and P06-FQM1858 to PB are acknowledged. The Spanish 'Juan de la Cierva' grant to MH and Basque Country 'Programa de formación de investigadores del Departamento de Educación, Universidades e Investigación' grant to GB are also acknowledged.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.