Periodic pattern detection in sparse boolean sequences
 Ivan Junier^{1, 2},
 Joan Hérisson^{2} and
 François Képès^{2}Email author
DOI: 10.1186/17487188531
© Junier et al; licensee BioMed Central Ltd. 2010
Received: 26 March 2010
Accepted: 10 September 2010
Published: 10 September 2010
Abstract
Background
The specific position of functionally related genes along the DNA has been shown to reflect the interplay between chromosome structure and genetic regulation. By investigating the statistical properties of the distances separating such genes, several studies have highlighted various periodic trends. In many cases, however, groups built up from cofunctional or coregulated genes are small and contain wrong information (data contamination) so that the statistics is poorly exploitable. In addition, gene positions are not expected to satisfy a perfectly ordered pattern along the DNA. Within this scope, we present an algorithm that aims to highlight periodic patterns in sparse boolean sequences, i.e. sequences of the type 010011011010... where the ratio of the number of 1's (denoting here the transcription start of a gene) to 0's is small.
Results
The algorithm is particularly robust with respect to strong signal distortions such as the addition of 1's at arbitrary positions (contaminated data), the deletion of existing 1's in the sequence (missing data) and the presence of disorder in the position of the 1's (noise). This robustness property stems from an appropriate exploitation of the remarkable alignment properties of periodic points in solenoidal coordinates.
Conclusions
The efficiency of the algorithm is demonstrated in situations where standard Fourierbased spectral methods are poorly adapted. We also show how the proposed framework allows to identify the 1's that participate in the periodic trends, i.e. how the framework allows to allocate a positional score to genes, in the same spirit of the sequence score. The software is available for public use at http://www.issb.genopole.fr/MEGA/Softwares/iSSB_SolenoidalApplication.zip.
Background
There is increasing evidence that the organization of the genome plays a crucial role in the interplay between genetic regulation and chromosome structure. At the smallest scale, several experimental studies have highlighted the importance of the positions of the transcription factor binding sites in the functioning of small transcriptional regulatory networks [1–3]. At a larger  but still local  scale, in bacteria many transcription units are known to be located along the DNA close to the gene that encodes their regulating transcription factors [4–6]. At the global scale of the chromosome, both in Escherichia coli and in Saccharomyces cerevisiae, it has been previously realized that the genes that are regulated by the same transcription factor have a tendency to be periodically spaced along the DNA [7, 8]. Recently, the relative positions of phylogenetically conserved gene pairs were also shown to tend to periodically organize along the DNA in E. coli[9]. Such periodic organization has been proposed to be responsible for the spatial colocalization of coregulated genes [10]; indeed, a periodic ordering along the DNA of distal binding sites that can be crosslinked by a bivalent transcription factor (or a larger complex), just as in the case of the lac operon or of the λ bacteriophage repressor, leads to a quick and homogeneous formation of transcription factories [11].
More generally, in any kind of signals, the presence of periodic regularities reveals an underlying notion of order. As such, this can provide hints about the signal genesis and/or a base for a further processing of the information, just as in crystallographic experiments. However, the detection of periodic patterns can be drastically hampered by signal distortions [12, 13]. Specific techniques, which depend on the nature of the signal, therefore need to be developed  see e.g.[14, 15] in the context of gene expression data. In this article, we present a method to detect periodic patterns in boolean sequences, i.e., the signal X(l) is a onedimensional signal that takes values in {0, 1}, the coordinate l is discrete and takes values in ℕ. More particularly, we address the question of sparse sequences, that is the ratio of the number of 1's to 0's is much smaller than 1. A prototypic example concerns the organization of genes along DNA. For instance, the human genome contains approximately 3 × 10^{4} genes that are distributed along a 3 × 10^{9} basepair long DNA  in this case, l stands for the position of the basepairs forming the DNA. Hence, the ratio 1/0 is on the order of 10^{5}.
One of the major difficulties of periodic detection, especially in the case of sparse data, lies in the robustness of the method with respect to noise, data contamination and missing data. Noise leads to positions of 1's that are different from the ideal periodic case. This is a ubiquitous source of signal distortion since perfect periodic patterns stem from specific types of phenomena, e.g. the ordering of atoms in crystals. Data contamination, often referred to as false positives, refers to the points {l, X(l) = 1} that come from wrong information. Such contamination is commonplace in bioinformatics, especially when predicting features using datasets that are built from genomewide experiments [16]. Preventing it mostly leads to missing true findings (missing data), that is X(l) = 0 for values of l such that X(l) should be equal to 1, which is often referred to as false negatives. As a result, datasets may contain both false positives and false negatives  they always do in datasets coming from highthroughput biological experiments [16].
Within this scope, we present a periodic pattern detection method that is particularly robust with respect to noise, data contamination and missing data. The method has two facets, namely, i) it highlights the presence of periodic patterns and ii) it identifies the points that participate in the periodic trends, which are discussed in the two next sections. As an illustration, using both artificial and real datasets, we then show the limitations of standard Fourierbased spectral methods in situations where the present tool is fully efficient.
Highlighting periodic regularities in boolean sequences
Scoring function
Solenoidal spectra
This allows to quantitatively evaluate the statistical significance of a pSoS, though the independence of the peaks, if any, may be a delicate point to prove. In any case, the probability for such a spectrum to occur by chance is lower than $1{(1{p}^{\ast})}^{{N}_{p}}$.
Identifying the periodic points
In a periodical dataset, all the sites are not expected to be positioned accordingly to the apparent periodicity. In particular, in addition to some wrongly predicted positions, datasets may contain sites that are generated from different sources or that belong to different families (e.g., different families of coregulated genes). The positions of these sites are therefore not expected to be correlated. Interestingly, the SCM allows to determine which sites are concerned by a given periodicity. More precisely, a positional score can be defined for each site, which is related to the likelihood for the site to be periodically positioned with respect to the other sites of the dataset.
Given a period P, the positional score of any site i is calculated by analyzing the position of the nearest neighbors modulo P, i.e. the nearest neighbors on the Psolenoid face view. The principle consists in rewarding the sites that are located in the dense regions of the solenoid face view. To this end, we use, again, a quantity akin to the information content related to the distances of the nearest neighbors. Each pair of sites on each side of i is allocated with a probability p_{ < } for the sites to be separated by a distance that is inferior to their current distance, supposing the sites to be randomly drawn according to a uniform distribution. This leads to the information content [ log(p_{ < } )]. Next, a scoring function ${\mathcal{S}}^{\prime}(i,P)$ associated to i at the period P is defined. It is equal to the maximum information content obtained from the nearest neighbors, i.e. from the pair of nearest neighbors that has the lowest p_{ < } .
Implementation
Periodic pattern detection: generating the solenoidal spectra
The next paragraph provides some details about the scoring function that is at the core of the SCM. The second paragraph provides further details about the computation of the pvalues that are involved in the scoring function.
Scoring function
J represents the maximum number of nearest neighbors to be considered. Hence, the computation is all the faster that J is small. However, J must increase with N in order to efficiently detect dense regions. In this regard, all the reported results in this article have been obtained by choosing J = max(E[N/16], 1) where the function E[x] gives the integer part of x. We have observed that the precise dependence of J on N does actually not affect the detection.
pvalues ${p}_{j}^{N}\left({x}_{i,j}^{P}\right)$
For all i and j, ${p}_{j}^{N}\left({x}_{i,j}^{P}\right)$ is the probability for generating a distance as extreme as ${x}_{i,j}^{P}$ when the sites are independently drawn according to a uniform law. In the case of dense regions, respectively poor regions, this corresponds to generate distances that are smaller, larger respectively, than ${x}_{i,j}^{P}$. This can be explicitly written in terms of the probability density ${\rho}_{2j1}^{N}(x)$ of the random variable associated to the distance between any pair of sites that are separated by 2j 1 sites, which can be readily computed ∀j ∈ {1,...,N/2} as explained now.
where ${F}_{2j1}^{N}$ stands for the repartition function associated to ${\rho}_{2j1}^{N}$, that is ${F}_{2j1}^{N}(x)={\displaystyle {\int}_{0}^{x}dy}{\rho}_{2j1}^{N}(y)$. Dense regions correspond to small values of ${F}_{2j1}^{N}\left({x}_{i,j}^{P}\right)$ so that Eq. 4 leads to the right value of ${p}_{j}^{N}\left({x}_{i,j}^{P}\right)$ in the limit of small distances, that is ${p}_{j}^{N}\left({x}_{i,j}^{P}\right)={F}_{2j1}^{N}\left({x}_{i,j}^{P}\right)$. On the other hand, poor regions correspond to values of ${F}_{2j1}^{N}\left({x}_{i,j}^{P}\right)$ close to 1 so that we also recover ${p}_{j}^{N}\left({x}_{i,j}^{P}\right)=\left(1{F}_{2j1}^{N}\left({x}_{i,j}^{P}\right)\right)$ in the limit of large distances. Intermediate values of ${x}_{i,j}^{P}$, i.e. close to the maximum of ${\rho}_{2j1}^{N}$, do not play any crucial role for highlighting clusters, and hence, they are not crucial for the periodic detection method.
Positional score
The positional score is calculated by analyzing the position of the nearest neighbors modulo P, i.e. the nearest neighbors on the Psolenoid face view. This means to compute the pvalue p_{<} for two sites to be separated by a distance that is inferior to their current distance supposing that the sites have been randomly drawn according to a uniform distribution.
where ⟨j, k⟩ _{ J } stands for the set of pairs composed of two sites that lie on the J first nearest neighbors on each side of i.
Results and discussion
Periodic pattern detection
Two methods are often used to highlight the presence of periodic patterns. The first one is mostly used in the case of sparse boolean sequences, which is the case treated here. It consists in computing the histogram of the distances that separate each pair of points. The histogram is then analyzed thanks to a (discrete) Fourier transform. The second one is a standard procedure for analyzing continuous signals. It consists in computing an autocorrelation function, which is then analyzed thanks to a Fourier transform, too.
SCM versus pairdistance histograms  Fig. 3
Each row of Fig. 3 corresponds to the analysis of a specific set of positions, which is indicated at the top of the row. The first column gives the pairdistance histograms by reporting the number of occurrence N_{ o }of the distances (bin = 50). The second column gives the discrete Fourier transform F of the pairdistance histogram. Finally, the third column gives the pvalue Solenoidal Spectrum (pSoS) using a semilog scale. The first row shows the equivalence between F and pSoS for a Dirac comb with period P = 10000, i.e. a set of sites that are regularly spaced by a distance P = 10000. In both spectra, the peaks are harmonics of a main peak (period P = 10000). The second row shows the results for a set of positions that consists of a periodic succession (8 times here) of a complex pattern (red points). The period is still 10000. In F, the main peak is obtained at P ~ 10000/6 whereas the pSoS still provides the main peak at P = 10000 (the other main peaks are harmonics of this period). In the third row, noise is added by drawing the positions according to a uniform distribution of amplitude A, which is centered around the sites of the second row (i.e., the second row corresponds to A = 0). For A/P = 10%, unlike the Fourier transform, most of the pSoS's still provide a main peak at P = 10000. The fourth row shows the results for the same set of positions as in the third row, except that 10 points (of the 40 initial ones) have been deleted (false negatives) and replaced by 10 points at random locations (false positives). One can see that the SCM is still able to detect the presence of the periodic pattern, which demonstrates the robustness of the method with respect to data contamination.
The last two rows show the results for positions resulting from the combination of two periodic patterns having different periods (blue and red points). In the fifth row, positions correspond to a succession of the periodic motifs up to the position 80000, resulting in 56 points. The Fourier spectrum of the pairdistance histogram is flat around one of the main period (P = 10000) whereas all the peaks in the pSoS are harmonics of the two main periods P = 7270 and P = 10000, which are respectively indicated by the green and blue dashed vertical lines. The last row gives the curves that result from an average over 100 sets of positions drawn by adding noise to the previous case. In contrast to the Fourier spectrum of the pairdistance histogram, the two main periods are revealed by two sharp peaks in the pSoS, plus one main harmonic peak for each of them.
To summarize, the pairdistance histogram method is poorly efficient to highlight the periodic presence of a complex motif. More strikingly, the mixing of two motifs having two different periods lead to flat Fourier spectra of the pairdistance histograms around the expected periods. On the contrary, even in the presence of noise, the pSoS leads to welldefined peaks that clearly reveal the two different periods.
SCM versus autocorrelation function  Fig. 4
The autocorrelation function C(x) of an infinitely long sequence X is defined by $C(x)=\underset{L\to \infty}{\mathrm{lim}}\frac{\mathcal{N}}{L}{\displaystyle \sum _{i=0}^{L}X(i)}\cdot X(i+x)$ where the normalization factor $\mathcal{N}$ is such that C(0) = 1. For small and sparse sequences, one needs first to smooth the sequence to avoid the product X(i) ·X(i + x) to be mostly equal to 0. We call $\tilde{X}$ this smoothed sequence and further suppose $\tilde{X}$ to represent the best smoothing procedure for highlighting the periodic trend. In this case, the autocorrelation function is defined by $C(x)=\frac{\mathcal{N}}{Lx}{\displaystyle \sum _{i=0}^{Lx}\tilde{X}(i)}\cdot \tilde{X}(i+x)$. The SCM is compared to the autocorrelation method in Fig. 4. The first (second) row corresponds to the case treated in the fourth (sixth) row of Fig. 3. The last row reports the analysis of positions coming from genomic studies.
Just as in the case of the pairdistance histogram technique, the two first rows demonstrate that autocorrelation functions are poorly efficient to highlight the periodic presence of complex motifs or the periodic presence of motifs having different periods. The results reported here have been obtained by smoothing the sequence using a 1000 long square window. Different smoothing procedures (Gaussian, different window sizes,...) can be checked to have little, if any, positive impact on the results.
In the last row, 90 genes of the 4639675 basepair long Escherichia coli genome were analyzed. The positions were taken from the Regulon DataBase [18]. They correspond to the genes that present experimental evidence for being transcriptionally regulated by the transcription factor CRP, which is the transcription factor that regulates the most genes in E. coli. The Fourier transform of the autocorrelation function leads to two significant high peaks at periods 9508 and 27782 (green circles) whereas the pSoS leads to four significant high peaks at periods 6581, 9507, 19015 and 27782 (blue circles). In particular, the highest peak in the pSoS, i.e. the peak at the period 19015, has no counterpart in the autocorrelation function. These different results would lead to different interpretations of the genomic organization, and hence, to different predictions of the spatial organization of DNA [11].
Positional scores
A useful way for distinguishing the points that belong to different periodic trends then consists in plotting the quantity ${10}^{{}^{{\mathcal{S}}_{pos}(P)}}$, i.e. ${p}_{v}({\mathcal{S}}^{\prime}(P))$ in Eq. 2, computed at the period P = 10000 versus the same quantity computed at the period P = 7270, which is done in Fig. 5b. In this plot, one can clearly distinguish the points that belong to the 10000periodicity (points along the xaxis) from those that belong to the 7270periodicity (points along the yaxis). Interestingly, this representation also allows to distinguish the different points for a sequence that is distorted. In Fig. 5c, the distortion consisted in adding noise to the sequence used in Fig. 5a and 5b. This was done by drawing the positions according to a uniform distribution of amplitude 727 (i.e. 10% of 7270), which is centered around the original sites. This hence corresponds to the situation of the last row of Fig. 3. In contrast, in Fig. 5d, we report the quantity 10^{Spos(P)}computed at P = 8700 versus the same quantity computed at P = 5300, i.e. at periods where no regularities are expected. In this situation, the points are no more separated.
Conclusion
Pairdistance histograms and autocorrelation functions, either analyzed by discrete or continuous Fourier transforms, may be poorly appropriate for highlighting the presence of periodic patterns in sparse and noisy sequences. More importantly, both methods do not succeed in disentangling multiple patterns having different periods so that the corresponding Fourier spectra are flat at the periods supposedly characterizing the sequence (Fig. 3 and 4). In contrast, the solenoid coordinate method (SCM) has been built in order to be particularly sensitive to any periodic patterns, even in the case of overlapping patterns with different periods. Its robustness to signal distortion, which can be due to the presence of noise, false positives or/and false negatives, stems from the remarkable alignment properties of periodic sites when they are represented in a solenoidal coordinate system with the right period (Fig. 1). It must also be noted that the SCM does not need any smoothing of the original sequence as in the case of the autocorrelation function. Finally, thanks to the definition of a positional score, we have shown that the SCM framework further allows to identify the sites that participate most in a periodic tendency. This should be particularly useful for identifying periodic genes, and hence, for investigating their functional properties.
The present method is suited to sparse (boolean) sequences that contain a rather small number of sites (1's). More precisely, the computational time for running a spectrum of a sequence containing N sites scales as JN ~ N^{2} (see Eq. 3). The method is therefore poorly scalable in its present form. Different improvements along this direction can be contemplated. A possible one would consist in computing the KullbackLeibler divergence (with respect to a uniform distribution) of the density distribution of the sites modulo the periods, i.e. the KullbackLeibler divergence of the density distribution along the solenoid face views. This cannot be done when the number of sites is too small, which was the case treated here.
Availability and requirements
The software is available for public use at http://www.issb.genopole.fr/MEGA/Softwares/iSSB_SolenoidalApplication.zip.
List of abbreviations used
 SCM :

solenoidal coordinate method
 SoS :

solenoidal spectrum
 pSoS p :

valued Solenoidal Spectrum
Declarations
Acknowledgements
This work was supported by the Sixth European Research Framework (project number 034952, GENNETEC project), PRES UniverSud Paris, CNRS and Genopole.
Authors’ Affiliations
References
 Hochschild A, Ptashne M: Cooperative binding of λ repressors to sites separated by integral turns of the DNA helix. Cell. 1986, 44 (5): 6817. 10.1016/00928674(86)908330PubMedView ArticleGoogle Scholar
 ColladoVides J, Magasanik B, Gralla JD: Control site location and transcriptional regulation in Escherichia coli. Microbiol Rev. 1991, 55 (3): 37194.PubMedPubMed CentralGoogle Scholar
 Müller J, Oehler S, MüllerHill B: Repression of lac promoter as a function of distance, phase and quality of an auxiliary lac operator. J Mol Biol. 1996, 257: 219. 10.1006/jmbi.1996.0143PubMedView ArticleGoogle Scholar
 Korbel JO, Jensen LJ, von Mering C, Bork P: Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nat Biotech. 2004, 22 (7): 9117. 10.1038/nbt988View ArticleGoogle Scholar
 Warren PB, ten Wolde PR: Statistical analysis of the spatial distribution of operons in the transcriptional regulation network of Escherichia coli. J Mol Biol. 2004, 342 (5): 137990. 10.1016/j.jmb.2004.07.074PubMedView ArticleGoogle Scholar
 Kolesov G, Wunderlich Z, Laikova ON, Gelfand MS, Mirny LA: How gene order is influenced by the biophysics of transcription regulation. Proc Natl Acad Sci USA. 2007, 104 (35): 13948 10.1073/pnas.0700672104PubMedPubMed CentralView ArticleGoogle Scholar
 Képès F: Periodic epiorganization of the yeast genome revealed by the distribution of promoter sites. J Mol Biol. 2003, 329 (5): 859865. 10.1016/S00222836(03)005357PubMedView ArticleGoogle Scholar
 Képès F: Periodic transcriptional organization of the E. coli genome. J Mol Biol. 2004, 340 (5): 957964. 10.1016/j.jmb.2004.05.039PubMedView ArticleGoogle Scholar
 Wright M, Kharchenko P, Church G, Segrè D: Chromosomal periodicity of evolutionarily conserved gene pairs. Proc Natl Acad Sci USA. 2007, 104 (25): 10559 10.1073/pnas.0610776104PubMedPubMed CentralView ArticleGoogle Scholar
 Képès F, Vaillant C: Transcriptionbased solenoidal model of chromosomes. Complexus. 2003, 1 (4): 171180. 10.1159/000082184View ArticleGoogle Scholar
 Junier I, Martin O, Képès F: Spatial and topological organization of DNA chains induced by gene colocalization. PLoS Comput Biol. 2010, 6 (2): e1000678 10.1371/journal.pcbi.1000678PubMedPubMed CentralView ArticleGoogle Scholar
 Kanjilal PP, Bhattacharya J, Saha G: Robust method for periodicity detection and characterization of irregular cyclical series in terms of embedded periodic components. Phys Rev E. 1999, 59 (4): 40134025. 10.1103/PhysRevE.59.4013View ArticleGoogle Scholar
 Ghil M, Allen MR, Dettinger MD, Ide K, Kondrashov D, Mann ME, Robertson AW, Saunders A, Tian Y, Varadi F: Advanced spectral methods for climatic time series. Rev Geophys. 2002, 40: 100310.1029/2000RG000092. 10.1029/2000RG000092View ArticleGoogle Scholar
 Ahdesmäki M, Lähdesmäki H, Gracey A, Shmulevich L, YliHarja O: Robust regression for periodicity detection in nonuniformly sampled timecourse gene expression data. BMC Bioinformatics. 2007, 8: 233 10.1186/147121058233PubMedPubMed CentralView ArticleGoogle Scholar
 Liang KC, Wang X, Li TH: Robust discovery of periodically expressed genes using the laplace periodogram. BMC Bioinformatics. 2009, 10: 15 10.1186/147121051015PubMedPubMed CentralView ArticleGoogle Scholar
 Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003, 100 (16): 94405. 10.1073/pnas.1530509100PubMedPubMed CentralView ArticleGoogle Scholar
 Shannon CE, Weaver W: The mathematical theory of communication. 1975, Urbana: University of Illinois Press,Google Scholar
 GamaCastro S, JimenezJacinto V, PeraltaGil M, SantosZavaleta A, PenalozaSpinola M, ContrerasMoreira B, SeguraSalazar J, MunizRascado L, MartinezFlores I, Salgado H, BonavidesMartinez C, AbreuGoodger C, RodriguezPenagos C, MirandaRios J, Morett E, Merino E, Huerta A, TrevinoQuintanilla L, ColladoVides J: RegulonDB (version 6.0): gene regulation model of Escherichia coli K12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Research. 2007, gkm994v1, http://nar.oxfordjournals.org/cgi/content/full/gkm994v1Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.