Prediction of plant promoters based on hexamers and random triplet pair analysis
© Azad et al; licensee BioMed Central Ltd. 2011
Received: 20 January 2011
Accepted: 28 June 2011
Published: 28 June 2011
With an increasing number of plant genome sequences, it has become important to develop a robust computational method for detecting plant promoters. Although a wide variety of programs are currently available, prediction accuracy of these still requires further improvement. The limitations of these methods can be addressed by selecting appropriate features for distinguishing promoters and non-promoters.
In this study, we proposed two feature selection approaches based on hexamer sequences: the Frequency Distribution Analyzed Feature Selection Algorithm (FDAFSA) and the Random Triplet Pair Feature Selecting Genetic Algorithm (RTPFSGA). In FDAFSA, adjacent triplet-pairs (hexamer sequences) were selected based on the difference in the frequency of hexamers between promoters and non-promoters. In RTPFSGA, random triplet-pairs (RTPs) were selected by exploiting a genetic algorithm that distinguishes frequencies of non-adjacent triplet pairs between promoters and non-promoters. Then, a support vector machine (SVM), a nonlinear machine-learning algorithm, was used to classify promoters and non-promoters by combining these two feature selection approaches. We referred to this novel algorithm as PromoBot.
Promoter sequences were collected from the PlantProm database. Non-promoter sequences were collected from plant mRNA, rRNA, and tRNA of PlantGDB and plant miRNA of miRBase. Then, in order to validate the proposed algorithm, we applied a 5-fold cross validation test. Training data sets were used to select features based on FDAFSA and RTPFSGA, and these features were used to train the SVM. We achieved 89% sensitivity and 86% specificity.
We compared our PromoBot algorithm to five other algorithms. It was found that the sensitivity and specificity of PromoBot performed well (or even better) with the algorithms tested. These results show that the two proposed feature selection methods based on hexamer frequencies and random triplet-pair could be successfully incorporated into a supervised machine learning method in promoter classification problem. As such, we expect that PromoBot can be used to help identify new plant promoters. Source codes and analysis results of this work could be provided upon request.
Promoters are non-coding regions in genomic DNA that contain information crucial to the activation or repression of downstream genes. Located upstream of the transcription start site (TSS) of a gene, the promoter region consists of certain short conserved DNA sequences known as cis-elements or motifs, which are recognized and bound by specific transcription factors . Transcriptional regulation of gene expression thus depends on various interactions between these cis-elements and their respective transcription factors.
The accurate identification of promoters and TSS localization remains a major challenge in bioinformatics due to the great degree of diversity observed in the gene and species specific architectures of such regulatory sequences. The first comprehensive review of publicly available promoter prediction tools was made by Fickett and Hatzigeorgiou . However, this program demonstrated a high rate of false positive prediction, mainly because they relied on only one or two given sequence feature characteristics of the promoter region, such as the presence of a TATA box or Initiator element. Ohler  then integrated some physical properties of DNA, such as DNA bendability and CpG content, along with the sequence features in their proposed method (referred to as McPromoter), though their approach was developed based on only a particular species, Drosophila. And Knudsen  developed Promoter 2.0 by combining a neural network and a genetic algorithm that recognized all five promoter sites on a positive strand in a complete Adenovirus genome, but also included 30 false predictions. Another eukaryotic promoter prediction algorithm, TSSW, had 42% accuracy with one false positive per 789 bp . It should also be noted that most of these algorithms were trained exclusively for a specific animal species, and as such their prediction reliability further decreased when applied to distant species, particularly plants.
The first promoter prediction tool trained and adapted for plants was TSSP-TCM, created by Shahmuradov . It used confidence estimation along with a support vector machine (SVM) to predict plant promoters. TSSP-TCM correctly identified 35 out of 40 test TATA promoters and 21 out of 25 TATA-less promoters; the predicted TSSs deviating 5-14 bp from their true positions . However, recent studies have shown that TATA boxes and Initiators are not universal features for characterizing plant promoters, and that other motifs such as Y patches may play a major role in the transcription process in plants . For example, around 50% of rice genes contain Y patches in their promoter regions . However, identification of the true promoter region in long genomic sequences using known regulatory motifs, such as TATA box or Y patch, is extremely difficult due to the short length and degenerative nature of these elements. Hence, prediction methods based on a few known elements may not provide the best results for identifying promoters in plant genomes.
In order to devise a more effective approach for identifying plant promoters, several structural and sequence dependent properties, such as curvature and periodicity in experimentally validated promoters (both TATA-plus and TATA-less types), were analyzed by Pandey . The analysis revealed that the DNA curvature in promoter regions was greater than that in gene containing regions, indicating the possibility of distant sequences being nearer to the core promoter elements and thus affecting regulation of gene expression in the promoter region. To improve the promoter prediction, the use of DNA structural properties such as bendability, B-DNA twist, and duplex-free energy has been further explored for several eukaryotic genomes, including plants [10, 11]. And though each of these approaches has shown that a distinct structural profile is associated with core promoter regions, it is still unknown to what extent such DNA-structural properties are related to the presence of known or novel regulatory elements in the plant promoter. Hence, the possibility of distal elements underlying such distinct structural patterns needs to be further explored in order to more fully characterize the actual promoter regions.
In most of the promoter prediction approaches currently available, only protein-coding sequences are used as a non-promoter dataset for training. However, there are other regions in genomic DNA that are neither coding regions nor promoters. For example, miRNA, ribosomal RNA, and tRNA genes are not translated to proteins but have their own promoters. These genes constitute a significant part of the genome that belongs to non-promoter regions. Hence, building a non-promoter dataset that consists of such RNA genes, along with the protein-coding sequences, may improve program efficiency in discriminating between promoter and non-promoter sequences.
Recently, a novel approach (PromMachine) used a characteristic tetramer frequency analysis along with SVM to predict plant promoters . In this approach, all possible tetramer combinations for the nucleotides A, T, G, and C (44 = 256) were generated. The most significant tetramers (128 in total) were then taken as discriminating features between the promoters and non-promoters. This approach was not dependent on the presence of TATA boxes or Initiator motifs, though it also had several drawbacks. For example, the non-promoter dataset used for training was built only from the protein-coding sequences, with no other non-promoter sequences included, such as non-coding RNA gene sequences. Also, the program could not locate the TSS position when the TATA box was not present . This limits the utility of PromMachine in detecting TSSs for a huge number of plant promoters, as only ~19% of rice genes and 29% of Arabidopsis genes contain TATA box in their core promoters [8, 13]. Since the prediction accuracy of PromMachine using 7-fold cross-validation was ~83.91%, the achievement of better accuracy still remains a challenge. As such, the development of a standard validation protocol is important in order to determine the best performing promoter prediction program. To this end Abeel et al proposed a set of validation protocols for the fair evaluation of promoter prediction programs aiming to identify a gold standard. Among these protocols, two were based on a binning approach (bins of 500 bp) in which each bin was checked to see whether it overlapped with an experimentally known transcription start region (TSR) or a known start position of a gene. The remaining protocols were based on distance, in which a prediction was considered to be correct if the distance to the closest TSR was smaller than 500 bp. Based on their investigation they proposed a standard for evaluating promoter prediction software, and identified four highly performing software programs; although each of these programs works on different principles and were designed for different tasks .
In this study, we proposed two approaches for feature selection that can improve prediction accuracies and analyze the concept of frequently occurring triplet pairs in sequences. The first feature selection approach is the Frequency Distribution Analyzed Feature Selection Algorithm (FDAFSA), in which we counted the frequency of hexamers (adjacent triplet pairs) in a dataset. The second approach is the Random Triplet Pair Feature Selecting Genetic Algorithm (RTPFSGA), where we used the genetic algorithm to find random triplet pairs (RTPs), which randomly pairs two nonadjacent triplets. It should be noted that the distribution of triplet frequencies has been analyzed in many previous studies to identify genes, as the significance of nucleotide triplets that act as codons in coding sequences is universally known. Recent studies have also found that distant amino acids in protein sequences may become adjacent in the tertiary structure and form local spatial patterns (LSP), which may play an important role in the protein's biological functionality [15, 16]. Hence, the distribution of triplet frequency may also be useful for identifying promoter regions, as differential patterns of triplet over/under-representation have been discovered in a large number of genomes from diverse species over the last few years [17–19].
These observations support the concept of using RTP as a discriminative feature. In our proposed RTPFSGA, the triplets in each pair are essentially non-adjacent to facilitate the analysis of distant triplets that may become adjacent and act as pairs in three dimensional structures, and to enable identification of significant RTP distributions in coding and non-coding promoter sequences for classification purposes. By combining distinct features selected by FDAFSA and RTPFSGA, and SVM for classification of promoter and non-promoter sequences, we developed PromoBot, as an alternative technique for promoter identification. PromoBot was found to be comparable to, and even outperform, other existing algorithms in classifying plant promoters.
Two datasets were used in selecting features and estimating the performance of the promoter classification algorithm: the plant promoter sequence dataset, and the non-promoter sequence dataset.
Plant promoter sequence database
For this study, 305 experimentally validated plant promoter sequences, collected from the PlantProm database , were used as a positive dataset. PlantProm is an annotated, non-redundant collection of proximal promoter sequences for RNA polymerase II from different plant species. In the PlantProm database, all promoter sequences have experimentally verified TSSs  and sequence segments are from -200 to +51 bp relative to TSS.
Non-promoter sequence database
A set of non-redundant plant mRNA, tRNA, and rRNA sequences of various species extracted from PlantGDB  as well as miRNA precursor sequences downloaded from miRBase  were used to construct the negative dataset. We collected 305 sequences having ≥ 251 bp in length from a list of different plant species (Additional File 1). We had chosen a random start position in each non-promoter sequence and then extracted 251 bp, so that all promoter and non-promoter sequences are of the same length.
Support vector machine
Support vector machine (SVM) is a supervised machine-learning algorithm that is used to solve classification and regression problems. For binary classifications, candidate input datasets are assumed to be two sets of vectors in an n-dimensional space. SVM generates a hyper-plane in the space and uses the maximum margin between these two sets of vectors. Then, two parallel hyper-planes on each side of the separating hyper-plane are constructed to calculate the margin. In this method, a good classification depends on the good separation of spaces, which is accomplished via a hyper plane that ensures a maximum distance to the neighboring data points of both classes . In this study, we used LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
Success of SVM classification largely depends on the features chosen. In this study, two different approaches were proposed for feature selection: FDAFSA and RTPFSGA. The final version, PromoBot, was built after being trained using the SVM-TRAIN tool of LIBSVM, based on the extracted distinct features from these two feature-selection approaches. In order to use the 5-fold cross validation test, both the promoter and non-promoter datasets were partitioned into 5 groups of promoters and 5 groups of non-promoters; 4 groups were used for selecting features and the remaining group was used for testing. Each set of training data contained 244 promoters and 244 non-promoters, and each test data had 61 promoters and 61 non-promoters.
We next sorted hexamers based on Diffi, and finally we had hexamer_set k , which was defined as a collection of 4,096 features obtained from each training_data k .
The motivation to use a genetic algorithm for this approach was to iteratively select distantly related triplet (trimer) pairs. A total of 64 possible triplets were generated and randomly paired during the initialization phase of the genetic algorithm. To build the initial population, we considered a fixed number of random triplet pairs (RTPs) as an individual set of the initial population. Frequencies of each candidate triplet in RTP i were counted in all promoters and non-promoters in training_data k ; their minimum frequency value was then considered as the frequency of the particular RTP i . Observing both promoter and non-promoter sequences in each training_data k , each RTP i had two frequency values, defined as X 1 and X 2, respectively. For a particular RTP i, these two frequency values were analyzed by a fitness function, which in turn provided a fitness value for that RTP i . In the fitness function, a two-tailed student's t-test was applied on these two frequency datasets. For this t-test we formulated our problem as follows:
, where n 1 was the number of elements in X 1 , and n 2 was the number of elements in X 2 . The p-value was then considered as the fitness value for a particular RTP i . The assumption was that any RTP i having a smaller p-value than the others has a greater discriminating power. Thus, any RTP i having a smaller p-value was considered as a better fit than the others for the next generation of genetic algorithms, where "Tournament Selection" was used for the survival selection. The best-fit individual between two randomly taken individuals was chosen as the first parent P 1 , and the second parent P 2 was chosen in the same way.
Two types of reproduction operators were used in this algorithm: crossover and mutation. The threshold for crossover probability used here was 0.8 and the mutation probability was 0.05. At each step of reproduction, two parent RTPs were checked for crossover. If the probability was less than the threshold, the triplets of both RTPs were swapped with each other. After every crossover action, the mutation probability was checked for every offspring. If the probability was less than the mutation probability, we mutated the offspring. The mutation logic was very simple. First, the part to be mutated was randomly selected, and we then randomly selected a triplet to replace the mutated part. However, we were cautious about the distinct existence of mutated RTPs in the current population. If a mutated RTP was already in the current population, we discarded the choice and search for new mutated part. We generated random double values to simulate these probabilities in order to compare with the corresponding threshold probabilities. The threshold for mutation probability was intentionally set to a relatively smaller value compared to that of crossover so that mutation happens less frequently than crossover.
After the reproduction phase, a fitness value was assigned into each child using the same fitness function (as described above), and two different populations were created: a parent or current population (μ), and a child population (Ω). For the selection of survivors, the (μ + Ω) g → μ mapping approach was used instead of (μ, Ω) → μ, which means that the best-fit individuals (RTPs) in the current population among μ and Ω were selected for the next generation - instead of considering only μ or Ω. Other parameter values of genetic algorithms, except for crossover and mutation probability, were used are as follows: the maximum population size in one generation was 1,000, the number of reproductions in one generation was 500, the maximum child limit in one generation was 500, and the maximum number of generations was 1,000. After tuning several times, these parameter values were fixed (data not shown).
Selection of significant features from FDAFSA
Top 10 common hexamers in a set of top 25% features of FDAFSA from 5 data sets of 5-fold cross validation.
Common hexamers extracted from All 5 dataset (top 25%)
FDAFSA vs. PromMachine.
Methods (n-mers used)
Average Sensitivity of 5-fold cross validation (%)
Average Specificity of 5-fold cross validation (%)
Selection of significant features from RTPFSGA
10 common RTPs in a set of RTPs having p-value < 0.000001 of all 5 data sets using 5-fold cross validation.
Random Triplet Pair
The specificity of FDAFSA was significantly higher than that of RTPFSGA. As shown in Figures 1 and 2, when we chose the top 25% features from FDAFSA, the average specificity of the prediction was 0.86, and the average specificity for features selected by RTPFSGA using a p-value < 0.000001 was 0.59. In contrast, the features selected by RTPFSGA had a higher average sensitivity when compared to the sensitivity from FDAFSA (0.94 and 0.84, respectively). Then, in an attempt to increase both the sensitivity and specificity, we merged the two feature sets in PromoBot. For each set of training_data k we had two feature sets: hexamer_set' k and RTP_set' k . We selected only distinct features from these two feature sets to build PromoBot. As RTPs were triplet pairs, two hexamers could be formed from each RTP in RTP_set' k . In order to construct a unique set of features, the hexamer_set' k from FDAFSA was checked for the presence of hexamers obtained from RTPs, and these hexamers were subsequently excluded from hexamer_set' k . Finally, we made combined_feature_set k from each training_data k , in which the numbers of features in five combined sets were 1077, 1115, 1096, 1071, and 1097, respectively.
Results of prediction test with combined features from FDAFSA and RTPFSGA.
Comparative accuracy of PromoBot with FDAFSA and RTPFSGA.
Algorithm for feature selection
Average sensitivity for 5-fold cross validation (%)
Average specificity for 5-fold cross validation (%)
[FDAFSA + RTPFSGA]
Comparison with other methods
Comparison with other methods.
Statistical Measure (%)
NNPP 2.2 (threshold = 0.8)
Promoter Scan Version
Performance evaluation using experimentally validated new promoters
Performance evaluation using 271 experimentally validated promoters.
No. of sequences
No. of accurate prediction
Comparison of promoter prediction performance using different negative datasets
We also evaluated the effect of using different types of negative datasets on promoter prediction. For this comparison, we collected plant miRNA sequences from miRBase  and took 305 sequences having a length greater or equal to 240 bp. Similarly, we collected mRNA and rRNA sequences from PlantGDB, selecting 305 sequences from each. In the case of rRNA, we removed sequences having 80% redundancy using Jalview version 2 and considered sequences having a length greater or equal to 140 bps.
Comparative assessment of performance using different negative datasets
Statistical Measure (%)
[miRNA + mRNA + rRNA + tRNA]
In PromoBot--which used a combined negative dataset in which only 40 non-redundant rRNA sequences are included--the overall performance was higher than the case of using only mRNA or miRNA as negative set. The results show effectivity of combining mRNA, rRNA, and miRNA, and tRNA in the construction of the negative set. When only miRNA was used as the negative dataset, the specificities of both programs decreased, though the specificity of TSSP-TCM was significantly better than PromoBot (Table 8). Since discriminating mRNA promoters from miRNA is not an easy task, but an important challenge; further extensive investigations are required for this task. We did not include tRNA sequences for this analysis because there were very few non-redundant tRNA sequences in PlantGDB, with considerable variances in sequence length.
Discussion and conclusions
The comparative improvement of the accuracy rate of promoter predictions by PromoBot indicates that using the frequency distribution of hexamer sequences in combination with RTP analysis can be effective in identifying promoters in plant genomes. This method also has the potential to achieve improved accuracy in promoter identification if extended to genomes of other eukaryotic species.
Besides using two different algorithms for feature selection, the prediction model in PromoBot has been trained with experimentally identified promoter dataset as well as negative dataset derived from four different sources, i.e. miRNA, tRNA, rRNA and protein coding mRNA genes. With the availability of a large number of plant genome sequences, the accurate identification of promoter regions from such non-coding RNA genes is becoming important. Our analysis showed that the performance of PromoBot varied depending on the negative dataset and that the second highest sensitivity and specificity were achieved when the combination of mRNA, miRNA, rRNA and tRNA gene sequences was used for the negative set (Table 8). Although the use of rRNA alone as the negative data yielded the highest sensitivity and specificity, it might be due to features selected from highly conserved and redundant sequences of rRNA. In the case of the negative dataset consisting of only miRNA genes, the prediction performance was decreased. One of the reasons for this low performance might be the length of miRNA precursor sequences. Plant miRNA precursors are highly variable, with a length ranging from 55-930 bp (average ~146 bp) . Such variation limited our attempt to collect enough miRNA precursor sequences having lengths equal to that of the experimentally verified promoters. Features collected from such sequences might be insufficient for accurate discrimination of RNA pol II plant promoters from miRNA genes. Also, miRNA genes may have other strong features that are unrecognized by the FDAFSA and RTPFSGA in PromoBot. In the future, statistical and biological features of miRNA genes will be studied in detail to fully utilize these features for improvement of prediction algorithm.
Recently, a hierarchical stochastic language algorithm that utilizes the analysis of hexamer occurrence frequencies in DNA sequences has been shown to be successful in accurately recognizing transcriptional regulatory regions in several species including Arabidopsis and rice . This usefulness of hexamers in identifying promoter sequences is also confirmed by our results (Table 5), demonstrating high sensitivity and specificity (84% and 86%, respectively) in case of FDAFSA. Also, the utilization of RTP alone in discriminating promoter and non-promoter datasets resulted in highly improved sensitivity (94%) in the test datasets. However, unlike hexamers, use of RTP information did not yield high specificity. This may be due to several reasons. First, the protein coding sequences in the training dataset were obtained from multiple species. While this approach is useful for avoiding species specificity in the prediction method, it also means that there was no specific codon usage bias present in the collected protein sequences. Also, our non-promoter dataset contained protein-coding sequences and other non-coding gene sequences such as tRNA and miRNA; such diversity may have caused noise in the RTP analysis and it is quite possible that the RTP analysis may have shown more specificity for non-promoter sequences if the coding sequences were taken from a single species. Nevertheless, we assumed from the results that RTPs may also have some other significance in the promoter regions of the genome, as it was found that the DNA curvature of promoters is higher than that of coding regions . Thus, distal elements may become proximal to the core promoter elements and contribute to the regulation of gene expression. However, a more detailed study is required in order to explore and identify the significance of RTPs in promoter regions in greater detail.
This work was supported by the Basic Science Research Program through the National Research Foundation (NRF) of Korea funded by the Ministry of Education, Science and Technology (2010-0003597).
- de Boer GJ, Testerink C, Pielage G, Nijkamp HJ, Stuitje AR: Sequences surrounding the transcription initiation site of the Arabidopsis enoyl-acyl carrier protein reductase gene control seed expression in transgenic tobacco. Plant Mol Biol. 1999, 39 (6): 1197-1207. 10.1023/A:1006129924683PubMedView ArticleGoogle Scholar
- Fickett JW, Hatzigeorgiou AG: Eukaryotic promoter recognition. Genome Res. 1997, 7 (9): 861-878.PubMedGoogle Scholar
- Ohler U, Niemann H, Liao G, Rubin GM: Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition. Bioinformatics. 2001, 17 (Suppl 1): S199-206. 10.1093/bioinformatics/17.suppl_1.S199PubMedView ArticleGoogle Scholar
- Knudsen S: Promoter2.0: for the recognition of PolII promoter sequences. Bioinformatics. 1999, 15 (5): 356-361. 10.1093/bioinformatics/15.5.356PubMedView ArticleGoogle Scholar
- Solovyev V, Salamov A: The Gene-Finder computer tools for analysis of human and model organisms genome sequences. Proc Int Conf Intell Syst Mol Biol. 1997, 5: 294-302.PubMedGoogle Scholar
- Shahmuradov IA, Solovyev VV, Gammerman AJ: Plant promoter prediction with confidence estimation. Nucleic Acids Res. 2005, 33 (3): 1069-1076. 10.1093/nar/gki247PubMedPubMed CentralView ArticleGoogle Scholar
- Yamamoto YY, Ichida H, Abe T, Suzuki Y, Sugano S, Obokata J: Differentiation of core promoter architecture between plants and mammals revealed by LDSS analysis. Nucleic Acids Res. 2007, 35 (18): 6219-6226. 10.1093/nar/gkm685PubMedPubMed CentralView ArticleGoogle Scholar
- Civan P, Svec M: Genome-wide analysis of rice (Oryza sativa L. subsp. japonica) TATA box and Y Patch promoter elements. Genome. 2009, 52 (3): 294-297. 10.1139/G09-001PubMedView ArticleGoogle Scholar
- Pandey SP, Krishnamachari A: Computational analysis of plant RNA Pol-II promoters. Biosystems. 2006, 83 (1): 38-50. 10.1016/j.biosystems.2005.09.001PubMedView ArticleGoogle Scholar
- Abeel T, Saeys Y, Bonnet E, Rouze P, Van de Peer Y: Generic eukaryotic core promoter prediction using structural features of DNA. Genome Res. 2008, 18 (2): 310-323. 10.1101/gr.6991408PubMedPubMed CentralView ArticleGoogle Scholar
- Gan Y, Guan J, Zhou S: A pattern-based nearest neighbor search approach for promoter prediction using DNA structural profiles. Bioinformatics. 2009, 25 (16): 2006-2012. 10.1093/bioinformatics/btp359PubMedView ArticleGoogle Scholar
- Anwar F, Baker SM, Jabid T, Mehedi Hasan M, Shoyaib M, Khan H, Walshe R: Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach. BMC Bioinformatics. 2008, 9: 414- 10.1186/1471-2105-9-414PubMedPubMed CentralView ArticleGoogle Scholar
- Molina C, Grotewold E: Genome wide analysis of Arabidopsis core promoters. BMC Genomics. 2005, 6 (1): 25- 10.1186/1471-2164-6-25PubMedPubMed CentralView ArticleGoogle Scholar
- Abeel T, Van de Peer Y, Saeys Y: Toward a gold standard for promoter prediction evaluation. Bioinformatics. 2009, 25 (12): i313-i320. 10.1093/bioinformatics/btp191PubMedPubMed CentralView ArticleGoogle Scholar
- Kornev AP, Taylor SS, Ten Eyck LF: A helix scaffold for the assembly of active protein kinases. Proc Natl Acad Sci USA. 2008, 105 (38): 14377-14382. 10.1073/pnas.0807988105PubMedPubMed CentralView ArticleGoogle Scholar
- Ten Eyck LF, Taylor SS, Kornev AP: Conserved spatial patterns across the protein kinase family. Biochim Biophys Acta. 2008, 1784 (1): 238-243.PubMedView ArticleGoogle Scholar
- Gorban AN, Zinovyev AY, Popova TG: Seven clusters in genomic triplet distributions. In Silico Biol. 2003, 3 (4): 471-482.PubMedGoogle Scholar
- Majewski J, Ott J: Distribution and characterization of regulatory elements in the human genome. Genome Res. 2002, 12 (12): 1827-1836. 10.1101/gr.606402PubMedPubMed CentralView ArticleGoogle Scholar
- Albrecht-Buehler G: The three classes of triplet profiles of natural genomes. Genomics. 2007, 89 (5): 596-601. 10.1016/j.ygeno.2006.12.009PubMedView ArticleGoogle Scholar
- Shahmuradov IA, Gammerman AJ, Hancock JM, Bramley PM, Solovyev VV: PlantProm: a database of plant promoter sequences. Nucleic Acids Res. 2003, 31 (1): 114-117. 10.1093/nar/gkg041PubMedPubMed CentralView ArticleGoogle Scholar
- Dong Q, Schlueter SD, Brendel V: PlantGDB, plant genome database and analysis tools. Nucleic Acids Res. 2004, 32 (Database): D354-359.PubMedPubMed CentralView ArticleGoogle Scholar
- Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ: miRBase: tools for microRNA genomics. Nucleic Acids Res. 2008, 36 (Database): D154-158.PubMedPubMed CentralView ArticleGoogle Scholar
- Boser BE, Guyon IM, Vapnik VN: A Training Algorithm for Optimal Margin Classifiers. Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory: 1992. 1992, 144-152. Pittsburgh: ACM pressGoogle Scholar
- Reese MG: Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome. Comput Chem. 2001, 26 (1): 51-56. 10.1016/S0097-8485(01)00099-7PubMedView ArticleGoogle Scholar
- Prestridge DS: Predicting Pol II promoter sequences using transcription factor binding sites. J Mol Biol. 1995, 249 (5): 923-932. 10.1006/jmbi.1995.0349PubMedView ArticleGoogle Scholar
- Waterhouse AM, Procter JB, Martin DMA, Clamp Ml, Barton GJ: Jalview Version 2 - a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009, 25 (9): 1189-1191. 10.1093/bioinformatics/btp033PubMedPubMed CentralView ArticleGoogle Scholar
- Thakur V, Wanchana S, Xu M, Bruskiewich R, Quick W, Mosig A, Zhu XG: Characterization of statistical features for plant microRNA prediction. BMC Genomics. 2011, 12 (1): 108- 10.1186/1471-2164-12-108PubMedPubMed CentralView ArticleGoogle Scholar
- Wang Q, Wan L, Li D, Zhu L, Qian M, Deng M: Searching for bidirectional promoters in Arabidopsis thaliana. BMC Bioinformatics. 2009, 10 (Suppl): S29- 10.1186/1471-2105-10-S1-S29PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.