iTriplet, a rule-based nucleic acid sequence motif finder
© Ho et al; licensee BioMed Central Ltd. 2009
Received: 07 May 2009
Accepted: 29 October 2009
Published: 29 October 2009
With the advent of high throughput sequencing techniques, large amounts of sequencing data are readily available for analysis. Natural biological signals are intrinsically highly variable making their complete identification a computationally challenging problem. Many attempts in using statistical or combinatorial approaches have been made with great success in the past. However, identifying highly degenerate and long (>20 nucleotides) motifs still remains an unmet challenge as high degeneracy will diminish statistical significance of biological signals and increasing motif size will cause combinatorial explosion. In this report, we present a novel rule-based method that is focused on finding degenerate and long motifs. Our proposed method, named iTriplet, avoids costly enumeration present in existing combinatorial methods and is amenable to parallel processing.
We have conducted a comprehensive assessment on the performance and sensitivity-specificity of iTriplet in analyzing artificial and real biological sequences in various genomic regions. The results show that iTriplet is able to solve challenging cases. Furthermore we have confirmed the utility of iTriplet by showing it accurately predicts polyA-site-related motifs using a dual Luciferase reporter assay.
iTriplet is a novel rule-based combinatorial or enumerative motif finding method that is able to process highly degenerate and long motifs that have resisted analysis by other methods. In addition, iTriplet is distinguished from other methods of the same family by its parallelizability, which allows it to leverage the power of today's readily available high-performance computing systems.
Here we present a rule-based method to identify degenerate and long motifs in nucleic acid sequences. The widely accepted sequence motif finding problem formulation proposed by Pevzner and Sze in  is adopted in this article. We call an oligomer of length l, an l mer. A motif model is denoted by <l, d>, where l is the length of the motif, and d is the maximum number of mutations allowed with respect to the motif. An instance of a motif is termed d-mutant. Two d-mutants of the same motif must not differ by more than 2d differences. We call two l mers neighbors if their difference is ≤ 2d. Given n sequences, each of length L (could be of variable length), the goal is to locate the set of d-mutants in each sequence from the sample where the largest difference between any pair of d-mutants in the set is ≤ 2d. In the following we will briefly summarize two major motif finding approaches, viz. statistical and combinatorial. Readers who are interested in a more comprehensive survey about motif finding can refer to [2, 3].
Position weight matrix is often used as a statistical scoring system to identify biological signals from background. This technique implies that biological signals consist in part of conserved nucleotides that are critically important for their potency. As a result, motifs discovered by this approach tend to contain relatively invariant nucleotides at a few positions. Many transcription factor binding site prediction methods are developed based on this approach. Gibbs sampling and expectation maximization are typical techniques employed by MEME [4, 5], AlignACE , BioProspector , MDScan  and MotifSampler . The primary advantage of this approach is its speedy runtime and minimal memory consumption. However, statistical overrepresentation will vanish when the size of the motif to the number of mutations ratio decreases. One improvement of this approach is to incorporate phylogenetic information in background estimation. Well-known examples of this approach include FootPrinter  and PhyloGibbs . However, such an approach is challenged by multiple substitutions occurring in distant species or motif searching in a single species. Some other methods train a Markov model to capture nucleotide dependency information of known binding sites in order to make prediction for unseen cases. One extension of the Markov model was reported in . The authors incorporated several features, such as gaps and polyadic sequence elements, to handle diversified transcription factor binding sites.
An alternative to a statistical approach is the combinatorial or enumerative approach  where the observable biological signals are believed to be the variation of a hidden motif, and they do not exhibit conspicuous conservation at any particular position, and yet they are similar to each other. This approach is suitable for families of biological signals where the targeting proteins do not rely on a few conserved nucleotides at fixed positions. Instead the overall binding affinity is determined cooperatively by nucleotides in a region. Many such examples are found in precursor RNA processing signals including the pyrimidine-rich region near 3' splice sites and the U/GU-rich region downstream of polyadenylation sites. One fundamental problem faced by the enumerative approach is the exponential growth of computing resources when the size of the motif increases. To circumvent this, existing methods such as WINNOWER , MotifEnumerator , MITRA , TIERESIAS , Gemoda  and PMSprune , employ various elegant pruning strategies to abandon unpromising pursuits as early as possible.
Both enumerative and statistical approaches have proven to be valuable in analyzing real biological examples and both approaches are complementary to each other. In most situations when little prior knowledge is known about the motif, we believe both approaches should be considered. Since our focus is on solving degenerate and long motifs, we adopt the enumerative approach that is guaranteed to find the optimal motif by applying a novel rule-based algorithm to identify all optimal motif candidates without the expense of exploring the entire 4l space exhaustively. In addition, our algorithm is designed to be highly parallelizable so as to exploit today's parallel computing technology in handling massive biological data. As a proof of concept, we will evaluate our algorithm using the simulated data described in . Also we will show our method is able to identify motifs in real promoter sequences, and 5' and 3' untranslated regions (UTR) from different species. Results show that our method can solve highly degenerate and/or longer motifs that overwhelm the capabilities of other methods. Furthermore, we have compared the prediction accuracy of our method with the statistical motif finding methods mentioned above and find that our method is equal to and sometimes better than these methods. Besides in-silico simulations, we have also verified our prediction of downstream polyadenylation motifs for three human genes using a dual Luciferase assay. Our software is developed in C++ and standard template library (STL). It has been tested on Linux platform. Interested readers can download the software freely from this website http://www.rci.rutgers.edu/~gundersn/iTriplet.
Our rule-based enumerative algorithm is named iTriplet. It stands for inter-sequence triplets. A triplet consists of three neighboring l mers (less than 2d differences from each other) sampled from three different sequences. The 'inter-sequence' part of the iTriplet algorithm systematically explores tripartite combinations of l mers from different sequences in order to identify motif(s) that span all sequences in the sample. The span of a motif refers to the number of sequences containing its d-mutant. For clarity, we will explain our method by limiting to only one motif in the sample, and every sequence contains at least one occurrence d-mutant of the motif even though our method can deal with multiple motifs and 10-20% of contamination. We will describe our iTriplet algorithm in two parts: the 'inter-sequence' part will be discussed first, followed by the Triplet algorithm.
The inter-sequence part of iTriplet
If sufficient number of sequences are given and the motif model is not highly degenerate, i.e. small d with respect to l, the likelihood that an l-sized motif can span through all sequences by chance is rare. Based on this insight, we utilize the span of a motif as the indicator to identify unusual motifs in a sample.
The inter-sequence part of iTriplet consists of two stages: initialization stage and expansion-pruning stage. Below is the description of the procedure.
Given a set of n sequences and a motif model <l, d>, randomly designate two sequences from the sample as reference sequences, namely R1 and R2, and the rest as non reference sequences S1, S2, ..., Sn-2.
Expansion-pruning stage: Randomly select an unprocessed non-reference sequence, say Sj. Similar to the initialization stage, identify all triplets based on r1, l mers from sequences R2 and Sj. Identify the set of common motifs of all triplets using Triplet algorithm and store them in the hash table. Prune the hash table by removing motifs with span not covering all processed sequences so far. If the hash table is not empty after pruning, repeat the expansion-pruning stage with the next unprocessed non-reference sequence. If the hash table is empty after pruning, return to the initialization stage, randomly pick a different l mer (r1) from R1, and repeat the same two-stage inter-sequence process again until all l mers in R1 have been processed. If all non-reference sequences have been processed and the hash table is not empty, then return motif(s) in the hash table to the calling program.
As described above, the processing of different l mer r1 in R1 are completely independent of each other. It means that they can be executed simultaneously wherein not even a single synchronization point is required. Therefore, given M processors, the algorithm can trigger up to (M-1) concurrent processes simultaneously. Theoretically, the performance gain by parallelizing this step is (M-1) times for a M-processor system where one processor is designated for overall coordination purposes. Our current parallel version of iTriplet is implemented based on this idea.
The Triplet part of iTriplet
The purpose of this part of the algorithm is to uncover the complete set of motifs common to all members of the triplet in a deterministic and efficient way. The clues solely come from the similarities and differences among the three l mers rather than the enumeration of all possible l mers. It is efficient because the number of motifs shared among all three l mers should be small. By example, the estimated probability of any three l mers to share at least one common motif for models <12,3> and <30,9>, is 5.47 × 10-4 and 2.97 × 10-4, respectively.
Triplet algorithm consists of three stages: 1) centroid l mer construction, 2) exploratory scheme discovery, and 3) motif generation. Below is the description:
Stage 1: centroid l mer construction. Given a triplet of three l mers from the calling program, identify the three column types Pi, Pmn and Pnc as discussed above. Check if the triplet satisfies this inequality: l-d ≤ |P i | + |P mn |*2/3+|P nc |*1/3 (for derivation, see Additional file 1) where |Pi,|, |Pmn| and |Pnc| denote the number of Pi, Pmn and Pnc patterns respectively. If the given triplet fails to satisfy this inequality, return no common motif and exit. Otherwise take these three steps to construct the initial move and score vectors: i) take the common nucleotides from columns Pi, ii) take the dominant nucleotides from Pmn, and iii) for columns Pnc, take the nucleotides from the l mer which is currently farthest from the work-in-progress centroid l mer produced by the previous two steps. Pass the newly created move and score vectors to stage 2 for further processing.
Stage 2: exploratory scheme discovery. Based on the excess score(s) (> l-d) in one or more of the three values in the initial score vector, formulate alternative ways to select nucleotides from Pi, Pmn and Pnc patterns through the 61 rules (will be discussed later). An execution of a rule produces a new set of move vector(s) and its associated score vector. Repeat stage 2 processing of the new move vector(s) until all newly generated score vector(s) becomes [l-d, l-d, l-d] i.e. no excess score. Pass all move and score vectors generated in this stage to stage 3.
Stage 3: motif generation. Generate motif by going through each value in the move vector, and select the specified number of column patterns and associated nucleotides accordingly. When all move vectors are processed, return all motifs to the calling program.
Five basic operations for triplet processing of iTriplet algorithm
Examples based on Figure 2D if possible
Instead of choosing the dominant nucleotide from Pmn column, choose the odd nucleotide.
sac(P12), take 'G' at position 3 from l 3 instead of 'C' from l 1 or l 2
Instead of choosing the dominant or odd nucleotide from Pmn column, choose nucleotides complementary to them.
Apply on the 2nd column, compl(P23), take nucleotides complementary to 'G' and 'T', i.e. choose 'A' or 'C' for position 2.
Instead of taking nucleotide from l meri, choose from l merj in a Pnc column.
Apply nc(3,1) to position 11. Instead of choose 'A' from l 3, choose 'T' from l 1 at position 11.
Instead of taking nucleotide from l meri, choose from the complementary nucleotide of a Pnc column.
Apply nc(3,0) to position 11. Instead of choose 'A' from l 3, assign the complementary nucleotide 'G' to position 11.
Instead of keeping the nucleotide identical to all lmers in the triplet, take the three complementary nucleotides.
Apply sac_i(Pi) to position 1. Take 'A', 'G' or 'T' instead of 'C' at position 1.
Regarding stage 3, one move vector may generate more than one motif. For the example in Figure 2D, the new move vector due to rule 13 is [5,2,1,2,1,0,0,1,0,0,0]. The first value specifies to select the nucleotides from the five P i column patterns which are found in positions 1, 5, 6, 8 and 10 (see Figure 2B). Since there are exactly five P i column patterns, only one way is possible. The second value of the move vector specifies to choose dominant nucleotides from two P12 column patterns out of three and to choose the odd nucleotide from the remaining one. It will generate three possibilities. The rest of the values in the move vector will be processed similarly.
We have given the full description of iTriplet algorithm. Regarding the correctness of the algorithm, at this stage, we have not come up with a theoretical proof yet, however we have conducted extensive testing of more than 14,000 cases including models <11,2>, <12,3>, <13,3>, <15,4>, <28,8> and <40,12>; over 2,000 cases per model. In each case, we had generated 20 sequences each of length 600 with all nucleotides occurring equally likely. In each sequence, a single l-size d-mutant was planted at a random location. After each run, we checked whether the returned motif from iTriplet was the same as planted or not. iTriplet performed correctly for all cases.
Time and Space Complexities of iTriplet
The inter-sequence part of iTriplet mainly iterates all combinations of triplets among sequences. Therefore, for model <l, d>, we estimate the time complexity of the inter-sequence part of iTriplet to be O(nL3pl) where n, L and p are the number of sequences, length of sequence and probability to form a triplet that shares at least one motif. As discussed before, we estimate p should be in the range of 10-4, and L should normally be 102. Therefore, the effective time complexity of the inter-sequence part ranges from O(nLl) to O(nL2l). Stage 2 of Triplet part should generate all possible score vectors as long as the score value between each l mer and the centroid l mer is at least l-d. In the worst case scenario, there are d3 score vectors. The generation of actual motifs based on the move vector in step 3 should depend on the size of the motif l. Therefore the time complexity of Triplet is O(d3l). Hence the overall time complexity of iTriplet is O(nL3pl2d3). For PMSprune, the time complexity is O(nL2N(l, d)), where N(l, d) is . After eliminating the common terms, the main difference lies in the growth of Lpl2d3 and N(l, d) in iTriplet and PMSprune, respectively. When the motif model is small, N(l, d) is smaller than Lpl2d3. However, when l increases, the combinations of N(l, d) grows exponentially. iTriplet's space complexity depends on the degeneracy of the model, therefore it is O(N(l, d)) before pruning. After pruning, the space requirement will shrink.
Results and Discussion
Methods comparison on simulated datasets.
For short motifs (<16 nucleotides) iTriplet is comparable to the fastest (PMSprune) and is significantly faster than MotifEnumerator and RISOTTO. When motif length is longer than 16, all other methods take longer than 6 hours to process. Note that iTriplet is able to process highly degenerate <18,6> and <24,8> models which cannot be handled by these other three methods as well as other statistical based methods such as MEME, MotifSampler and BioProspector. Based on these results, we learned that the performance of all methods depends on l and d, but to a different extent. Intriguingly, the runtime of PMSprune quadrupled, though still very fast, when l increased from 12 to 15 even though the neighborhood probability remained relatively at the same level. A similar trend is also observed in RISOTTO but with even higher fold increment in runtime. Such a phenomenon is not observed in our method. When neighborhood probability is doubled in models <12,3> versus <14,4>, and <14,4> versus <16,5>, the runtime of PMSprune increased 15 and 13.5 times respectively and RISOTTO increased 12 and 10 times respectively whereas iTriplet only increased 6 and 9 times, respectively. Based on these observations, we can understand that the algorithms employed by RISOTTO and PMSprune are quite sensitive to both l and d even when the neighboring probability remains at the same level. Thus RISOTTO and PMSprune take a longer time to search for the optimal motif; whereas the combined effect of l and d on performance was less severe for iTriplet. This explains why RISOTTO and PMSprune encountered difficulty in handling longer motif models. This does not exclude that iTriplet is unaffected by large d (high degeneracy). But one distinctive feature of our algorithm is that it can split the task into smaller subtasks which can be run independently in parallel. When comparing sequential and parallel versions of iTriplet, the parallel version averaged 1.77 times performance gain in a three-node cluster that is quite close to the theoretical gain 2.0. Testing based on the simulated data revealed that different methods have different tradeoffs in tackling the general <l, d> motif problem therefore further investigation is still needed to cope with various challenges of this problem.
Real biological sequences
iTriplet prediction using real biological sequences.
Preproinsulin (IEB1) promoter+5' UTR
Transfac ID: R04457
DHFR (promoter+5' UTR)
ATTTCG TG GGCA
TGCAATTTC GCGCCA AAC
Transfac ID: R01928
Metallothionein promoter+5' UTR
TTTTGCRC TCG YCCC
CTCTGC AC ACGG CCC
Transfac ID: R08298
c-fos serum response element promoter+5' UTR
CCA AATTT G
Transfac ID: R00466
3'UTR Regulatory Elements
iTriplet Prediction only
Cytoplasmic Polyadenylation element (CPE)
TTTTAT and TTTTAAT
Pumillio binding element (PBE)
T KT WAATA
Multiple motifs are often identified by iTriplet for real biological sequences. Four reasons account for this: 1) the number of sequences considered is small, mostly 4 in our test therefore resulting in a higher chance to encounter random span, 2) a naturally occurring recognition site is not necessarily represented by one consensus, 3) it is possible for the biological sequence to carry more than one signal especially in the 3' UTR, and 4) the presence of low complexity repeats.
Therefore we need a scoring system to filter out random from genuine motifs. Since only a small number of sequences are given, the set of true motif instances must resemble each other more than a set of random l mers; otherwise no conclusion can be made. As we have discussed in the inter-sequence algorithm section, if members of the triplet are very similar to each other, the intersection will become big, i.e. high numbers of common motifs. Based on this property, we derived a straightforward scoring system based on the numbers of common motifs uncovered to support whether the finding is statistically significant. Due to this, the 5' and 3' overlapping neighbors of the true motif are often included as part of the prediction as well. Therefore in some cases of the genes listed in Table 3, the predicted motif is longer than the model specified. Each prediction is a consensus of a number of common motifs. The method of constructing the consensus is similar to the frequency plot of Weblogo . Nucleotides with frequency at a position greater than 30% will be included in the consensus sequence. As can be seen from Table 3, our predictions for promoter and 5' UTR sequences, and 3' UTR regulatory elements are largely consistent with published experimental data.
Sensitivity and specificity test
Prediction accuracy of iTriplet versus four others motif finding methods.
In vitro verification of predicted polyA downstream elements
To test whether the TCTGATTT motifs identified by iTriplet were functional, the dual Luciferase reporter system was used where Renilla Luciferase mRNA contained the entire 3'UTR plus sequences past the PAS of the gene of interest. A co-transfected Firefly Luciferase reporter was included that serves as an internal normalization control. As diagrammed in Figure 3, the plasmid pRL-GAPDHwt was made from a standard pRL-SV40 Renilla expression plasmid by replacing the SV40-derived 3'UTR and downstream polyA signal sequences with the human GAPDH 3'UTR and polyA signal region (NM_002046) including 116 nt past the polyA site. iTriplet predicted that GAPDH has a motif we call GAPDH Motif A that would potentially be important for polyA site activity. To determine if GAPDH Motif A is functional, we mutated it as shown to make plasmid pRL-GAPDHmt. Plasmids were transfected into HeLa cells and Luciferase activity was measured; values for Renilla Luciferase were normalized to those obtained from the co-transfected Firefly Luciferase control plasmid. The pRL-GAPDHmt plasmid expresses 43% less Renilla Luciferase than pRL-GAPDHwt, indicating Motif A enhances Renilla Luciferase expression by about 2.2-fold.
The same analysis was done in panels B and C but for the human RAF and human U1A genes, respectively. As can be seen the RAF Motif A enhances expression 3.2 fold and the U1A Motif A enhances expression by 5.1 fold. Here we have demonstrated the predictive power of iTriplet for these three genes however we do not exclude the existence of other binding sites that can also affect polyA activity of these genes.
We have presented a novel rule-based algorithm called iTriplet to solve the challenging degenerate and long motif finding problem that was unsolved before. In addition, we have confirmed our prediction for real biological signals experimentally. The runtime of iTriplet is comparable to other well-known methods of the same design philosophy and is significantly better at analyzing longer motifs (>16 nucleotides). To our knowledge, iTriplet is the most parallelizable motif finding method in the family of guaranteed optimal motif finding algorithms developed so far. Furthermore we have shown that our method is very competitive in prediction accuracy when compared with other popular motif finding methods. Overall, our method has the superiority like other exact optimal motif finding methods to find the optimal motif in the absence of statistical overrepresentation and yet without sacrificing prediction accuracy. That said, no single method or approach is able to solve the general <l, d> motif problem completely in terms of guaranteed solution, speed, memory consumption and prediction accuracy. Thus, further research effort is needed to overcome various hurdles of this problem.
We generated multiple sets of simulated sequences according to the <l, d> motif model formulated by Pevzner and Sze . Each dataset consists of 20 sequences, each 600 nucleotides long. All nucleotides occur equally likely. In each sequence, a single l-size d-mutant is planted at a random location. We have prepared datasets for a wide range of <l, d> motif models, i.e. <11,2>, <12,3>, <13,3>, <14,4>, <15,4>, <16,5>, <18,6>, <19,6>, <24,8>, <28,8>, <30,9>, <38,12>and <40,12>.
Untranslated region sequence data
In addition to simulated data, we also prepared and tested several sets of real biological data that can be split into two groups, one 5' upstream of the start codon, i.e. 5' UTR and promoter; and the other from the 3' UTR. For the 5' UTR-promoter group, we chose four genes that are commonly tested in other motif finding algorithms [10, 14, 17], namely, preproinsulin, DHFR, metallothionine, and c-fos. Homologous regions from four species were included for analysis using the Homologene database http://www.ncbi.nlm.nih.gov/sites/entrez?db=homologene from NCBI. To obtain the upstream promoter region, BLAT  was used to map the cDNA to the species genome provided by Genome browser . Based on the 5' starting point of the cDNA, we then extracted promoter sequence from the genome. Further details of this sequence set are found in the Additional file 1.
Another set of real biological sequences is taken from the 3' UTR where AU-rich elements (AREs), cytoplasmic polyadenylation elements (CPEs), and Pumillio binding elements (PBEs)  were chosen. The AREs were derived from 30 experimentally validated human and mouse 3' UTRs . These genes were also confirmed by the ARE database ARED 2.0 http://brp.kfshrc.edu.sa/ARED/. Based on the accession numbers provided by ARED, we retrieved the cDNA sequences from NCBI's RefSeq database . The 5' end of the 3'UTR begins right after the stop codon, However, the 3' end of the 3' UTR is not obvious because we have found that most of the cDNA sequences deposited in RefSeq database lack a poly(A) tail. In order to accurately determine the 3' end of the 3' UTR, we utilized expressed sequence tag (EST) data from the UCSC Genome Browser . We first mapped each cDNA to the genome using BLAT. The true end of the 3' UTR should coincide with the endpoint of the EST. The more ESTs that end at the same spot as the cDNA, the higher confidence we have about the true end of the 3' UTR. The set of sequences we obtained are variable in length ranging from 92 to 1608 bases bringing the total sequence space to 23,022 bases. For CPEs and PBEs, we have used the five cyclin genes, B1, B2, B3, B4, and B5 from Xenopus laevis . Complete annotation of the sequences can be found in the Additional file 1.
We compared the performance of our method with three other methods with the same enumerative design philosophy, viz. MotifEumerator, RISOTTO and PMSprune. Source codes were downloaded from these sites, MotifEnumerator from http://faculty.cs.tamu.edu/shsze/motifenumerator/, RISOTTO from http://kdbio.inesc-id.pt/~asmc/pub/software/RISO/riso-me-src.zip, and PMSprune from http://www.engr.uconn.edu/~jid02003/Jaime/pmsprune.c. They were compiled in the Linux ×86 platform according to the instructions documented in the respective websites.
Transfection and Luciferase assays
Cell culture and transfections were done as previously described in . For Luciferase assays, the cells were harvested after 24 hours and Luciferase measured using the Promega dual Luciferase kit (Promega, Madison, WI) measured on a Turner BioSystems Luminometer (Turner BioSystems, Sunnyvale, CA).
Sensitivity and specificity test
For the sensitivity and specificity test, we adopted the three-level (nucleotide, binding site, and motif) testing framework proposed by the Kihara group . Two sets of data, ECRDB70 and ECRDB62A, were downloaded from their website http://dragon.bio.purdue.edu/pmotif. These data were originally derived from the RegulonDB database . The ECRDB62A dataset comprises 713 intergenic sequences containing binding sites for 62 transcription factors in E. Coli K-12. We filtered out duplicated sequences, transformed reverse strands into forward direction, and dropped transcription factors with less than three binding sequences. The final reconstructed dataset contains 379 distinct sequences from 36 transcription factors. At the nucleotide and binding site level, four different assessments were performed. Sensitivity (Sn) is defined as , where TP, FN stands for true positive and false negative respectively. Specificity (Sp) is defined as , where FP is false positive. We followed two other assessments that were described in  to combine Sn and Sp. Performance coefficient (PC) is defined as , which was originally proposed in [1, 22]. The last assessment is called F-measure (F), which tends to penalize the imbalance of Sn and Sp. F is defined as . Both PC and F fall into the range of [0,1], with value 1 indicating perfect prediction. In addition to the nucleotide and binding site levels, the Kihara group proposed two other accuracy measurements viz. sequence accuracy (sSr) and motif accuracy (mSr). sSr is defined as where Ns is the number of sequences having their motifs correctly predicted, and N is the total number of binding sequences of a transcription factor. The overall sSr is the average sSr of all transcription factors. mSr is defined as , where Np is the number of transcription factors with at least one correctly predicted binding site in the binding sequence set and M is the total number of transcription factors in the dataset. We compared our method with WEEDER  and the top three best-performing methods previously evaluated in : MEME, BioProspector and MotifSampler. MEME, BioProspector, MotifSampler and WEEDER were download from http://meme.nbcr.net/downloads/, http://motif.stanford.edu/distributions/bioprospector/, http://homes.esat.kuleuven.be/~thijs/download/linux_3.2/MotifSampler and http://220.127.116.11/modtools/downloads/weeder1.3.1.tar.gz respectively. For WEEDER, we specified the organism to be E. Coli K12 "BEC" and the type of analysis "large".
Availability and Requirements
Source code of iTriplet is freely available and requires no license requirement. The current version of iTriplet is only available on RedHat linux platform. It requires GNU g++ compiler version 3.4.4 or above and Python version 2.5.1 or above. Interested readers can download the software from our website via the following link http://www.rci.rutgers.edu/~gundersn/iTriplet. Detailed information about how to build and run the software, description of parameters and output can be found in the captioned website.
This work is supported by NIH grant GM57286 to SIG and a fellowship from Rutgers IGERT program on Biointerfaces to ESH from NSF DGE 0333196 (PI: P. Moghe).
- Pevzner PA, Sze SH: Combinatorial approaches to finding subtle signals in DNA sequences. Proc Int Conf Intell Syst Mol Biol. 2000, 8: 269-78.PubMedGoogle Scholar
- Das MK, Dai HK: A survey of DNA motif finding algorithms. BMC Bioinformatics. 2007, 8 (Suppl 7): S21-PubMedPubMed CentralView ArticleGoogle Scholar
- Rajasekaran S: Algorithms for motif search. Handbook of Computational Biology. Edited by: Srinivas Aluru. 2006, 37: 1-21. Chapman & Hall/CRCGoogle Scholar
- Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. 1994, 28-36.Google Scholar
- Bailey TL, Elkan C: Unsupervised Learning of Multiple Motifs in Biopolymers using EM. Machine Learning. 1995, 21 (1-2): 51-80.View ArticleGoogle Scholar
- Roth FP, Hughes JD, Estep PW, Church GM: Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnology. 1998, 10: 939-45.View ArticleGoogle Scholar
- Liu X, Brutlag DL, Liu JS: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput. 2001, 6: 127-38.Google Scholar
- Liu X, Brutlag DL, Liu JS: An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nature Biotechnology. 2002, 20: 835-9.PubMedView ArticleGoogle Scholar
- Thijs G, Marchal K, Lescot M, Rombauts S, De Moor B, Rouzé P, Moreau Y: A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. Journal of Computational Biology. 2002, 9: 447-64.PubMedView ArticleGoogle Scholar
- Blanchette M, Tompa M: Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Research. 2002, 5: 739-48.View ArticleGoogle Scholar
- Siddharthan R, Siggia ED, van Nimwegen E: PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Computational Biology. 2005, 7: e67-View ArticleGoogle Scholar
- Wang J, Hannenhalli S: Generalizations of Markov model to characterize biological sequences. BMC Bioinformatics. 2005, 6: 219-PubMedPubMed CentralView ArticleGoogle Scholar
- Sze SH, Zhao X: Improved pattern-driven algorithms for motif finding in DNA sequences. Proceedings of the 2005 Joint RECOMB Satellite Workshops on Systems Biology and Regulatory Genomics. Lecture Notes in Bioinformatics. 2006, 4023: 198-211.Google Scholar
- Eskin E, Pevzner PA: Finding composite regulatory patterns in DNA sequences. Bioinformatics. 2002, 18 (Suppl 1): S354-63.PubMedView ArticleGoogle Scholar
- Rigoutsos I, Floratos A: Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics. 1998, 14 (1): 55-67.PubMedView ArticleGoogle Scholar
- Jensen KL, Styczynski MP, Rigoutsos I, Stephanopoulos GN: A generic motif discovery algorithm for sequential data. Bioinformatics. 2006, 22 (1): 21-8.PubMedView ArticleGoogle Scholar
- Davila J, Balla S, Rajasekaran S: Fast and practical algorithms for planted (l, d) motif search. IEEE/ACM Trans Computational Biology & Bioinformatics. 2007, 4: 544-52.View ArticleGoogle Scholar
- Pisanti N, Carvalho AM, Marsan L, Oliveira AL, Sagot MF: RISOTTO: Fast extraction of motifs with mismatches. Proceedings of the 7th Latin American Theoretical Informatics Symposium. 2006, 3887: 757-768.Google Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: A sequence logo generator. Genome Research. 2004, 14: 1188-1190.PubMedPubMed CentralView ArticleGoogle Scholar
- Salgado H, Gama-Castro S, Martinez-Antonio A, Diaz-Peredo E, Sanchez-Solano F, Peralta-Gil M, Garcia-Alonso D, Jimenez-Jacinto V, Santos-Zavaleta A, Bonavides-Martinez C, Cllado-Vides J: RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Research. 2004, 32: D303-306.PubMedPubMed CentralView ArticleGoogle Scholar
- Hu J, Li B, Kihara D: Limitations and potentials of current motif discovery algorithms. Nucleic Acids Research. 2005, 33 (15): 4899-913.PubMedPubMed CentralView ArticleGoogle Scholar
- Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology. 2005, 23 (1): 137-44.PubMedView ArticleGoogle Scholar
- Pavesi G, Mereghetti P, Mauri G, Pesole G: Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Research. 2004, W199-203. 32 Web Server
- Tian B, Hu J, Zhang H, Lutz CS: A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Research. 2005, 33 (1): 201-12.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhao J, Hyman L, Moore C: Formation of mRNA 3' ends in eukaryotes: mechanism, regulation, and interrelationships with other steps in mRNA synthesis. Microbiol Mol Biol Rev. 1999, 63 (2): 405-45.PubMedPubMed CentralGoogle Scholar
- Salisbury J, Hutchison KW, Graber JH: A multispecies comparison of the metazoan 3'-processing downstream elements and the CstF-64 RNA recognition motif. BMC Genomics. 2006, 7 (1): 55-PubMedPubMed CentralView ArticleGoogle Scholar
- Chen F, Wilusz J: Auxiliary downstream elements are required for efficient polyadenylation of mammalian pre-mRNAs. Nucleic Acids Research. 1998, 26 (12): 2891-8.PubMedPubMed CentralView ArticleGoogle Scholar
- Perez Canadillas JM, Varani G: Recognition of GU-rich polyadenylation regulatory elements by human CstF-64 protein. EMBO J. 2003, 22 (11): 2821-30.PubMedPubMed CentralView ArticleGoogle Scholar
- Kent WJ: BLAT - The BLAST-Like Alignment Tool. Genome Research. 2002, 12 (4): 656-664.PubMedPubMed CentralView ArticleGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The Human Genome Browser at UCSC. Genome Research. 2002, 12 (6): 996-1006.PubMedPubMed CentralView ArticleGoogle Scholar
- Piqué M, López JM, Foissac S, Guigó R, Méndez R: A combinatorial code for CPE-mediated translational control. Cell. 2008, 132 (3): 434-48.PubMedView ArticleGoogle Scholar
- Chen CY, Shyu AB: AU-rich elements: characterization and importance in mRNA degradation. Trends Biochem Sci. 1995, 11: 465-70.View ArticleGoogle Scholar
- Bakheet T, Frevel M, Williams BR, Greer W, Khabar KS: ARED: human AU-rich element-containing mRNA database reveals an unexpectedly diverse functional repertoire of encoded proteins. Nucleic Acids Research. 2001, 1: 246-54.View ArticleGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research. 2007, D61-5. 35 Database
- Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT, Roskin KM, Schwartz M, Sugnet CW, Thomas DJ, Weber RJ, Haussler D, Kent WJ: The UCSC Genome Browser Database. Nucleic Acids Research. 2003, 31 (1): 51-54.PubMedPubMed CentralView ArticleGoogle Scholar
- Goraczniak R, Gunderson SI: The regulatory element in the 3'-untranslated region of human papillomavirus 16 inhibits expression by binding CUG-binding protein 1. J Biol Chem. 2008, 283 (4): 2286-96.PubMedView ArticleGoogle Scholar
- Wingender E, Dietze P, Karas H, Knüppel R: TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Research. 1996, 24 (1): 238-41.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.