MCOIN: a novel heuristic for determining transcription factor binding site motif width
- Alastair M Kilpatrick^{1}Email author,
- Bruce Ward^{2} and
- Stuart Aitken^{1}
https://doi.org/10.1186/1748-7188-8-16
© Kilpatrick et al.; licensee BioMed Central Ltd. 2013
Received: 18 December 2012
Accepted: 24 June 2013
Published: 27 June 2013
Abstract
Background
In transcription factor binding site discovery, the true width of the motif to be discovered is generally not known a priori. The ability to compute the most likely width of a motif is therefore a highly desirable property for motif discovery algorithms. However, this is a challenging computational problem as a result of changing model dimensionality at changing motif widths. The complexity of the problem is increased as the discovered model at the true motif width need not be the most statistically significant in a set of candidate motif models. Further, the core motif discovery algorithm used cannot guarantee to return the best possible result at each candidate width.
Results
We present MCOIN, a novel heuristic for automatically determining transcription factor binding site motif width, based on motif containment and information content. Using realistic synthetic data and previously characterised prokaryotic data, we show that MCOIN outperforms the current most popular method (E-value of the resulting multiple alignment) as a predictor of motif width, based on mean absolute error. MCOIN is also shown to choose models which better match known sites at higher levels of motif conservation, based on ROC analysis.
Conclusions
We demonstrate the performance of MCOIN as part of a deterministic motif discovery algorithm and conclude that MCOIN outperforms current methods for determining motif width.
Keywords
Introduction
Recent advances in biology have led to a huge increase in the amount of data available for study. Of considerable interest to biologists are transcription factor binding site (TFBS) motifs; short DNA sequence patterns that have important roles in gene transcription and regulation. Discovery and further analysis of these sequences remains an important task in the wider challenge of understanding the mechanisms of gene expression (examples from the recent ENCODE project include [1–3]). Consequently, there is much continuing interest in developing algorithms which can automatically discover TFBS motifs [4].
Automatically determining the width of a novel TFBS motif is a desirable property for motif discovery algorithms since the true motif width is generally not known a priori. An ideal algorithm would be executed over a range of reasonable candidate widths and return the most likely result based on some criterion. This is an important but challenging computational problem, as the likelihood function maximised by motif discovery algorithms cannot be used directly to compare models with different motif widths [5]. The difficulty partially stems from the fact that the maximum value of the joint likelihood of the model given the data and the missing information is bound to increase with increasing motif width as a consequence of the increasing number of free parameters [5–7]. The complexity of the problem is increased when additional constraints on the parameters (e.g. the palindrome constraint in the popular MEME algorithm) are employed, as the maximum likelihood value of models with parameter constraints will be lower than unconstrained models of the same motif width. To some degree, this problem corresponds to the more general problem of model selection in statistics. A number of general model selection criteria which incorporate adjustments for model dimensionality have been used in other areas with success [8, 9]. However, these criteria have generally not performed well at determining motif width in known datasets [5].
Attempts at a heuristic to automatically determine motif width in a deterministic (Expectation-Maximization, or EM-based) algorithm have included functions based on the maximum likelihood ratio test (LRT) [11], methods based on V -fold cross-validation [7] and the Bayesian Information Criterion (BIC) [12]. However, in practice, estimators based on the E-value of the resulting multiple alignment are used instead [4]. The E-value of the multiple alignment of predicted motif occurrences is an approximate p-value for testing the hypothesis that the predicted motif occurrences were generated from the predicted model against the null hypothesis that the predicted occurrences were generated by the background model. Typically, E-values are calculated for models at each candidate width and the model with the minimum E-value chosen.
Here, we validate a novel heuristic for automatically determining the width of a motif in deterministic motif discovery algorithms, based on motif containment and information content (MCOIN). Based on tests with previously characterised prokaryotic TFBS motifs, we show that MCOIN outperforms the E-value of the resulting multiple alignment as a predictor of motif width, using mean absolute error. MCOIN is also shown to improve the overall correctness of results, based on receiver operating characteristic (ROC) analysis. Finally, we show that the performance of MCOIN will improve as the performance of the core motif discovery algorithm improves.
Approach
The MCOIN heuristic is based on the concepts of motif containment and mean information content per column. If it is assumed that the motif discovery algorithm discovers the true motif within the dataset as well as possible at every candidate width {w_{ min },…,w_{ max }}, then the algorithm discovers the true motif exactly at the true width w^{∗}. It follows that, at candidate widths smaller than the true width (that is, {w_{ min },…,w^{∗} - 1}), only a portion of the true motif is discovered while at candidate widths larger than the true width (that is, {w^{∗} + 1,…,w_{ max }}), the full motif is discovered, along with a number of background positions. Clearly, these models must be similar and are describing the same underlying motif. If we know that the models for widths w-1 and w are describing the same motif and also assume that model selection criteria (e.g. BIC) will choose the shorter model due to it having fewer free parameters, then the model with width w-1 can be removed from the set of candidate models as the width-w model also describes the same motif.
Retaining the assumption that the motif discovery algorithm discovers the true motif as well as possible at every candidate width, it follows that the model at the true width w^{∗} will also be removed as a result of it being contained within the model at width w^{∗} + 1. The result of discarding models based only on containment would be to discard all but the longest model. Clearly, we would prefer models at widths w_{ min } to w^{∗} - 1 to be discarded in favour of the model at width w^{∗}, but this model not to be discarded in favour of longer models. Calculating the mean information content per column (IC/col) for each model allows a method of stopping containment at widths greater than w^{∗}. If, for example, the IC/col of the model at width w^{∗} is B bits, the model at width w^{∗} + 1 will have these same columns plus an additional background column, which will have a very low information content (if each nucleotide in the background model is equiprobable, the information content of this column will be 0 bits); the low information content of this additional background column will make the IC/col of the model at w^{∗} + 1 less than B bits. We can therefore modify our model selection process, discarding a shorter model in favour of a longer model only if the shorter model is contained within the longer model and the IC/col of the longer model is similar to that of the shorter model.
At a high level, this is implemented as follows: the position weight matrix (PWM; the probabilistic model of a motif used in motif discovery algorithms) of the shortest model (w_{ min }) is tested against each longer model (w_{ min } + 1,…,w_{ max }), calculating the mean root Jensen-Shannon divergence per column (JSD/col) for each comparison. The Jensen-Shannon divergence [13] is used a measure of similarity; intuitively, the lower the JSD/col, the more similar the PWMs are. The IC/col ratio of the longer model to the shorter model is then calculated. If this is significantly lower than 1, we can assume that the additional column in the longer model is not information-rich and the longer model is longer than the true motif width. If the shorter model is ‘contained’ within the longer model (that is, the minimum JSD/col is smaller than some similarity threshold t_{ sim }, where 0 ≤ t_{ sim } ≤ 1) and the models have similar information (that is, the IC/col ratio of the longer model to the shorter model is greater than some information threshold t_{ info }), the shorter model is removed from the set of candidate models. The process is repeated for model widths w_{ min } + 1 to w_{ max } - 1 (the longest model is always kept in the set of candidate models). The remaining model with the lowest BIC score is chosen as our best estimate of motif width.
Method
where I(k,a) is an indicator function which is 1 if and only if a = a_{ k } and 0 otherwise and x_{ i,j } is the nucleotide in the j th position of sample x_{ i }. Let x_{ pred } be the set of non-overlapping predicted motif occurrences and n_{ pred } be the number of predicted motif occurrences |x_{ pred }|.
Calculating the BIC for candidate models
We run our motif discovery algorithm over a number of reasonable candidate widths and return a model ϕ = {θ,λ} for each width. We assume that the unknown true motif width w^{∗} is within the range of tested candidate widths, that is, w_{ min } ≤ w^{ ∗ } ≤ w_{ max }.
where M is the number of free parameters in the model, equivalent to 3(w + 1). We now have a set of models$\left\{{\varphi}^{\left({w}_{\mathit{\text{min}}}\right)},\dots ,{\varphi}^{\left({w}^{\ast}\right)},\dots ,{\varphi}^{\left({w}_{\mathit{\text{max}}}\right)}\right\}$; each model with its own BIC score, based on its log likelihood (calculated using its set of predicted sites) and the number of model parameters. We now apply MCOIN, as described in the next section.
MCOIN heuristic
MCOIN relies on two threshold parameters, t_{ sim } and t_{ info }. The value of t_{ sim } may be chosen to be anywhere between 0 and 1. Choosing a good value for t_{ sim } is important. If this value is too small, smaller models are required to match longer models more exactly before being discarded. Therefore, fewer models are discarded and MCOIN tends to choose models of shorter widths, leading to an underestimation of the true motif width. In contrast, if the value of t_{ sim } is too large, shorter models may be discarded in favour of longer models when they are dissimilar, leading to an overestimation of true motif width. The optimal value of t_{ sim } was calculated using tests on the realistic synthetic data collection described in the Data section; root mean squared error was minimised at t_{ sim } = 0.32. Tests using the previously characterised E. coli data described in the Data section validated this parameter value: root mean squared error was minimised when 0.30≤t_{ sim } ≤ 0.32. We therefore recommend t_{ sim } = 0.32; this is intuitively reasonable as we would prefer to keep the value of t_{ sim } low in order to ensure that two models are reasonably similar before discarding the shorter in favour of the longer. Tests which removed the motif discovery phase of the algorithm showed that the mean information content per column ratio alone was sufficient to choose the true motif width. That is, the value of t_{ sim } had no effect. From this, we may conclude that, as motif predictions become stronger, the exact value of t_{ sim } becomes less important. However, at current motif discovery algorithm performance levels, a value of 0.32 gives optimal results. It may be possible to change this value data-adaptively, but so far results have not shown this to be required.
E-value of the resulting multiple alignment
The E-value of the multiple alignment of predicted motif occurrences [14] is an approximate p-value for testing the hypothesis that the predicted motif occurrences were generated from the predicted model against the null hypothesis that the predicted occurrences were generated by the background model. The E-value is then an estimate of the expected number of multiple alignments with statistical significance as great or greater than the observed alignment. Briefly, the E-value is calculated by computing the log-likelihood ratio of each column of the resulting multiple alignment of predicted sites and computing the p-value for each. The p-value of the product of column p-values is computed and then multiplied by the number of possible ways to select positions for the given number of sites in the set of input sequences to give the E-value. The E-value is calculated for models at each candidate width and minimised to select the best estimate of motif width [4, 12].
Data
Realistic synthetic data
Five data collections, each consisting of 1,000 datasets, were created in order to test MCOIN. Each dataset contained 20 input sequences of length 200bp. Input sequences were created by extracting 200bp from the EcoGene [15] database of E. coli intergenic sequences, representing ‘background’ positions. Datasets were created so that each data collection had different mean levels of motif conservation, ranging from 0.51 to 2.00 bits/col: Motif positions within each sequence were chosen at random and a synthetic motif inserted. Synthetic motifs were created by choosing nucleotides (A, C, G, T) at random and randomly mutating positions in the motif occurrences so that the levels of conservation at each position could be controlled. Motif width was chosen to be 12 bp each time. A comparison of methods for determining motif width in [12] used datasets containing real (human) motifs with a minimum mean information content of 0.76 bits/col; the realistic synthetic data used in this study contains many motifs at lower levels of motif conservation, as analysis of known E. coli TFBS motifs indicated that significant numbers of motifs had mean conservation levels of less than 0.76 bits/col.
E. coli data
E. coli motifs
High conservation | Low conservation | ||||
---|---|---|---|---|---|
Name | w ^{∗} | N | Name | w ^{∗} | N |
Ada | 13 | 4 | ArgR | 18 | 35 |
CaiF | 16 | 8 | DeoR | 16 | 7 |
CueR | 19 | 3 | FruR | 18 | 18 |
IlvY | 21 | 4 | Fur | 19 | 99 |
LacI | 21 | 3 | GntR | 20 | 17 |
MalI | 12 | 2 | MalT | 10 | 20 |
MelR | 18 | 11 | Nac | 15 | 18 |
MetR | 13 | 7 | RcsB | 14 | 11 |
PurR | 16 | 20 | |||
SoxR | 19 | 2 | |||
TorR | 10 | 10 | |||
XylR | 18 | 4 |
The number of motif occurrences in RegulonDB defined the number of input sequences; the mean number of input sequences was 15, ranging from 2 to 99 input sequences. The median number of input sequences was 9. Using known motif occurrences allows realistic motif conservation. The mean motif conservation was 1.13 bits/col, ranging from 0.49 to 2.00 bits/col. The median motif conservation was 1.04 bits/col. The mean motif width was 16bp, ranging from 10 to 21bp. The median motif width was 17 bp.
The data collection was split into two groups based on mean information content per column. The split was made at a value of 1 bit/col, producing a ‘high conservation’ group containing 12 datasets and a ‘low conservation’ group containing 8 datasets. In the ‘high conservation’ group, the mean number of input sequences was 7, ranging from 2 to 20 input sequences. The median number of input sequences was 4. The mean motif conservation was 1.36 bits/col, ranging from 1.02 to 2.00 bits/col. The median motif conservation was 1.31 bits/col. The mean motif width was 16bp, ranging from 10 to 21bp. The median motif width was 17bp. In the ‘low conservation’ group, the mean number of input sequences was 28, ranging from 7 to 99 input sequences. The median number of input sequences was 18. The mean motif conservation was 0.78 bits/col, ranging from 0.49 to 0.99 bits/col. The median motif conservation was 0.79 bits/col. The mean motif width was 16bp, ranging from 10 to 20bp. The median motif width was 17bp. Table 1 illustrates some of the diversity within the chosenE. coli motifs. The sequence logos of selected motifs (Figure 1) illustrate this further.
Prokaryotic ChIP data
Prokaryotic ChIP motifs
Species | Name | w ^{∗} | N |
---|---|---|---|
E. coli | CRP | 22 | 34 |
E. coli | LexA | 20 | 25 |
E. coli | PurR | 16 | 28 |
E. coli | RutR | 16 | 19 |
V. cholerae | Fur | 21 | 55 |
V. cholerae | RpoN | 15 | 37 |
M. tuberculosis | DosR | 18 | 24 |
M. tuberculosis | LexA | 18 | 23 |
B. subtilis | Spo0A | 12 | 94 |
The number of motif occurrences for each motif defined the number of input sequences; the mean number of input sequences was 38, ranging from 19 to 94 input sequences. The median number of input sequences was 28. Again, using known motif occurrences allows realistic motif conservation. The mean motif conservation was 0.99 bits/col, ranging from 0.56 to 1.25 bits/col. The median motif conservation was 1.04 bits/col. The mean motif width was 16bp, ranging from 10 to 22bp. The median motif width was 16 bp.
Measuring performance
The performance of the heuristic on a data collection is assessed through its mean site-level sensitivity (sSn), mean site-level positive predictive value (sPPV) and the area under the receiver operating characteristic (ROC) curve (AUC). Following [26], we define a predicted site as a ‘true positive’ result if it overlaps the true site by at least a quarter of the true width.sSn (also known asrecall in machine learning literature) measures the proportion of true positive sites which are correctly predicted as such.sPPV (also known asprecision) measures the proportion of predicted positive sites which are actually true positives. For our purposes,sSn is defined as the fraction of true sites which are predicted andsPPV is defined as the fraction of predicted sites which are known to be true (see also [27]); that is:$\mathit{\text{sSn}}=\frac{\mathit{\text{sTP}}}{\mathit{\text{sTP}}+\mathit{\text{sFN}}}$ and$\mathit{\text{sPPV}}=\frac{\mathit{\text{sTP}}}{\mathit{\text{sTP}}+\mathit{\text{sFP}}}$.
AUC is the integral of the ROC curve plottingsSn against the site-level false positive rate ($\mathit{\text{sFPR}}=\frac{\mathit{\text{sFP}}}{(\mathit{\text{sFP}}+\mathit{\text{sTN}})}$). The ROC curve is constructed by computing the probability of each possible site being an occurrence of the motifp(Z_{ i j } = 1|X_{i,j},θ) and ranking each possible site based on this value.s S n ands F P R are plotted for all possible thresholds ofp(Z_{i,j} = 1|X_{i,j},θ) and AUC calculated using the trapezoid rule. This is implemented using the ROCR R package [28].
While the above classification statistics provide an indication of how well the predicted sites associated with a motif model match the true sites, they give no indication of how well a heuristic estimates motif width. This performance is assessed here through the mean absolute error (MAE) and root mean squared error (RMSE), comparing the predicted motif width to the known width. RMSE is a commonly used measure but tends to exaggerate the effect of estimations which are further from the true value; in contrast, MAE treats all error sizes equally according to their magnitude. In most practical situations, the best estimator remains the best regardless of which error method is used [29].
Results and discussion
In general, mean site-level sensitivity (sSn) and positive predictive value (sPPV) decrease with decreasing motif conservation. The decrease insSn is a result of the motif discovery algorithm predicting fewer sites overall. That is, at lower motif conservations, fewer sites score highly enough such thats(x_{ i }) > t (see Equation 3). This leads to an increase in the number of false negative results (sites incorrectly classified as ‘background’) and therefore a decrease insSn. The decrease insPPV is attributable to background sites better matching the weaker motif model; as the model becomes weaker, the difference in scores between true motif occurrences and spurious background sites decreases. This can lead to an increase in the number of false positive results (sites incorrectly classified as motif occurrences) and therefore a decrease insPPV.
Width determination without discovery
Tests without motif discovery: classification-based results
Conservation | Known width (w^{∗}) | MCOIN (w^{∗} ± 4) | E-values (w^{∗} ± 4) | ||||||
---|---|---|---|---|---|---|---|---|---|
(mean bits/col) | sSn | sPPV | AUC | sSn | sPPV | AUC | sSn | sPPV | AUC |
2.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
1.49 | 0.98 | 0.94 | 1.00 | 0.98 | 0.94 | 1.00 | 0.97 | 0.93 | 1.00 |
1.08 | 0.80 | 0.93 | 1.00 | 0.80 | 0.93 | 1.00 | 0.82 | 0.79 | 1.00 |
0.76 | 0.49 | 0.89 | 0.99 | 0.49 | 0.89 | 0.99 | 0.56 | 0.71 | 0.99 |
0.51 | 0.23 | 0.79 | 0.99 | 0.23 | 0.79 | 0.99 | 0.23 | 0.77 | 0.98 |
Tests without motif discovery: mean error in motif width
Conservation | MCOIN (w^{∗} ± 4) | E-values (w^{∗} ± 4) | ||
---|---|---|---|---|
(mean bits/col) | MAE | RMSE | MAE | RMSE |
2.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1.49 | 0.00 | 0.00 | 0.12 | 0.50 |
1.08 | 0.00 | 0.06 | 1.55 | 1.84 |
0.76 | 0.01 | 0.09 | 1.79 | 2.04 |
0.51 | 0.07 | 0.39 | 3.33 | 3.60 |
We first note from Table 4 that the width predicted by MCOIN closely matches the true width in almost all cases; the error in the predicted width increases slightly as mean motif conservation is decreased. The E-values estimator initially matches MCOIN but quickly begins to underestimate motif width, leading to a much larger increase in the error in predicted width. MCOIN shows a clear performance advantage in terms of predicted width at all conservation levels.
Given that the widths predicted by MCOIN generally match the known width, it is unsurprising that the classification-based results (Table 3) match those in the case where the width is known. As noted above,sSn decreases with decreasing motif conservation. A similar, but less sharp, decrease is seen insPPV. Although the E-values estimator slightly outperforms MCOIN in terms ofsSn for the data collections with mean motif conservation of 1.08 bits/col and 0.76 bits/col (0.82 compared to 0.80 and 0.56 compared to 0.49, respectively), the corresponding values ofsPPV are outperformed by MCOIN (0.93 compared to 0.79 and 0.89 compared to 0.71, respectively). Combining these results with the results presented in Table 3, this is likely a result of the E-values estimator choosing models at non-optimal widths which predict more sites overall at the expense of more false positive predictions.
Realistic synthetic data
Realistic synthetic data: classification-based results
Conservation | Known width (w^{∗}) | MCOIN (w^{∗} ± 4) | E-values (w^{∗} ± 4) | ||||||
---|---|---|---|---|---|---|---|---|---|
(mean bits/col) | sSn | sPPV | AUC | sSn | sPPV | AUC | sSn | sPPV | AUC |
2.00 | 0.84 | 0.25 | 0.99 | 0.93 | 0.42 | 1.00 | 0.91 | 0.79 | 0.99 |
1.49 | 0.26 | 0.07 | 0.98 | 0.28 | 0.15 | 0.99 | 0.21 | 0.45 | 0.98 |
1.08 | 0.02 | 0.01 | 0.96 | 0.01 | 0.01 | 0.96 | 0.01 | 0.23 | 0.96 |
0.76 | 0.00 | 0.00 | 0.94 | 0.00 | 0.00 | 0.93 | 0.00 | 0.12 | 0.94 |
0.51 | 0.00 | 0.00 | 0.93 | 0.00 | 0.00 | 0.93 | 0.00 | 0.09 | 0.93 |
Realistic synthetic data: mean error in motif width
Conservation | MCOIN (w^{∗} ± 4) | E-values (w^{∗} ± 4) | ||
---|---|---|---|---|
(mean bits/col) | MAE | RMSE | MAE | RMSE |
2.00 | 1.60 | 2.06 | 1.80 | 2.28 |
1.49 | 1.59 | 2.08 | 2.46 | 2.82 |
1.08 | 1.97 | 2.42 | 2.16 | 2.51 |
0.76 | 2.38 | 2.74 | 1.84 | 2.22 |
0.51 | 2.38 | 2.71 | 1.95 | 2.32 |
We note that the results for predictions at the known width are generally lower than when the motif discovery phase of the algorithm was removed. These results illustrate the fact that the core motif discovery algorithm is far from perfect: even when the true motif width is known, classification-based results may be low. In all data collections, both MCOIN and the E-values estimator are shown to have a performance similar to or better than that at the known width in terms of classification-based measures. As noted by [12], this may be attributed to the fact that predicted sites are only required to overlap the known site by a quarter in order to be counted as a true positive.
As noted above, results for all three classification-based measures generally decrease as mean motif conservation also decreases (Table 5). At higher levels of motif conservation, MCOIN is shown to outperform the E-values estimator in terms ofsSn. In this test, MCOIN generally chooses models which increasesSn, at the expense ofsPPV. That is, MCOIN chooses models which tend to predict more false positive sites. While we would prefer to have few false results (that is, higher values for bothsSn andsPPV) overall, it may be preferable to increasesSn at the expense ofsPPV. For example, when searching for putative binding sites to be verified experimentally, it may be more useful to have more false positives than false negatives. The E-values estimator is shown to achieve a highersPPV in all cases; this matches the findings of [12], where the E-values estimator was shown to achieve a slightly highersPPV than other estimators on datasets containing human TFBS motifs. At higher levels of motif conservation, MCOIN is also shown to outperform the E-values estimator in terms of AUC.
While MCOIN generally matches the E-values estimator in terms of overall correctness based on AUC values, this does not represent the full picture. It follows from the above that an estimator may appear to perform well even if the chosen width does not match the true width [12]. Errors in the predicted width are presented in Table 6. We note from these results that the error in width predicted by both estimators generally increases as mean motif conservation is decreased. However, at higher levels of motif conservation, MCOIN outperforms the E-values estimator using both error measures.
E. coli and prokaryotic ChIP data
E. coli data: classification-based results
Conservation | Known width (w^{ ∗ }) | MCOIN (w^{∗} ± 4) | E-values (w^{∗x} ± 4) | ||||||
---|---|---|---|---|---|---|---|---|---|
(mean bits/col) | sSn | sPPV | AUC | sSn | sPPV | AUC | sSn | sPPV | AUC |
‘high’ (1.36) | 0.81 | 0.22 | 0.96 | 0.72 | 0.29 | 0.96 | 0.70 | 0.17 | 0.95 |
‘low’ (0.78) | 0.63 | 0.41 | 0.96 | 0.69 | 0.51 | 0.98 | 0.66 | 0.32 | 0.97 |
overall (1.13) | 0.74 | 0.30 | 0.96 | 0.71 | 0.38 | 0.96 | 0.68 | 0.23 | 0.96 |
E. coli data: mean error in motif width
Conservation | MCOIN (w^{∗} ± 4) | E-values (w^{∗} ± 4) | ||
---|---|---|---|---|
(mean bits/col) | MAE | RMSE | MAE | RMSE |
‘high’ (1.36) | 2.08 | 2.43 | 2.92 | 3.12 |
‘low’ (0.78) | 1.75 | 2.06 | 3.00 | 3.20 |
overall (1.13) | 1.95 | 2.29 | 2.95 | 3.15 |
Prokaryotic ChIP data: classification-based results
Conservation | Known width (w^{∗}) | MCOIN (w^{∗} ± 4) | E-values (w^{∗} ± 4) | ||||||
---|---|---|---|---|---|---|---|---|---|
(mean bits/col) | sSn | sPPV | AUC | sSn | sPPV | AUC | sSn | sPPV | AUC |
0.99 | 0.75 | 0.67 | 0.99 | 0.75 | 0.68 | 0.99 | 0.73 | 0.67 | 0.99 |
Prokaryotic ChIP data: mean error in motif width
Conservation | MCOIN (w^{∗} ± 4) | E-values (w^{∗} ± 4) | ||
---|---|---|---|---|
(mean bits/col) | MAE | RMSE | MAE | RMSE |
0.99 | 1.44 | 1.86 | 2.33 | 2.73 |
The heterogeneity of the motifs in both data collections may suggest that results for these datasets could be equally varied. However, both MCOIN and the E-values estimator are reasonably robust in terms of predicted sites (Tables 7 and 9). While thesSn results for the low conservation group in theE. coli data collection are lower than that for the high conservation group,sPPV increases with decreasing motif conservation. This is a result of the smaller set of predicted sites containing fewer false positive results and can be attributed to the small number of datasets tested. When combined, the reduction in the number of false positive predictions and the consistently high AUC values suggest that models are chosen where true motif occurrences are predicted with greater confidence. For both theE. coli and prokaryotic ChIP data collections, MCOIN outperforms the E-values estimator in terms of classification-based results. The prokaryotic ChIP data collection shows a slight improvement insSn andsPPV values (Table 9); this improvement is greater in theE. coli data collection (Table 7). It is also noted that the classification-based results for MCOIN better match those at the known width than the results of the E-values estimator.
Tables 8 and 10 present the mean error in motif width based on both data collections. MCOIN is shown to outperform the E-values estimator for both data collections. Although the mean error in motif width for models predicted by MCOIN appears to decrease with decreasing motif conservation in theE. coli data collection, this is explained by the small number of datasets tested. The small number of datasets tested also accounts for the fact that the error in motif widths predicted by the E-values estimator is relatively high for both real data collections, given the results previously obtained on realistic synthetic data.
Conclusions
Determining the width of a TFBS motif is an important and challenging problem with direct relevance to computational motif discovery. MCOIN is a novel heuristic for determining the width of a motif, based on motif containment and information content. Results of tests on two data collections of previously characterised prokaryotic motifs show that MCOIN outperforms the E-value of the resulting multiple alignment (currently the most widely used estimator) as a predictor of motif width, using mean absolute error and root mean squared error. MCOIN is also shown to choose models which improve the overall correctness of predicted motif sites, based on site-level sensitivity, positive predictive value and the area under the ROC curve.
MCOIN also has a clear advantage over methods based on cross-validation with limited numbers of folds, as all available data is used for motif discovery, improving discovery results. Further, the results of experiments which removed the motif discovery phase of the algorithm show that, as the performance of this phase improves, the performance of MCOIN as a predictor of motif width also improves: as the discovered model becomes stronger and better models the true motif, the error in the width estimated by MCOIN will decrease.
Declarations
Acknowledgements
AMK is supported by an EPSRC Doctoral Studentship. SA and AMK were funded by BBSRC grant BB/I023461/1 (Bayesian evidence analysis tools for systems biology). AMK also acknowledges the support of an ECCB’12 Conference Fellowship, awarded by the SIB Swiss Institute of Bioinformatics.
Authors’ Affiliations
References
- Yip K, Cheng C, Bhardwaj N, Brown J, Leng J, Kundaje A, Rozowsky J, Birney E, Bickel P, Snyder M, Gerstein M: Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 2012, 13: R48.PubMed CentralView ArticlePubMedGoogle Scholar
- Spivakov M, Akhtar J, Kheradpour P, Beal K, Girardot C, Koscielny G, Herrero J, Kellis M, Furlong E, Birney E: Analysis of variation at transcription factor binding sites in Drosophila and humans. Genome Biol. 2012, 13: R49.PubMed CentralView ArticlePubMedGoogle Scholar
- Whitfield T, Wang J, Collins P, Partridge EC, Aldred S, Trinklein N, Myers R, Weng Z: Functional analysis of transcription factor binding sites in human promoters. Genome Biol. 2012, 13: R50.PubMed CentralView ArticlePubMedGoogle Scholar
- Bailey TL, Bodén M, Whitington T, Machanick P: The value of position-specific priors in motif discovery using MEME. BMC Bioinformatics. 2010, 11: 179.PubMed CentralView ArticlePubMedGoogle Scholar
- Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993, 262: 208-14.View ArticlePubMedGoogle Scholar
- Bailey TL, Elkan C: The value of prior knowledge in discovering motifs with MEME. Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology: 16-19 July 1995: Cambridge, UK. Edited by: Rawlings C, Rawlings C .The AAAI Press,1995, 21-29.Google Scholar
- Keles S, van der Laan MJ, Dudoit S, Xing B, Eisen MB: Supervised detection of regulatory motifs in DNA sequences. Stat Appl Genet Mol Biol. 2003, 2 (1): Article 5.Google Scholar
- Akaike H: A new look at the statistical model identification. IEEE Trans Automatic Control. 1974, 19: 716-723. 10.1109/TAC.1974.1100705.View ArticleGoogle Scholar
- Schwarz G: Estimating the Dimension of a Model. Ann Stat. 1978, 6: 461-464. 10.1214/aos/1176344136.View ArticleGoogle Scholar
- Bi C: A Monte Carlo EM algorithm for De Novo Motif discovery in biomolecular sequences. IEEE/ACM Trans Comput Biol Bioinformatics. 2009, 6: 370-386.View ArticleGoogle Scholar
- Bailey TL, Elkan C: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach Learn. 1995, 21: 51-80.Google Scholar
- Bembom O, Keles S, Van Der Laan M: Supervised detection of conserved motifs in DNA sequences with cosmo. Stat Appli Genet Mol Biol. 2007, 6 (1): Article 8.Google Scholar
- Lin J: Divergence measures based on the Shannon entropy. IEEE Trans Inform Theory. 1991, 37: 145-151. 10.1109/18.61115.View ArticleGoogle Scholar
- Hertz G, Stormo G: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999, 15: 563-577.View ArticlePubMedGoogle Scholar
- Rudd KE: EcoGene: a genome sequence database for Escherichia coli K-12. Nucleic Acids Res. 2000, 28: 60-64.PubMed CentralView ArticlePubMedGoogle Scholar
- Gama-Castro S, Salgado H, Peralta-Gil M, Santos-Zavaleta A, Muñiz Rascado, Solano-Lira H, Jimenez-Jacinto V, Weiss V, García-Sotelo JS, López-Fuentes A, Porrón-Sotelo L, Alquicira-Hernández S, Medina-Rivera A, Martínez-Flores I, Alquicira-Hernández K, Martínez-Adame R, Bonavides-Martínez C, Miranda-Ríos J, Huerta AM, Mendoza-Vargas A, Collado-Torres L, Taboada B, Vega-Alvarado L, Olvera M, Olvera L, Grande R, Morett E, Collado-Vides J: RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units). Nucleic Acids Res. 2011, 39 (Suppl 1): D98—D105PubMed CentralPubMedGoogle Scholar
- Grainger DC, Hurd D, Harrison M, Holdstock J, Busby SJW: Studies of the distribution of Escherichia coli cAMP-receptor protein and RNA polymerase along the E. coli chromosome. Proc Nat Acad Sci USA. 2005, 102: 17693-17698.PubMed CentralView ArticlePubMedGoogle Scholar
- Wade JT, Reppas NB, Church GM, Struhl K: Genomic analysis of LexA binding reveals the permissive nature of the Escherichia coli genome and identifies unconventional target sites. Genes & Devx. 2005, 19: 2619-2630.View ArticleGoogle Scholar
- Cho BK, Federowicz SA, Embree M, Park YS, Kim D: Palsson BØ: The PurR regulon in Escherichia coli K-12 MG1655. Nucleic Acids Res. 2011, 39: 6456-6464.PubMed CentralView ArticlePubMedGoogle Scholar
- Shimada T, Ishihama A, Busby SJW, Grainger DC: The Escherichia coli RutR transcription factor binds at targets within genes as well as intergenic regions. Nucleic Acids Res. 2008, 36: 3950-3955.PubMed CentralView ArticlePubMedGoogle Scholar
- Davies BW, Bogard RW, Mekalanos JJ: Mapping the regulon of Vibrio cholerae ferric uptake regulator expands its known network of gene regulation. Proc Nat Acad Sci. 2011, 108 (30): 12467-72.PubMed CentralView ArticlePubMedGoogle Scholar
- Dong TG, Mekalanos JJ: Characterization of the RpoN regulon reveals differential regulation of T6SS and new flagellar operons in Vibrio cholerae O37 strain V52. Nucleic Acids Res. 2012, 40: 7766-7775.PubMed CentralView ArticlePubMedGoogle Scholar
- Lun D, Sherrid A, Weiner B, Sherman D, Galagan J: A blind deconvolution approach to high-resolution mapping of transcription factor binding sites from ChIP-seq data. Genome Biol. 2009, 10: R142.PubMed CentralView ArticlePubMedGoogle Scholar
- Smollett KL, Smith KM, Kahramanoglou C, Arnvig KB, Buxton RS, Davis EO: Global analysis of the Regulon of the transcriptional repressor LexA, a key component of SOS response in Mycobacterium tuberculosis. J Bioll Chem. 2012, 287: 22004-22014. 10.1074/jbc.M112.357715.View ArticleGoogle Scholar
- Molle V, Fujita M, Jensen ST, Eichenberger P, González-Pastor JE, Liu JS, Losick R: The Spo0A regulon of Bacillus subtilis. Mol Microbiol. 2003, 50: 1683-1701.View ArticlePubMedGoogle Scholar
- Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005, 23: 137-44.View ArticlePubMedGoogle Scholar
- Hu J, Li B, Kihara D: Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res. 2005, 33: 4899-4913.PubMed CentralView ArticlePubMedGoogle Scholar
- Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics. 2005, 21: 3940-3941.View ArticlePubMedGoogle Scholar
- Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques (2nd Ed.). 2005, Morgan Kaufmann: San Francisco.Google Scholar
- Eisen M: All motifs are NOT created equal: structural properties of transcription factor-DNA interactions and the inference of sequence specificity. Genome Biol. 2005, 6: P7-10.1186/gb-2005-6-5-p7.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.