# Analysis of computational approaches for motif discovery

- Nan Li
^{1}Email author and - Martin Tompa
^{1}

**1**:8

https://doi.org/10.1186/1748-7188-1-8

© Li and Tompa; licensee BioMed Central Ltd. 2006

**Received: **10 March 2006

**Accepted: **19 May 2006

**Published: **19 May 2006

## Abstract

Recently, we performed an assessment of 13 popular computational tools for discovery of transcription factor binding sites (M. Tompa, N. Li, et al., "Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites", Nature Biotechnology, Jan. 2005). This paper contains follow-up analysis of the assessment results, and raises and discusses some important issues concerning the state of the art in motif discovery methods: 1. We categorize the objective functions used by existing tools, and design experiments to evaluate whether any of these objective functions is the right one to optimize. 2. We examine various features of the data sets that were used in the assessment, such as sequence length and motif degeneracy, and identify which features make data sets hard for current motif discovery tools. 3. We identify an important feature that has not yet been used by existing tools and propose a new objective function that incorporates this feature.

## Keywords

For the past decade, research on identifying regulatory elements, notably the binding sites for transcription factors, has been very intense. The problem, usually abstracted as a search problem, takes as the input a set of sequences, which encode the regulatory regions of genes that are putatively co-regulated. The output consists of the regulatory elements (short words in the input sequences) and a motif model that profiles them.

Numerous computational tools have been developed for this task. Natually, evaluation of these tools is becoming vital in this area. Recently, Tompa et al. [1] report the results of one such assessment. In this assessment, some popular tools are tested on datasets of four species: human, mouse, fly and yeast. Each dataset contains a set of sequences planted with binding sites of one transcription factor. The binding sites are provided in the TRANSFAC database [2]. Details of the datasets are explained in [1].

Besides the result of the assessment, this work also raises questions about the approaches used by these tools. We discuss some interesting questions that arise from further analysis of the assessment in [1]. We believe that techniques that have been adopted in search are very powerful, as proven by these eminent tools. But the definition of the search problem, especially the formulation of objective functions, leaves space for substantial improvement in the performance of the motif discovery tool.

## 1 Are the objective functions informative?

The first step to design a new algorithm for the motif discovery problem is to choose a proper objective function. This is critical because the objective function implements the designer's understanding of the protein-DNA interaction model. Searching for candidates that optimize the objective function is a major step to pull out the candidate binding sites from the background sequences. An ideal objective function should be able to assign the optimal score to the true motif binding sites and nowhere else.

Although there are numerous tools available, surprisingly the types of objective functions are not as many. Here we examined three popular objective functions. Theoretically, for each objective function we would test whether the score of the planted binding sites is superior to the scores of all other sets of words in the background sequences which are false positive predictions. This, of course, is impractical. In practice, we chose one tool that applies this objective function and compared the tool's prediction, which unfortunately is often a false positive, with the planted motif. If the planted motif has a better score, then the gap between the two scores shows the least extent to which the tool misses the global optimum of the objective function. On the other hand, if the prediction scores higher, it would suggest that the objective function is not accurate enough to model the true binding sites.

### Log likelihood ratio

This ratio and its associated forms are used by most alignment-driven algorithms to assess the significance of motif candidates. When the candidates are of different lengths, the *p*-value of the ratio is used. A method to compute the *p*-value is described in [3]. The log likelihood ratio of the predicted motif *m* is

where *X* is the set of sequences in the dataset, *Pr*(*X*|*φ*, *Z*) is the likelihood of the sequences *X* given the motif model *φ* and its binding sites *Z*, and *Pr(X|p*_{0}) gives the likelihood of the sequences assuming the background model *p*_{0}.

*p-*values of the log likelihood ratio in the negative logarithm scale) of MEME's predictions and the planted binding sites. For most datasets, the predictions of MEME have higher scores than the planted motifs. We conclude that even an algorithm guaranteeing the global optimal solution for the log likelihood ratio function will miss the true binding sites in these datasets, because this objective function does not accurately capture the nature of the binding sites.

Now, consider one dataset in detail. The dataset is an example for which the planted motif has a higher log likelihood ratio score than MEME's prediction, yet we argue that log likelihood ratio still doesn't work well as an objective function in this case.

In a way, the motif-searching problem is a classification problem: all the words of a certain length appearing in the sequences should be partitioned into two classes: the binding sites, and all the others. Training the optimal classifier equates to searching for the optimal candidate motif model. When the log likelihood ratio is applied as the objective function, the ultimate classifier would be a threshold of the log likelihood ratio score so that all the binding sites are above the threshold, and all the others are below it. A classifier corresponding to a good prediction can achieve a decent balance between the false positives and false negatives of the classification. Vice versa, if no threshold is satisfactory enough to classify the words, no good prediction can be found under this motif model.

It is therefore fair to say that log likelihood ratio alone will not be able to separate the true motif from the background noise. We will return to it later.

### Z-score

The Z-score measures the significance of predictions based on their over-representation. YMF [5] searches a restricted space defined by a consensus motif model and finds the candidates with the top Z-scores. The form of the Z-score is as follows:

where *obs*(*m*) is the actual number of occurrences of the motif *m, E*(*m*) is the expected number of its occurrences in the background model, and *σ*(*m*) is the standard deviation.

### Sequence specificity

Another type of objective function emphasizes the likelihood that most, if not all, sequences are potentially bound by the transcription factor. That means a prediction having multiple binding sites in one sequence and none in the others is much less significant than a prediction having a balanced number of binding sites in each sequence. This idea is designed into ANN-Spec [6] and Weeder [7]. The objective function, named sequence specificity, is defined in [7] as follows.

where *E*_{
i
}(*m|p*_{0}) is the expected number of motif *m*'s occurrences in sequence *i* assuming the background model *p*_{0}, and *L* is the total number of sequences in the dataset.

Comparing Figure 4 with the other objective functions (Figure 1, 3), this result shows certain promise that using the sequence specificity score may often lead to the true binding sites. From objective function point of view solely, sequence specificity seems to have the edge for our datasets. An assumption of this objective function is that most sequences in the datasets should have binding sites of the motif. Although our data shows that tools such as Weeder and ANN-Spec are not too sensitive to the slight departure from this assumption, we have not tested them on datasets with more deviation. The Z-score function is based on the statistical over-representation solely without any reference to biological theories. The log likelihood ratio relies on high-quality non-gapped alignments, but it's not clear that non-gapped alignments are powerful enough to model the true binding sites. No objective function meets our standard that all planted motifs should have scores at least as high as those of the predictions. We need to understand better the conservation information hidden among those binding sites.

## 2 Is this a hard dataset?

Among the questions arising from the assessment project, a particularly interesting one is this: what makes a particular dataset so hard to solve? The answer to this question would be helpful at both ends of the tools. For the users, it would save time and money if a certain assurance of the predictions is provided; for the designers, focus would be put upon factors that account for some for the poor performance of current methods.

Some features of the datasets obviously show correlations with the tools' performance. For instance, a dataset of a large size intuitively is not easy to handle. But, when any feature is studied alone, its correlation with the performance of the tools is always too weak to be convincing, as the effects of all but this feature are ignored.

We applied multiple linear regression [8], a method of estimating the conditional expected value of one variable *Y* (the dependent variable) in terms of a set of other variables *X* (predictor variables). It is a special type of regression in which the dependent variable is a linear function of the "missing data" (regression coefficients *W*) in the model. A general form of multiple regression can be expressed as

*E*(*Y*|*X*) = *f*(*W*, *X*) + *ε*

where *f* is a linear function of *W*, a simple example of which is *f*(*W*, *X*) = *W*·*X*. *ε* is called regression residue. It has the expected value 0, and is independent of *X* ("inequality of variance").

The goodness-of-fit of regression is measured by the coefficient of determination *R*^{2}. This is the proportion of the total variation in *Y* that can be explained or accounted for by the variation in the predictor variables {*X*}. The higher the value of *R*^{2}, the better the model fits the data. Often *R*^{2} is adjusted for the bias brought by the degree of freedom of the model and the limited number of observations as
*= R*^{2}*- p* ×
, where *n* is the number of observations, and *p* is the number of predictors.

*Y*is the performance of the tools for a dataset, which is measured by the highest nucleotide-level correlation coefficient score

*nCC*(see [9]) among all the tools. The reason for using the highest score is to smooth the disadvantages of each individual tool. The predictor variables are a set of features of a dataset which we think may be possible factors. These features include:

- 1.
the total size of a dataset;

- 2.
the median length of a single sequence in a dataset;

- 3.
the number of binding sites in a dataset;

- 4.
the density of the binding sites, which equals the number of binding sites divided by the total size of a dataset;

- 5.
the fraction of null sequences (ones that do not contain a binding site) in a dataset;

- 6.
relative entropy of binding sites in a dataset;

- 7.
the relative entropy-density in a dataset, which is the overall relative entropy times the density of the binding sites;

- 8.
the uniformity of the binding site locations within the sequences in a dataset. We quantified this position distribution information by performing a Kolmogorov-Smirnov test [10] against a uniform distribution and calculating its

*p-*value.

We used least square fitting to calculate the regression coefficients. The most common forms of it include least square fitting of lines and least square fitting of polynomials. In the former, only the first-order term of the predictor variables are involved in the regression model; in the latter, higher order polynomial terms of them are also used. Due to a limited number of observations available (the number of "Generic" and "Markov" datasets in the analysis is about thirty) compared to the number of features, we confined ourselves to the simplest form of linear regression: only the first-order terms are used in the fitting. As we will discuss below, this simplification does not affect the regression result much.

Some features are obviously not independent. For example, relative entropy-density is the non-linear operation (multiplication) of two other *X* variables, relative entropy and density. For every set of features that are highly correlated to each other, we replaced it by its subset with the highest adjusted correlation coefficient
.

*p*-value less than 0.001. The regression residues versus the estimated response (Figure 5(b)) doesn't indicate evident inequality of variance, which is an important assumption of linear regression the requires that regression residues are independent of the expected value of

*Y*.

We also tried the transformations of the power family on the dependent variable *Y* using the Box-Cox method [11]. A lambda value other than 1 improves the
to about 90%. The three features mentioned above again show significance in the model. But some other features – the fraction of null sequences and density particularly – which are skipped in the first model show impact here. This confirms that the three features are likely important for affecting the performance, but we can't rule out other features.

It's no surprise that the sequence conservation (relative entropy) is key to the hardness of a dataset. It turns out that tools are actually quite robust with respect to the size of the dataset in a large range (up to 10,000 bp). Rather, the length of each single sequence has a bigger impact. This is somewhat supported by our discussion of the objective functions that sequences in a dataset should be considered as individuals. Also, it is connected to the position distribution information, as the longer each single sequence is, the more significant it becomes that the binding sites are not uniformly distributed in the sequences.

## 3 Can other information help?

The result of the multiple regression suggests a type of information that may help capture the hidden information in the motif's binding sites: the conservation of the binding sites' positions in the promoter sequences. It has been discussed in previous work (see [12]), but never integrated into the objective functions by the commonly used tools.

*p*-value assuming a uniform distribution as the background model. Now on the 2

*D*plane, the axes correspond to the motifs' conservation in both sequence and position. It's easy to see that even a straight line classifier

*y - ax - b =*0 will separate the two sets decently. Let

*Pr*

_{ llr }be the

*y*value, the negative log

*p*-value of the log likelihood ratio,

*Pr*

_{ pos }be the

*x*value, the negative log

*p*-value of Kolmogorov-Smirnov test as explained above. Most true binding sites will fit

*aPr*

_{ pos }

*- Pr*

_{ llr }

*+ b >*0, and most false predictions of MEME will fit

*aPr*

_{ pos }

*- Pr*

_{ llr }

*+ b*< 0. The straight line in Figure 6(b) has parameters

*a*= 13.5,

*b*= 21.

This interesting result suggests a new form of objective function

*aPr*_{
pos
}- *Pr*_{
llr
}

*a*and

*b*can vary from data set to data set.

## Declarations

## Authors’ Affiliations

## References

- Tompa M: Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites. Nature Biotechnology. 2005, 23 (no 1): 137-144. 10.1038/nbt1053PubMedView ArticleGoogle Scholar
- Matys V, Fricke E, Geffers R, Gling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Mnch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003, 31: 374-378. 10.1093/nar/gkg108PubMedPubMed CentralView ArticleGoogle Scholar
- Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinform. 1999, 15: 563-577. 10.1093/bioinformatics/15.7.563.View ArticleGoogle Scholar
- Bailey TL, Elkan C: The value of prior knowledge in discovering motifs with MEME. Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology. 1995, 21-29. AAAI Press Menlo Park CAGoogle Scholar
- Sinha S, Tompa M: YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nuc Acids Res. 2003, 31: 3586-3588. 10.1093/nar/gkg618.View ArticleGoogle Scholar
- Workman CT, Stormo GD: ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput. 2000, 467-478.Google Scholar
- Pavesi G, Mereghetti P, Mauri G, Pesole G: Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nuc Acids Res. 2004, 32: W199-W203.View ArticleGoogle Scholar
- McCullagh P, Nelder JA: Generalized Linear Models. 1989View ArticleGoogle Scholar
- Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics. 1996, 34: 353-367. 10.1006/geno.1996.0298PubMedView ArticleGoogle Scholar
- Neal DK: Goodness of Fit Tests for Normality. Mathematica Educ Res. 1996, 5: 23-30. http://library.wolfram.com/infocenter/Articles/1379/Google Scholar
- Box GEP, Cox DR: An analysis of transformations. Journal of the Royal Statistical Society series B. 1964, 26: 211-252.Google Scholar
- Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis-regulatory elements associated with functionally coherent groups of genes in Saccharomyces cerevisiae. J Mol Biol. 2000, 296: 1205-1214. 10.1006/jmbi.2000.3519PubMedView ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.