The first step to design a new algorithm for the motif discovery problem is to choose a proper objective function. This is critical because the objective function implements the designer's understanding of the protein-DNA interaction model. Searching for candidates that optimize the objective function is a major step to pull out the candidate binding sites from the background sequences. An ideal objective function should be able to assign the optimal score to the true motif binding sites and nowhere else.

Although there are numerous tools available, surprisingly the types of objective functions are not as many. Here we examined three popular objective functions. Theoretically, for each objective function we would test whether the score of the planted binding sites is superior to the scores of all other sets of words in the background sequences which are false positive predictions. This, of course, is impractical. In practice, we chose one tool that applies this objective function and compared the tool's prediction, which unfortunately is often a false positive, with the planted motif. If the planted motif has a better score, then the gap between the two scores shows the least extent to which the tool misses the global optimum of the objective function. On the other hand, if the prediction scores higher, it would suggest that the objective function is not accurate enough to model the true binding sites.

### Log likelihood ratio

This ratio and its associated forms are used by most alignment-driven algorithms to assess the significance of motif candidates. When the candidates are of different lengths, the *p*-value of the ratio is used. A method to compute the *p*-value is described in [3]. The log likelihood ratio of the predicted motif *m* is

where *X* is the set of sequences in the dataset, *Pr*(*X*|*φ*, *Z*) is the likelihood of the sequences *X* given the motif model *φ* and its binding sites *Z*, and *Pr(X|p*_{0}) gives the likelihood of the sequences assuming the background model *p*_{0}.

MEME [4] carries out an EM-based procedure to search for a model that maximizes the likelihood ratio. The local optimum can sometimes be avoided by rerunning the program with different initializations. Figure 1 depicts, for each dataset from [1], the scores (the *p-* values of the log likelihood ratio in the negative logarithm scale) of MEME's predictions and the planted binding sites. For most datasets, the predictions of MEME have higher scores than the planted motifs. We conclude that even an algorithm guaranteeing the global optimal solution for the log likelihood ratio function will miss the true binding sites in these datasets, because this objective function does not accurately capture the nature of the binding sites.

Now, consider one dataset in detail. The dataset is an example for which the planted motif has a higher log likelihood ratio score than MEME's prediction, yet we argue that log likelihood ratio still doesn't work well as an objective function in this case.

In a way, the motif-searching problem is a classification problem: all the words of a certain length appearing in the sequences should be partitioned into two classes: the binding sites, and all the others. Training the optimal classifier equates to searching for the optimal candidate motif model. When the log likelihood ratio is applied as the objective function, the ultimate classifier would be a threshold of the log likelihood ratio score so that all the binding sites are above the threshold, and all the others are below it. A classifier corresponding to a good prediction can achieve a decent balance between the false positives and false negatives of the classification. Vice versa, if no threshold is satisfactory enough to classify the words, no good prediction can be found under this motif model.

To test the classifiability of this dataset, we calculated the log likelihood ratio scores of all the words in it, including the true binding sites, and tried out various threshold values to classify the words. Among those having scores above the threshold, the numbers of words are counted which belong to binding sites and which belong to the background sequences. Figure 2 indicates that no matter what threshold we choose to identify the binding sites of the motif, we won't be able to find a value to achieve an acceptable balance between the sensitivity and the specificity of the classification. For example, to correctly classify all 11 true binding sites, the threshold must be chosen so low that 130 false positives are classified as binding sites of the motif.

It is therefore fair to say that log likelihood ratio alone will not be able to separate the true motif from the background noise. We will return to it later.

### Z-score

The Z-score measures the significance of predictions based on their over-representation. YMF [5] searches a restricted space defined by a consensus motif model and finds the candidates with the top Z-scores. The form of the Z-score is as follows:

where *obs*(*m*) is the actual number of occurrences of the motif *m, E*(*m*) is the expected number of its occurrences in the background model, and *σ*(*m*) is the standard deviation.

Consensus-based algorithms such as YMF are sometimes criticized for not being able to incorporate the true binding sites into the motif model. To focus on the objective function and spare the limitation induced by the consensus motif model, we fantasize a motif model for each dataset that comprises the planted binding sites completely and exclusively. We calculate the Z-scores of the predictions and the planted motifs for selected datasets, as shown in Figure 3. Note that the competition is actually not fair: with an expanded motif search space, the new optimum should be at least as high as the current prediction. Nevertheless, we consider the Z-score of the prediction as a touchstone: any score lower than it will not be competitive in the new search space. From Figure 3, we see that is exactly what happens in nearly all of the tested datasets. Note the similarity to results as shown in Figure 1 in the sense of our test: statistical over-representation as measured by Z-score does not necessarily mean binding preference either.

### Sequence specificity

Another type of objective function emphasizes the likelihood that most, if not all, sequences are potentially bound by the transcription factor. That means a prediction having multiple binding sites in one sequence and none in the others is much less significant than a prediction having a balanced number of binding sites in each sequence. This idea is designed into ANN-Spec [6] and Weeder [7]. The objective function, named sequence specificity, is defined in [7] as follows.

where *E*_{
i
}(*m|p*_{0}) is the expected number of motif *m*'s occurrences in sequence *i* assuming the background model *p*_{0}, and *L* is the total number of sequences in the dataset.

We calculated the scores of the predictions of Weeder and ANN-Spec and the planted motifs. The planted motif has a higher score than the predictions of the tools for most datasets, as illustrated in Figure 4. The obvious gap between the scores of planted binding sites and the predictions reflects a lack of optimum of the search strategies adopted by these tools. Recall that ANN-Spec is a generalized version of SEM (Stochastic EM), and Weeder uses a greedy and heuristic search method.

Comparing Figure 4 with the other objective functions (Figure 1, 3), this result shows certain promise that using the sequence specificity score may often lead to the true binding sites. From objective function point of view solely, sequence specificity seems to have the edge for our datasets. An assumption of this objective function is that most sequences in the datasets should have binding sites of the motif. Although our data shows that tools such as Weeder and ANN-Spec are not too sensitive to the slight departure from this assumption, we have not tested them on datasets with more deviation. The Z-score function is based on the statistical over-representation solely without any reference to biological theories. The log likelihood ratio relies on high-quality non-gapped alignments, but it's not clear that non-gapped alignments are powerful enough to model the true binding sites. No objective function meets our standard that all planted motifs should have scores at least as high as those of the predictions. We need to understand better the conservation information hidden among those binding sites.