Like the MONKEY method [12], PhyloScan uses the phylogenetic model of Neyman [38] and the efficient algorithm of Felsenstein [39] to evaluate the probability that a site in observed multiply-aligned sequence data is consistent with a transcription factor's motif model. With either MONKEY or PhyloScan, each position of the motif is evaluated, and the computed probabilities for the motif positions are then multiplied together to give the strength of the site. Via the approach of Staden [10], the probability that such strength would arise by chance is precisely computed.

PhyloScan goes beyond MONKEY in several key ways. First, PhyloScan combines the information from multiple sites within an intergenic region, so that evidence from weak sites that would not be significant in isolation is combined, to identify a statistically significant find. Second, information from more-distant sequences, both non-alignable isolated sequences and clades of alignable sequences, is incorporated so as to further increase sensitivity, without an accompanying increase in false predictions. Third, we signify strength of a find by reporting its *q*-value, the fraction of predictions of this probability or better that are expected to be false, rather than its *p*-value, the fraction of false sites that are expected to demonstrate this probability or better.

Descriptions of the three main differences between the two algorithms are provided below.

### Combining evidence across sites within an intergenic region

PhyloScan combines information from multiple predictions via a weighted Bonferroni test in a manner similar to that of Neuwald and Green [40]. Specifically, for a user-supplied value *k*, which defaults to 2, and user-supplied weights (*w*_{1},..., *w*_{
k
}), which default to (0.9, 0.1), PhyloScan conservatively computes an intergenic region's *p*-value as

{P}_{\text{intergenic}}=\mathrm{min}\left\{1,\frac{{p}_{1}}{{w}_{1}},\frac{{p}_{2}}{{w}_{2}},\mathrm{...},\frac{{p}_{k}}{{w}_{k}}\right\}

where the weights (*w*_{1},..., *w*_{
k
}) are nonnegative and sum to one, and *p*_{
i
}is the probability that a randomly generated, intergenic sequence alignment of the same size would have its *i* th best site as good as or better than the *i* th best site in the intergenic sequence data under consideration. The calculation is conservative because the underlying events whose probabilities are (*p*_{1},..., *p*_{
k
}) are not statistically disjoint [40].

Thus, an intergenic region with a strong site will make its presence known via a strong (*i.e*., low) value for the *p*_{1}/*w*_{1} term, and an intergenic region that does not have a strong site, but that does have an *i* th best site that is surprisingly strong (given its rank as *i* th best), will be detected through a strong value for the *p*_{
i
}/*w*_{
i
}term. This enables us to detect both transcription factors that tend to bind strongly but in isolation and transcription factors that tend to bind multiply but weakly.

An alternate approach for combining the contributions of multiple binding sites, that of seeking the *p*-value of the sum of their log-likelihoods [41], is not employed by PhyloScan.

### Combining evidence from more-distant sequences

As described above, a *P*_{intergenic} *p*-value is generated for each sequence alignment of an intergenic region, but a true site's value may still be too weak to distinguish that site from the false positives in a vast genome. To address this problem, we combine this *p*-value with the *p*-values for the same intergenic region that come from sequence alignments of more distantly-related species. That is, we partition the input sequences for orthologous promoters into clades such that each clade is either an isolated sequence or contains sequences that can be reliably, multiply aligned; we compute the *P*_{intergenic} value for each clade as above; and we combine these *p*-values using the formula of Bailey and Gribskov [32]. When there are *n* such clades whose *P*_{intergenic} values are *P*_{1}, *P*_{2},..., *P*_{
n
}then we compute:

\begin{array}{lll}{P}_{\text{product}}\hfill & =\hfill & {\displaystyle \prod _{c=1}^{n}{P}_{c}}\hfill \\ {P}_{\text{combined}}\hfill & =\hfill & {P}_{\text{product}}{\displaystyle \sum _{i=0}^{n-1}\frac{{(-\mathrm{ln}({P}_{\text{product}}))}^{i}}{i!}.}\hfill \end{array}

This formula precisely computes the *p*-value for the product of *n* values drawn randomly from the interval [0, 1]. An example of this calculation is available in the Supplementary Materials [see Additional file 1].

PhyloScan allows a *p*-value cutoff *α*, which defaults to 0.05, such that sites in a user-specified clade of interest that are worse than this cutoff are not permitted to be strengthened by data from the other species via the combination process. This feature allows the user to concentrate on a single clade or species rather than the entire tree of species. Because of this cutoff, it is appropriate to modify the above formula for sites that survive the cutoff:

\begin{array}{ccc}{P}_{\text{combined}}& =& {P}_{\text{product}}{\displaystyle \sum _{i=0}^{n-1}\frac{{\left(-\mathrm{ln}\left(\frac{{P}_{\text{product}}}{\alpha}\right)\right)}^{i}}{i!}.}\end{array}

### Utility of *q*-value over *p*-value

The *p*-value, the probability that a negative control would appear positive, must be used with great care because genomes are vast relative to regulatory sequence elements. For instance, in many other situations a *p*-value of 10^{-6} is considered excellent, but when there are on the order of 10^{9} places where a transcription factor binding site is not likely to bind, such a "strong" *p*-value can leave us with 1,000 false positives – or even more, in the usual case that some of the biology has not been incorporated into the statistical model. Thus, to properly interpret a *p*-value, the researcher must be on guard to quantify the number of negative cases.

The *q*-value (or False Discovery Rate [16]) explicitly incorporates the vastness of the genome in the calculation. The *q*-value of a transcription factor binding site tells us the proportion of sites of that strength or better that we expect to be false positives. Under ideal circumstances, the researcher who chooses a *q*-value threshold of 0.001 expects only one in 1,000 of the reported sites to be a false positive regardless of the genome size. (However, because we do not pretend to have statistically modeled all the relevant biology, the false discovery rate will generally be higher than the specified threshold.)

### Real data inputs

The collection of orthologous intergenic regions, the division of species into clades, the multiple alignments, the phylogenetic trees, and the motif models needed as input to PhyloScan (or other similar algorithms) can be difficult to construct, and are unique to an individual's research interests and applications. We discuss our approaches in the following. The flowchart in Figure 6 depicts a high-level view of the intergenic sequence database generation and the application of PhyloScan to these data.

It is our belief that PhyloScan (and, *e.g*., MONKEY) are fairly robust to typical levels of error in these inputs, though further exploration is required to substantiate this claim.

#### Locating orthologous sequences

Genome sequence data and annotations were downloaded from the NCBI RefSeq database [42]: *Escherichia coli* K12 (NC_000913.1), *Salmonella enterica* serovar Typhi (*S. typhi*)(NC_003198), *Yersinia pestis* CO92 (NC_003143), *Haemophilus influenzae* Rd (NC_000907), *Vibrio cholerae* El Tor (NC_002505 and NC_002506), *Shewanella oneidensis* MR-1 (NC_004347 and NC_004349), and *Pseudomonas aeruginosa* PA01 (NC_002516). Orthologs for each of the annotated *E. coli* genes were identified in each of the remaining six species, using INPARANOID v.1.35 [43]. This program uses BLAST [44] to compare the complete set of predicted protein sequences from one genome with that of another, and identifies the reciprocal best hits. We set the parameters to use the BLOSUM62 matrix and a minimum bit score of 30, and we required that the alignment cover at least 50% of both proteins.

In the examples presented in this study, *E. coli* was the primary species of interest; we therefore identified a set of *E. coli* promoter-containing sequences by identifying each *E. coli* protein-coding gene (excluding 111 genes encoded on transposons or prophage elements) that has at least 20 bp of upstream intergenic sequence. By these criteria, there are 2379 *E. coli* intergenic regions of interest. Orthologous upstream intergenic-sequence data files were then generated for this set of 2379 *E. coli* regions, using the results from INPARANOID to identify orthologs, and the seven genome annotations to define intergenic boundaries. In the Supplementary Materials are a table with these data [see Additional file 2] and a caption for the table [see Additional file 1].

#### Designating clades

Among the species included in this study, only *E. coli* and *S. typhi* exhibit extensive homology (70% identity on average) in the promoter regions [26]. The phylogenetic distance of two sequences that share this level of homology is 0.384, assuming the nucleotide substitution model of Jukes & Cantor [45] (and the value would be similar under a variety of more current models); thus, we assumed this phylogenetic distance between *E. coli* and *S. typhi*, and data from these two species are taken to form one clade for PhyloScan. Each of the remaining species formed a separate clade of unaligned sequence data, since these species do not exhibit sequence identity with *E. coli* or with each other [26].

Generally, we would combine sequences into a single clade if their pairwise phylogenetic distances were comparable to that between *E. coli* and *S. typhi*, or nearer.

#### Constructing multiple alignments

With only two closely related species in our set, we chose the Smith-Waterman [46] pairwise, gapped local alignment algorithm (implemented as BestFit in the Wisconsin Package Version 10.3, Accelrys Inc., San Diego, CA) to align their orthologous intergenic regions, using default parameters (match = 10.000; mismatch = -9.000; gap creation penalty = 50; gap extension penalty = 3). The alignment of *E. coli* and *S. typhi* orthologous upstream intergenic sequences resulted in 1662 unique aligned sequence pairs. The upstream intergenic sequences for an additional 836 *E. coli* genes that did not have orthologs in *S. typhi* remained. The combination of these two datasets (1662 + 836 = 2498) does not equal the above number of *E. coli* intergenic regions of interest (2379 sequences), due to the complication of divergently transcribed genes. Specifically, we observed that for some divergently transcribed genes in *E. coli*, the orthologous genes in *S. typhi* are not syntenic, thus *S. typhi* provided two separate intergenic regions for alignment to a single intergenic region of *E. coli*.

To perform the real-data tests, three databases representing the reference species clade were generated for scanning: (1) a database containing the 2379 *E. coli* intergenic regions of interest, (2) a database containing only *E. coli* data ("*E. coli* reduced"), where 1662 *E. coli* intergenic regions have been reduced in sequence space by alignment with *S. typhi* orthologous data plus an additional 836 *E. coli* sequences for which there was no orthologous *S. typhi* data, and (3) a database containing 1662 *E. coli-S. typhi* aligned orthologous intergenic regions plus an additional 836 *E. coli* sequences for which there was no orthologous *S. typhi* data.

#### Producing a phylogenetic tree

We constructed the phylogenetic tree for the more complicated, synthetic sequence data set using 16S rRNA gene data via MUSCLE [30] and PHYLIP [31], scaling tree branch lengths up by a factor of 13.5, as described above – see Synthetic Sequence Data in the Results section. A tree constructed in this manner is not definitive but should be sufficient for use with PhyloScan.

#### Obtaining binding site motif models

*E. coli* Crp and PurR binding sites that have been experimentally identified by DNase I footprinting were extracted from the literature and available databases, RegulonDB [47] and DPInteract [48]. The 87 Crp sites (from 65 *E. coli* intergenic regions) and 22 PurR sites (from 20 *E. coli* intergenic regions), were aligned using the Gibbs Recursive Sampler [49] specifying palindromic models (total width of 16–24 bp), to generate a PurR motif (Figure 7) and a Crp motif (Figure 1). These figures show both the nucleotide equilibrium and the information content for each position of the motif [9].

### Generation of the weak synthetic sequence data

To test the sensitivity and specificity of PhyloScan when seeking binding sites that are weaker than *E. coli* Crp binding sites, we generated "1/2-strength" and "1/3-strength" Crp sites. The 1/2-strength Crp motif was designed to have an average information content per column that is half the average information content of the full-strength Crp motif; we did this by raising each probability of a nucleotide to its 0.637^{th} power, with subsequent scaling so that the probabilities of the four nucleotides for any motif column sum to 1.0. Likewise, the 1/3-strength Crp sites were generated from a 1/3-strength Crp motif to give one-third the average information content, using an exponent of 0.507. See Figure 1 and its legend for more information.

## Comments

View archived comments (1)