EXMOTIF: efficient structured motif extraction
 Yongqiang Zhang^{1} and
 Mohammed J Zaki^{1}Email author
DOI: 10.1186/17487188121
© Zhang and Zaki; licensee BioMed Central Ltd. 2006
Received: 23 July 2006
Accepted: 16 November 2006
Published: 16 November 2006
Abstract
Background
Extracting motifs from sequences is a mainstay of bioinformatics. We look at the problem of mining structured motifs, which allow variable length gaps between simple motif components. We propose an efficient algorithm, called EXMOTIF, that given some sequence(s), and a structured motif template, extracts all frequent structured motifs that have quorum q. Potential applications of our method include the extraction of single/composite regulatory binding sites in DNA sequences.
Results
EXMOTIF is efficient in terms of both time and space and is shown empirically to outperform RISO, a stateoftheart algorithm. It is also successful in finding potential single/composite transcription factor binding sites.
Conclusion
EXMOTIF is a useful and efficient tool in discovering structured motifs, especially in DNA sequences. The algorithm is available as opensource at: http://www.cs.rpi.edu/~zaki/software/exMotif/.
Introduction
Analyzing and interpreting sequence data is an important task in bioinformatics. One critical aspect of such interpretation is to extract important motifs (patterns) from sequences. The challenges for motif extraction problem are twofold: one is to design an efficient algorithm to enumerate the frequent motifs; the other is to statistically validate the extracted motifs and report the significant ones.
Motifs can be classified into two main types. If no variable gaps are allowed in the motif, it is called a simple motif. For example, in the genome of Saccharomyces cerevisiae, the binding sites of transcription factor, GAL4, have as consensus [1], the simple motif, CGG[11,11]CCG. Here [11,11] means that there is a fixed "gap" (or don't care characters), 11 positions long. If variable gaps are allowed in a motif, it is called a structured motif. A structured motif can be regarded as an ordered collection of simple motifs with gap constraints between each pair of adjacent simple motifs. For example, many retrotransposons in the Ty1copia group [2] have as consensus the structured motif: MT[115,136]MTNTAYGG[121,151]GTNGAYGAY. Here MT, MTNTAYGG and GTNGAYGAY are three simple motifs; [115,136] and [121,151] are variable gap constraints ([minimum gap, maximum gap]) allowed between the adjacent simple motifs. More formally, a structured motif, $\mathcal{M}$, is specified in the form:
M_{1}[l_{1}, u_{1}]M_{2}[l_{2}, u_{2}]M_{3} ... M_{k1}[l_{k1}, u_{k1}]M_{ k }
where M_{ i }, 1 ≤ i ≤ k, is a simple motif component, and l_{ i }and u_{ i }(for 1 ≤ i <k and where 0 ≤ l_{ i }≤ u_{ i }), are the minimum and maximum number of gaps allowed between M_{ i }and M_{i+1}, respectively. Note that a gap is defined to be the number of intervening positions after M_{ i }but before M_{i+1}. In other words, if s_{ i }and e_{ i }represent the start and end positions of component M_{ i }, then for i ∈ [1, k  1], the number of gaps is given as g_{ i }= e_{i+1} s_{ i } 1, and we require that g_{ i }∈ [l_{ i }, u_{ i }]. The number of simple motif components, k, is also called the length of $\mathcal{M}$. Let W_{ i }, 1 ≤ i <k, denote the span of the gap range, [l_{ i }, u_{ i }], which is calculated as: W_{ i }= u_{ i } l_{ i }+ 1.
In the structured motif extraction problem, the component motifs M_{ i }are unknown before the extraction. However, we do provide some known parameters to restrict the structured motifs to be extracted, including: (i) k – the length of $\mathcal{M}$; (ii) M_{ i } – the length of each component M_{ i }∈ $\mathcal{M}$, for 1 ≤ i ≤ k; and (iii) [l_{ i }, u_{ i }] – the gap range between M_{ i }and M_{i+1}, for 1 ≤ i <k. All these parameters define a structured motif template, $\mathcal{T}$, for the structured motifs to be extracted from a set of sequences $\mathcal{S}$. A structured motif $\mathcal{M}$ matching the template $\mathcal{T}$ in $\mathcal{S}$ is called an instance of $\mathcal{T}$. We use K to denote the number of symbols (not counting gaps) in $\mathcal{M}$ and use $\mathcal{M}$[j] (with 1 ≤ j ≤ K) to denote the j th symbol of $\mathcal{M}$.
Let δ_{ S }($\mathcal{M}$) denote the number of occurrences of an instance motif $\mathcal{M}$ in a sequence S ∈ $\mathcal{S}$. Let d_{ S }($\mathcal{M}$) = 1 if δ_{ S }($\mathcal{M}$) > 0 and d_{ S }($\mathcal{M}$) = 0 if δ_{ S }($\mathcal{M}$) = 0. The support of motif $\mathcal{M}$ in the is defined as $\pi (\mathcal{M})={\displaystyle {\sum}_{S\in \mathcal{S}}{d}_{S}}(\mathcal{M})$, i.e., the number of sequences in $\mathcal{S}$ that contain at least one occurrence of $\mathcal{M}$. The weighted support of $\mathcal{M}$ is defined as ${\pi}_{w}(\mathcal{M})={\displaystyle {\sum}_{S\in \mathcal{S}}{\delta}_{S}}(\mathcal{M})$, i.e., total number of occurrences of $\mathcal{M}$ over all sequences in $\mathcal{S}$. We use $\mathcal{O}$ ($\mathcal{M}$) to denote the set of all occurrences of a structured motif $\mathcal{M}$. Given a userspecified quorum threshold q ≥ 1, a motif that occurs at least q times will be called frequent.
There are two main tasks in the structured motif extraction problem: a) Common Motifs – find all motifs $\mathcal{M}$ in a set of sequences $\mathcal{S}$, such that the support of $\mathcal{M}$ is at least q, b) Repeated Motifs – find all motifs in a single sequence S, such that the weighted support of $\mathcal{M}$ is at least q. Furthermore, the structured motif extraction problem allows several variations:

Substitutions: $\mathcal{O}$ may consist of similar motifs, as measured by Hamming Distance[3], instead of exact matches, to the simple motifs in $\mathcal{M}$. We can either allow for at most ε_{ i }errors for each simple motif M_{ i }, 1 ≤ i ≤ k, or at most ε errors for the whole structured motif $\mathcal{M}$.

Overlapping Components: The variable gap constraints (l_{ i }and u_{ i }) can take on a limited range of negative values, allowing search for overlapping simple motifs. We allow two adjacent components M_{ i }and M_{i+1}to overlap, but we require that M_{i+1}does not precede M_{ i }. This condition can be satisfied by the following constraints on the gap range [l_{ i }, u_{ i }]: M_{ i } ≤ l_{ i }≤ u_{ i }, for i ∈ [l, k). For example the search for motif template NNN[2,2]NNN (where 'N' stands for any of the four DNA bases: A,C,G,T), may discover the pattern ACG[2,2]CGA, representing an overlapped occurrence, ACGA, as well as a nonoverlapped occurrence, ACGCGA, at the two extremes of the gap range.

Motif Length Ranges: Each simple motif M_{ i }in a template $\mathcal{M}$ can be of a range of lengths, i.e., M_{ i } ∈ [l_{ a }, l_{ b }], where l_{ a }and l_{ b }are the lower and upper bounds on the desired length.
Structured motif extraction.
Sequence S_{1} (∈ $\mathcal{S}$):  CCGTACCGAACCTCAAA 

Sequence S_{2} (∈ $\mathcal{S}$):  CCG TTAT A G G A AC CA TT 
Sequence S_{3} (∈ $\mathcal{S}$):  TAT GG AACCA TCTT 
Sequence S_{4} (∈ $\mathcal{S}$):  TAA CGG ATCCCT TT 
Structured Motif Template ($\mathcal{T}$):  NNN[0,3]NN[1,3]NNNN 
Quorum (q):  2 
In this paper, we propose EXMOTIF, an efficient algorithm for both the structured motif extraction problems. It uses an inverted index of symbol positions, and it enumerates all structured motifs by positional joins over this index. The variable gap constraints are also considered at the same time as the joins, resulting in considerable efficiency. In order to save time and space, we only keep the start positions of each intermediate pattern during the positional join.
Related work
Many simple motif extraction algorithms have been proposed primarily for extracting the transcription factor binding sites, where each motif consists of a unique binding site [4–10] or two binding sites separated by a fixed number of gaps [11–13]. A pattern with a single component is also called a monad pattern. Structured motif extraction problems, in which variable number of gaps are allowed, have attracted much attention recently, where the structured motifs can be extracted either from multiple sequences [14–21] or from a single sequence [22, 23]. In many cases, more than one transcription factor may cooperatively regulate a gene. Such patterns are called composite regulatory patterns. To detect the composite regulatory patterns, one may apply single binding site identification algorithms to detect each component separately. However, this solution may fail when some components are not very strong (significant). Thus it is necessary to detect the whole composite regulatory patterns (even with weak components) directly, whose gaps and other possibly strong components can increase its significance.
Several algorithms have been used to address the composite pattern discovery with two components, which are called dyad patterns. Helden et al. [11] propose a method for dyad analysis, which exhaustively counts the number of occurrences of each possible pair of patterns in the sequences and then assesses their statistical significance. This method can only deal with fixed number of gaps between the two components. MITRA [12] first casts the composite pattern discovery problem as a larger monad discovery problem and then applies an exhaustive monad discovery algorithm. It can handle several mismatches but can only handle sequences less than 60 kilobases long. CoBind [24] models composite transcription factors with Position Weight Matrices (PWMs) and finds PWMs that maximize the joint likelihood of occurrences of the two binding site components. CoBind uses Gibbs sampling to select binding sites and then refines the PWMs for a fixed number of times. CoBind may miss some binding sites since not all patterns in the sequences are considered. Moreover, using a fixed number of iterations for improvement may not converge to the global optimal dyad PWM.
SMILE [14] describes four variants of increasing generality for common structured motif extraction, and proposes two solutions for them. The two approaches for the first problem, in which the structured motif template consists of two components with a gap range between them, both start by building a generalized suffix tree for the input sequences and extracting the first component. Then in the first approach, the second component is extracted by simply jumping in the sequences from the end of the first one to the second within the gap range. In the second approach, the suffix tree is temporarily modified so as to extract the second component from the modified suffix tree directly. The drawback of SMILE is that its time and space complexity are exponential in the number of gaps between the two components. In order to reduce the time during the extraction of the structured motifs, [18] presents a parallel algorithm, PSmile, based on SMILE, where the search space is wellpartitioned among the available processors.
RISO [15–17] improves SMILE in two aspects. First, instead of building the whole suffix tree for the input sequences, RISO builds a suffix tree only up to a certain level l, called a factor tree, which leads to a large space saving. Second, a new data structure called boxlink is proposed to store the information about how to jump within the DNA sequences from one simple component (box) to the subsequent one in the structured motif. This accelerates the extraction process and avoids exponential time and space consumption (in the gaps) as in SMILE. In RISO, after the generalized factor tree is built, the boxlinks are constructed by exhaustively enumerating all the possible structured motifs in the sequences and are added to the leaves of the factor tree. Then the extraction process begins during which the factor tree may be temporarily and partially modified so as to extract the subsequent simple motifs. Since during the boxlink construction, the structured motif occurrences are exhaustively enumerated and the frequency threshold is never used to prune the candidate structured motifs, RISO needs a lot of computation during this step.
For repeated structured motif identification problem, the frequency closure property that "all the subsequences of a frequent sequence must be frequent", doesn't hold any more since the frequency of a pattern can exceed the frequency of its subpatterns. [22] introduces an closurelike property which can help prune the patterns without missing the frequent patterns. The two algorithms proposed in [22] can extract within one sequence all frequent patterns of length no greater than a length threshold, which can be either manually specified or automatically determined. However, this method requires that all the gap ranges [l_{ i }, u_{ i }], between adjacent symbols in the structured motif be the same, i.e., [l_{ i }, u_{ i }] = [l, u] for all i ∈ [1, k  1]. Moreover, approximate matches are not allowed for the structured motif.
The EXMOTIF algorithm
We first introduce our basic approach for common structured motif extraction problem. We then successively optimize it for various practical scenarios.
The basic approach
Poslists.
X  poslists 

A  {1,6,5,9,10,15,16,17, 2,5,6,8,11,12,15, 3,4,2,6,7,10, 4,3,2,3,7} 
C  {1,7,1,2,6,7,11,12,14, 2,4,1,2,13,14, 3,3,8,9,12, 4,4,4,9,10,11} 
G  {1,2,3,8, 2,3,3,9,10, 3,2,4,5, 4, 2,5,6} 
T  {1,2,4,13, 2,5,4,5,7,16,17 3,5,1,3,11,13,14, 4,5,1,8,12,13,14} 
Positional joins
We first extend the notion of poslists to cover structured motifs. The poslist of $\mathcal{M}$ in S_{ i }∈ $\mathcal{S}$ is given as the set of start positions of all the matches of $\mathcal{M}$ in S_{ i }. Let X, Y ∈ Σ_{DNA} be any two symbols, and let $\mathcal{M}$ = X[l, u]Y be a structured motif. Given the poslists of X and Y in S_{ i }for 1 ≤ i ≤ n, namely, $\mathcal{\text{P}}$(X, S_{ i }) and $\mathcal{\text{P}}$(Y, S_{ i }), the poslist of $\mathcal{M}$ in S_{ i }can be obtained by a positional join as follows: for a position x ∈ $\mathcal{\text{P}}$(X, S_{ i }), if there exists a position y ∈ $\mathcal{\text{P}}$(Y, S_{ i }), such that l ≤ y  x  1 ≤ u, it means that Y follows X within the variable gap range [l, u] in the sequence S_{ i }, and thus we can add x to the poslist of motif X[l, u]Y. Let d be the number of gaps between x ∈ $\mathcal{\text{P}}$(X, S_{ i }) and y ∈ $\mathcal{\text{P}}$(Y, S_{ i }), given as d = y  x  1.
Then, in general, there are three cases to consider in the positional join algorithm:

d <l: Advance y to the next element in $\mathcal{\text{P}}$(Y, S_{ i }).

d > u: Advance x to the next element in $\mathcal{\text{P}}$(X, S_{ i }).

l ≤ d ≤ u: Save this occurrence in $\mathcal{\text{P}}$(X[l, u]Y, S_{ i }), and then advance x to the next element in $\mathcal{\text{P}}$(X, S_{ i }).
Given a longer motif $\mathcal{M}$, the positional joins start with the last two symbols, and proceed by successively joining the poslist of the current symbol with the intermediate poslist of the suffix. That is, the intermediate poslist for a (l+1)length pattern (with l ≥ 1) is obtained by doing a positional join of the poslist of the pattern's first symbol, called the head symbol, with the poslist of its llength suffix, called the tail. As the computation progresses the previous tail poslists are discarded. Combined with the fact that only start positions are kept in a poslist, this saves both time and space.
In order to enumerate all frequent motifs instances $\mathcal{M}$ in $\mathcal{S}$, EXMOTIF computes the poslist for each $\mathcal{M}$ and report $\mathcal{M}$ only if its support is no less than the quorum (q). A straightforward approach is to directly perform positional joins on the symbols from the end to the start for each $\mathcal{M}$. This approach leads to much redundant computation since simple motif components may be shared among several structured motifs. EXMOTIF, in contrast, performs two steps: it first computes the poslists for all simple motifs in $\mathcal{S}$ by doing positional joins on poslists of its symbols, and it then computes the poslist for each structured motif by doing positional joins on poslists of its simple motif components. EXMOTIF handles both simple and structured motifs uniformly, by adding the gap range [0, 0] between adjacent symbols within each simple motif M_{ i }. For our example in Table 1, the structured motif template $\mathcal{T}$ becomes: N[0,0]N[0,0]N[0,1]N[0,0]N[2,3]N[0,0]N[0,0]N[0,0]N. Also since we only report frequent motifs, we can prune the candidate patterns during the positional joins based on the closure property of support (note however that this cannot be done for weighted support).
Extraction of the simple motifs
Given a template motif $\mathcal{T}$, we know the lengths of the simple motif components desired. A naive approach is to directly do positional joins on the symbols from the end to the start of each simple motif. However, since some simple motifs are of the same length and the longer simple motifs can be obtained by doing positional joins on the shorter simple motifs/symbols, we can avoid some redundant computation. Note also that the gap range inside the simple motif is always [0,0].
Let $\mathcal{L}$ = {L_{ i }, 1 ≤ i ≤ m}, where L_{ i }is the length of each simple motif in $\mathcal{T}$ and assume $\mathcal{L}$ is sorted in the ascending order. For each L_{ i }, 1 ≤ i ≤ m, we need to enumerate ${\left{\Sigma}_{\text{DNA}}\right}^{{L}_{i}}$ possible simple motifs. Let $ma{x}_{\mathcal{L}}$ be the maximum length in $\mathcal{L}$. We can compute the poslists of simple motifs sequentially from length 1 to $ma{x}_{\mathcal{L}}$. But this may waste time in enumerating some simple motifs of lengths that are not in $\mathcal{L}$. Instead, EXMOTIF first computes the poslists for the simple motifs of lengths that are powers of 2. Formally, let J be an integer such that 2^{ J }≤ $ma{x}_{\mathcal{L}}$ < 2^{J+1}. We extract the patterns of length 2^{ j }by doing positional joins on the poslists of patterns of length 2^{j1}for all 1 ≤ j ≤ J. For example, when $ma{x}_{\mathcal{L}}$ = 11, EXMOTIF first computes the poslists for simple motifs of length 2^{0} = 1, 2^{1} = 2, 2^{2} = 4 and 2^{3} = 8.
EXMOTIF then computes the poslists for the simple motifs of L_{ i }∈ $\mathcal{L}$, by doing positional joins on simple motifs whose poslist(s) have already been computed and their lengths sum to L_{ i }. For example, when L_{ i }= 11, EXMOTIF has to join motifs of lengths 8, 2, and 1. It first obtains all motifs of length 8 + 2 = 10, and then joins the motifs of lengths 10 and 1, to get the poslists of all simple motifs of length 10 + 1 = 11. The poslists for the simple motifs of length L_{ i }∈ $\mathcal{L}$ are kept for further use in the structured motif extraction. At the end of the first phase, EXMOTIF has computed the poslists for all simple motif components that can satisfy the template.
Extraction of the structured motifs
We extract the structured motifs by doing positional joins on the poslists of the simple motifs from the end to the start in the structured motif $\mathcal{M}$. Formally, let H[l, u]T be an intermediate structured motif, with simple motif H as the head, and a suffix structured motif T as tail. Then $\mathcal{\text{P}}$(H[l, u]T) can be obtained by doing positional joins on $\mathcal{\text{P}}$(H) and $\mathcal{\text{P}}$(T). Since $\mathcal{\text{P}}$(H) keeps only the start positions, we need to compute the corresponding end positions for those occurrences of H, to check the gap constraints. Since only exact matches or substitutions are allowed for simple motifs, the end position is simply s + H  1 for a start position s.
Fullposition recovery
In our positional join approach, to save time and space we retain only the motif start positions, however, in some applications, we may need to know the full position of each occurrence, i.e., the set of matching positions for each symbol in the motif. EXMOTIF records some "indices" during the positional joins in order to facilitate full position recovery.
Length ranges for simple motifs
EXMOTIF also allows variation in the lengths of the simple motifs to be found. For example, a motif template may be specified as M_{1}[5,10] M_{2}, M_{1} ∈ [2,4], and M_{2} ∈ [6,7], which means that we have to consider NN, NNN, and NNNN as the possible templates for M_{1} and similarly for M_{2}. A straightforward way for handling length ranges is to enumerate exhaustively all the possible subtemplates of $\mathcal{T}$ with simple motifs of fixed lengths and then to extract each subtemplate separately. Instead, EXMOTIF does an optimized extraction. EXMOTIF reuses the partial poslists created when using a depth first search to enumerate and extract the subtemplates.
Handling substitutions
As mutations are a common phenomena in biological sequences, we allow substitutions in the extracted motifs. That is two motif instances may be considered to be the same if they are within the allowed substitution thresholds. EXMOTIF allows users to specify the number of substitutions allowed for the whole motif (ε), and also a per simple motif threshold (ε_{ i }, i ∈ [1, k]). There are two types of substitutions we consider.
Positionspecific substitutions
IUPAC alphabet (Σ_{IUPAC}).
Symbol  A  C  G  T 

Bases  A  C  G  T 
Symbol  U  R  Y  K 
Bases  U  A,G  C,T  G,T 
Symbol  M  S  W  B 
Bases  A,C  G,C  A,T  C,G,T 
Symbol  D  H  V  N 
Bases  A,G,T  A,C,T  A,C,G  A,C,G,T 
Arbitrary substitutions
Here we allow a DNA symbol in $\mathcal{M}$ to be substituted with other symbols across all positions (i.e., in a position independent manner), up to the allowed maximum errors per motif (or per component). To count the support for a motif, EXMOTIF has to consider all of its neighbors as well, which are defined as all the motifs (including itself) within Hamming distance, ε (or per motif e_{ i }). Then the support of an instance motif is calculated as the total number of sequences in which its neighbors (including itself) are present. As always, the motif is frequent if its support meets the quorum q, that is, its neighbors are present in at least q distinct sequences.
The main challenge is that when arbitrary, position independent substitutions are allowed, we cannot do support checking during each positional join, since the support of the current motif may be below quorum, but combined with its neighbors it may meet quorum. Thus EXMOTIF does support checking at two points. First, it checks for quorum after the poslists of all the simple motifs in $\mathcal{T}$ have been computed, provided the per motif error thresholds e_{ i }have been specified. In this case each simple motif must be frequent to be extended to a structured motif. Second, it checks for quorum after the poslists of all the structured motifs that satisfy $\mathcal{T}$ are computed.
Determining neighbors
In order to quickly find all the existing neighbors of a motif within the allowed error thresholds, EXMOTIF first computes all the exact structured motifs, and stores them into a hash table to facilitate fast lookup. Then for each extracted structured motif $\mathcal{M}$, EXMOTIF enumerates all its possible neighbors and checks whether they exist in the hash table. One problem is that the number of possible neighbors of $\mathcal{M}$ can be quite large. When we allow ε_{ i }substitutions for simple component M_{ i }in $\mathcal{M}$, for 1 ≤ i ≤ k, the number of $\mathcal{M}$'s neighbors is given as ${\prod}_{i=1}^{k}[{\displaystyle {\sum}_{j=0}^{{e}_{i}}\left(\begin{array}{c}\left{M}_{i}\right\\ j\end{array}\right)\cdot {3}^{j}}]$. For example, for $\mathcal{M}$ = AACGTT[1,5]AGTTCC, when we allow one substitution for each simple motif, the number of its neighbors is 361; when we allow two substitutions per component, the number of its neighbors is 23,716. Instead of enumerating the potentially large number of neighbors (many of which may not even occur in the sequence set $\mathcal{S}$) for each structured motif $\mathcal{M}$ individually, EXMOTIF utilizes the observation that many motifs have shared neighbors, and thus previously computed support information can be reused. EXMOTIF enumerates neighbors in two steps. In the first step, for each $\mathcal{M}$, it enumerates aggregate neighbor motifs, replacing the allowed number of errors e_{ i }with as many 'N' symbols (which stands for A,C,G, or T). The number of possible aggregate neighbors is given as ${\prod}_{i=1}^{k}\left(\begin{array}{c}\left{M}_{i}\right\\ {\epsilon}_{i}\end{array}\right)$. The second step, it computes the support for each aggregate neighbor by expanding each 'N' with each DNA symbol, looking up the hash table for the support of the corresponding motif, and adding the supports for all matching motifs. Since the motifs matching an aggregate are also neighbors of each other, the support of the aggregate can be reused to compute the support of other matching motifs as well. Once the supports for all aggregate neighbors have been computed, the final support of the structured motif $\mathcal{M}$ can be obtained. Thus for each $\mathcal{M}$, the number of "neighbors" to consider can be as low as ${\prod}_{i=1}^{k}\left(\begin{array}{c}\left{M}_{i}\right\\ {\epsilon}_{i}\end{array}\right)$!
Counting support
There are two methods to record the support for each motif. In the first method, we associate each motif with a bit vector, $\mathcal{V}$. Each bit, $\mathcal{V}$_{ i }for 1 ≤ i ≤ n (where n = $\mathcal{S}$) indicates whether the motif is present in the sequence S_{ i }∈ $\mathcal{S}$. The support of the motif is the number of set bits in $\mathcal{V}$. Thus to obtain the support for a motif, we can simply union the bit vectors of all its (aggregate) neighbors. Using one bit to represent a sequence saves space, and also saves time via the union operation. However, since we need n fixed bits for each motif to store its bit vector, this is not efficient if there are many sequences, and if a motif occurs only in a small number of sequences, which leads to a sparse bit vector. Thus in the second method, EXMOTIF associates each motif with an identifier array, $\mathcal{Q}$, to only store the sequence identifiers in which the motif occurs. EXMOTIF can then obtain the support for a motif by scanning the identifier arrays of its neighbors in linear time. For example consider again our motif (from Table 1), TAT[0,1]GG[2,3]CCAT, which occurs in S_{2} and S_{3}, Its bit vector is thus $\mathcal{V}$ = {0110} and its identifier array $\mathcal{Q}$ = {2, 3}.
Creating positional weight matrices
For any frequent structured motif $\mathcal{M}$, we can summarize the information about its neighbors (including $\mathcal{M}$) by computing a Positional Weight Matrix (PWM). The PWM for a structured motif $\mathcal{M}$ gives for each nongap position the likelihood of occurrence for each symbol in Σ_{DNA}. The PWM $\mathcal{W}$ for $\mathcal{M}$ is calculated as follows:
$\begin{array}{cc}{r}_{ij}=\frac{{f}_{ij}+{p}_{i}}{{\displaystyle {\sum}_{k=1}^{\left{\Sigma}_{\text{DNA}}\right}{f}_{kj}+{p}_{k}}},& {\mathcal{W}}_{ij}=\mathrm{ln}\left(\frac{{r}_{ij}}{{p}_{i}}\right)\end{array}\left(1\right)$
where, f_{ ij }and r_{ ij }represent the observed and relative frequency of symbol i at position j, respectively, p_{ i }is the prior probability of symbol i, and $\mathcal{W}$_{ ij }is the weight (loglikelihood) of observing symbol i at position j. Whereas $\mathcal{W}$ gives the likelihood of observing a given symbol in a given position in $\mathcal{M}$ it does not account for the degree to which some symbols are conserved at some positions. We can adjust the weights $\mathcal{W}$_{ ij }by considering the information content at each position. The information content for a PWM is given as:
$\begin{array}{ccc}{\mathcal{I}}_{ij}={r}_{ij}\mathrm{ln}({r}_{ij}){p}_{i}\mathrm{ln}({p}_{i}),& {\mathcal{I}}_{j}={\displaystyle \sum _{i=1}^{\left{\Sigma}_{\text{DNA}}\right}{\mathcal{I}}_{ij}},& {\mathcal{I}}_{\mathcal{W}}={\displaystyle \sum _{j=1}^{K}{\mathcal{I}}_{j}}\end{array}\left(2\right)$
where K is the number of symbols in $\mathcal{M}$; $\mathcal{I}$_{ ij }is the information content of symbol i at position j; $\mathcal{I}$_{ j }is the information content over all bases at position j; and ${\mathcal{I}}_{\mathcal{W}}$ is the information content of the entire matrix $\mathcal{W}$. To allow mismatches at less conserved positions to be more easily tolerated than those at highly conserved positions, we multiply each $\mathcal{W}$_{ ij }by $\mathcal{I}$_{ j }, which is larger for more conserved positions. As a result, the corrected weight of each element in the PWM $\mathcal{W}$ becomes:
${\mathcal{W}}_{ij}^{c}={\mathcal{I}}_{j}\mathrm{ln}\left(\frac{{r}_{ij}}{{p}_{i}}\right)\left(3\right)$
Then we can calculate the PWM score, $\mathcal{R}$, for a structured motif, $\mathcal{M}$, by summing up the positional weights for the bases in $\mathcal{M}$, given as $\mathcal{R}={\displaystyle {\sum}_{j=1}^{K}{\mathcal{W}}_{\mathcal{M}[j]j}^{c}}$. Thus for each $\mathcal{M}$, its PWM score and PWM information content can be further used to measure whether $\mathcal{M}$ is a significant motif.
Solving repeated structured motif identification problem
In repeated structured motif identification problem, the frequency closure property (that all the subsequences of a frequent sequence must be frequent), does not hold any more. For example, the sequence GCTTT, has three occurrences of pattern G[1,3]T, but it subpattern, G, has only one occurrence. Thus we cannot apply the closure property for pruning candidates. Nevertheless, a bound on the frequency of a subpattern can be established, which can be used for pruning.
Theorem 1. Let $\mathcal{M}$ = M_{1} ... M_{ k }be a structured motif and ${\mathcal{M}}^{\prime}$ = M_{ i }... M_{ k }be a suffix of $\mathcal{M}$, for 1 ≤ i ≤ k. If the weighted support of $\mathcal{M}$is π_{ w }($\mathcal{M}$), then ${\pi}_{w}({\mathcal{M}}^{\prime})\ge \frac{{\pi}_{w}(\mathcal{M})}{{\displaystyle {\prod}_{m=1}^{i1}{W}_{m}}}$, where W_{ m }= u_{ m } l_{ m }+ 1 is the span of the gap range for m ∈ [1, k  1].
Proof. Let $\mathcal{O}$($\mathcal{M}$) be the occurrence set of $\mathcal{M}$ and $\mathcal{O}$(${\mathcal{M}}^{\prime}$) be the occurrence set of ${\mathcal{M}}^{\prime}$. For each occurrence of ${\mathcal{M}}^{\prime}$ in $\mathcal{O}$(${\mathcal{M}}^{\prime}$), we can extend it to get occurrences of $\mathcal{M}$ in $\mathcal{O}$($\mathcal{M}$) by adding M_{1} ... M_{i1}before ${\mathcal{M}}^{\prime}$. This leads to at most $\prod}_{m=1}^{i1}{W}_{m$ occurrences of $\mathcal{M}$ for any occurrence of ${\mathcal{M}}^{\prime}$. Thus $\left\mathcal{O}({\mathcal{M}}^{\prime})\right\cdot {\displaystyle {\prod}_{m=1}^{i1}{W}_{m}}\ge \left\mathcal{O}(\mathcal{M})\right$, which immediately gives ${\pi}_{w}({\mathcal{M}}^{\prime})\ge \frac{{\pi}_{w}(\mathcal{M})}{{\displaystyle {\prod}_{m=1}^{i1}{W}_{m}}}$. □
With Theorem 1, EXMOTIF can calculate a support bound for any suffix ${\mathcal{M}}^{\prime}$ of $\mathcal{M}$, given the quorum requirement q. For example, assume that the motif template is NN[3,5]NNN[0,4]NNN and q = 100, with W_{1} = 5  3 + 1 = 3 and W_{2} = 4  0 + 1 = 5. When processing the suffix component ${\mathcal{M}}^{\prime}$ = NNN, we require that π_{ w }(${\mathcal{M}}^{\prime}$) ≥ $\frac{100}{3\times 5}$ = 6; when processing ${\mathcal{M}}^{\prime}$ = NNN[0,4]NNN, we require that π_{ w }(${\mathcal{M}}^{\prime}$) ≥ $\frac{100}{3}$ = 33. Thus even the weaker bounds can lead to some pruning.
The complete EXMOTIF algorithm: complexity analysis
EXMOTIF initially adjusts the support thresholds if the task is repeated motif identification (lines 1–2). The main approach for handling exact matches or positionspecific substitutions is the same. The main difference is that while enumerating the simple motifs, EXMOTIF uses the appropriate IUPAC alphabet (specified by v_{ i }for component M_{ i }; lines 6–7). The structured motifs are found via positional joins over the simple motifs (line 8). The positional joins are performed as described in Figure 1. For arbitrary substitutions, EXMOTIF first enumerates the simple motifs (line 9) and checks their aggregate support (i.e., including the supports of all neighbors within error ε_{ i }). From these, the structured motifs are enumerated and stored in a hashtable (ℍ; line 11). Lastly, the aggregate support of all these motifs is computed as described in Figure 5 (line 12). Those that meet the quorum will be output. Finally, if desired, EXMOTIF recovers the full positions for each occurrence, via the procedure outlined in Figure 2.
In terms of the computational complexity of EXMOTIF, let's first consider the complexity of extracting the simple motifs. Assume that m is the length of the longest simple motif component in the structured template $\mathcal{T}$. Note that there are potentially Σ^{ m }frequent simple motifs at that length, but due to the quorum requirement, many of these will not be frequent. Nevertheless, in the worst case O(Σ^{ m }) simple components may be extracted. For a simple motif of length m, EXMOTIF uses O(log(m)) positional joins to obtain its support, and each such join takes O(N) time, where $N={\displaystyle {\sum}_{i=1}^{n}\left{S}_{i}\right}$ is the sum of the lengths of all the sequences S_{ i }in the database $\mathcal{S}$. Thus, extracting the simple motifs takes time O(N log(m)Σ^{ m }) in the worst case.
With Σ^{ m }simple motifs, there are O(Σ^{ mk }) potential structured motifs, though a vast majority of these will not meet the quorum requirement. Extracting the structured motifs then takes time O(kNΣ^{ mk }) for the exact match and positionspecific substitution cases. For arbitrary substitutions there is additional cost of enumerating aggregate neighbors and computing their support. For each motif we have to consider ${\prod}_{i=1}^{k}\left(\begin{array}{c}\left{M}_{i}\right\\ {\epsilon}_{i}\end{array}\right)$ = km^{ e }aggregate neighbors, where e = max_{ i }{e_{ i }}. Furthermore, an aggregate neighbor can have kΣ^{ e }matching motifs. Thus the time complexity of extracting all the structured motifs is O(kNΣ^{ mk }+ k^{2}m^{ e }Σ^{ e }) for arbitrary substitutions. Since typically mk > e and N > m^{ e }, the time complexity is essentially O(kNΣ^{ mk }). Combined with the cost for simple motif extraction, the computational complexity of EXMOTIF is then given as O(log(m) N Σ^{ m }+ kNΣ^{ km }) = O(kNΣ^{ km }).
Experimental results
EXMOTIF has been implemented in C++, and compiled with g++ v4.0.0 at optimization level 3 (O3). We performed experiments on a Macintosh PowerPC G5 with dual 2.7GHz processors and 4GB memory running Mac OS X vl0.4.5. We compare our results with the latest version of RISO [15–17] (called RISOTTO [17]), the best previous algorithm for structured motif extraction problem.
EXMOTIF and RISO: comparison
Exact matching
In the first experiment, shown in Figure 7(a), we randomly generated 100 structured motif templates, with k ∈ [2,4] simple motifs of length l ∈ [4,7] (k and l are selected uniformly at random within the given ranges). The gap range between each pair of simple motifs is a random subinterval of [0, 200]. The xaxis is sorted on the number of motifs extracted. For clarity we plot average times for the methods when the number of motifs extracted fall into the given range on the xaxis. For example, the time plotted for the range [10^{2}, 10^{3}) is the average time for all the random templates that produce between 100 and 1000 motifs. We find that the average running time for RISO across all extracted motifs is 120.7s, whereas for EXMOTIF it takes 88.4s for reporting only the supports, and 91.3s for also reporting all the occurrences. The median times were 26.3s, 8.5s, and 9.2s, respectively, indicating a 3 times speedup of EXMOTIF over RISO.
In the next set of experiments we varied one parameter while keeping the others fixed. We set the default quorum to 12% (q = 127), the default gap ranges to [0,100], the default simple motif length to l = 4 (NNNN), and the default number of components k = 3 (e.g., NNNN[0,100]NNNN[0,100]NNNN). In Figure 7(b), we plot the time as a function of the number of simple motifs k in the template. We find that as the number of components increases the time gap between EXMOTIF and RISO increases; for k = 4 simple motifs, EXMOTIF is around 5 times faster than RISO. Figure 7(c) shows the effect of increasing gap ranges, from [0,0] to [0,200]. We find that as the gap range increases the time for EXMOTIF increases at a slower rate compared to RISO. For [0,200], EXMOTIF is 3–4 times faster than RISO depending whether only frequency or full occurrences are reported. In Figure 7(d), as the quorum threshold increases, the running time goes down for both methods. For quorum 24%, EXMOTIF is 4–5 times faster than RISO. As support decreases, the gap narrows somewhat, but EXMOTIF remains 2–3 times faster. Finally, Figure 7(e) plots the effect of increasing simple motif lengths l ∈ [2,6]. We find that the time first increases and then decreases. This is because there are a large number of motif occurrences for length 3 and length 4, but relatively few occurrences for length 5 and length 6. Depending on the motif lengths, EXMOTIF can be 3–40 times faster than RISO for comparable output, i.e., reporting only the support. EXMOTIF remains up to 5 times faster when also reporting the actual occurrences.
To compare the performance for extracting structured motifs with length ranges, we used the template $\mathcal{T}$ = M_{1}[50, 100] M_{2}[1,50]M_{3}[20, 100]M_{4} with q = 12%, where M_{1} ∈ [2,4], M_{2} ∈ [3,4], M_{3} ∈ [5,6], M_{4} ∈ [4,5]. EXMOTIF took 78.4s, whereas RISO took 1640.9s to extract 14,174 motifs.
Approximate matching
In the first experiment, shown in Figure 8(a), we randomly generated 30 structured motif templates, with k ∈ [2,3] simple motifs of length l ∈ [3,6] (k and l are selected uniformly at random within the given ranges). The gap range between each pair of simple motifs is a random subinterval of [10, 30]. The xaxis is sorted on the number of motifs extracted, and average times are plotted for the extracted number of motifs in the given range. We find that the average running time for RISO is 334.5s, whereas for EXMOTIF it takes 59.3s seconds for reporting only the support, and 176.7s for also reporting all the occurrences. Thus EXMOTIF is on average 5 times faster than RISO, with comparable output.
Figures 8(b)–(e) plot the time for approximate matching as a function of different parameters. We set the default quorum to 12% (q = 127, out of $\mathcal{S}$ = 1062 sequences), the default gap ranges to [12,22], the default simple motif length to l = 6 (NNNNNN), and the default number of components k = 2 (e.g., NNNNNN[12,22]NNNNNN). Figure 8(b) shows how increasing gap ranges effect the running time; for gap range [8,26] between the two motif components, EXMOTIF is 2–3 times faster than RISO. In Figure 8(c), we increase the numbers of arbitrary substitutions allowed for each simple motif; a pair (ε_{1}, ε_{2}) on the xaxis denotes that ε_{1} substitutions are allowed for motif component M_{1}, and ε_{2} for M_{2}. We can see that EXMOTIF is always faster than RISO. It is 9 times faster when only frequencies are reported, and it can be up to 5 times faster then full occurrences are reported, though for some cases the difference is slight.
Figure 8(d) plots the effect of the quorum threshold. Compared to RISO, EXMOTIF performs much better for low quorum, e.g., for q = 4% EXMOTIF is 4–5 times faster than RISO. Finally in Figure 8(e), as the simple motif lengths increase, the time for both EXMOTIF and RISO increases, and we find that EXMOTIF can be 2–3 times faster.
Comparison of EXMOTIF and RISO for different quorums and allowed substitutions.
Quorum  #Substitutions  RISO  EXMOTIF  EXMOTIF(#) 

5%  (0, 0)  1.82s  1.42s  1.52s 
30%  (1, 1)  63.01s  58.91s  64.52s 
60%  (2, 2)  2763.31s  328.43s  2317.35s 
90%  (3, 3)  13682.13s  707.56s  41464.93s 
Real applications
Discovery of single transcription factor binding sites
We evaluate our algorithm by extracting the conserved features of known transcription factor binding sites in yeast. In particular we used the binding sites for the Zinc (Zn) factors [11]. There are 11 binding sites listed for the Zn cluster, 3 of which are simple motifs. The remaining 8 are structured, as shown in Table 5. For the evaluation, we first form several structured motif templates according to the conserved features in the binding sites. Then we extract the frequent structured motifs satisfying these templates from the upstream regions of 68 genes regulated by zinc factors [11]. We used the 1000 to 1 upstream regions, truncating the region if and where it overlaps with an upstream openreading frame (ORF). After extraction, since binding sites cannot have many occurrences in the ORF regions, we drop some motifs if they also occur frequently in the ORF regions (i.e., within the genes). Finally, we calculate the Zscores for the remaining frequent motifs, and rank them by descending Zscores. In our experiments, we set the minimum quorum threshold to 7% within the upstream regions and the maximum support threshold to 30% in the ORF regions. We use the shuffling program from SMILE [14] to compute the Zscores. The shufffing program randomly shuffles the original input sequences to obtain a new shuffled set of sequences.
Regulons of Zn cluster proteins.
TF Name  Known Motif  Predicted Motifs  NumMotifs  Ranking 

GAL4 GAL4 chips  CGGRnnRCYnYnCnCCG  CGG[11,11]CCG  1634(3346)  1/1 
CAT8  CGGnnnnnnGGA  CGG[6,6]GGA  1621(3356)  147/13 
HAP1  CGGnnnTAnCGGCGGnnnTAnCGGnnnTA  CGG[6,6]CGG  1621(3356)  111/146 
LEU3  RCCGGnnCCGGY  CCG[4,4]CGG  1588(3366)  2/1 
LYS  WWWTCCRnYGGAWWW  TCC[3,3]GGA  1605(3360)  33/21 
PPR1  WYCGGnnWWYKCCGAW  CGG[6,6]CCG  1621(3356)  1/2 
PUT3  YCGGnAnGCGnAnnnCCGA CGGnAnGCnAnnnCCGA  CGG[10,11]CCG  727(4035)  1/1 
Discovery of composite regulatory patterns
UASH and URS1H binding sites.
Genes  UASH  URS1H  Gap  

Site  Pos  Site  Pos  
ZIP1  GATTCGGAAGTAAAA  42  ==TCGGCGGCTAAAT  22  5 
MEI4  TCTTTCGGAGTCATA  121  ==TGGGCGGCTAAAT  98  8 
DMC1  TTGTGTGGAGAGATA  175  AAATAGCCGCCCA==  143  17 
SPO13  TAATTAGGAGTATAT  119  AAATAGCCGCCGA==  100  4 
MER1  GGTTTTGTAGTTCTA  152  TTTTAGCCGCCGA==  115  22 
SPO16  CATTGTGATGTATTT  201  ==TGGGCGGCTAAAA  90  96 
REC104  CAATTTGGAGTAGGC  182  ==TTGGCGGCTATTT  93  74 
RED1  ATTTCTGGAGATATC  355  ==TCAGCGGCTAAAT  167  173 
REC114  GATTTTGTAGGAATA  288  ==TGGGCGGCTAACT  94  179 
MEK1  TCATTTGTAGTTTAT  233  ==ATGGCGGCTAAAT  150  68 
Consensus  taTTTtGGAGTaata  ==ttGGCGGCTAA==  [4,179] 
Conclusion and future work
In this paper, we introduced EXMOTIF, an efficient algorithm to extract structured motifs within one or multiple biological sequences. We showed its application in discovering single/composite regulatory binding sites. In the structured motif template, we assume the gap range between each pair of simple motifs is known. In the future, we plan to solve the motif discovery problem when even the gap ranges are unknown. Another potential direction is to directly extract structured profile (or position weight matrix) patterns.
Declarations
Acknowledgements
This work was supported in part by NSF CAREER Award IIS0092978, DOE Career Award DEFG0202ER25538, and NSF grants EIA0103708 & EMT0432098. We also thank the anonymous referees for their helpful suggestions.
Authors’ Affiliations
References
 Zhu J, Zhang M: SCPD: A Promoter Database of the Yeast Saccharomyces Cerevisiae. Bioinformatics. 1999, 15 (7–8): 60711.PubMedView ArticleGoogle Scholar
 Policriti A, Vitacolonna N, Morgante M, Zuccolo A: Structured Motifs Search. Symposium on Research in Computational Molecular Biology. 2004, 133139.Google Scholar
 Michailidis P, Margaritis K: Online Approximate String Searching Algorithms: Survey and Experimental Results. International Journal of Computer Mathematics. 2002, 79 (8): 867888.View ArticleGoogle Scholar
 Sinha S, Tompa M: Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation. Nucleic Acids Research. 2002, 30 (24): 554960.PubMedPubMed CentralView ArticleGoogle Scholar
 Sinha S, Tompa M: YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research. 2003, 31 (13): 35863588.PubMedPubMed CentralView ArticleGoogle Scholar
 Pavesi G, Mauri G, Pesole G: A Consensus Based Algorithm for Finding Transcription Factor Binding Sites. Workshop on Genomes: Information Structure and Complexity. 2004Google Scholar
 Pavesi G, Mauri G, Pesole G: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics. 2001, 17 (Suppl 1): S20714.PubMedView ArticleGoogle Scholar
 Bailey TL, Elkan C: The value of prior knowledge in discovering motifs with MEME. 3rd Int'l Conference on Intelligent Systems for Molecular Biology. 1995, 2129.Google Scholar
 Sagot MF: Spelling Approximate Repeated or Common Motifs Using a Suffix Tree. 3rd Latin American Symposium on Theoretical Informatics. 1998, 374390.View ArticleGoogle Scholar
 Friberg M, von Rohr P, Gonnet G: Scoring functions for transcription factor binding site prediction. BMC Bioinformatics. 2005, 6: 84 http://www.biomedcentral.com/14712105/6/84PubMedPubMed CentralView ArticleGoogle Scholar
 van Helden J, Rios A, ColladoVides J: Discovering regulatory elements in noncoding sequences by analysis of spaced dyads. Nucleic Acids Res. 2000, 28 (8): 180818.PubMedView ArticleGoogle Scholar
 Eskin E, Pevzner P: Finding composite regulatory patterns in DNA sequences. Bioinformatics. 2002, 18 (Suppl 1): S35463.PubMedView ArticleGoogle Scholar
 Eskin E, Keich U, Gelfand M, Pevzner P: Genomewide analysis of bacterial promoter regions. Pac Symp Biocomput. 2003, 2940.Google Scholar
 Marsan L, Sagot M: Extracting Structured Motifs Using a suffix Tree – Algorithms and Application to Promoter Consensus Identification. Journal of Computational Biology. 2000, 7: 345354.PubMedView ArticleGoogle Scholar
 Carvalho A, Freitas A, Oliveira A, Sagot M: Efficient Extraction of Structured Motifs Using Boxlinks. String Processing and Information Retrieval Conference. 2004, 267278.View ArticleGoogle Scholar
 Carvalho A, Freitas A, Oliveira A, Sagot M: A highly scalable algorithm for the extraction of cisregulatory regions. AsiaPacific Bioinformatics Conference. 2005, 273283.View ArticleGoogle Scholar
 Pisanti N, Carvalho AM, Marsan L, Sagot MF: RISOTTO: Fast extraction of motifs with mismatches. 7th Latin American Theoretical Informatics Symposium. 2006Google Scholar
 Carvalho AM, Freitas AT, Oliveira AL, Sagot MF: A parallel algorithm for the extraction of structured motifs. 19th ACM Symposium on Applied Computing. 2004, 147153.Google Scholar
 Brazma A, Jonassen I, Vilo J, Ukkonen E: Pattern Discovery in Biosequences. International Colloquium on Grammatical Inference. 1998, 257270.View ArticleGoogle Scholar
 Apostolico A, Parida L: Incremental Paradigms of Motif Discovery. Journal of Computational Biology. 2004, 11: 1525.PubMedView ArticleGoogle Scholar
 Apostolico A, Comin M, Parida L: Conservative extraction of overrepresented extensible motifs. Bioinformatics. 2005, 21 (Suppl. 1): i9il8.PubMedView ArticleGoogle Scholar
 Zhang M, Kao B, Cheung DWL, Yip K: Mining Periodic Patterns with Gap Requirement from Sequences. ACM Int'l Conference on Management of Data. 2005Google Scholar
 Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research. 1999, 27 (2): 57380.PubMedPubMed CentralView ArticleGoogle Scholar
 Thakurta D, Stormo G: Identifying target sites for cooperatively binding factors. Bioinformatics. 2001, 17 (7): 608621.View ArticleGoogle Scholar
 Zaki MJ: SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning Journal. 2001, 42: 131.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.