SMOTIF: efficient structured pattern and profile motif search

Background A structured motif allows variable length gaps between several components, where each component is a simple motif, which allows either no gaps or only fixed length gaps. The motif can either be represented as a pattern or a profile (also called positional weight matrix). We propose an efficient algorithm, called SMOTIF, to solve the structured motif search problem, i.e., given one or more sequences and a structured motif, SMOTIF searches the sequences for all occurrences of the motif. Potential applications include searching for long terminal repeat (LTR) retrotransposons and composite regulatory binding sites in DNA sequences. Results SMOTIF can search for both pattern and profile motifs, and it is efficient in terms of both time and space; it outperforms SMARTFINDER, a state-of-the-art algorithm for structured motif search. Experimental results show that SMOTIF is about 7 times faster and consumes 100 times less memory than SMARTFINDER. It can effectively search for LTR retrotransposons and is well suited to searching for motifs with long range gaps. It is also successful in finding potential composite transcription factor binding sites. Conclusion SMOTIF is a useful and efficient tool in searching for structured pattern and profile motifs. The algorithm is available as open-source at: .


Background
Searching biological sequence(s) for motifs is a fundamental task in bioinformatics. Motifs can be represented as either patterns over a specific alphabet, or profiles (also called positional weight matrix (PWM)), which give the probability of observing each symbol in each position. Motifs can be classified into two main types. If no variable gaps are allowed in the motif, it is called a simple motif. For example, in the genome of Saccharomyces cerevisiae, the binding sites of transcription factor, GAL4 [1], can be characterized by the simple motif shown in Table 1, which illustrates the pattern over the IUPAC alphabet (Σ IUPAC ; see Table 2), as well as its profile (which gives the frequency of each DNA base at each position). The motif in Table 1 only consists of one component and thus is a simple motif. Since the symbols in the first 3 positions (CGS) and in the last 3 positions (SCG) are well conserved, we can also represent this motif as CGS [11,11]SCG, where [11,11] means that there is a fixed "gap" of length 11 between the two components. If variable gaps are allowed in a motif, it is called a structured motif. A structured motif can be regarded as an ordered collection of simple motifs with gap constraints between each pair of adjacent simple motifs. For example, the LTR retrotransposons from the Copia group, corresponding to genes encoding reverse transcriptase, in A. thaliana can be characterized by the structured motif M 1 [2,5] M 2 [6,7] M 3 , as shown in Table 3 [2]. Here M 1 , M 2 and M 3 are three simple motifs; [2,5] and [6,7] are variable gap constraints ([minimum gap, maximum gap]) allowed between the adjacent simple motifs. Note that each simple motif M i (with 1 ≤ i ≤ 3) can either be a pattern over Σ IUPAC or a profile over Σ DNA . Searching for structured motifs is more complicated than searching for simple motifs, and is an ongoing research area [3][4][5][6][7]. The sequence to be searched can be very long, e.g., chromosome 1 of Homo Sapiens contains 245 million (245M) base pairs. The structured motif can also be as long as several kilobases. All these factors need to be considered when designing an efficient structured motif search algorithm.
More formally, a structured motif , is specified in the form: M1 [l1, u1] M2 [l2, u2] M3 ... Mk-1 [lk-1, uk-1] Mk, where Mi, 1 ≤ i ≤ k, is a simple motif component; and li and ui (with 0 ≤ li ≤ ui), 1 ≤ i <k, are the minimum and maximum length of the gap allowed between Mi and Mi+1, respectively. Note that a gap is defined to be the number of intervening positions after Mi but before Mi+1. In other words if si and ei represent the start and end positions of component Mi, then for i ∈ [1, k -1], the length of the gap between Mi and Mi+1 is given as gi = ei+1 -si -1, and we require that gi ∈ [li, ui]. We use |Mi| to denote the number of symbols/positions in component Mi, also called the length of the component, and we use to denote the total length of the structured motif (not counting gaps). The maximum span, L, of the motif is the maximum number of positions that can be occupied by the structured motif, which is given as . The structured motif can be either specified as a pattern or a profile. When denotes a pattern, we use the notation j to denote the symbol at position j in the motif, and when denotes a profile, we use the notation xj to denote the frequency of symbol x ∈ ΣDNA at position j, where j ∈ [1, Table 3 shows both the pattern and profile representation of an example structured motif with three components.
Depending on the application, the structured motif search problem can have several variations: • Missing Components: The matching motifs can consist of some, instead of all, the simple motifs in , allowing for at most q missing components.
• Approximate Matches: The matching motifs may consist of similar motifs (as measured by Hamming or Levenshtein distance [8]), instead of exact matches, to the simple motifs in , allowing for at most ε i errors for simple motif M i (when is expressed as a pattern).   allowing search for overlapping simple motifs. We allow two adjacent components M i and M i+1 to overlap, but we require that M i+1 does not precede M i . This condition can be satisfied by the following constraints on the gap range [l i , u i ]: -|M i | ≤ l i ≤ u i , for i ∈ [1, k). For example the search for motif ACG[-2,2]CGA, can discover the overlapped occurrence ACGA, as well as the non-overlapped occurrence ACG--CGA, at the two extremes of the gap range.
• Profile Search: The components of the motif can be specified as a pattern in either the DNA (Σ DNA ) or IUPAC (Σ IUPAC ) alphabets, or as a profile over Σ DNA .
In this paper, we focus on the problem of searching for a given structured motif in one or more sequences. We propose SMOTIF, an efficient algorithm for structured motif searches. It uses an inverted index of symbol positions, and it finds all occurrences by positional joins over this index. For structured pattern search problem, we propose two main variants of our approach: i) a direct search for simple motifs and the structured motif via positional joins, and ii) a two-step approach, where we use a suffix tree to search for simple motifs and then use positional joins for the structured motif. For structured profile search problem, we first search each simple motif by aligning its profile with the sequences, and then search structured motifs with positional joins. SMOTIF allows missing components, overlapping motifs, and also approximate matches (when using the two-step approach). SMOTIF also allows flexible matches using IUPAC symbols.
We apply SMOTIF for searching long DNA sequences for LTR retrotransposons, which constitute a substantial fraction of most Eukaryotic genomes and are believed to have a significant impact on genome structure and function [9,10]. We show that SMOTIF is effective in searching for composite regulatory patterns, and it can also suggest potentially new binding sites. We experimentally demonstrate the superiority of SMOTIF over SMARTFINDER, a state-of-the-art method for structured motif search, both in terms of time and space; SMOTIF can be up to 7 times faster and can consume 100 times less space.

Related work
Many existing pattern matching algorithms [8,[11][12][13][14][15][16][17][18] can be used to solve the simple pattern search problem. Given the sequence length, n, and the pattern length, m, exact matching algorithm can run in O (n + m) [11]; approximate matching algorithm can run in O (rn) [16,17] or O (nm/w) [18], where r is the error threshold and w is the size of a computer word. The space complexity is O (n) for both exact matching and approximate matching.
Several previous efforts have focused on the structured pattern search problem. Anrep [3,4] provides a unified biosequence pattern representation by using network expressions with spacers, where a network expression is a regular expression without Kleene closure. With network expressions, one can specify the scoring scheme and the threshold of approximate matching for each simple motif separately, the positional weights which express the relative importance of different parts of a motif, and whether a simple motif is optional. Anrep introduces a two-step approach: first it searches for simple motifs by a threshold-sensitive motif matching algorithm and then it finds the structured motif by an optimized backtracking matching algorithm. However, as compared by [6], Anrep is much slower than SMARTFINDER.
Two algorithms are proposed for CBG expression search: forward search and backward search. Bit parallelism speeds up the search, but is adequate only for a pattern whose maximum span is smaller than the length of the computer word. Also the implementation of CBG can only handle such pattern. This limits the application of CBG to searching for patterns with small number of symbols and gaps.
SMARTFINDER [6,7], which is currently the most efficient method for structured motif search, is also a two-step approach. In the first step each simple motif is searched separately by building a suffix tree for the sequence. This step outputs the ordered occurrence lists of all simple motifs. The second step solves a constraint satisfaction problem by considering constraints individually in three substeps. First it considers the gap constraints and builds a constraint graph whose nodes are the simple motif occurrences and edges connect all possible pairs of nodes that locally satisfy the gap constraints. It then considers the constraint for the maximum number of missing components and shrinks the graph to contain only the nodes that can be in the structured motif occurrences. Finally it enumerates all the valid occurrences by a depth first search (DFS). Notable differences in SMOTIF and SMART-FINDER are as follows: we search patterns directly by positional joins over an inverted index, we consider variable gap constraints during the positional joins as opposed to building a constraint graph, and we handle missing components more efficiently by considering them over patterns instead of over each occurrence as in SMARTFINDER. Note also that like Anrep and SMARTFINDER, SMOTIF can also mine approximate patterns, when using the twostep approach, which we describe later.
For profile search, MATCH [19], P-Match [20], and MatInspector [21,22] search DNA sequences against a position weight matrix library (such as TRANSFAC database [23]) and report the occurrences that satisfy given score thresholds. They compute the matrix score by multiplying the base frequency with the information content value at each position, in order to emphasize the fact that mismatches at less conserved positions are more easily tolerated than mismatches at highly conserved positions. Besides the matrix score, they define a core region, which is usually the first 4-5 most conserved consecutive positions of the matrix, and perform the core score threshold check. Then they align the matrix to each position of the sequence and calculate the core score and matrix score. However, these algorithm don't consider the prior probability of each base when calculating the matrix (or core) score, and the core region is required to be consecutive. They need to check all positions of each subsequence (at least all the core positions) in order to calculate the matrix (core) score. Moreover, these algorithms only work on simple profile with one single matrix component. For structured profile search, only Anrep [3,4] provides the capability to model structured profiles, with its general network expressions. However, Anrep doesn't give a solution on score calculation and fast search for structured profiles. Moreover, its implementation doesn't support structured profile search. To our knowledge, SMOTIF is the only implemented method that can handle structured profile search.

Methods
We first introduce our basic approach for structured pattern search, and successively optimize it for various practical scenarios. We then present our approach for structured profile search.

Structured pattern search: basic approach
Let us assume that we are searching for a structured motif over a single sequence S ∈ . We assume that S is over Σ DNA , whereas, is over Σ IUPAC to allow for more flexible matches. SMOTIF first converts S into an equivalent inverted format [24,25], where we associate with each symbol in the sequence its pos-list, a sorted list of the positions where the symbol occurs in S. More formally, for a symbol X ∈ Σ IUPAC , its pos-list is given as (X, S) = {i | S is the symbol at position i in S, and |S| denotes the length of S. When S is obvious, we drop it, and denote the pos-list as (X). For our example sequence S in Table 4, the pos-lists for X ∈ Σ DNA are given in Table 5.
Depending on whether we compute the pos-lists for IUPAC symbols or not, SMOTIF uses two approaches:   • d > u: Advance x to the next element in (X).
• l ≤ d ≤ u: Save this occurrence in (X [l, u] Y), and then advance x.
The pos-list for X [l, u] Y can be computed in time linear in the lengths of (X) and (Y). In essence, each time we advance x ∈ (X), we check if there exists a y ∈ (Y) that satisfies the given gap constraint. Instead of searching for the matching y from the beginning of the pos-list each time, we search from the last position used to compare with x. This results in fast positional joins, and also allows overlaps among occurrences. For example, during the positional join for the motif T[0, 1]A, with l = 0 and u = 1, we scan the pos-lists of T and A in Table 5. Initially, x = 4 and y = 3. This gives d = 3 -4 -1 = -2 <l, thus we advance y to 10. Next, d = 10 -4 -1 = 5 > u, thus we advance x to 8.  Positional-Joins(P(X), P(Y ), l, u) tions (x and y) in adjacent columns if their difference (d = y -x -1) falls within the corresponding gap range. The joins begin with the last two symbols, with A as the head and T as the tail with a gap of Note also how position 2 that was in the tail's pos-list cannot be extended, since there is no position where A occurs within a gap of [1,4] before the tail CAT. The join process continues until the first symbol, and we finally get 5 as the only start occurrence for the full structured motif.

Full position recovery
In our positional join approach, to save time and space we retain only the motif start positions, however, in some applications, we may need to know the full position of each occurrence, i.e., the set of matching positions for each symbol in the motif. We describe two approaches to recover the full positions: recomputed or indexed fullposition recovery. As an example, let's assume we want to recover the full position for the motif = GC[1,2]T in our the sequence S from Table 4, starting from position s = 5. The recovery process is illustrated in

Recomputed full position recovery
Note that is never used. Also note that ( ) = 1 . Let s be a start position for the structured motif in sequence S, and let s be the j s -th entry in 1 , i.e., s = 1 [j s ]. Also let F store a full position starting from s, and let store the set of all full positions. Figure 5 shows the pseudo-code for recovering full positions starting from s. This recursive algorithm has two parameters: i denotes a (suffix) position in , and j gives the j-th entry in i . The algorithm is initially called with

Recomputed-Recovery(S, s)
In the next call we can follow the index 2 [j] to get the next position in F, namely F [3]. Thus in each call we keep following the indices from one pos-list to the next and finally we can get a full position starting from s when we reach the last pos-list, . Furthermore, at each suffix position i, since j only marks the first position in i+1 that satisfies the gap constraints, we also need to consider all the subsequent positions j' > j that may satisfy the corresponding gap range.

Sequence segmentation
The SMOTIF approach as described above works well for searching a motif in a relatively short sequence. For a very long sequence S (e.g., searching for (LTR) retrotransposons in an entire chromosome) the pos-lists can get very long in the initial stages, consuming a lot of memory. SMOTIF handles a long sequence by splitting it into several segments and searches each segment separately for the structured motif. That is, the sequence S is split into p equal partitions (except for the last one). Handling each smaller segment S i (i ∈ [l, p]) instead of the original S can save a lot of space and also reduces the total search time. After segmentation, to avoid missing any occurrence, we require that each partition   Structured Motif first L -1 symbols from partition S i+1 . Finally, to avoid duplicate occurrences, we discard all occurrences with a start position in the overlap region, since it would be reported when we process segment S i+1 . For example, let S be the sequence in Table 6, and let the structured motif be = GC[1,2]T with maximum span L = 5. If p = 3, then we would have three segments of length 6 each. After adding the overlap region of L -1 = 4 positions at the end of each segment, we obtain the final three segments shown in Table 6. Two start positions of would be found in S 1 (namely 1 and 5), and one in S 2 (namely 11). Note that start positions 5 and 11 would have been missed if we had no overlap.

Indexed-Recovery
So far we have assumed that we are searching for the structured motif in a single sequence. SMOTIF can easily handle a collection of sequences. We simply search each sequence separately using segmentation when necessary.

Missing components
In some applications a partial match of the structured motif might still be of interest. SMOTIF allows up to q simple motif components to be missing during the search. Let be a structured motif with k components. SMOTIF first enumerates all possible sub-motifs having k' components, where k' ∈ [k -q, k]. Next, the gap ranges are adjusted in each sub-motif to account for skipping over the missing components. The new gap range, [l i,j , u i,j ], between components M i and M j (with 1 ≤ i <j ≤ k) in a sub-motif, is calculated as follows: , and .
For example, if we allow one (q = 1) missing component for our structured motif in Table 4 Instead of searching each sub-motif separately, we do an optimized search. We reuse the partial pos-lists created when using a depth first search to enumerate and search the sub-motifs. The idea is to re-use the pos-lists created for common suffixes when enumerating their sub-motif extensions.

Two-step approach for structured pattern search
So far we have described the direct method used by SMO-TIF to search for the structured motif by positional joins over the symbols. In fact, SMOTIF, can also follow a twostep approach like in Anrep [4] and SMARTFINDER [6]. In the first step, given and , we search for each simple motif in , i.e., M 1 , M 2 , ..., M k . This task can be solved by existing pattern matching algorithms. In the second step, we do positional joins on the pos-lists of the simple motifs. Let

Exact matching
Many algorithms [11][12][13][14] exist for exact pattern matching. Like in SMARTFINDER we use a lazy suffix tree [11] to extract the pos-lists for all simple motifs. The matching occurrences are sorted after extracting them from the suffix tree to obtain the pos-list in sorted order. For an inter-

Approximate matching
Several algorithms [8,12,[15][16][17][18] exist for approximate pattern matching. For consistency, we used Sellers' dynamic programming algorithm [26], as implemented in SMARTFINDER, to extract the pos-lists for all simple motifs with approximate matches. This algorithm is not optimal and it can be replaced by more efficient ones [16][17][18]. Since we allow a specific Levenshtein distance [8] (i.e., insertions, deletions and substitutions) between the occurrences and the motif, the length of the occurrences can be different from the component length |M i |. Thus we augment the pos-list to explicitly store the end position, in addition to the start position, for each occurrence. Figure  7(b) shows how the pos-list joins work for approximate matches of simple motifs. In the structured motif from Table 4, we consider the exact matches of GC and CAT, and the approximate matches of TTA within Levenshtein distance of 1. Each column in (b) shows the pos-list of a simple motif: the left sub-column is a list of its start positions and the right sub-column is a list of its end positions. We first join the pos-lists of TTA and CAT, checking for gap range [1,4]. We compare the end positions of TTA and the start positions of CAT and find that the pairs (9,12), (10,12), (10,15), and (11,15) all lie within the gap range (indicated by the links), and thus the pairs, (7, 10), (8,9), (8,10) and (8,11)

Structured profile search
Having outlined our approach for structured pattern search, here we tackle the problem of structured profile search. The profile (also called a position weight matrix) for a structured motif gives for each position in each component, the frequency of occurrence for each symbol in Σ DNA , i.e., xj gives how often symbol x ∈ Σ DNA occurs at position j. Note that, as in the case of pattern motifs, for a structured profile motif , we have to specify the gap constraints between its simple profile motif components. Profiles are able to better capture the variability in the  motifs, since they capture position specific statistics on the symbols, and as such can retain more valuable information than the pattern representation.
Given a profile structured motif , and a user-specified match/score threshold λ, the goal of structured profile search is to enumerate all structured patterns that match the profile motif above the threshold λ. Our approach to profile motif search consists of the following two steps: a) convert the raw frequency profile into a relative profile weighted by information content at each position, b) enumerate occurrences matching the profile.
Weighted profile creation Given the initial "raw" frequency profile for the structured motif , we first convert it into a weighted profile as follows: where xj gives the absolute frequency of symbol x ∈ Σ DNA at position j in the structured motif , for 1 ≤ j ≤ Whereas the weights computed above give the likelihood of observing a given symbol in a given position they do not account for the degree to which some symbols are conserved at some positions. We can adjust the weights by considering the information content at each position. The information content for a profile is given as: where xj is the information content of symbol x at position j; j is the information content over all bases at position j; and is the information content of the entire profile. To allow mismatches at less conserved positions to be more easily tolerated than those at highly conserved positions, we multiply each weight by j , which is larger for more conserved positions. As a result, the corrected weight of each element in the profile becomes: Given the initial profile motif we obtain its corresponding information content weighted profile for use in the enumeration step. Our weighting approach has some differences with respect to previous ones [19][20][21][22]. For instance, we consider the prior probability of each base for the calculation of the positional weight and information content. That is necessary because some bases (i.e. C and G) occur more frequently than others (i.e. A and T). Also we use relative frequency instead of the absolute one, so as to compare with the prior probability. However, note that, in general, SMOTIF is flexible enough to handle any user specified profile.

Profile scoring
Given the weighted profile for a structured motif , a sequence S ∈ , and any subsequence S' of S (not necessarily consecutive) of length | |, that satisfies the gap constraints, the profile score for subsequence S' is computed as the sum of weights of its symbols at each position in the profile, given as: (note that S' [j]j gives the weight of symbol S' [j] at position j in the profile motif). In order to check whether S' is a potential binding site, we compare its profile score with the userspecified threshold, λ ∈ [0,1]. To ensure that the profile score lies between 0 and 1, we have to normalize it. Let  When applying the score threshold for an occurrence S', we require that its normalized score is above the threshold λ. For example, for Equation 4(b), In other words, instead of normalizing the score for each match, we take λ n as the new normalized threshold for scoring the potential matches. Likewise we can get the new thresholds for Equations 4(a) and 4(c). For example, for 4(a) the normalized threshold would be λ n = λ · max .

Partial scores
For profile matching problem, we are only given the score threshold λ for the whole structured motif. We here develop a method to compute a partial score threshold for any sub-profile, which can lead to great pruning efficiency. For a sub-profile of profile , let λ n ( ) denote its minimum score threshold and let max be its maximum weight, i.e., the sum of the maximum weights (regardless of the symbol) across all positions in .
Then the minimum (partial) score threshold for is calculated as:  In another word, the minimum threshold for any subprofile is calculated as the difference of the minimum threshold for the whole structured motif and the maximum weight of the motif excluding the sub-profile .  [27] proposed a (permuted) lookahead profile scoring approach similar to our partial scoring method. However, their method is only for simple motif scoring, while our method is applied for both simple motif scoring and structured motif scoring.

Core scores
In many biologically relevant motifs, some positions are more conserved than others. We call them core positions, and these are precisely those positions with high information content. We choose the top h (usually 4 to 6) positions in the profile with highest information content as the core positions. For any potential match S', we can compute its core score c (S') to be the sum of the weights over only the core positions, and we require that c (S') satisfy a user-specified normalized core score threshold λ c . Just like the score threshold, we use the normalized core score threshold for pruning, and furthermore, we can compute the core threshold for any subprofile of the core positions. Note that we compute separate (partial) core scores for each component. For example, assuming we restrict our attention to only component M 1 in Figure 8, with λ c = 1, we have c (M 1 ) = 1.36 + 0.78 = 2.14.

Motif enumeration Simple motif scoring
Given a profile motif , score threshold λ, core score threshold λ c , and a sequence set , for each sequence S ∈ To prune matches S' that will eventually not meet the score threshold, we check the score threshold as each position in S' is being considered. If the score for any prefix of S' falls below the score threshold, we can discard S'. In fact, before applying scores over all positions, we first consider the scores for the prefixes of the core positions within the component. This continuous check for the core positions and regular positions leads to very effective pruning. Note that as opposed to previous methods [19][20][21][22], our approach does not require the core positions be consecutive so as to find the most conserved parts in the profile. A2 + G4 = 2.14 ≥ = 2.14. Thus we continue to check the whole score. Checking for position 1 gives A1 = -0.53 > -2.02; then a check for the prefix up to position 2 gives A1 + A2 = 0.83 > -0.66. By continuing this process, we see that AACG satisfies the core score threshold as well as the score threshold. Figure 9(a) shows all the matching positions, i.e., the pos-list, for each component in the profile.

Structured motif scoring and positional joins
After obtaining the pos-lists of simple motif components, we can enumerate structured motifs by doing positional joins on these pos-lists, as already outlined for structured pattern search. We compute the positional joins with M i as the head and M i+1 ʜ M k as the tail, as we start from M k as the head and end at M 1 as the head. During the positional joins we also check the partial structured score thresholds λ n (M i ʜ M k ). If the check fails at any stage we prune the match candidate. We keep a pattern (along with its score) only if it satisfies the full structured score threshold λ n .

Full position recovery
Once the pos-list for the profile has been computed, we can then recover the full positions from each start position, using the approach already outlined for the structured pattern search, i.e., we use an index i for each component to speed up the full position recovery. For  14 15 16 17 18 19 20 21 22 23 24 25 26 27 28  30 31  33 34  36  2 3  29  32 35 1 example, in Figure 9(e), to recover the full position for the occurrence starting at position 1, we follow index 1 to get position 10 in M 2 's pos-list, to obtain = {(1, 10)}.

A C G G A C G T C A T G C T G A C C A T A C G C A T G C C T T A C G C A
Then we follow index 1 to position 20 in M 3 's pos-list, to get = {(1,10, 20)} as a full position and its corresponding sequence is AACG [5]CATGCT [4]ATACG. By continuing this process, we can get the other two full positions, as shown in Figure 9(f).

The complete SMOTIF algorithm: complexity analysis
The pseudo-code for the complete SMOTIF algorithm is shown in Figure 10. We distinguish three cases: the direct (or one-step) approach, and the two-step approach for structured pattern search are denoted as SMOTIF-1, and SMOTIF-2, respectively. The approach for structured pro- Overall, the total time for SMOTIF is O (| | · |Σ DNA | + n · (| | + k) + L · | |), except when we use a comparison based sorting in SMOTIF-2 exact matching case, in which case the second term becomes n · (| | + k + log n).

Performance comparison
We have implemented the SMOTIF algorithm in standard C++. We compare our results with SMARTFINDER, the best previous algorithm for structured motif search. For fair comparison, the suffix tree used in SMOTIF-2 for finding the pos-list of simple motifs is the same as the one used in SMARTFINDER, namely the lazy mode suffix tree [11], and we apply Sellers' dynamic programming algorithm [26] for approximate matching. The programs were compiled with g++ v3.2.2 at the optimization level 3 (-O3). Unless indicated otherwise, we did the experiments on an Apple G5 with dual 2.7Ghz processors and 4GB memory running Mac OS X; the timings reported are total times for all steps of the algorithms; all our experiments use exact matching for the simple motifs, and we report the full position for the occurrences.

SMOTIF: parameter settings
In the first set of experiments we used the 5 chromosomes of Arabidopsis thaliana as our sequence set. These five chromosomes have lengths 29M, 19M, 22.7M, 16.9M and 25.7M base pairs, respectively. The structured motif to search for includes a well conserved feature of a Copia retrotransposon [6,7], shown in Table 7. Figure 11 shows how SMOTIF performs on the A. thaliana chromosome 1, while searching for the Copia retrotransposon. We study the impact of allowing 0, 1 or 2 missing components out of the 6 simple motifs in the Copia structured motif. Figure 11   the effect of 0,1, or 2 missing components. Both figures suggest the best segment length is 10,000 base pairs. At that segment length, sequence segmentation can speed up the search around 2 to 3 times for SMOTIF-1 (depending on the number of missing components) and 4 times for SMOTIF-2. In the following experiments, we use 10,000 as the default segmentation length. Figure 11(c) and 11(d) compare the time and memory usage for SMOTIF-1 when we use the recomputed or indexed full position recovery. We find that the indexed approach is about 2 times faster than recomputing the full positions, and at the same time it can consume up to 20 times less memory! The effects are more pronounced for more missing components. Henceforth, we use the indexed approach to full position recovery. Figure 11(e) compares the two methods for handling IUPAC symbols in SMOTIF-1, namely DNA pos-lists or IUPAC pos-lists. We find that there is not much difference between the two when 0 or 1 missing components are allowed; but for 2 missing components, IUPAC pos-lists have a slight advantage. In terms of memory space, even though the IUPAC pos-lists approach stores the pos-lists for each distinct symbol that appears in the structured motif, since only one segment is processed at one time, the space utilization of the two approaches is comparable. For example, with no missing components, the IUPAC and DNA pos-lists consume 2.98MB and 2.82MB memory, respectively. Henceforth, we use the IUPAC pos-lists approach.
Figure 11(f) shows the time for SMOTIF-1 and SMOTIF-2 on each of the 5 chromosomes of A. thaliana, with no missing components. The chromosomes are arranged by increasing length on the x-axis. We find that the search time increases linearly with the lengths of the sequences. We also observe that the direct approach, SMOTIF-1, outperforms the two-step approach, SMOTIF-2, when searching for the Copia retrotransposon. Figure 12(a) and 12(b) compare time and memory usage, respectively, for SMOTIF-1, SMOTIF-2 and SMART-FINDER, when searching for the Copia retrotransposon in chromosome 1 of A. thaliana. We can see that SMOTIF-2 is around 5 times faster and takes around 8 times less memory than SMARTFINDER. Their running time and memory usage increase slowly with the number of missing components. This is because in both approaches, the simple motif search step is the same for different number of missing components and usually takes most of the time and memory. SMOTIF-1 always outperforms SMART-FINDER, both in time and memory usage. SMOTIF-1 is more than 7 times faster and takes 100 times less memory than SMARTFINDER! Its time increases linearly with the number of missing components. While SMOTIF-1 is faster when there are no missing components, SMOTIF-2 does better when there are two missing components. This is because SMOTIF-1 does the positional joins on the poslists of individual symbols, rather than simple motifs, so over all the sub-motifs of the structured motif, the pos-lists of simple motifs may be redundantly computed several times in SMOTIF-1, whereas these are computed only once in SMOTIF-2. Thus the time of SMOTIF-1 varies depending on the number of missing components. However, since we keep one pos-list per DNA/IUPAC symbol, the memory usage remains almost unchanged for SMO-TIF-1 algorithm.

SMOTIF and SMARTFINDER: comparison Comparison on A. thaliana
To directly evaluate SMARTFINDER's constraint graph approach with our positional join approach, in Figure  12(c), we compare the time for the second step of SMO-TIF-2 and SMARTFINDER, after the suffix tree is built in the first step. We observe that SMOTIF-2 is orders of magnitude faster than SMARTFINDER in finding all the occurrences that satisfy the structured gap constraints depending on the number of missing components allowed. We conclude that doing positional joins on the inverted index is an efficient way of enumerating all occurrences.
We also multiply aligned 36 A. thaliana LTR retrotransposons from Repbase Update [2] database, which belong to Copia repeat class, and have reverse transcriptase as the keyword. We then extracted four conserved features from the alignment results, shown as i (i ∈ [1, 4]) in Table  7. We searched chromosome 1 of A. thaliana for the four extracted motifs using SMARTFINDER, SMOTIF-1 and SMOTIF-2, respectively. Table 8 shows the number of occurrences | | found, and the search times. We can observe that since we allow no missing components, SMOTIF-1 can be 4 to 5 times faster than SMOTIF-2, and 4 to 18 times faster than SMARTFINDER. Note also that for the longest motif 4 , SMARTFINDER ran out of memory.

Searching motifs with long gaps
One application of SMOTIF is to find structured motifs with long gaps.

Comparison on random motifs
Here we used chromosome 20 of Homo Sapiens as the sequence; it has length 61M base pairs. We generate 100 random structured motifs in the Σ IUPAC alphabet, with k ∈ [3,8] simple motifs of length l ∈ [5,10] (k and l are selected uniformly at random within the given ranges). The gap range between each pair of simple motifs is a random sub-interval of [-5, 100]. Note that the negative minimum gap shows that SMOTIF can mine overlapping simple motifs. Here we also compare the structured profile search approach SMOTIF-P, as follows: for each random motif, we form a profile by first expanding the IUPAC symbols into their corresponding DNA symbols and assign them a random probability of occurrence which accounts for 90% of the share, whereas the other DNA symbols randomly share in the remaining 10%. We use SMOTIF-P to search for the profiles with 2 as the number of core positions in each simple motif, λ c = 0.1 as the core score threshold and λ = 0.6 as the total score threshold. We use random motifs mainly to demonstrate the effects of various parameters on SMOTIF and SMART-FINDER. Figure 13(a)-(d) show the results. Here we do not allow missing components. As noted before, we may find overlapping occurrences if a negative gap is present in a motif. Figure 13(a) shows how the running time varies with the sum of the number of occurrences of the simple motifs in each of the 100 random motifs. For clarity, each point   SMaRTFinder sMOTIF-2 sMOTIF-2 (non-segmented) reflects the average time for the number of occurrences in the given range on the x-axis. For example, the first point on the x-axis [0, 1) corresponds to the case when there are between 0 and 1 million occurrences found. The general trend is that it takes more time as the number of occurrences increases. Figure 13(b) shows the time with respect to the number of occurrences of the whole structured motif (again, for clarity, only average times are plotted for occurrences in the given ranges in the x-axis). We observe that the time increases slightly with increase in the occurrences. In general, the times are more sensitive to the number of intermediate (simple) occurrences. Figure  13(c) shows the effect of the number of simple components in the structured motif. Each point shows the average time over all motifs having the given number of simple motifs. Here again the time increases with increasing components. Finally Figure 13(d) shows the impact of the number of IUPAC symbols in the structured motifs; the trend being that the more the symbols the more time it takes to search. We also observe that the approaches scale linearly, on average, with respect to the different parameters. Also SMOTIF remains about 5-10 times faster than SMARTFINDER over all these experiments. Table 9 shows the mean and variance of the search times over all the 100 structured motifs. It also shows the time for finding only the start positions or the full positions for SMOTIF. Overall, for these random motifs, we find that on average SMOTIF-1 is the fastest, SMOTIF-2 and SMO-TIF-P are comparable, and all three outperform SMART-FINDER by a factor of 4 to 6.
It is interesting to note that SMOTIF is more stable than SMARTFINDER: SMOTIF-P has around 8 times less variance than SMARTFINDER, SMOTIF-2 has around 5 times less variance than SMARTFINDER, whereas SMOTIF-1 has around 17 times less variance than SMARTFINDER. Note also that the overhead in recovering the full positions from the start positions is negligible.

Application: composite regulatory patterns
The complex transcriptional regulatory network in Eukaryotic organisms usually requires interactions of multiple transcription factors. A potential application of SMOTIF is to search for such composite regulatory binding sites in DNA sequences. We took two such transcription factors, URS1H and UASH, that are known to cooperatively regulate 11 yeast genes [28]. These 11 genes are also listed in SCPD [1], the promoter database of Saccharomyces cerevisiae. In 10 of those genes the URS1H binding site appears downstream from UASH; in the remaining one (HOP1) the binding sites are reversed. We took the binding sites for the 10 genes, and after their multiple alignment, we obtained the composite motif NNDTBNGDWGDNNDH [5,179]WBRGCSGCYVW, where we represent each column in the alignment with the IUPAC symbol corresponding to the bases that appear at that position. We also extracted the profile for these 10 binding sites. Table 10 shows the binding sites for the 10 genes, their alignment, and the start positions and the distances between the sites (the difference of start positions). The smallest distance is 20 and the largest is 194. Since these are start positions, the variable gap range is obtained by subtracting the length (15) of UASH to obtain l = 20 -15 = 5 and u = 194 -15 = 179. Notice also how the alignment preserves the highly conserved GCSGC region in URS1H.
We then searched for the structured motif in the upstream regions of all 5873 genes in the yeast genome. We used the -800 to -1 upstream regions, and truncated the segment if it overlaps with an upstream open reading frame (ORF). As a result, 5794 sequences with average length of 497 bases are left. By searching for the IUPAC pattern, we found 65 occurrences, including the 10 originally known sites, within 1 second. By searching for the profile with 5 as the number of core positions in each simple motif, λ c = 0.6 as the core score threshold and λ = 0.8 as the total score threshold, we found 56 occurrences in 0.18 seconds.
For each occurrence, we then extracted its actual sequence segment in the matching upstream regions. Since the structured motif represented by IUPAC symbols may be too general, for each matching segment, we calculated its hamming distance to one of the 10 known binding sites. We then selected an occurrence as a possibly new binding site if the minimum hamming distance to any of the 10 known sites is within a given maximum threshold value. Table 11 lists the 12 newly found occurrences in the entire yeast genome (upstream regions) using a hamming distance threshold of 5. The sites discovered using both pattern and profile search are listed. These new occurrences could be putative binding sites for the two transcription factors UASH and URS1H.
Upon further analysis, we found that in fact, the new occurrence in MEK1 (at positions -233,-136) that we found is also listed in the SCPD database as a binding site. SCPD lists one site for UASH at position -233, and two sites for URS1H at positions -136 and -150. To construct the motif, we had used -150 as the site for URS1H, without knowledge of the other site. SMOTIF was thus able to automatically find the other site based on the extracted motif! For REC114, we also found another occurrence at positions -158 (UASH) and -94 (URS1H). However, this is not reported in SCPD.
To further analyze the remaining new occurrences, we consulted the SGD (Saccharomyces Genome Database) Gene Ontology Term Finder [29] to find the inter-relationships between the genes. The first three rows of Table  SMOTIF and SMARTFINDER Comparison: Random Motifs

#IUPAC Symbols
SMaRTFinder sMOTIF-1 sMOTIF-2 sMOTIF-P (c) (d) 12 show the significant GO terms (biological process or molecular function) that are common to the genes corresponding to a new occurrence and its closest (known) gene. The rest of the table shows the significant terms among the 18 genes. These results indicate that at least some of the new occurrences (such as SPO1, HSP60, MES1, and GNT1) have a potential to be binding sites since they share some significant processes with the known sites' genes. Out of these SPO1 has the highest potential to be a new binding site, since it is known that UASH and URS1H are involved in early meiotic expression, during sporulation [28]. Table 12 shows that SPO1 shares meiosis and M phase of meiotic cell cycle with the rest of the genes. After searching for SPO1 in SGD database, we found that SPO1 is a transcriptional regulator involved in sporulation, and required for middle and late meiotic expression. This increases our confidence that SPO1 has high potential to be a previously unknown binding site.
Finally, since we knew that in gene HOP1, the URS1H binding site appears upstream from UASH, we wanted to see if we could extract the "reversed" binding site. We search for the original and the reversed motifs using a hamming threshold of 6. We found 34 new binding sites where UASH can appear either up-or down-stream from URS1H. Among these we found two possible potential binding sites for the gene HOP1, with UASH at position -201 and URS1H at positions -534 and -175. The former pair (-201,-534) is in fact a known binding site as reported in the SCPD database [1]. This once again showcases the ability of SMOTIF to find potential new binding sites.

Conclusion
We introduced SMOTIF, a fast and efficient algorithm to search structured pattern and profile motifs in biological sequences. We showed its applications in searching for composite regulatory patterns, long terminal repeat retrotransposons, and for searching long range motifs. SMOTIF is also computationally more efficient than previous stateof-the-art methods like SMARTFINDER [6].
In biosequence analysis there are four related structured motif problems depending on whether the simple motifs and gap ranges in the structured motif are known or not: