SMOTIF: efficient structured pattern and profile motif search
- Yongqiang Zhang^{1} and
- Mohammed J Zaki^{1}Email author
DOI: 10.1186/1748-7188-1-22
© Zhang and Zaki; licensee BioMed Central Ltd. 2006
Received: 21 May 2006
Accepted: 21 November 2006
Published: 21 November 2006
Abstract
Background
A structured motif allows variable length gaps between several components, where each component is a simple motif, which allows either no gaps or only fixed length gaps. The motif can either be represented as a pattern or a profile (also called positional weight matrix). We propose an efficient algorithm, called SMOTIF, to solve the structured motif search problem, i.e., given one or more sequences and a structured motif, SMOTIF searches the sequences for all occurrences of the motif. Potential applications include searching for long terminal repeat (LTR) retrotransposons and composite regulatory binding sites in DNA sequences.
Results
SMOTIF can search for both pattern and profile motifs, and it is efficient in terms of both time and space; it outperforms SMARTFINDER, a state-of-the-art algorithm for structured motif search. Experimental results show that SMOTIF is about 7 times faster and consumes 100 times less memory than SMARTFINDER. It can effectively search for LTR retrotransposons and is well suited to searching for motifs with long range gaps. It is also successful in finding potential composite transcription factor binding sites.
Conclusion
SMOTIF is a useful and efficient tool in searching for structured pattern and profile motifs. The algorithm is available as open-source at: http://www.cs.rpi.edu/~zaki/software/sMotif/.
Background
A Simple Motif
Symbols | Motif | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | 0 | 0 | 0 | 4 | 1 | 1 | 7 | 0 | 5 | 1 | 0 | 2 | 0 | 2 | 0 | 0 | 0 |
C | 10 | 0 | 1 | 2 | 3 | 5 | 0 | 7 | 0 | 4 | 2 | 5 | 5 | 1 | 9 | 10 | 0 |
G | 0 | 10 | 9 | 4 | 5 | 3 | 2 | 3 | 0 | 3 | 1 | 1 | 4 | 1 | 1 | 0 | 10 |
T | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 5 | 2 | 7 | 2 | 1 | 6 | 0 | 0 | 0 |
IUPAC | C | G | S | V | N | N | D | S | W | N | B | N | B | N | S | C | G |
IUPAC Alphabet (Σ_{IUPAC})
Symbol | A | C | G | T | U | R | Y | K |
---|---|---|---|---|---|---|---|---|
Bases | A | C | G | T | U | A,G | C,T | G,T |
Symbol | M | S | W | B | D | H | V | N |
Bases | A,C | G,C | A,T | C,G,T | A,G,T | A,C,T | A,C,G | A,C,G,T |
A Structured Motif
Symbols | M _{1} | M _{2} | M _{3} | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | 2 | 12 | 17 | 1 | 11 | 1 | 35 | 0 | 24 | 1 | 0 | 3 | 1 | 35 | ||||||
C | 0 | 10 | 8 | 5 | 2 | 0 | 0 | 19 | 0 | 0 | 25 | 5 | 35 | 1 | ||||||
G | 2 | 5 | 5 | 2 | 10 | 34 | 1 | 0 | 0 | 26 | 11 | 0 | 0 | 0 | ||||||
T | 32 | 9 | 6 | 28 | 13 | 1 | 0 | 17 | 12 | 9 | 0 | 28 | 0 | 0 | ||||||
IUPAC | D | N | N | N | N | D | R | Y | W | D | S | H | M | M |
More formally, a structured motif $\mathcal{M}$, is specified in the form: M_{1} [l_{1}, u_{1}] M_{2} [l_{2}, u_{2}] M_{3} ... M_{k-1}[l_{k-1}, u_{k-1}] M_{ k }, where M_{ i }, 1 ≤ i ≤ k, is a simple motif component; and l_{ i }and u_{ i }(with 0 ≤ l_{ i }≤ u_{ i }), 1 ≤ i <k, are the minimum and maximum length of the gap allowed between M_{ i }and M_{i+1}, respectively. Note that a gap is defined to be the number of intervening positions after M_{ i }but before M_{i+1}. In other words if s_{ i }and e_{ i }represent the start and end positions of component M_{ i }, then for i ∈ [1, k - 1], the length of the gap between M_{ i }and M_{i+1}is given as g_{ i }= e_{i+1}- s_{ i }- 1, and we require that g_{ i }∈ [l_{ i }, u_{ i }]. We use |M_{ i }| to denote the number of symbols/positions in component M_{ i }, also called the length of the component, and we use $\left|\mathcal{M}\right|={\displaystyle {\sum}_{i=1}^{k}\left|{\mathcal{M}}_{i}\right|}$ to denote the total length of the structured motif $\mathcal{M}$ (not counting gaps). The maximum span, L, of the motif $\mathcal{M}$ is the maximum number of positions that can be occupied by the structured motif, which is given as $L={\displaystyle {\sum}_{i=1}^{k}\left|{M}_{i}\right|+{\displaystyle {\sum}_{i=1}^{k-1}{u}_{i}}}$. The structured motif $\mathcal{M}$ can be either specified as a pattern or a profile. When $\mathcal{M}$ denotes a pattern, we use the notation $\mathcal{M}$_{ j }to denote the symbol at position j in the motif, and when $\mathcal{M}$ denotes a profile, we use the notation $\mathcal{M}$_{ xj }to denote the frequency of symbol x ∈ Σ_{DNA} at position j, where j ∈ [1, |$\mathcal{M}$|]. Table 3 shows both the pattern and profile representation of an example structured motif with three components.
Given a collection of sequences, $\mathcal{S}$, over the DNA alphabet Σ_{DNA} = {A,C,G,T}, and a structured motif, $\mathcal{M}$, the structured motif search problem is to report all the occurrences (or matches) of $\mathcal{M}$ in $\mathcal{S}$. The occurrence set of the structured motif, given as $\mathcal{O}$, can be reported in two forms: a) full positions: list of the positions for each symbol in $\mathcal{M}$, for all possible matches in $\mathcal{S}$, or b) start positions: list of the starting positions of $\mathcal{M}$ for each match in $\mathcal{S}$. Table 4 shows an example sequence $\mathcal{S}$ and a structured motif $\mathcal{M}$, where M_{1} = CG, M_{2} = TTA and M_{3} = CAT, and [0, 1] and [1,4] are the intervening gap ranges between M_{1} and M_{2}, and M_{2} and M_{3}, respectively. The motif has k = 3 components, it length is |$\mathcal{M}$| = 8 and its maximum span is L = 13. The occurrence set of full positions is $\mathcal{O}$ = {(5,6,8,9,10,12,13,14), (5,6,8,9,10,15,16,17)}, and of start positions is $\mathcal{O}$ = {5}.
Structured Motif Search
Sequence (S ∈ $\mathcal{S}$): | GCATGCGTTAGCATCATC |
---|---|
Structured Motif ($\mathcal{M}$): | GC[0,1]TTA[1,4]CAT |
•Missing Components: The matching motifs can consist of some, instead of all, the simple motifs in $\mathcal{M}$, allowing for at most q missing components.
•Approximate Matches: The matching motifs may consist of similar motifs (as measured by Hamming or Levenshtein distance [8]), instead of exact matches, to the simple motifs in $\mathcal{M}$, allowing for at most ε_{ i }errors for simple motif M_{ i }(when $\mathcal{M}$ is expressed as a pattern).
•Overlapping Components: The variable gap constraints (l_{ i }and u_{ i }) can take on a limited range of negative values, allowing search for overlapping simple motifs. We allow two adjacent components M_{ i }and M_{i+1}to overlap, but we require that M_{i+1}does not precede M_{ i }. This condition can be satisfied by the following constraints on the gap range [l_{ i }, u_{ i }]: - |M_{ i }| ≤ l_{ i }≤ u_{ i }, for i ∈ [1, k). For example the search for motif ACG[-2,2]CGA, can discover the overlapped occurrence ACGA, as well as the non-overlapped occurrence ACG- -CGA, at the two extremes of the gap range.
•Profile Search: The components of the motif $\mathcal{M}$ can be specified as a pattern in either the DNA (Σ_{DNA}) or IUPAC (Σ_{IUPAC}) alphabets, or as a profile over Σ_{DNA}.
In this paper, we focus on the problem of searching for a given structured motif in one or more sequences. We propose SMOTIF, an efficient algorithm for structured motif searches. It uses an inverted index of symbol positions, and it finds all occurrences by positional joins over this index. For structured pattern search problem, we propose two main variants of our approach: i) a direct search for simple motifs and the structured motif via positional joins, and ii) a two-step approach, where we use a suffix tree to search for simple motifs and then use positional joins for the structured motif. For structured profile search problem, we first search each simple motif by aligning its profile with the sequences, and then search structured motifs with positional joins. SMOTIF allows missing components, overlapping motifs, and also approximate matches (when using the two-step approach). SMOTIF also allows flexible matches using IUPAC symbols.
We apply SMOTIF for searching long DNA sequences for LTR retrotransposons, which constitute a substantial fraction of most Eukaryotic genomes and are believed to have a significant impact on genome structure and function [9, 10]. We show that SMOTIF is effective in searching for composite regulatory patterns, and it can also suggest potentially new binding sites. We experimentally demonstrate the superiority of SMOTIF over SMARTFINDER, a state-of-the-art method for structured motif search, both in terms of time and space; SMOTIF can be up to 7 times faster and can consume 100 times less space.
Related work
Many existing pattern matching algorithms [8, 11–18] can be used to solve the simple pattern search problem. Given the sequence length, n, and the pattern length, m, exact matching algorithm can run in O (n + m) [11]; approximate matching algorithm can run in O (rn) [16, 17] or O (nm/w) [18], where r is the error threshold and w is the size of a computer word. The space complexity is O (n) for both exact matching and approximate matching.
Several previous efforts have focused on the structured pattern search problem. Anrep [3, 4] provides a unified biosequence pattern representation by using network expressions with spacers, where a network expression is a regular expression without Kleene closure. With network expressions, one can specify the scoring scheme and the threshold of approximate matching for each simple motif separately, the positional weights which express the relative importance of different parts of a motif, and whether a simple motif is optional. Anrep introduces a two-step approach: first it searches for simple motifs by a threshold-sensitive motif matching algorithm and then it finds the structured motif by an optimized backtracking matching algorithm. However, as compared by [6], Anrep is much slower than SMARTFINDER.
In [5], the structured motif is called a Classes of Characters and Bounded Gaps (CBG) expression and is represented by a non-deterministic ε-automaton with bit parallelism. Two algorithms are proposed for CBG expression search: forward search and backward search. Bit parallelism speeds up the search, but is adequate only for a pattern whose maximum span is smaller than the length of the computer word. Also the implementation of CBG can only handle such pattern. This limits the application of CBG to searching for patterns with small number of symbols and gaps.
SMARTFINDER [6, 7], which is currently the most efficient method for structured motif search, is also a two-step approach. In the first step each simple motif is searched separately by building a suffix tree for the sequence. This step outputs the ordered occurrence lists of all simple motifs. The second step solves a constraint satisfaction problem by considering constraints individually in three sub-steps. First it considers the gap constraints and builds a constraint graph whose nodes are the simple motif occurrences and edges connect all possible pairs of nodes that locally satisfy the gap constraints. It then considers the constraint for the maximum number of missing components and shrinks the graph to contain only the nodes that can be in the structured motif occurrences. Finally it enumerates all the valid occurrences by a depth first search (DFS). Notable differences in SMOTIF and SMARTFINDER are as follows: we search patterns directly by positional joins over an inverted index, we consider variable gap constraints during the positional joins as opposed to building a constraint graph, and we handle missing components more efficiently by considering them over patterns instead of over each occurrence as in SMARTFINDER. Note also that like Anrep and SMARTFINDER, SMOTIF can also mine approximate patterns, when using the two-step approach, which we describe later.
For profile search, MATCH [19], P-Match [20], and MatInspector [21, 22] search DNA sequences against a position weight matrix library (such as TRANSFAC database [23]) and report the occurrences that satisfy given score thresholds. They compute the matrix score by multiplying the base frequency with the information content value at each position, in order to emphasize the fact that mismatches at less conserved positions are more easily tolerated than mismatches at highly conserved positions. Besides the matrix score, they define a core region, which is usually the first 4–5 most conserved consecutive positions of the matrix, and perform the core score threshold check. Then they align the matrix to each position of the sequence and calculate the core score and matrix score. However, these algorithm don't consider the prior probability of each base when calculating the matrix (or core) score, and the core region is required to be consecutive. They need to check all positions of each subsequence (at least all the core positions) in order to calculate the matrix (core) score. Moreover, these algorithms only work on simple profile with one single matrix component. For structured profile search, only Anrep [3, 4] provides the capability to model structured profiles, with its general network expressions. However, Anrep doesn't give a solution on score calculation and fast search for structured profiles. Moreover, its implementation doesn't support structured profile search. To our knowledge, SMOTIF is the only implemented method that can handle structured profile search.
Methods
We first introduce our basic approach for structured pattern search, and successively optimize it for various practical scenarios. We then present our approach for structured profile search.
Structured pattern search: basic approach
Let us assume that we are searching for a structured motif $\mathcal{M}$ over a single sequence S ∈ $\mathcal{S}$. We assume that S is over Σ_{DNA}, whereas, $\mathcal{M}$ is over Σ_{IUPAC} to allow for more flexible matches. SMOTIF first converts S into an equivalent inverted format [24, 25], where we associate with each symbol in the sequence its pos-list, a sorted list of the positions where the symbol occurs in S. More formally, for a symbol X ∈ Σ_{IUPAC}, its pos-list is given as $\mathcal{\text{P}}$(X, S) = {i | S [i] = X, i ∈ [1, |S|]}, where S [i] is the symbol at position i in S, and |S| denotes the length of S. When S is obvious, we drop it, and denote the pos-list as $\mathcal{\text{P}}$(X). For our example sequence S in Table 4, the pos-lists for X ∈ Σ_{DNA} are given in Table 5.
Pos-lists
A | C | G | T |
---|---|---|---|
3 | 2 | 1 | 4 |
10 | 6 | 5 | 8 |
13 | 12 | 7 | 9 |
16 | 15 | 11 | 14 |
18 | 17 |
Positional joins
•d <l: Advance y to the next element in $\mathcal{\text{P}}$(Y).
•d > u: Advance x to the next element in $\mathcal{\text{P}}$(X).
•l ≤ d ≤ u: Save this occurrence in $\mathcal{\text{P}}$(X [l, u] Y), and then advance x.
The pos-list for X [l, u] Y can be computed in time linear in the lengths of $\mathcal{\text{P}}$(X) and $\mathcal{\text{P}}$(Y). In essence, each time we advance x ∈ $\mathcal{\text{P}}$(X), we check if there exists a y ∈ $\mathcal{\text{P}}$(Y) that satisfies the given gap constraint. Instead of searching for the matching y from the beginning of the pos-list each time, we search from the last position used to compare with x. This results in fast positional joins, and also allows overlaps among occurrences. For example, during the positional join for the motif T[0, 1]A, with l = 0 and u = 1, we scan the pos-lists of T and A in Table 5. Initially, x = 4 and y = 3. This gives d = 3 - 4 - 1 = -2 <l, thus we advance y to 10. Next, d = 10 - 4 - 1 = 5 > u, thus we advance x to 8. Next, d = 10 - 8 - 1 = 1 ∈ [0, 1]. So we store x = 8 in $\mathcal{\text{P}}$(T[0, 1] A) and advance x to 9. By continuing the process, we get the final pos-list as $\mathcal{\text{P}}$(T[0,l] A) = {8, 9, 14}.
Given a longer motif $\mathcal{M}$, the positional joins start with the last two symbols, and proceed by successively joining the pos-list of the current symbol with the intermediate pos-list of the suffix. Formally, let H [l, u] T be an intermediate pattern, with symbol H as the head symbol, and a suffix structured motif T as tail. The pos-list of H [l, u] T is obtained by doing positional join on the pos-list of H and the pos-list of T. As the computation progresses the previous tail pos-lists are discarded. Combined with the fact that only start positions are kept in a pos-list, this saves both time and space.
SMOTIF handles both simple and structured motifs uniformly, by adding the gap range [0, 0] between adjacent symbols within each simple motif M_{ i }. For our example in Table 4, the structured motif $\mathcal{M}$ becomes: G[0,0]C[0,1]T[0,0]T[0,0]A[1,4]C[0,0]A[0,0]T. Furthermore, SMOTIF treats the IUPAC symbol N (which stands for any of the four bases: A,C,G,T) as a gap, [1,1], and merges it with adjacent gaps in the motif. For example, the motif A[0,0]N[0,0]N[0,0]C will be first converted to A[0, 0][1,1][0,0][1,1][0,0]C, and then the adjacent gaps will be combined to obtain A[2,2]C as the final motif.
Full position recovery
In our positional join approach, to save time and space we retain only the motif start positions, however, in some applications, we may need to know the full position of each occurrence, i.e., the set of matching positions for each symbol in the motif. We describe two approaches to recover the full positions: recomputed or indexed full-position recovery.
Recomputed full position recovery
For recomputing the full positions SMOTIF needs access to only the sequence S and the post-list $\mathcal{\text{P}}$($\mathcal{M}$). Let s ∈ $\mathcal{\text{P}}$($\mathcal{M}$) be a start position for the structured motif in sequence S. Figure 3 shows how to recompute the full positions starting from s. Note that a structured motif with maximum span L must be found within position range [s, s + L - 1], so we can stop searching from s after the maximum span is reached. During the full position recovery, we maintain a list of intermediate position prefixes $\mathcal{F}$ that match the prefix of $\mathcal{M}$. For an intermediate prefix F ∈ $\mathcal{F}$, let |F| denote the number of positions in F. Initially $\mathcal{F}$ = {(s)}. For each symbol S [i] with i ∈ [s + 1, s + L - 1] in sequence S, we consider each candidate prefix F ∈ $\mathcal{F}$ and check whether $\mathcal{M}$[|F| + 1] = S [i] and d = i - F [|F|] ∈ [l, u]. If yes, S [i] is a valid occurrence of $\mathcal{M}$[|F| + 1], and we append position i to F. If there are multiple positions i that are valid occurrences for $\mathcal{M}$[|F| + 1], we add as many copies of F appended with the i positions to $\mathcal{F}$. A prefix F is removed if there are no symbols in S which match within the maximum gap. The algorithm stops once all F ∈ $\mathcal{F}$ are full positions or if the maximum span L has been reached.
Indexed full position recovery
Rather than recomputing the positions, we can "index" some information during the positional joins in order to facilitate full position recovery. For each suffix of $\mathcal{M}$ starting at position i with 1 ≤ i ≤ |$\mathcal{M}$|, we keep its pos-list, $\mathcal{\text{P}}$_{ i }, and an index list, $\mathcal{N}$_{ i }. For each entry, say $\mathcal{\text{P}}$_{ i }[j], in the pos-list $\mathcal{\text{P}}$_{ i }, the corresponding index entry $\mathcal{N}$_{ i }[j], points to the first entry, say l, in $\mathcal{\text{P}}$_{i+1}that satisfies the gap range with respect to $\mathcal{\text{P}}$_{ i }[j], i.e., $\mathcal{\text{P}}$_{i+1}[l] - $\mathcal{\text{P}}$_{ i }[j] - 1 ∈ [l_{ i }, u_{ i }]. Note that ${\mathcal{N}}_{\left|\mathcal{M}\right|}$ is never used. Also note that $\mathcal{\text{P}}$($\mathcal{M}$) = $\mathcal{\text{P}}$_{1}. Let s be a start position for the structured motif in sequence S, and let s be the j_{ s }-th entry in $\mathcal{\text{P}}$_{1}, i.e., s = $\mathcal{\text{P}}$_{1}[j_{ s }]. Also let F store a full position starting from s, and let $\mathcal{F}$ store the set of all full positions. Figure 5 shows the pseudo-code for recovering full positions starting from s. This recursive algorithm has two parameters: i denotes a (suffix) position in $\mathcal{M}$, and j gives the j-th entry in $\mathcal{\text{P}}$_{ i }. The algorithm is initially called with $\mathcal{F}$ = {F = {s}}, i = 2 and with j = $\mathcal{N}$_{1}[j_{ s }]. Since we have $\mathcal{\text{P}}$_{2}[j] - F [1] - 1 ∈ [l_{1}, u_{1}] we set F [2] = $\mathcal{\text{P}}$_{2}[j]. In the next call we can follow the index $\mathcal{N}$_{2}[j] to get the next position in F, namely F [3]. Thus in each call we keep following the indices from one pos-list to the next and finally we can get a full position starting from s when we reach the last pos-list, ${\mathcal{\text{P}}}_{\left|\mathcal{\text{M}}\right|}$. Furthermore, at each suffix position i, since j only marks the first position in $\mathcal{\text{P}}$_{i+1}that satisfies the gap constraints, we also need to consider all the subsequent positions j' > j that may satisfy the corresponding gap range.
Sequence segmentation
The SMOTIF approach as described above works well for searching a motif in a relatively short sequence. For a very long sequence S (e.g., searching for (LTR) retrotransposons in an entire chromosome) the pos-lists can get very long in the initial stages, consuming a lot of memory. SMOTIF handles a long sequence by splitting it into several segments and searches each segment separately for the structured motif. That is, the sequence S is split into p equal partitions (except for the last one). Handling each smaller segment S_{ i }(i ∈ [l, p]) instead of the original S can save a lot of space and also reduces the total search time. After segmentation, to avoid missing any occurrence, we require that each partition S_{ i }, with i ∈ [l, p - 1], include the first L - 1 symbols from partition S_{i+1}. Finally, to avoid duplicate occurrences, we discard all occurrences with a start position in the overlap region, since it would be reported when we process segment S_{i+1}. For example, let S be the sequence in Table 6, and let the structured motif be $\mathcal{M}$ = GC[1,2]T with maximum span L = 5. If p = 3, then we would have three segments of length 6 each. After adding the overlap region of L - 1 = 4 positions at the end of each segment, we obtain the final three segments shown in Table 6. Two start positions of $\mathcal{M}$ would be found in S_{1} (namely 1 and 5), and one in S_{2} (namely 11). Note that start positions 5 and 11 would have been missed if we had no overlap.
Segmentation into p = 3
S | G | C | A | T | G | C | G | T | T | A | G | C | A | T | C | A | T | C | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
S _{1} | G | C | A | T | G | C | G | T | T | A | |||||||||
S _{2} | G | T | T | A | G | C | A | T | C | A | |||||||||
S _{3} | A | T | C | A | T | C |
Missing components
In some applications a partial match of the structured motif might still be of interest. SMOTIF allows up to q simple motif components to be missing during the search. Let $\mathcal{M}$ be a structured motif with k components. SMOTIF first enumerates all possible sub-motifs having k' components, where k' ∈ [k - q, k]. Next, the gap ranges are adjusted in each sub-motif to account for skipping over the missing components. The new gap range, [l_{ i,j }, u_{ i,j }], between components M_{ i }and M_{ j }(with 1 ≤ i <j ≤ k) in a sub-motif, is calculated as follows: ${l}_{i,j}={\displaystyle {\sum}_{n=1}^{j-1}{l}_{n}}$, and ${u}_{i,j}={u}_{i}+{\displaystyle {\sum}_{n=i+1}^{j-1}({u}_{n}+|{M}_{n}|)}$.
For example, if we allow one (q = 1) missing component for our structured motif in Table 4, the set of sub-motifs that need to be searched for are: GC[0,1]TTA[1,4]CAT, GC[1,8]CAT, GC[0,1]TTA and TTA[1,4]CAT. Note that it is straightforward to incorporate other approaches to compute new ranges into SMOTIF since it would only change the gap constraints. For example, l_{ i,j }= min_{n ∈ [i,j-1]}{l_{ n }} and u_{ i,j }= max_{n ∈ [i,j-1]}{u_{ n }} is another possible way to compute the adjusted gap ranges.
Instead of searching each sub-motif separately, we do an optimized search. We reuse the partial pos-lists created when using a depth first search to enumerate and search the sub-motifs. The idea is to re-use the pos-lists created for common suffixes when enumerating their sub-motif extensions.
Two-step approach for structured pattern search
So far we have described the direct method used by SMOTIF to search for the structured motif by positional joins over the symbols. In fact, SMOTIF, can also follow a two-step approach like in Anrep [4] and SMARTFINDER [6]. In the first step, given $\mathcal{S}$ and $\mathcal{M}$, we search $\mathcal{S}$ for each simple motif in $\mathcal{M}$, i.e., M_{1}, M_{2}, ..., M_{ k }. This task can be solved by existing pattern matching algorithms. In the second step, we do positional joins on the pos-lists of the simple motifs. Let $\mathcal{\text{P}}$(M_{ i }[l_{ i }, u_{ i }] $\mathcal{T}$) be an intermediate pos-list, with simple motif M_{ i }as the head, and a suffix structured motif $\mathcal{T}$ = M_{i+1}⋯ M_{ k }as tail. Since $\mathcal{\text{P}}$(M_{ i }) stores only the start positions, we need convert them into end positions to check the gap constraints. There are two cases to consider.
Exact matching
Approximate matching
Several algorithms [8, 12, 15–18] exist for approximate pattern matching. For consistency, we used Sellers' dynamic programming algorithm [26], as implemented in SMARTFINDER, to extract the pos-lists for all simple motifs with approximate matches. This algorithm is not optimal and it can be replaced by more efficient ones [16–18]. Since we allow a specific Levenshtein distance [8] (i.e., insertions, deletions and substitutions) between the occurrences and the motif, the length of the occurrences can be different from the component length |M_{ i }|. Thus we augment the pos-list to explicitly store the end position, in addition to the start position, for each occurrence. Figure 7(b) shows how the pos-list joins work for approximate matches of simple motifs. In the structured motif from Table 4, we consider the exact matches of GC and CAT, and the approximate matches of TTA within Levenshtein distance of 1. Each column in (b) shows the pos-list of a simple motif: the left sub-column is a list of its start positions and the right sub-column is a list of its end positions. We first join the pos-lists of TTA and CAT, checking for gap range [1,4]. We compare the end positions of TTA and the start positions of CAT and find that the pairs (9,12), (10,12), (10,15), and (11,15) all lie within the gap range (indicated by the links), and thus the pairs, (7, 10), (8, 9), (8, 10) and (8, 11) are retained in the resulting pos-list. Likewise (5, 6) is in the final pos-list, since after comparing the end position of GC, 6, with the start position of TTA, 7 and 8, we find d = 7 - 6 - 1 = 0 ∈ [0,1] and d = 8 - 6 - 1 = 1 ∈ [0, 1].
Structured profile search
Having outlined our approach for structured pattern search, here we tackle the problem of structured profile search. The profile (also called a position weight matrix) for a structured motif $\mathcal{M}$ gives for each position in each component, the frequency of occurrence for each symbol in Σ_{DNA}, i.e., $\mathcal{M}$_{ xj }gives how often symbol x ∈ Σ_{DNA} occurs at position j. Note that, as in the case of pattern motifs, for a structured profile motif $\mathcal{M}$, we have to specify the gap constraints between its simple profile motif components. Profiles are able to better capture the variability in the motifs, since they capture position specific statistics on the symbols, and as such can retain more valuable information than the pattern representation.
Given a profile structured motif $\mathcal{M}$, and a user-specified match/score threshold λ, the goal of structured profile search is to enumerate all structured patterns that match the profile motif above the threshold λ. Our approach to profile motif search consists of the following two steps: a) convert the raw frequency profile into a relative profile weighted by information content at each position, b) enumerate occurrences matching the profile.
Weighted profile creation
Given the initial "raw" frequency profile for the structured motif $\mathcal{M}$, we first convert it into a weighted profile as follows:
$\begin{array}{cc}{f}_{xj}=\frac{{\mathcal{M}}_{xj}+{p}_{x}}{{\displaystyle {\sum}_{y\in {\Sigma}_{\text{DNA}}}({\mathcal{M}}_{yj}+{p}_{y})}},& {{\mathcal{W}}^{\prime}}_{xj}=\mathrm{ln}\left(\frac{{f}_{xj}}{{p}_{x}}\right)\end{array}\left(1\right)$
where $\mathcal{M}$_{ xj }gives the absolute frequency of symbol x ∈ Σ_{DNA} at position j in the structured motif $\mathcal{M}$, for 1 ≤ j ≤ |$\mathcal{M}$|; f_{ xj }represents the relative frequency of symbol x at position j in the motif; p_{ x }(or p_{ y }) denotes the prior (background) probability of symbol x (or y) ∈ Σ_{DNA}; ${{\mathcal{W}}^{\prime}}_{xj}$ is the weight (log-likelihood) of observing symbol x at position j.
Whereas the weights computed above give the likelihood of observing a given symbol in a given position they do not account for the degree to which some symbols are conserved at some positions. We can adjust the weights by considering the information content at each position. The information content for a profile is given as:
$\begin{array}{ccc}{\mathcal{I}}_{xj}={f}_{xj}\mathrm{ln}({f}_{xj})-{p}_{x}\mathrm{ln}({p}_{x}),& {\mathcal{I}}_{j}={\displaystyle \sum _{x\in {\Sigma}_{\text{DNA}}}{\mathcal{I}}_{xj},}& \mathcal{I}={\displaystyle \sum _{j=1}^{\left|\mathcal{M}\right|}{\mathcal{I}}_{j}}\end{array}\left(2\right)$
where $\mathcal{I}$_{ xj }is the information content of symbol x at position j; $\mathcal{I}$_{ j }is the information content over all bases at position j; and $\mathcal{I}$ is the information content of the entire profile. To allow mismatches at less conserved positions to be more easily tolerated than those at highly conserved positions, we multiply each weight ${{\mathcal{W}}^{\prime}}_{xj}$ by $\mathcal{I}$_{ j }, which is larger for more conserved positions. As a result, the corrected weight of each element in the profile becomes:
${\mathcal{W}}_{xj}={\mathcal{I}}_{j}{{\mathcal{W}}^{\prime}}_{xj}={\mathcal{I}}_{j}\mathrm{ln}\left(\frac{{f}_{xj}}{{p}_{x}}\right)\left(3\right)$
Given the initial profile motif $\mathcal{M}$ we obtain its corresponding information content weighted profile $\mathcal{W}$ for use in the enumeration step. Our weighting approach has some differences with respect to previous ones [19–22]. For instance, we consider the prior probability of each base for the calculation of the positional weight and information content. That is necessary because some bases (i.e. C and G) occur more frequently than others (i.e. A and T). Also we use relative frequency instead of the absolute one, so as to compare with the prior probability. However, note that, in general, SMOTIF is flexible enough to handle any user specified profile.
Profile scoring
Given the weighted profile $\mathcal{W}$ for a structured motif $\mathcal{M}$, a sequence S ∈ $\mathcal{S}$, and any subsequence S' of S (not necessarily consecutive) of length |$\mathcal{M}$|, that satisfies the gap constraints, the profile score for subsequence S' is computed as the sum of weights of its symbols at each position in the profile, given as: $\mathcal{R}({S}^{\prime})={\displaystyle {\sum}_{j=1}^{\left|\mathcal{M}\right|}{\mathcal{W}}_{{S}^{\prime}[j]j}}$ (note that $\mathcal{W}$_{S' [j]j}gives the weight of symbol S' [j] at position j in the profile motif). In order to check whether S' is a potential binding site, we compare its profile score with the user-specified threshold, λ ∈ [0,1]. To ensure that the profile score lies between 0 and 1, we have to normalize it. Let $\mathcal{W}$^{min} and $\mathcal{W}$^{max} be the minimum and maximum weights, and let μ and σ be the mean and standard deviation of the weights, across all the positions in the weighted profile. The normalized profile score $\mathcal{R}$^{ n }can then be calculated using any of the following formulas:
$\begin{array}{ccc}(a).{\mathcal{R}}^{n}({S}^{\prime})=\frac{\mathcal{R}({S}^{\prime})}{{\mathcal{W}}^{\mathrm{max}}}& (b).{\mathcal{R}}^{n}({S}^{\prime})=\frac{\mathcal{R}({S}^{\prime})-{\mathcal{W}}^{\mathrm{min}}}{{\mathcal{W}}^{\mathrm{max}}-{\mathcal{W}}^{\mathrm{min}}}& (c).{\mathcal{R}}^{n}({S}^{\prime})=\frac{\frac{\mathcal{R}({S}^{\prime})-\mu}{3\sigma}+1}{2}\end{array}\left(4\right)$
Note that whereas Equations 4(a) and 4(b) are strictly in the range [0, 1], for 4(c) $\mathcal{R}$^{ n }is in the range [0, 1] only 99.7% of the time (within a range ± 3σ of the mean μ).
When applying the score threshold for an occurrence S', we require that its normalized score is above the threshold λ. For example, for Equation 4(b),
${\mathcal{R}}^{n}({S}^{\prime})=\frac{\mathcal{R}({S}^{\prime})-{\mathcal{W}}^{\mathrm{min}}}{{\mathcal{W}}^{\mathrm{max}}-{\mathcal{W}}^{\mathrm{min}}}\ge \lambda \Rightarrow \mathcal{R}({S}^{\prime})\ge \lambda ({\mathcal{W}}^{\mathrm{max}}-{\mathcal{W}}^{\mathrm{min}})+{\mathcal{W}}^{\mathrm{min}}={\lambda}^{n}$
In other words, instead of normalizing the score for each match, we take λ^{ n }as the new normalized threshold for scoring the potential matches. Likewise we can get the new thresholds for Equations 4(a) and 4(c). For example, for 4(a) the normalized threshold would be λ^{ n }= λ · $\mathcal{W}$^{max}.
Partial scores
For profile matching problem, we are only given the score threshold λ for the whole structured motif. We here develop a method to compute a partial score threshold for any sub-profile, which can lead to great pruning efficiency. For a sub-profile ${\mathcal{M}}^{\prime}$ of profile $\mathcal{M}$, let λ^{ n }(${\mathcal{M}}^{\prime}$) denote its minimum score threshold and let ${\mathcal{M}}^{\prime}$^{max} be its maximum weight, i.e., the sum of the maximum weights (regardless of the symbol) across all positions in ${\mathcal{M}}^{\prime}$. Then the minimum (partial) score threshold for ${\mathcal{M}}^{\prime}$ is calculated as:
${\lambda}^{n}({\mathcal{M}}^{\prime})={\lambda}^{n}-({\mathcal{W}}^{\mathrm{max}}-{{\mathcal{M}}^{\prime}}^{\mathrm{max}})\left(5\right)$
In another word, the minimum threshold for any sub-profile ${\mathcal{M}}^{\prime}$ is calculated as the difference of the minimum threshold for the whole structured motif and the maximum weight of the motif excluding the sub-profile ${\mathcal{M}}^{\prime}$. For example, consider the weighted profile M_{1} [0, 5] M_{2} [0, 9] M_{3} shown in Figure 8. Assume that λ = 0.8, then using Equation 4(a), we have λ^{ n }= λ $\mathcal{W}$^{max} = 0.8 × 10.75 = 8.60. Note that $\mathcal{W}$^{max} = 10.75 is obtained by adding all the highest weights (in bold) at each position in the profile. Now consider the sub-profile of $\mathcal{M}$ consisting only of component M_{3}. Then the score threshold for ${\mathcal{M}}^{\prime}$' = M_{3} is given as λ^{ n }(M_{3}) = 8.6 - (10.75 – 3.75) = 1.60. Likewise, we can compute the minimum score threshold for any sub-profile (composed of any set of positions) of $\mathcal{M}$. Wu et al. [27] proposed a (permuted) lookahead profile scoring approach similar to our partial scoring method. However, their method is only for simple motif scoring, while our method is applied for both simple motif scoring and structured motif scoring.
Core scores
In many biologically relevant motifs, some positions are more conserved than others. We call them core positions, and these are precisely those positions with high information content. We choose the top h (usually 4 to 6) positions in the profile with highest information content as the core positions. For any potential match S', we can compute its core score $\mathcal{R}$^{ c }(S') to be the sum of the weights over only the core positions, and we require that $\mathcal{R}$^{ c }(S') satisfy a user-specified normalized core score threshold λ_{ c }. Just like the score threshold, we use the normalized core score threshold ${\lambda}_{c}^{n}$ for pruning, and furthermore, we can compute the core threshold for any sub-profile of the core positions. Note that we compute separate (partial) core scores for each component. For example, assuming we restrict our attention to only component M_{1} in Figure 8, with λ_{ c }= 1, we have $\mathcal{R}$^{ c }(M_{1}) = 1.36 + 0.78 = 2.14.
Motif enumeration
Simple motif scoring
Given a profile motif $\mathcal{M}$, score threshold λ, core score threshold λ_{ c }, and a sequence set $\mathcal{S}$, for each sequence S ∈ $\mathcal{S}$, we first compute the potential matches for each component M_{ i }separately. That is, for each component M_{ i }, for each consecutive subsequence S' of length |M_{ i }| starting at each position j in S (i.e., S' = S [j, j + |M_{ i }| - 1]) we compute its component core score $\mathcal{R}$^{ c }(S'), and partial score $\mathcal{R}$ (S'). If these scores are larger than the corresponding score thresholds ${\lambda}_{c}^{n}$(M_{ i }) and λ^{ n }(M_{ i }), respectively, we record this position j into the component's pos-list $\mathcal{\text{P}}$(M_{ i }).
To prune matches S' that will eventually not meet the score threshold, we check the score threshold as each position in S' is being considered. If the score for any prefix of S' falls below the score threshold, we can discard S'. In fact, before applying scores over all positions, we first consider the scores for the prefixes of the core positions within the component. This continuous check for the core positions and regular positions leads to very effective pruning. Note that as opposed to previous methods [19–22], our approach does not require the core positions be consecutive so as to find the most conserved parts in the profile.
Structured motif scoring and positional joins
After obtaining the pos-lists of simple motif components, we can enumerate structured motifs by doing positional joins on these pos-lists, as already outlined for structured pattern search. We compute the positional joins with M_{ i }as the head and M_{i+1}⋯ M_{ k }as the tail, as we start from M_{ k }as the head and end at M_{1} as the head. During the positional joins we also check the partial structured score thresholds λ^{ n }(M_{ i }⋯ M_{ k }). If the check fails at any stage we prune the match candidate. We keep a pattern (along with its score) only if it satisfies the full structured score threshold λ^{ n }.
Figure 9(a) shows how to enumerate structured motifs via positional joins. The pos-list of each component is simply the set of positions (1st element of the quadruples) under it. For example, $\mathcal{\text{P}}$(M_{1}) = {1, 5, 10, 21, 25, 32}. As before the joins proceed from M_{3} to M_{1}, i.e., first we obtain the pos-list for M_{2} [0,9] M_{3}, and then for M_{1} [0, 5] as head and M_{2} [0, 9] M_{3} as the tail. At any stage, the head motif's pos-list corresponds to the full list of positions shown, whereas the tail's pos-list consists only of the shaded positions. For illustration, we add a link between any two positions, x and y, in adjacent columns if their difference (d = y - x - 1) falls within the corresponding gap range. If the current partial motif starting at position x also satisfies the corresponding (partial) score threshold, the link is solid; otherwise, the link is dashed. When joining M_{2} as the head and M_{3} as the tail with a gap of [0, 9], the positions x ∈ $\mathcal{\text{P}}$(M_{2}) that satisfy the gap constraint are 10, 19 and 25 (marked in bold), which thus form the pos-list of M_{2} [0, 9] M_{3}. We then check whether each occurrence satisfies the corresponding structured score threshold. Figure 9(d) shows the minimum partial scores required for each component suffix. For example the threshold λ^{ n }(M_{2} [0, 9] M_{3}) = 5.87. Checking the score for CATACG[0,9]TTACG, we get 2.44 + 3.48 = 5.92 > 5.87, so we keep it. The other occurrences also satisfy the partial score threshold. Next we join $\mathcal{\text{P}}$(M_{1}) with $\mathcal{\text{P}}$(M_{2} [0, 9] M_{3}) to get the valid positions 1, 5, 10 and 21. When checking with the score threshold, we find that the score of CATG[0,5]CATACG[0,9]TTACG is 1.05 + 5.87 = 6.97 < 8.60 = λ^{ n }, so we discard this motif (as a result the corresponding link between the positions is dashed.) Finally, we get $\mathcal{\text{P}}$(M_{1} [0, 5] M_{2} [0, 9] M_{3}) = {1, 5, 21} as the pos-list for the full structured motif.
Full position recovery
Once the pos-list for the profile $\mathcal{M}$ has been computed, we can then recover the full positions from each start position, using the approach already outlined for the structured pattern search, i.e., we use an index $\mathcal{N}$_{ i }for each component to speed up the full position recovery. For example, in Figure 9(e), to recover the full position for the occurrence starting at position 1, we follow index 1 to get position 10 in M_{2}'s pos-list, to obtain $\mathcal{F}$ = {(1, 10)}. Then we follow index 1 to position 20 in M_{3}'s pos-list, to get $\mathcal{F}$ = {(1,10, 20)} as a full position and its corresponding sequence is AACG[5]CATGCT[4]ATACG. By continuing this process, we can get the other two full positions, as shown in Figure 9(f).
The complete SMOTIF algorithm: complexity analysis
Results and discussion
Performance comparison
We have implemented the SMOTIF algorithm in standard C++. We compare our results with SMARTFINDER, the best previous algorithm for structured motif search. For fair comparison, the suffix tree used in SMOTIF-2 for finding the pos-list of simple motifs is the same as the one used in SMARTFINDER, namely the lazy mode suffix tree [11], and we apply Sellers' dynamic programming algorithm [26] for approximate matching. The programs were compiled with g++ v3.2.2 at the optimization level 3 (-O3). Unless indicated otherwise, we did the experiments on an Apple G5 with dual 2.7Ghz processors and 4GB memory running Mac OS X; the timings reported are total times for all steps of the algorithms; all our experiments use exact matching for the simple motifs, and we report the full position for the occurrences.
SMOTIF: parameter settings
Real Motifs
Copia Motif | TNGA [12,14] TWNYTNNA [19,21] TNTMYRT [4,6] WNCCNNNNRG [72,95] TGNNA [100,125] TNTANRTNRAYGA |
---|---|
Motif $\mathcal{M}$_{1} | HNGTNYDNHDNBTNNDNA [0,3] YNHTNYRHGGNBTNAR [0,2] ARDBNBH |
Motif $\mathcal{M}$_{2} | TNVRNKAYNKNVVNDV [9,11] HNRR [6,8] YDNNVNNV [9,13] HB [4,5] TNNNNRBNYDBDNNRR |
Motif $\mathcal{M}$_{3} | DNNNNDRYW [2,5] DS [6,7] HMM [1,2] TNDB |
Motif $\mathcal{M}$_{4} | DBNNNND [48,102] KRRYMYNNNMRNHYNDVNYAYVH [7,10] VNNNYNNND [34,63] WD [2,8] KNNH [3,5] VNDDRNNNNNNHVNNNNNNNHHH |
Figure 11(c) and 11(d) compare the time and memory usage for SMOTIF-1 when we use the recomputed or indexed full position recovery. We find that the indexed approach is about 2 times faster than recomputing the full positions, and at the same time it can consume up to 20 times less memory! The effects are more pronounced for more missing components. Henceforth, we use the indexed approach to full position recovery.
Figure 11(e) compares the two methods for handling IUPAC symbols in SMOTIF-1, namely DNA pos-lists or IUPAC pos-lists. We find that there is not much difference between the two when 0 or 1 missing components are allowed; but for 2 missing components, IUPAC pos-lists have a slight advantage. In terms of memory space, even though the IUPAC pos-lists approach stores the pos-lists for each distinct symbol that appears in the structured motif, since only one segment is processed at one time, the space utilization of the two approaches is comparable. For example, with no missing components, the IUPAC and DNA pos-lists consume 2.98MB and 2.82MB memory, respectively. Henceforth, we use the IUPAC pos-lists approach.
Figure 11(f) shows the time for SMOTIF-1 and SMOTIF-2 on each of the 5 chromosomes of A. thaliana, with no missing components. The chromosomes are arranged by increasing length on the x-axis. We find that the search time increases linearly with the lengths of the sequences. We also observe that the direct approach, SMOTIF-1, outperforms the two-step approach, SMOTIF-2, when searching for the Copia retrotransposon.
SMOTIF and SMARTFINDER: comparison
Comparison on A. thaliana
To directly evaluate SMARTFINDER's constraint graph approach with our positional join approach, in Figure 12(c), we compare the time for the second step of SMOTIF-2 and SMARTFINDER, after the suffix tree is built in the first step. We observe that SMOTIF-2 is orders of magnitude faster than SMARTFINDER in finding all the occurrences that satisfy the structured gap constraints depending on the number of missing components allowed. We conclude that doing positional joins on the inverted index is an efficient way of enumerating all occurrences.
Search Time for Several Real Motifs
$\mathcal{M}$ | |$\mathcal{O}$| | SMARTFINDER | SMOTIF-1 | SMOTIF-2 |
---|---|---|---|---|
1 | 27 | 17.44s | 3.98s | 3.80s |
2 | 446 | 85.68s | 4.98s | 19.73s |
3 | 507911 | 84.11s | 4.69s | 21.98s |
4 | 283676 | Out of memory | 10.61s | 50.86s |
Searching motifs with long gaps
One application of SMOTIF is to find structured motifs with long gaps. For example, we extracted the motif DNNNNDRYW [2578, 4202]RNNGVHVY, from the same 36 LTRs of A. thaliana mentioned above. Here we retain only the first and last conserved components in the resulting alignment of the 36 sequences. Note the relatively long gap ranges, with minimum gap l = 2578 and maximum gap u = 4202. SMOTIF was able to extract the approximately 83 million full positions, corresponding to 3 million start positions, in just 35.75s (SMOTIF-1) and 15.72s (SMOTIF-2).
Comparison on random motifs
Here we used chromosome 20 of Homo Sapiens as the sequence; it has length 61M base pairs. We generate 100 random structured motifs in the Σ_{IUPAC} alphabet, with k ∈ [3,8] simple motifs of length l ∈ [5,10] (k and l are selected uniformly at random within the given ranges). The gap range between each pair of simple motifs is a random sub-interval of [-5, 100]. Note that the negative minimum gap shows that SMOTIF can mine overlapping simple motifs. Here we also compare the structured profile search approach SMOTIF-P, as follows: for each random motif, we form a profile by first expanding the IUPAC symbols into their corresponding DNA symbols and assign them a random probability of occurrence which accounts for 90% of the share, whereas the other DNA symbols randomly share in the remaining 10%. We use SMOTIF-P to search for the profiles with 2 as the number of core positions in each simple motif, λ^{ c }= 0.1 as the core score threshold and λ = 0.6 as the total score threshold. We use random motifs mainly to demonstrate the effects of various parameters on SMOTIF and SMARTFINDER.
Random Motifs: Mean and Variance
Algorithm | Mean(s) | Variance (s) |
---|---|---|
SMARTFINDER | 44.42 | 24.85 |
SMOTIF-1 (full) | 6.97 | 1.45 |
SMOTIF-1 (start) | 6.93 | 1.46 |
SMOTIF-2 (full) | 10.83 | 5.07 |
SMOTIF-2 (start) | 10.81 | 5.07 |
SMOTIF-P (full) | 9.67 | 2.95 |
SMOTIF-P (start) | 9.66 | 2.97 |
It is interesting to note that SMOTIF is more stable than SMARTFINDER: SMOTIF-P has around 8 times less variance than SMARTFINDER, SMOTIF-2 has around 5 times less variance than SMARTFINDER, whereas SMOTIF-1 has around 17 times less variance than SMARTFINDER. Note also that the overhead in recovering the full positions from the start positions is negligible.
Application: composite regulatory patterns
UASH and URS1H Binding Sites
Genes | UASH | URS1H | Distance | ||
---|---|---|---|---|---|
Site | Pos | Site | Pos | ||
ZIP1 | GATTCGGAAGTAAAA | -42 | ==TCGGCGGCTAAAT | -22 | 20 |
MEI4 | TCTTTCGGAGTCATA | -121 | ==TGGGCGGCTAAAT | -98 | 23 |
DMC1 | TTGTGTGGAGAGATA | -175 | AAATAGCCGCCCA== | -143 | 32 |
SPO13 | TAATTAGGAGTATAT | -119 | AAATAGCCGCCGA== | -100 | 19 |
MER1 | GGTTTTGTAGTTCTA | -152 | TTTTAGCCGCCGA== | -115 | 37 |
SPO16 | CATTGTGATGTATTT | -201 | ==TGGGCGGCTAAAA | -90 | 111 |
REC104 | CAATTTGGAGTAGGC | -182 | ==TTGGCGGCTATTT | -93 | 89 |
RED1 | ATTTCTGGAGATATC | -355 | ==TCAGCGGCTAAAT | -167 | 188 |
REC114 | GATTTTGTAGGAATA | -288 | ==TGGGCGGCTAACT | -94 | 194 |
MEK1 | TCATTTGTAGTTTAT | -233 | ==ATGGCGGCTAAAT | -150 | 83 |
Motif | NNDTBNGDWGDNNDH | ==WBRGCSGCYVW== | [5,179] |
Potential Binding Sites
Genes | UASH | URS1H | Hamming Distance | ||
---|---|---|---|---|---|
Site | Pos | Site | Pos | ||
MES1 | GATTTTGAAGTAGGA | -438 | TTAGCCGCCGA | -246 | 5:MER1 |
YJL045W | TTTTGTGAAGAGATA | -407 | TTAGCCGCTCA | -273 | 4:DMC1 |
HSP60 | GTTTTTGTAGGTATA | -329 | ATAGCCGCCCA | -252 | 5:MER1 |
SPO1 | ATTTTTGAAGTTAAC | -192 | TCAGCGGCTAT | -90 | 5:RED1 |
MEK1 | TCATTTGTAGTTTAT | -233 | TCGGCGGCTAT | -136 | 3:MEK1 |
YIG1 | ATTTCCGGAGTTTTC | -183 | TCGGCGGCTAT | -140 | 5:RED1 |
^{†}AGP1 | CCTTTTGATGACTTT | -786 | TCGGCGGCTAA | -699 | 5:SPO16 |
^{†}AGPl | CCTTTTGATGACTTT | -786 | TCGGCGGCTAA | -668 | 5:SPO16 |
^{†}REC114 | CATTTTGGTGGGTTC | -158 | TGGGCGGCTAA | -94 | 5:SPO16 |
^{†}GNTl | TCATTTGGAGAATAT | -340 | ATAGCCGCCAT | -299 | 5:SPO13 |
^{‡}MEK1 | TTATATGCAGTATAT | -276 | ATGGCGGCTAA | -150 | 4:MEK1 |
^{‡}MMS1 | AACTCTGTAGTTATA | -643 | TGGGCGGCTAA | -497 | 5:REC114 |
Upon further analysis, we found that in fact, the new occurrence in MEK1 (at positions -233,-136) that we found is also listed in the SCPD database as a binding site. SCPD lists one site for UASH at position -233, and two sites for URS1H at positions -136 and -150. To construct the motif, we had used -150 as the site for URS1H, without knowledge of the other site. SMOTIF was thus able to automatically find the other site based on the extracted motif! For REC114, we also found another occurrence at positions -158 (UASH) and -94 (URS1H). However, this is not reported in SCPD.
Genes and Significant Gene Ontology (GO) Terms
Genes | Significant GO Terms | p-value |
---|---|---|
MES1, MER1 | RNA metabolism | 5.7e^{-3} |
HSP60, MER1 | nucleic acid binding | 7.9e^{-3} |
SPO1, RED1 | M phase-meiotic cell cycle, meiotic cell cycle, meiosis, M phase | 1.1e^{-3} |
^{‡} MMS1, REC114 | DNA recombination, DNA metabolism | 6.0e^{-3} |
MES1, REC114, ^{†} GNT1, MEK1, MEI4, DMC1, MER1, REC104 | biopolymer metabolism, macromolecule metabolism | 2.3e^{-4} |
REC114, SPO1, MEK1, ZIP1, MEI4, DMC1, SPO13, MER1, REC104, RED1, HOP1 | meiosis, M phase of meiotic cell cycle, meiotic cell cycle, M phase, cell cycle | 1.6e^{-14} |
HSP60, DMC1, RED1, HOP1 | DNA binding | 6.7e^{-7} |
HSP60, DMC1, RED1 | structure-specific DNA binding | 3.1e^{-8} |
HSP60, DMC1 | single-stranded DNA binding | 3.7e^{-6} |
MEI4, DMC1, REC104, REC114, ^{‡} MMS1 | DNA recombination, DNA metabolism | 2.8e^{-6} |
MEI4, DMC1, MER1, REC104, REC114, MEK1, MES1, ^{‡} MMS1 | biopolymer metabolism, macromolecule metabolism | 2.3e^{-4} |
Finally, since we knew that in gene HOP1, the URS1H binding site appears upstream from UASH, we wanted to see if we could extract the "reversed" binding site. We search for the original and the reversed motifs using a hamming threshold of 6. We found 34 new binding sites where UASH can appear either up-or down-stream from URS1H. Among these we found two possible potential binding sites for the gene HOP1, with UASH at position -201 and URS1H at positions -534 and -175. The former pair (-201,-534) is in fact a known binding site as reported in the SCPD database [1]. This once again showcases the ability of SMOTIF to find potential new binding sites.
Conclusion
We introduced SMOTIF, a fast and efficient algorithm to search structured pattern and profile motifs in biological sequences. We showed its applications in searching for composite regulatory patterns, long terminal repeat retrotransposons, and for searching long range motifs. SMOTIF is also computationally more efficient than previous state-of-the-art methods like SMARTFINDER [6].
In biosequence analysis there are four related structured motif problems depending on whether the simple motifs and gap ranges in the structured motif are known or not: (i) Structured motif search [4–6]: all simple motifs and gap ranges are known; (ii) Structured motif extraction [30–32]: all simple motifs are unknown and all gap ranges are known; (iii) Extended structured motif search: all simple motifs are known and all gap ranges are unknown; and (iv) Extended structured motif extraction: all simple motifs and gap ranges are unknown. In this paper we tackled problem (i). A companion paper [33] tackles problem (ii). In the future, we plan to develop efficient algorithms for the other two motif problems as well.
Declarations
Acknowledgements
This work was supported in part by NSF CAREER Award IIS-0092978, DOE Career Award DE-FG02-02ER25538, and NSF grants EIA-0103708 & EMT-0432098. We thank the anonymous reviewers for their many helpful suggestions.
Authors’ Affiliations
References
- Zhu J, Zhang M: SCPD: A Promoter Database of the Yeast Saccharomyces Cerevisiae. Bioinformatics. 1999, 15 (7–8): 607-11.PubMedView ArticleGoogle Scholar
- Jurka J, Kapitonov V, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J: Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 2005, 110 (l–4): 462-467.PubMedView ArticleGoogle Scholar
- Mehldau G, Myers G: A System for Pattern Matching Applications on Biosequences. Computer Applications in the Biosciences. 1993, 9 (3): 299-314.PubMedGoogle Scholar
- Myers E: Approximate Matching of Network Expressions with Spacers. J Comput Biol. 1996, 3 (1): 33-51.PubMedView ArticleGoogle Scholar
- Navarro G, Raffinot M: Fast and Simple Character Classes and Bounded Gaps Pattern Matching, with Applications to Protein Searching. J Comput Biol. 2003, 10 (6): 903-23.PubMedView ArticleGoogle Scholar
- Policriti A, Vitacolonna N, Morgante M, Zuccolo A: Structured Motifs Search. Int'l Conf on Research in Computational Molecular Biology. 2004, 133-139.Google Scholar
- Morgante M, Policriti A, Vitacolonna N, Zuccolo A: Structured Motifs Search. Tech Rep UDIMI/15/2003/RR. 2003, University of Udine,Google Scholar
- Michailidis P, Margaritis K: On-line Approximate String Searching Algorithms: Survey and Experimental Results. International Journal of Computer Mathematics. 2002, 79 (8): 867-888.View ArticleGoogle Scholar
- McCarthy E, McDonald J: LTR_STRUC: A Novel Search and Identification Program for LTR Retrotransposons. Bioinformatics. 2003, 19 (3): 362-367.PubMedView ArticleGoogle Scholar
- Feschotte C, Jiang N, Wessler S: Plant transposable elements: where genetics meets genomics. Nature Reviews Genetics. 2002, 3 (5): 329-41.PubMedView ArticleGoogle Scholar
- Giegerich R, Kurtz S, Stoye J: Efficient Implementation of Lazy Suffix Trees. 3rd Workshop on Algorithmic Engineering. 1999, 30-42.View ArticleGoogle Scholar
- Gusfield D: Algorithm on Strings, Trees, and Sequences: Computer Science and Computational Biology. 1997, Cambridge University Press,View ArticleGoogle Scholar
- Inenaga S: String Processing Algorithms. PhD thesis. 2003, University of Zurich, Department of InformaticsGoogle Scholar
- Karp RM, Miller RE, Rosenberg AL: Rapid identification of repeated patterns in strings, trees and arrays. ACM symposium on Theory of computing. 1972, 125-136.Google Scholar
- Ukkonen E: Approximate String-Matching over Suffix Trees. Combinatorial Pattern Matching Conference. 1993, 228-242.View ArticleGoogle Scholar
- Ukkonen E: Finding Approximate Patterns in Strings. J Algorithms. 1985, 6: 132-137.View ArticleGoogle Scholar
- Landau GM, Vishkin U: Fast String Matching with k Differences. J Comput Syst Sci. 1988, 37: 63-78.View ArticleGoogle Scholar
- Myers G: A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM. 1999, 46 (3): 395-415.View ArticleGoogle Scholar
- Kel A, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis O, Wingender E: MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Research. 2003, 31 (13): 3576-3579.PubMedPubMed CentralView ArticleGoogle Scholar
- Chekmenev D, Haid C, Kel A: P-Match: transcription factor binding site search by combining patterns and weight matrices. Nucleic Acids Research. 2005, W432-W437. 33 Web Server
- Quandt K, Frech K, Karas H, Wingender E, Werner T: MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Research. 1995, 23 (23): 4878-4884.PubMedPubMed CentralView ArticleGoogle Scholar
- Cartharius K, Frech K, Grote K, Klocke B, Haltmeier M, Klingenhoff A, Frisch M, Bayerlein M, Werner T: MatInspector and beyond: promoter analysis based on transcription factor binding sites. Bioinformatics. 2005, 21 (13): 2933-2942.PubMedView ArticleGoogle Scholar
- Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Munch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Research. 2003, 31: 374-378.PubMedPubMed CentralView ArticleGoogle Scholar
- Zaki M: SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning Journal. 2001, 42 (1/2): 1-31.View ArticleGoogle Scholar
- Zaki M: Sequence Mining in Categorical Domains: Incorporating Constraints. ACM Int'l Conference on Information and Knowledge Management. 2000, 422-429.Google Scholar
- Sellers PH: On the theory and computation of evolutionary distances. SIAM J Appl Math. 1974, 26: 787-793.View ArticleGoogle Scholar
- Wu T, Nevill-Manning C, Brutlag D: Fast Probabilistic Analysis of Sequence Function Using Scoring Matrices. Bioinformatics. 2000, 16 (3): 233-244.PubMedView ArticleGoogle Scholar
- Thakurta D, Stormo G: Identifying target sites for cooperatively binding factors. Bioinformatics. 2001, 17 (7): 608-621.View ArticleGoogle Scholar
- Saccharomyces Genome Database Gene Ontology Term Finder.http://www.yeastgenome.org
- Marsan L, Sagot M: Extracting Structured Motifs Using a suffix Tree – Algorithms and Application to Promoter Consensus Identification. Journal of Computational Biology. 2000, 7: 345-354.PubMedView ArticleGoogle Scholar
- Carvalho A, Freitas A, Oliveira A, Sagot M: Efficient Extraction of Structured Motifs Using Box-links. String Processing and Information Retrieval Conference. 2004, 267-278.View ArticleGoogle Scholar
- Carvalho A, Freitas A, Oliveira A, Sagot M: A highly scalable algorithm for the extraction of cis-regulatory regions. Asia-Pacific Bioinformatics Conference. 2005, 273-283.View ArticleGoogle Scholar
- Zhang Y, Zaki MJ: EXMOTIF: Efficient Structured Motif Extraction. Algorithms for Molecular Biology. 2006, 1: 21-PubMedPubMed CentralView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.