Direct vs 2stage approaches to structured motif finding
 Maria Federico^{1},
Affiliated with
 Mauro Leoncini^{1, 2}Email author,
Affiliated with
 Manuela Montangero^{1, 2} and
Affiliated with
 Paolo Valente^{1}
Affiliated with
DOI: 10.1186/17487188720
© Federico et al.; licensee BioMed Central Ltd. 2012
Received: 27 October 2011
Accepted: 25 July 2012
Published: 21 August 2012
Abstract
Background
The notion of DNA motif is a mathematical abstraction used to model regions of the DNA (known as Transcription Factor Binding Sites, or TFBSs) that are bound by a given Transcription Factor to regulate gene expression or repression. In turn, DNA structured motifs are a mathematical counterpart that models sets of TFBSs that work in concert in the gene regulations processes of higher eukaryotic organisms. Typically, a structured motif is composed of an ordered set of isolated (or simple) motifs, separated by a variable, but somewhat constrained number of “irrelevant” basepairs. Discovering structured motifs in a set of DNA sequences is a computationally hard problem that has been addressed by a number of authors using either a direct approach, or via the preliminary identification and successive combination of simple motifs.
Results
We describe a computational tool, named SISMA, for the denovo discovery of structured motifs in a set of DNA sequences. SISMA is an exact, enumerative algorithm, meaning that it finds all the motifs conforming to the specifications. It does so in two stages: first it discovers all the possible component simple motifs, then combines them in a way that respects the given constraints. We developed SISMA mainly with the aim of understanding the potential benefits of such a 2stage approach w.r.t. direct methods. In fact, no 2stage software was available for the general problem of structured motif discovery, but only a few tools that solved restricted versions of the problem. We evaluated SISMA against other published tools on a comprehensive benchmark made of both synthetic and real biological datasets. In a significant number of cases, SISMA outperformed the competitors, exhibiting a good performance also in most of the cases in which it was inferior.
Conclusions
A reflection on the results obtained lead us to conclude that a 2stage approach can be implemented with many advantages over direct approaches. Some of these have to do with greater modularity, ease of parallelization, and the possibility to perform adaptive searches of structured motifs. As another consideration, we noted that most hard instances for SISMA were easy to detect in advance. In these cases one may initially opt for a direct method; or, as a viable alternative in most laboratories, one could run both direct and 2stage tools in parallel, halting the computations when the first halts.
Keywords
Structured motif TFBS discovery Combinatorial algorithmsBackground
Understanding the complex mechanisms that regulate gene expression is a pivotal problem in molecular biology. Gene transcription [1] starts when one or more regulatory proteins bind DNA regulatory elements, which are mostly located in the promoter region nearby the transcription start site (TSS) of genes, or also further apart in eukaryotic organisms (e.g. enhancers, silencers). In eukaryotes, DNA binding proteins are called transcription factors (TFs) and regulatory elements, to which they bind, are known as transcription factor binding sites (TFBSs).
In lower eukaryotes TFBSs are usually short DNA strings (525 base pairs long) bound by a single TF, that frequently appear, with possibly some mutations, upstream of the TSS in the proximal promoter region of coregulated genes.
In higher eukaryotic organisms, transcription regulation is more complex and TFBSs are more difficult to characterize [2]. There may be multiple binding sites for a single TF in a single gene’s promoter region; there can be great variability in the binding sites of a single TF; the regulatory elements may be located also several kilobases away from the TSS, either upstream or downstream or in the introns of the genes that they regulate [3], and in this case they are often organized in functional groups (called cisregulatory modules) [2] bound by several interacting TFs in a cooperative or antagonistic way.
Being able to identify TFBSs is crucial to our understanding of the mechanisms that regulate gene expression (e.g., chronology and cellspecificity of transcription [4]), and of the functions of individual genes regulated by newly discovered TFBSs [2]. Also, mutations in TFBS underlie several degenerative human diseases (e.g., all forms of cancer) and constitute a substantial component of the phenotypic variability within and across species [5].
Structural and functional information on mechanisms of interaction between TFs and their binding sites are provided by experimental techniques, which are costly and timeconsuming.
TFBS discovery as an algorithmic problem
The identification of (possible) functional sites can be formulated as an algorithmic problem, provided a mathematical abstraction is given to model TFBSs. Two of the most popular such models are Position Specific Score Matrices (PSSM) and Hamming distance (HD) models (see, e.g., [6, 7]). Here we will adopt the HD model, which we now briefly recall.
Computational motif discovery (or motif finding) can be defined as the task of inferring the mathematical abstractions subject to the identification of the occurrences (i.e., the potential binding sites) in the input sequences. The typical input to a motif discovery program under the HD model includes a pair (ℓ,e), which describes the length of the consensus and the maximum substitutions allowed in the occurrences, respectively.
Motif discovery is a very difficult problem [8], since the space of possible occurrences may be huge. The inverse problem, i.e., finding the occurrences given a motif definition, is called motif search and is instead comparatively much easier than discovery.
where, for all admissible j and k, ℓ _{ j }and e _{ j }constrain the motifs that can occur as box j according to the HD model, while d _{ k } and D _{ k }are lower and upper bounds on the number of nucleotides between box k and box k + 1. Figure 1b illustrates the concept in case of b = 3.
Focus of the paper
There is an already huge literature on motif discovery (see, e.g., [6, 8–19] and also the references contained in the survey papers [20, 21]). However, for our purposes the proposed algorithmic solutions fall into two classes: (1) optimization algorithms (either deterministic or probabilistic), and (2) enumerative, exact algorithms. Algorithms of the first class seek motifs that optimize a certain scoring function, usually exploring only a limited portion of the space of all possible motif candidates (see [6, 9, 10, 22] for influential works). On the other hand, enumerative algorithms exhaustively search the motif space^{a}[12, 14, 16, 23, 24].
A fundamental component of exact methods is what we can term the enumeration engine, i.e., the algorithm adopted to generate all the possible candidate motifs to be later evaluated on some statistical basis (for another example, see [25]). Actually, some exact motif finding tools have been proposed which are just enumeration engines, simply returning all the motifs that satisfy the input constraints [7, 26–28]. Clearly, an appraisal of any such engine depends on its computational efficiency only.
The availability of enumeration tools is useful both because they can be taken as building blocks for more sophisticated finders and because they inspired (and still inspire) research in the whole field of exact methods. In this respect, it is worth observing that the tool which best behaved in the now famous assessment by Tompa et el. [29], namely Weeder, borrows its enumeration engines from [7].
In this paper we concentrate our attention on enumerative algorithms for structured motif discovery in a set of input DNA fragments. In particular, we focus on enumeration engines and base our analysis on running time and (to a lesser extent) memory consumption. We pay no attention to the “quality” of the results, simply because the output only depends on the input constraints posed to the motifs being sought. Running time is instead especially critical since faster enumeration leaves more room for post processing (i.e., picking the motifs deemed more likely to represent functional sites).
Before describing the contribution of our work, we analyze in more details the results presented in the literature that deal with enumeration engines for structured motif discovery.
Related work
Existing algorithms are essentially based on one of two possible approaches: (1) directly explore the search space of structured motifs, or (2) first extract the simple motifs that may occur as boxes (using any available simple motif finder) and then “assemble” them into structured motifs that satisfy box order and distance constraints. We shall refer to the latter as to the 2stage approach.
A wellknown potential advantage of directly exploring the space of structured motifs is that the combined boxes, together with the distance constraints, may be strong enough to quickly emerge, possibly together with few others spurious structures, even though each single box is a weak signal (see, e.g., [15]). We point out that the most efficient direct approach algorithms makes use of the (generalized) suffix tree data structure [30, 31].
The 2stage approach was first mentioned by Marsan and Sagot [32], who nonetheless deemed it impractical due to the high resource consumption. Recently, however, it was reconsidered by Zhou et al. [28], who provided much tighter theoretical upper bounds on the runtime and space complexity. They designed the Ecomp algorithm and showed it to be more efficient than more sophisticated exact methods in their experimental settings.
Some available exact motif finders require that at least one instance of the motif be exact, i.e., that it actually appears in one of the input sequences. This leads to a reduction of the motif search space with ensuing time and space savings. This simplified version of the problem is called Frequent (Structured) Motif Discovery problem [26].
SMILE, RISO and RISOTTO
SMILE [32] is a family of algorithms designed to solve slightly different variants of the structured motif discovery problem on set of input sequences. SMILE extends to structured motifs the algorithmic ideas on simple motif enumeration presented in [7]^{b}. To explore the space of possible structured motifs, SMILE uses a generalized suffix tree of the input sequences together with a (virtual) lexicographic tree of all possible simple motifs. Improvements to SMILE are presented by Carvalho et al. [33]. Their Riso algorithm exhibits an exponential time and space gain over SMILE in the worst case. Riso works on a variation of the generalized suffix tree (called generalized factor tree) [34], built only up to the box length level, with some extra information used for fast update of the tree. Riso’s computational complexity is exponential with respect to the number of boxes and their lengths, but it does not depend on interbox distances, thanks to the use of box links.
A further improvement is achieved by Risotto[27], although only on the average, thanks to it’s ability to quickly detect dead ends (i.e., words that cannot possibly be extended to a valid motif). In practice, Risotto is more than twice faster than Riso and, to the best of our knowledge, also the most efficient algorithm for exact enumeration of structured motifs composed of any number of boxes. For this reason, we will take Risotto as our primary competitor in the experimental tests.
ECOMP
Ecomp is a general, 2stage algorithm that uses Mitracount [15] to find all simple motifs and the starting positions of all their occurrences^{c}. In the second step, the algorithm looks for dyads by checking all pairs of occurrences of simple motifs, keeping and counting only those satisfying the distance constraint. At the end, Ecomp outputs only the dyads satisfying the quorum constraint.
Unfortunately, Ecomp is fully described and tested only for dyads and source code is apparently not available.
ExMotif
The 2stage approach is also used by Ex Motif[26] to solve the frequent structured motif extraction problem. Ex Motif’s main data structures are lists that store the positions of patterns appearing in the sequences. These lists are repeatedly intersected in order to find motifs that satisfy the input constraints. The output produced by Ex Motif includes the structured motifs satisfying the quorum constraint and only the positions of their exact occurrences. The computational cost of Ex Motif, as reported in [26], is exponential with respect to the number and the length of boxes.
Results
Our contribution is twofold. We present a novel 2stage algorithm, called SISMA, that enumerates all the structured motifs conforming to input specifications. More precisely, we describe two different software tools that implement SISMA’s ideas. The first version, called SISMA_Smile, solve the “unconstrained” enumeration problem, while the second one, named SISMA_Speller, addresses the frequent structured motif discovery problem. We compare the performances of SISMA_Smile (resp., SISMA_Speller) against those of Risotto (resp. Ex Motif) on a comprehensive dataset composed of both synthetic and real biological data. The experimental results show that our tools are competitive in enumerating spaces of structured motif candidates.
We also try to go one step further and reflect on the relative merits of direct vs 2stage approaches for structured motifs finding. The latter enjoy some potential design advantages, such as modularity and ease of parallelization (see the concluding section for more on these aspects). However, the argument of computational inefficiency has often been used to discourage their active use. From our experiments here we can not devise strong arguments in favor of any of the two approaches. It is true that, in some circumstances, direct methods can explore spaces which are beyond the capabilities of a 2stage algorithm. However, in other cases our 2stage approach software results much faster than the competitor direct tool. In the concluding section we will give some guidelines (depending on parameter sets) to possibly assist the users to choose the most suitable tool for the problem instances at hand.
Methods
In this section we describe the implementation of SISMA (“Successive Intersection of Simple Motifs Apart”), a structured motif finder based on the 2stage approach.
 1.
the set of sequences, in Fasta format, where the motifs must be found;
 2.
the number b ≥ 2 of boxes (simple motifs) which the structured motifs will be made of;
 3.
an ordered set of b pairs of integers: (ℓ _{ i },e _{ i }), i = 1,…,b, such that ℓ _{ i }is the length of the i ^{th} box and e _{ i }the corresponding number of admissible errors;
 4.
for each pair of consecutive boxes, say the i ^{th} and i + 1^{th} ones, a pair of integers (d _{ i },D _{ i }) that specify the minimum and maximum number of bases, respectively, that may separate the two boxes, i = 1,…,b − 1.
 5.
a value q ∈ (0,1] (the socalled quorum) that specifies the minimum fraction of input sequences that must contain an instance for the structured motif to be considered valid.
The output of the algorithm is made of all the possible structured motifs that conform to input specifications.
SISMA is implemented in C++ and its source code is available for download from http://algo.ing.unimo.it/mf/.
Basic implementation
SISMA stores simple and structured motifs using vector data structures that make it possible to perform list intersections and filtering operations (i.e., the distance and quorum checks described below) in a very efficient way (technical details can be found in Additional file 1).
In current implementations, SISMA comes in two versions, which will be referred to as SISMA_Smile and SISMA_Speller, solving the structured and the frequent motif discovery problems, respectively.
Stage 1
SISMA_Smile first calls SMILE [32] for simple motif discovery. To the best of our knowledge, SMILE is the only tool available for download which is exact and that returns the positions of all the occurrences of found motifs^{d}.
SISMA_Speller uses our implementation of the Speller[7] algorithm, which returns simple motifs with at least one exact match in the input sequences, together with all their occurrences. We decided for a new Speller implementation because, apparently, there is no available tool with these characteristics^{e}.
Independently of the simple motif finder adopted, the output of the first stage is a set of simple motifs with associated position lists of their occurrences, each one ordered by increasing sequence indexes and increasing positions within the sequence. Logically, motifs returned by stage 1 are classified into b subsets, denoted by K _{ i },i = 1,…,b, such that m ∈ K _{ i } if and only if m can be the i ^{th} box of one of the structured motifs being sought. Each set K _{ i }is maintained as a vector data structure, each cell of which in turns stores a pointer to a vector containing all occurrences of exactly one m ∈ K _{ i }, ordered by increasing input sequence index and increasing position in each sequence.
Observe that there is no filtering process of simple motifs found in this stage, because there is apparently no relations between significance of simple motifs and significance (or mere existence) of structured motif, as clearly stated in [35]. In particular, structured motifs might exist (and reach quorum) only because they contain weak simple motifs.
Stage 2
For i = 0,…,b, we will use the term i prefix to denote any structured motif made of i boxes that could possibly be extended to a full structured motif conforming to the problem specification.
In the second stage, which is divided in b steps, SISMA builds prefixes of increasing length, starting from the empty prefix in step 1 and ending with bprefixes (i.e., full structured motifs) in step b. Basically, at generic step i, SISMA considers all (i − 1)prefixes p and all motifs m ∈ K _{ i } to assemble possible iprefixes r = p − (d _{ i },D _{ i }) − m. The computed iprefixes are stored in a vector data structure, analogously to what is done with simple motifs.
During a prefix assembly step, SISMA checks distance and quorum constraints, in order to discard, as early as possible, prefixes that could not possibly be extended to full structured motifs.
Distance check: for each potential iprefix r being built, r = p − (d _{ i },D _{ i }) − m, and any sequence s, SISMA performs a binary search on the sorted occurrences of m in s in order to find the first occurrence that satisfies the minimum distance constraint. Then, all the subsequent occurrences of m are considered, until one is found that violates the maximum distance constraint. In this way, SISMA builds all occurrences of r.
Quorum check: the iprefix r is discarded if its occurrences appear in less than qS input sequences. In fact, it is obvious that prefix extension can only reduce the eventual motif quorum.
Options
The basic implementation has been enhanced with some options that might be used to have even more efficient second stage runs, under some circumstances. Implementation details can be found in Additional file 1.
Box index selection option
When this option is selected, SISMA builds structured motifs by considering boxes of increasing total number of occurrences and not by increasing index order. In more pictorial terms, the full (final) structured motifs are not determined by assembling longer and longer prefixes but rather longer and longer structured motif “subsequences”.
where oc c _{ i,t } is the set of occurrences of motif m _{ t }∈ K _{ i }, t = 1,…,K _{ i }. SISMA then sorts the sets {B _{ i }}_{ i = 1,…,b } and forms a list L with the corresponding box indexes, i.e., if L _{ i }= j and L _{ i + 1} = k, then B _{ j }≤ B _{ k }, i = 1,…,b − 1. Then, it uses the list L in stage 2 to determine the order with which to add the boxes to the structured motifs being built.
This option allows to limit the number of useless intermediate structured motifs (iprefixes) that are generated (i.e., those that eventually would be discarded, because they either could not be extended to full bboxes or would not satisfy the quorum constraint). This is particularly effective when the expected number of structured motifs in the output is not large, which is likely to happen, e.g., when the box length is large and a small number of errors are admitted, or simply when the number of boxes is large. There is a trade off between the slowdown introduced to store and handle extra information needed to implement this option and the speedup obtained by reducing the number of useless intermediate structured motifs. If the output is large, the slowdown is predominant; in contrast, if the output is small, speedup is predominant.
In practice, this option helped SISMA to drastically reduce the outofmemory failures, especially on synthetic data.
Spacesaving option
Sometimes the output to be generated is very large. This happens, e.g., when the first stage has returned a huge number of simple motifs occurrences. In turns, this can be a consequence of the particular values of the input parameters, such as very short motif lengths and/or relatively small difference between length and number of available errors. Under these circumstances, SISMA basic implementation, which keeps all the intermediate motifs (including the simple ones) in main memory, may fail due to memory shortage.
To cope with this situation, especially on low memory PCs, SISMA can be run with a specific option that produces the output in distinct slices, and that requires less main memory to produce the output of each slice. Given an integer v as the value of the spacesaving option parameter, for i = 1,…,b, each set of motifs K _{ i } is partitioned into ⌈K _{ i }/v⌉ subsets , each one containing at most v distinct motifs. The second stage is then run once for each possible element in the Cartesian product . In this way SISMA drastically reduces the number of intermediate structured motifs generated and usually avoids outofmemory failures at the price of a moderate slowdown (see the Computational cost section).
The only cases that this variant of SISMA is not able to handle are those in which the memory shortage is due to stage 1, i.e., when the number of simple motifs (and their occurrences) is simply too large to fit in memory (we will see that this happens in few very difficult instances on synthetic data).
Occurrence print option
SISMA might be instructed to output the starting positions of all occurrences of the discovered structured motifs.
Not using this option allows a fair comparison with tools that do not print all motif occurrences, but just the motif definitions.
Computational cost
The time computational cost of SISMA is given by the cost of simple motif extraction plus that of occurrence list intersections. Here we will first refer to the basic version of SISMA, with only briefly mentioning the various options at the end of the section.
With the current implementation, the simple motif extraction tool must be run once for each different single box template (i.e., for all different (ℓ,e) pairs). Both SMILE and our modified version of Speller have worstcase time complexity in O(N t _{ ℓ } ν(e,ℓ)), where t _{ ℓ } is the number of suffix tree nodes at depth ℓ, N is the number of input sequences, and ν(e,ℓ) is the number of words of length ℓ that differ in at most e letters from a word m of length ℓ. It holds that ν(e,ℓ) ≤ ℓ ^{ e }Σ^{ e }. Hence, the time complexity is linear in the input size, but possibly exponential in the number e of substitutions. Thus, as we are working with the DNA alphabet, the first stage takes .
Equation (2) clearly shows that the running time of the second stage depends essentially on the number of occurrences of simple motifs and that of intermediate structured motifs. Note, however, that if there is a large number of simple motifs the cost of first stage is high as well. Low cost of the first stage and high cost of second stage are possible only if there are relatively few simple motifs but many intermediate structured motifs. This is in principle possible, but in practice it hardly happens due to order and distance constraints, as the computational experiments clearly indicate. In practice, thus, the cost of extracting structured motifs is comparable with that of simple motif finding, at least when the starting positions of all the occurrences are required. As for the memory space used, it is not difficult to see that this is the maximum between: (1) SMILE space complexity needed to generate all the motifs occurrences for each box, and (2) the space needed to generate jprefixes occurrences, i.e., O(max_{ j∈[2.b−1]}{S _{ j } + B _{ j + 1}}).
We now briefly consider options. As for index selection, it can be easily seen that the handling of more complex data structures in main memory introduces, in the worst case, time and memory penalties linear with the number b of boxes. In practice, however, nonworst case instances might run much faster with this option activated (see Options section). For what concerns the space saving option, it can be proved that the slowdown is constant, although the exact figures depend on low level implementation issues. On the machine used to perform the experiments, the running times with space saving activated were almost four times higher. We must point out, however, that the intended use of this option is just to avoid outofmemory failures, and these can be regarded as infinite time computations. Then, in these cases the option can be thought to provide (sometimes) “unbounded” speedups. In this case, space requirement is dominated only by SMILE space requirement. Finally, SISMA actually generates all occurrences during computation, and hence printing them takes only linear time in the number of occurrences.
Results
We have performed a series of computational experiments on both synthetic and real biological data with a twofold goal: (1) compare the direct and 2stage approaches using the best available algorithm (Risotto) for the former and our SISMA_Smile code for the latter; (2) compare SISMA_Speller with Ex Motif[26], the only exact tool for the frequent structured motif extraction problem which adopts the 2stage approach and whose code is available (see the Related work section)^{f}.
We performed all the experiments on an uniprocessor AMD Athlon 64 3200+ with 1GB of RAM, forcing a timeout of twelve hours for the execution of each tool.
Tests on synthetic data
In this section we report the results of tests performed on synthetic data, which are often used to validate the effectiveness of existing methods in a fully controlled experimental setting, and to experimentally evaluate their scalability properties. In particular, we generated synthetic data sets according to the so called Planted Motif Problem (PMP) [36] in the following way:
Sequence generation: we randomly generated the sequences of the input set S, assuming the characters of each sequence be i.i.d. and with equal probability (0.25) assigned to each symbol. According to [36], the data sets contain 20 sequences of 600 characters each.
Structured motif planting: we selected the number b of boxes and the b pairs (ℓ _{ i },e _{ i }) using different rules, which we will specify when describing the experiments. We generated distance constraints at random, making sure that the total maximum distance between the first and last box fit into the sequences. For each pair (ℓ _{ i },e _{ i }), defining the template for a simple motif m _{ i }, i = 1,…,b, we first selected a random word w _{ i }(the “exact” instance of m _{ i }) and then generated S occurrences of m _{ i }, at Hamming distance ≤ e _{ i }from w _{ i }, by substituting e _{ i }characters of w _{ i }with characters from Σ chosen uniformly at random. Finally, we planted the occurrences, one per sequence, by respecting box order and distance constraints (but otherwise at random). When generating a dataset to be tested using Ex Motif, we planted at least one exact structured motif occurrence.
The parameters used to built the dataset where then used to run the motif finding tools on that dataset. The quorum is set to q = 1.0 in all tests.
The data set generating process outlined above produces boxes that are instances of the PMP. There is a wide literature on the PMP, especially for single motif extraction (see, e.g. [12, 37, 38]). Preliminary results relative to the structured motif extraction settings can be found in [39] and, limited to dyads, in [15, 28]. Using a simple model [37], we can estimate the number E(ℓ e) of simple motifs that one expects to find in a randomly generated sequence, depending on the length of the motif and on the number of allowed errors. Expectation of pairs makes some instances easier to solve than others. When talking about “difficult” instances we will refer to ones in which the expected number of randomly found motifs is high.
Experimental settings: Here we present results concerning a set of tests in which we planted boxes with variable lengths and number of allowed substitutions, randomly chosen among those with expectation close to one (i.e. for which the planted motif and a little number of other random motifs can be expected), over the following pairs: (9, 2), (10, 2), (11,2), (11, 3), (12, 3), (13, 3), (14, 1), (14, 2), (14, 3), (14, 4), (15, 1), (15, 2), (15, 3), (15, 4), (15, 5).
We varied the number b of boxes between 2 and 10 and ran the algorithms 20 times on different datasets. We ran SISMA with the box index selection option, which resulted very effective in this set of experiments.
Further results on synthetic data are briefly reported at the end of this section and, in details, in Additional file 2.
Results and discussion: To compare pairs of tools, we used two different measures: (1) wincount, i.e., given a common value of b, the number of times one tool outperformed the other; (2) running times: we report best, worst and average running times, as well as standard deviations, for each value of b used^{g} (we computed means and standard deviations omitting the best and worst time runs). Moreover, we separately computed the above measures by considering either all runs, or only runs where both tools ended computations.
SISMA_Smile vs Risotto
From Figure 2a we can see that no tool is definitely better than the other. Risotto is usually more competitive for small number of boxes (up to six), but turns significantly less competitive as the number of boxes increases. Moreover, Risotto failed on few runs even with relatively few (i.e., six or more) boxes, usually when the first boxes are hard instances of the PMP.
On the other hand, SISMA_Smile was almost as good as Risotto for small number of boxes while it handled larger problem instances definitively better. In the majority of tests, SISMA_Smile ended computation before Risotto and it failed only once due to memory shortage in the first stage (because of too many simple motif occurrences).
The best case was almost always favorable to Risotto, and there was no difference for what concerns best case when considering all runs or only those without failures. In the best case, Risotto ended computation in less than 10 seconds (with the exception of ten boxes, where the best run took 24 secs), while Risotto ended computation in less than 6 seconds for b ≤ 5 and in about 30 seconds for b > 5, never exceeding 35 seconds.
 (a)
In the worst case, SISMA_Smile is much faster than Risotto: the longer run for SISMA_Smile took less than 74min, while for Risotto some runs took more than 10h, even for relatively small number of boxes and excluding failures. Hence, even when Risotto defeated SISMA_Smile, the latter was still relatively fast. The opposite was not always true.
 (b)
Although average running times should be analyzed with care, Risotto showed average running times much worse than SISMA_Smile’s, even without considering failures. For instance, in case of ten boxes Risotto’s average runtime was around 1hand a half, while SISMA_Smile took less than 25min.
 (c)
Risotto showed a greater variance across all these data.
We further investigated the structure of the instances in which one tool outperformed the other in order to better understand advantages and disadvantages that may be typical of the direct and 2stage approaches (see an example in Figure 4). We anticipate that the key factor is the first stage of SISMA_Smile.

SISMA_Smile outperforms Risotto when SISMA_Smile first stage is fast. This happens mainly for two reasons (that might happen simultaneously): (i) there is a small total number of simple motifs and SMILE running time is low. (ii) Boxes are characterized by the same pair (length, errors), and hence SMILE is run only once for each pair.

Risotto outperforms SISMA_ Smile when the first stage of simple motif extraction is slow due to boxes producing a large number of simple motifs. This situation might also affect SISMA_Smile second stage: a large number of intermediate structured motifs means a time consuming occurrence list intersection stage. In some cases the phenomenon is almost completely eliminated using the box index selection option.
Finally, a closer inspection on Risotto’s behavior shows that its running time may be highly affected by the positions of boxes with large search spaces, e.g., box (14,4). We performed a set of experiments in which we searched for planted long structured motifs characterized by the same boxes (number and positions), with just one “floating” (14,4) box, which we moved from first to last position. While the details can be found in Additional file 2, we observe here that Risotto’s performance degraded considerably, while SISMA_Smile’s behavior was essentially unaffected by the (14,4) box position.
SISMA_Speller vs ExMotif
We have no results for the comparison of SISMA_Speller and Ex Motif, because the latter never ended computation within this experimental settings.
We explain this negative behavior observing that the set of pairs (ℓ,e) among which we chose the boxes of planted structured motifs contained several pairs characterized by large values of ℓ and e, which Ex Motif is apparently not able to address.
On the other hand, SISMA_Speller never failed within this experimental setting, exhibiting low running times. The worst case run (even for ten boxes) was never above three minutes while the best case runs lasted around three seconds for two boxes and about 33 seconds in case of ten boxes.
Results for other synthetic experiments
we conclude this section by mentioning the results obtained for two other synthetic data sets (the details can be found in Additional file 2).

We tested the tools on “presumably” (i.e., à priori) easy instances for SISMA, where the structured motifs sought were composed by boxes of the same type (i.e., same length and number of errors). Indeed SISMA_Smile always outperformed Risotto when both tools ended computation, while Ex Motif outperformed SISMA_Speller on input instances with very small values of ℓ,e, and b. For larger values, however, Ex Motif did not end computations, while SISMA_Speller failed only when boxes were any of the known very hard PMP instances.

The other data set was composed of presumably very hard instances, according to the PMP classification. We observed a relatively large number of failures, both of SISMA_Smile and Risotto. The reasons were essentially those already observed, but interestingly enough, the two tools did not (usually) fail on the same instances, meaning that a difficult instance for one tool might not be so difficult for the other. Ex Motif did end computation only on a limited number of instances and only in very few cases it outperformed SISMA_Speller.
Tests on real biological data
In this section we report the results of experiments performed on three different datasets composed of upstream regions of coregulated genes of the Saccharomyces cerevisiae in order to extract motifs representing transcription factor binding sites.
UASHURS110 dataset
The dataset was drawn from [13]. It contains the upstream sequences of 11 meiotic genes of the Saccharomyces cerevisiae which are cooperatively regulated by the transcription factors URS1H and UASH involved in the meiotic expression during sporulation.
These 11 genes are listed in SCPD [40]. In 10 out of the 11 genes, the URS1H binding site appears downstream from UASH site and both sites are located within the upstream region 300 to 1. We included these ten regions in the dataset. We do not included a sequence for the remaining gene (HOP1) since there the binding sites are reversed and the URS1H site is placed much further upstream compared to all the other genes in the set.
We designed the same experimental settings as in [26], except for the distance gap between the two sites. We chose a larger gap range with respect to [26] in order to approach the problem in a more realistic way, in which information about the binding site being sought may not be known. We look for structured motifs of the form (3,1)[1,1](5,2)[1,200](9,1). We required that structured motifs occurred in at least 7 sequences (quorum q = 70%). SISMA was run with the spacesaving option.
UASHURS15 dataset
In the 10 genes dataset discussed above, the two binding sites occur within at most 200 bases. However, as GuhaThakurta and Stormo suggest in [13], that gene sequences can be equally divided in two groups based on the average distance between UASH and URS1H sites. According to this, we obtained a group of 5 genes in which the binding sites are within 50 bases of each other.
We reproduced the experimental setup defined in [28]: we analyzed the five sequences in UASHURS15 and looked for dyad motifs of the form (7,1)[1,50](10,2) (again, we actually used a larger distance gap with respect to [28] in order to approach the problem in a more realistic way). Quorum was set to 80%, i.e., at least 4 sequences.
KAR4P dataset
This dataset contains 23 genes of the Saccharomyces cerevisiae which are coregulated by the KAR4p transcription factor required for gene regulation in response to pheromones. We obtained the list of 23 coregulated genes from the YEASTRACT database [41] and the upstream regions of those genes using the RSAT [42] retrieve sequence tool.
We deduced the characteristics of the KAR4p binding site from the consensus given in YEASTRACT and we looked for structured motifs of the form (3,1)[2,2](4,1)[2,2](3,1)[1,1](2,1) occurring in at least 68% of the input sequences, i.e., in at least 16 sequences.
Discussion
Figure 5 reports the running times (in seconds) for all the four tools, while Table 1 reports the number of simple and structured motifs found for each dataset and the average number of occurrences. We can make the following general observations:

Ex Motif terminated computations in all the experiments, contrary to what happened on synthetic datasets. Here, structured motifs are composed of few boxes with few substitutions allowed, experimental conditions extremely favorable to Ex Motif.

Stage 2 is predominant on SISMA’s running time (see Table 1), in contrast with what we observed on the synthetic datasets, because here (except for UASHURS15) there is a large number of occurrences for each box, making occurrence lists intersection a demanding task.

SISMA’s stage 2 is characterized by running times which increase with the number of boxes and with the total number of occurrences, coherently with the theoretical bound (see the Methods section).
Number of simple and structured motifs on biological datasets
UASHURS1_5  

SISMA_smile  SISMA_speller  
Box (7,1)  1,452 (∼10)  420 (∼10)  
Box (10,2)  5,472 (∼5)  92 (∼5)  
Structured Motif  
(7,1)  [1,50]  (10,2)  16,662 (∼5)  14 (∼5)  
UASHURS1_10  
SISMA_smile  SISMA_speller  
Box (3,1)  64 (∼1000)  64 (∼1000)  
Box (5,2)  1,024 (∼1000)  942 (∼1000)  
Box (9,1)  103 (∼10)  55 (∼10)  
Structured Motif  
(3,1)  [ 1,1 ]  (5,2)  [ 1,200 ]  (9,1)  2,309,173 (∼ 70)  7,241 (∼ 70)  
KAR4P  
SISMA_smile  SISMA_speller  
Box (3,1)  64 (∼2000)  64 (∼2000)  
Box (4,1)  256 (∼1000)  256 (∼1000)  
Box (2,1)  16 (∼6000)  16 (∼6000)  
Structured Motif  
(3,1)  [2,2] (4,1)  [2,2] (3,1) [1,1]  (2,1)  101,750 (∼ 50)  858 (∼ 50) 
The tools under consideration exhibited very different behaviors on the three datasets, so that there is not a clear “overall” winner.
UASHURS15
SISMA_Smile outperformed Risotto and SISMA_Speller outperformed Ex Motif. Running times and differences in running times are really small, meaning that this instance of the problem is not really challenging for any of the tools: SISMA deals with a small number of simple/structured motifs; Risotto drastically reduces the search space starting form the second box on; Ex Motif deals with a small number of boxes and allowed substitutions.
UASHURS110
On this dataset SISMA_Speller outperformed Ex Motif, while SISMA_Smile without the spacesaving option ran out of memory. The reported results refer to SISMA_Smile with the option activated (each set is partitioned in at least five subsets). SISMA_Smile outperformed Risotto ^{i}.
This dataset results to be the worst for Risotto, being more affected by the search space size of boxes and the by total number of structured motifs, than by the number of simple motif occurrences.
KAR4P
Risotto and Ex Motif outperformed SISMA_Smile and SISMA_Speller, respectively.
SISMA’s running times on biological datasets
SISMA_Smile  SISMA_Speller  

1^{ st } stage  2^{ nd } stage  1^{ st } stage  2^{ nd } stage  
UASHURS1_5  0.60 sec  23.28 sec  3.00 sec  0.24 sec 
UASHURS1_10  0.61 sec  358.43 sec  5.83 sec  26.70 sec 
KAR4P  0.18 sec  663.04 sec  6.20 sec  114.64 sec 
Conclusions
Our conclusion is that the 2stage approach cannot be turned down without due reflection. In this section we present arguments in support of this thesis and some guidelines that may help the user to choose the most efficient approach, depending on the problem instance s/he has to solve. Finally, we discuss possible improvements of SISMA.
Direct vs 2stage approaches
While implementing and working on SISMA we had the opportunity to reason on the advantages and disadvantages of the two approaches, besides running time.

Modularity. The 2stage approach is clearly more modular, being made of two (possibly completely) distinct software components. This makes implementation and maintenance easier. Possible optimizations and new variants can be implemented on both stages independently. Stage 1 might be optimized with new, more efficient simple motif extraction tools at negligible costs. The tool might be also easily adapted to extract simple motifs using different algorithms (not only exact), obtaining versions of the tool that tackle slightly different problems. Even more, the tool might be enhanced with the possibility for the user to choose the particular software to run in stage 1.Optimizations and variants might not be equally easy to implement under the direct approach, even though any conclusion to this end strictly depends on the particular software under consideration.

Parallelization. SISMA might be easily adapted to efficiently run on a multiprocessor machine. For stage 1 there may be the availability of a parallel version of the tool adopted (such as PSmile [43]), but otherwise simple motif space enumeration could be easily partitioned and mapped on distinct processors. Stage 2 (lists intersection) might be performed simultaneously on distinct processors as well, by using a technique analogous to the space saving option (with distinct processors accessing to distinct portions of data structures to avoid memory collisions).As before, the design of a fast parallel version of a direct enumeration algorithm cannot be guaranteed without taking the algorithm itself (and its logic) into consideration.

Exploratory search of structured motifs. The 2stage approach seems to better adapt to an “exploratory search” utilization for structured motif finding (e.g., through a Web interface). The user might be given the possibility to independently execute the two stages (or even upload the simple motifs occurrences); given the results of the first stage, s/he then might run the second stage several times with different input parameters (say, box orders and distance constraints). This feature might be crucial in real applications, where input parameters are difficult to determine with care.This adaptive use of the tool seems really harder to make in case of direct approach without paying a high price in terms of execution times.

Search space reduction. As already pointed out in the paper, since the direct approach looks at the structured motifs as whole, it is able to better handle instances characterized by large size search spaces of some boxes.
Direct tool comparison
The following observations might be used to guide the user toward the use of one approach/tool or the other.

SISMA vs Risotto.
The tests performed show that, when Risotto is faster than SISMA, one or both of the following conditions occur: (1) boxes have small size search spaces and a small number of simple motifs, (2) boxes with large size search spaces occur near the end (large box index) of the structured motifs. Usually, in all the other cases SISMA is faster. This is particularly evident when the structured motifs being found are composed of just one type (or few types) of boxes.Risotto fails for timeout (in our tests, 12 hours), SISMA fails for outofmemory mainly because of the first stage. Hence, according to the available hardware appropriate decisions on which tool to use can be made.Risotto does not output occurrence positions, but only their number.

SISMA vs Ex Motif.
Ex Motif terminates computation only for a small number of boxes and, under this circumstance, it is faster than SISMA only when boxes have small length and/or small number of admitted substitutions. It fails in any other more complex situation, making SISMA_Speller the only available tool for the frequent structured motif discovery problem.
SISMA implementation and improvements
We designed and developed a tool for exact structured motif discovery, based on the 2stage approach. Incorporating simple algorithmic ideas and data structures, SISMA is accurately crafted software which proved to compete very well with other published tools for the same problem. On a comprehensive benchmark (composed of both synthetic and real biological datasets) SISMA exhibited more than acceptable performances, even on a very limited power and memory machine. Running times never exceeded the imposed deadline of 12 hours and altogether the tool failed only on few very difficult problem instances (always due to memory shortage).
We can improve SISMA in some respects. As the experiments clearly show, a crucial issue is the possibly high memory consumption during stage 1, which may cause SISMA to fail. The positive side is that memory critical inputs can in general be detected and appropriate actions be taken. One such action consists simply of automatically activating the space saving option. Another option amounts to interleaving simple motif extraction and list intersection. Also, we plan to implement some simple algorithmic improvements that will help to reduce (to some extent) the search space of simple motifs. For instance, we can eliminate proper prefixes of sequences when extracting specific boxes; more precisely, given box i, we may cut prefixes that are as long as the sum of the lengths of previous boxes, plus the sum of the minimum distances between previous boxes.
Endnotes
^{a}Search time is sometimes reduced by further constraining the motif definition, as in Weeder [14].
^{b}Here we will refer the tool presented in [7] as Speller.
^{c}Actually, Ecomp uses an implementation of MITRAcount made by the authors of Ecomp, since MITRAcount itself was, and still is, not available.
^{d}Actually, SMILE is a suffix treebased tool designed to find structured motifs; however it can be used also for simple motif extraction (as a structured motif finder SMILE is outperformed by Risotto).
^{e}We discarded the option of implementing a postprocessing filter of SMILE output for efficiency reasons, and the option of modifying SMILE code as a more complicated one.
^{f}We always have run SISMA with the print option off since Risotto and Ex Motif have no possibility to output the starting positions of all occurrences. To be fair, Ex Motif actually prints the positions of the exact occurrences.
^{g}Observe that direct comparison on average running times might be of little significance, as they might vary considerably on input the same value of b (even resulting in charts that might show a non monotonic behavior).
^{h}For Risotto failure always means “runs beyond the deadline.”
^{i}On machines with more main memory the gap between the running times would have been even more favorable to SISMASmile.
Declarations
Acknowledgements
The authors wish to thank the anonymous reviewers for the constructive criticisms that helped to improve the manuscript, and Roberto Cavicchioli for the work done on a preliminary version of this endeavor.
Authors’ Affiliations
References
 Watson JD, Baker TA, Bell SP, Gann A, Levine M, Losick R: Molecular Biology of the Gene. 6/e: Pearson International Edition; 2007
 Werner T: Models for prediction and recognition of eukaryotic promoters. Mammalian Genome 1999, 10:168–175.PubMedView Article
 Sinha S, Tompa M: Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res 2002, 30:5549–5560.PubMedView Article
 Lemon B, Tjian R: Orchestrated response: a symphony of transcription factors for gene control. Genes & Dev 2000, 14:2551–2569.View Article
 Wray GA: The evolutionary significance of cisregulatory mutations. Nature Rev Genet 2007, 8:206–216.PubMedView Article
 Bailey TL, Elkan C: The Value of Prior Knowledge in Discovering Motifs with MEME. Proceedings of 3rd International Conference on Intelligent Systems for Molecular Biology (ISMB ’95). 1995, 21–29.
 Sagot MF: Spelling approximate repeated or common motifs using a suffix tree. Lecture Notes Comput Sci 1998, 1380:111–127.
 Li M, Ma B, Wang L: Finding Similar Regions in Many Strings. Proceedings of the 31th Annual ACM Symposium on Theory of Computing (STOC ’99). 1999, 473–482.
 Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993, 262:208–214.PubMedView Article
 Brazma A, Jonassen I, Eidhammer I, Gilbert D: Approaches to the Automatic Discovery of Patterns in Biosequences. J Comput Biol 1998,5(2):277–304. http://citeseer.ist.psu.edu/article/brazma97approaches.html View Article
 van Helden J, André B, ColladoVides J: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. Mol Biol 1998, 281:827–842. http://citeseer.ist.psu.edu/biol02extracting.html View Article
 Pevzner PA, Sze SH: Combinatorial Approaches to Finding Subtle Signals in DNA Sequences. Proceedings of 8th International Conference on Intelligent Systems for Molecular Biology (ISMB ’00). 2000, 269–278.
 GuhaThakurta D, Stormo GD: Identifying target sites for cooperatively binding factors. Bioinformatics 2001, 17:608–621.View Article
 Pavesi G, Mauri G, Pesole G: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 2001, 17:207–214.View Article
 Eskin E, Pevzner P: Finding composite regulatory patterns in DNA sequences. Proceedings of the 10th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB ’02). 2002, S354S363.
 Sinha S, Tompa M: YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res 2003, 31:3586–3588.PubMedView Article
 Leung HCM, Chin FYL: Generalized Planted (l, d)Motif Problem with Negative Set. Proceedings of the Workshop on Algorithms in Bioinformatics (WABI). 2005, 264–275.View Article
 Favorov AV, Gelfand MS, Gerasimova AV, Ravcheev DA, Mironov AA, Makeev VJ: A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length. Bioinformatics 2005, 21:2240–2245.PubMedView Article
 Mendes N, Casimiro A, Santos P, SáCorreia I, Oliveira A, Freitas A: MUSA: a parameter free algorithm for the identification of biologically significant motifs. Bioinformatics 2006, 22:2996–3002.PubMedView Article
 D’haeseleer P: How does DNA sequence motif discovery work? Nat Biotech 2006,24(8):959–961. http://dx.doi.org/10.1038/nbt0806959 View Article
 Das MK, Dai HK: A survey of dna motif finding algorithms. BMC Bioinformatics 2007, 8:S21.PubMedView Article
 Stormo GD, Hartzell GW III: Identifying protein binding sites from unaligned DNA fragments. PNAS 1989, 86:1183–1187.PubMedView Article
 Wolfertstetter F, Frech K, Herrmann G, Werner T: Identification of functional elements in unaligned nucleic acid sequences by a novel tuple search algorithm. Comput Appl Biosci 1996, 12:71–80.PubMed
 Tompa M: An exact method for finding short motifs in sequences, with application to the ribosome binding site problem. Proceedings of 7th International Conference on Intelligent Systems for Molecular Biology (ISMB ’99). 1999, 262–271.
 Linhart C, Halperin Y, Shamir R: Transcription factor and microRNA motif discovery: The Amadeus platform and a compendium of metazoan target sets. Genome Res 2008,18(7):1180–1189.PubMedView Article
 Zhang Y, Zaki MJ: EXMOTIF: efficient structured motif extraction. Algorithms Mol Biol 2006, 1:21.PubMedView Article
 Pisanti N, Carvalho A, Marsan L, Sagot MF: RISOTTO: Fast extraction of motifs with mismatches. Proceedings of the 7th Latin American Theoretical Informatics Symposium. 2006.
 Zhou J, Sander J, Lin G: Efficient composite pattern finding from monad patterns. Int J Bioinf Res Appl 2007, 3:86–99.View Article
 Tompa M, Li N, Bailey TL, Church GM, et. al: Assessingcomputational tools for the discovery of transcription factor binding sites. Nature Biotechnol 2005, 23:137–144. http://www.ncbi.nlm.nih.gov/pubmed/15637633 View Article
 McCreight EM: A SpaceEconomical Suffix Tree Construction Algorithm. J ACM 1976,23(2):262–272.View Article
 Gusfield D: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. New York: Cambridge University Press; 1997.View Article
 Marsan L, Sagot MF: Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification. J Comput Biol 2000,7(3–4):345–362.PubMedView Article
 Carvalho A, Freitas A, Oliveira A, Sagot MF: A highly scalable algorithm for the extraction of cisregulatory regions. Proceedings of the AsiaPacific Bioinformatics Conference. 2005, 273–282.View Article
 Allali J, Sagot MF: The at most kdeep factor tree. Tech. 2004.
 Carvalho A, Freitas A, Oliveira A Sagot: Efficient Extraction of Structured Motifs Using Boxlinks. Proceedings of 11th Conference on String Processing and Information Retrieval. 2004, 267–268. http://citeseer.ist.psu.edu/viewdoc/summary?doi:View Article
 Leung CM, Chin FYL: Algorithms for Challenging Motif Problems. J Bioinf Comput Biol 2006, 4:43–58.View Article
 Buhler J, Tompa M: Finding motifs using random projections. J Comput Biol 2002, 9:225–242.PubMedView Article
 Davila J, Balla S, Rajasekaran S: Fast and Practical Algorithms for Planted (l,d)Motif Search. IEEE/ACM Trans Comput Biol Bioinf (TCBB) 2007,4(4):544–552.View Article
 Federico M, Valente P, Leoncini M, Montangero M, Cavicchioli R: An Efficient Algorithm for Planted Structured Motif Extraction. CompBio ’09: Proceedings of the 1st ACM Workshop on Breaking Frontiers of Computational Biology. 2009, 1–6.View Article
 Zhu J, Zhang M: SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics 1999, 15:607–611.PubMedView Article
 Teixeira MC, Monteiro P, Jain P, Tenreiro S, Fernandes AR, Mira NP, Alenquer M, Freitas AT, Oliveira AL, SáCorreia I: The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae. Nucleic Acids Res 2006, 34:D446D451.PubMedView Article
 ThomasChollier M, Sand O, Turatsinze JV, Janky R, Defrance M, Vervisch E, Brohee S van Helden: RSAT: regulatory sequence analysis tools. Nucleic Acids Res 2008, 36:W119W127.PubMedView Article
 Carvalho AM, Freitas AT, Oliveira AL, Sagot MF: A parallel algorithm for the extraction of structured motifs. Proceedings of the 19th ACM Symposium on Applied Computing (SAC’04). 2004, 147–153.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.