Direct vs 2-stage approaches to structured motif finding

Background The notion of DNA motif is a mathematical abstraction used to model regions of the DNA (known as Transcription Factor Binding Sites, or TFBSs) that are bound by a given Transcription Factor to regulate gene expression or repression. In turn, DNA structured motifs are a mathematical counterpart that models sets of TFBSs that work in concert in the gene regulations processes of higher eukaryotic organisms. Typically, a structured motif is composed of an ordered set of isolated (or simple) motifs, separated by a variable, but somewhat constrained number of “irrelevant” base-pairs. Discovering structured motifs in a set of DNA sequences is a computationally hard problem that has been addressed by a number of authors using either a direct approach, or via the preliminary identification and successive combination of simple motifs. Results We describe a computational tool, named SISMA, for the de-novo discovery of structured motifs in a set of DNA sequences. SISMA is an exact, enumerative algorithm, meaning that it finds all the motifs conforming to the specifications. It does so in two stages: first it discovers all the possible component simple motifs, then combines them in a way that respects the given constraints. We developed SISMA mainly with the aim of understanding the potential benefits of such a 2-stage approach w.r.t. direct methods. In fact, no 2-stage software was available for the general problem of structured motif discovery, but only a few tools that solved restricted versions of the problem. We evaluated SISMA against other published tools on a comprehensive benchmark made of both synthetic and real biological datasets. In a significant number of cases, SISMA outperformed the competitors, exhibiting a good performance also in most of the cases in which it was inferior. Conclusions A reflection on the results obtained lead us to conclude that a 2-stage approach can be implemented with many advantages over direct approaches. Some of these have to do with greater modularity, ease of parallelization, and the possibility to perform adaptive searches of structured motifs. As another consideration, we noted that most hard instances for SISMA were easy to detect in advance. In these cases one may initially opt for a direct method; or, as a viable alternative in most laboratories, one could run both direct and 2-stage tools in parallel, halting the computations when the first halts.

• We tested the inuence of dicult boxes, according to their position in the planted structured motif, on the tools' performance.
Planted structured motifs with boxes having same length and errors We generated datasets using, for each dataset, a xed template ( , e) for each box. The value of varied from 9 to 15 in dierent experiments, with e ∈ {1.. 1/3 }, and with the number of boxes ranging from 2 to 10.
For each choice of the parameters we performed 20 runs for each tool, computing the average running time and the corresponding 95% condence interval.
As a general conclusion we may notice that, under these experimental conditions, SISMA tools outperformed the competitors in essentially all datasets on which tools ended comutation within the imposed deadline. Figure 1 reports the results obtained for the pairs (10, 3), (11, 3), (12, 3), (13, 3), and (15, 5) (vertical bars represent the 95% condence interval of the average running times The situation gets even worse for Risotto when the search space to be explored is large, as for instances characterized by a large number of expected motifs. In particular, Risotto goes in time-out already with two boxes in the following cases: (9, 3),(10, 3),(12, 4), (15, 4) and (15, 5). This is coherent with the fact that Risotto's running time is exponential with box length and with the number of errors. SISMA_Smile goes out of memory, already for two boxes, only for instances characterized by pairs (9, 3), (10, 3) and (12, 4), even in case of the space-saving implementation.
SISMA_Speller vs ExMotif. In Table 1 we give a wide summary of the comparison between these tools, reporting whether tools terminate and which tool is the fastest. SISMA fails only for (9, 3) As it can be easily seen, in the majority of runs SISMA_Speller outperforms ExMotif. Exceptions occur only in few particular cases, with small number of boxes and few substitutions allowed. ExMotif turns signicantly slower than SISMA_Speller when the structured motifs have three boxes or more, and in general when any of the input parameters (number of boxes, box length, and number of allowable substitutions) increases.
SISMA_Speller always ends its computation, with the exception of the instances characterized by the pair (9, 3). Its running time is generally below one minute for easy instances, never above ve minutes for hard instances (e.g., 10 boxes characterized by length 10 and 3 allowed errors). In case of (9, 3) simple motifs, SISMA_Speller fails even in case of two boxes and even if the output is produced in slices i.e., with the space-saving option on. Such failure depends on the high number (more that 10 4 ) of simple motifs, which makes the program go out of memory when looking for dyads.

Planted structured motifd with boxes having dierent leghts and errors
To stress tools with particularly dicult instances, we ran another set of synthetic experiments in which we randomly planted simple motifs in the sequences with length varying from 7 to 18, and errors in the range [1.. /3 ]. SISMA_Smile was run with the box index selection option activated.
As for the experiments presented in the regular paper, we compared tools' performance using running times and counting how many times one tool outperformed the other.
Risotto vs SISMA_Smile. Figure  In this experimental setting, running times varied considerably from instance to instance, even for the same number of boxes. With respect to experiments shown in the paper (Section Test on synthetic data), here this is much more evident. experiments with variable boxes. This was to be expected since some instances here are much more challenging. In particular, we notice that Risotto starts failing already from three boxes, while SISMA_Smile exhibits a slightly better behavior (with the exception of ve boxes where we observe many failures for both tools, meaning that several hard instances were randomly selected).
Moreover, SISMA_Smile fails for memory shortage only, and always because of the rst stage (i.e., too many simple motifs were found and they could not t into primary memory). This kind of failures usually happens much earlier than the established deadline of 12 hours.
On the other hand, Risotto fails always due to time-out, hence 12 hours were actually a lower bound on running time for those instances. There are two main reasons why Risotto fails: (1) in one of the rst positions of the structured motif there is box with a large search space (but not necessarily a large number of motifs); (2) the output is very large.
In the rst case, if the actual number of simple motifs found is not large, SISMA_Smile is usually slow, but nonetheless it ends computation before time-out. In the latter case, the second stage of SISMA_Smile is usually long, but neither in this cases more than 12 hours. As previously pointed out, the worst case for SISMA_Smile usually happens when the number of simple motifs found in the rst stage is large, but not enough to cause out-of-memory.
The typical case in which Risotto outperforms SISMA_Smile is when there is a box, in the last positions of the structured motifs, with large search space and a small number of simple motifs.
Risotto drastically reduces the search space, while the running time of SISMA_Smile is almost entirely dedicated to the extraction of simple motifs for such boxes. Figure 3 and Figure 4 show best, worst, average running times and standard deviation when considering all runs and when considering only runs in which both tools end computation. In the rst case, 12 hours are counted as a lower bound for Risotto running time, when this fails due to time-out; when SISMA_Smile fails for out-of-memory, we considered the actual run time (never exceeding 12 hours). Again, as Risotto is the tool failing the most, the dierences in the charts are more evident for Risotto than for SISMA_Smile.
Observe that, in Figure 3.a and Figure 4.a', for ten boxes, the best case for Risotto changes (it is worst in the second gure), meaning that the best case for Risotto is one in which SISMA fails. Analogously, for worst case, Figure 3.b and Figure 4.b', there is a dierences for SISMA for ve and six boxes. In these cases we have shorter runs, meaning that the instances on which Risotto failed are somehow dicult also for SISMA (that, nevertheless, ended computation before 12 hours).  As in this set of experiments we have a larger number of failures, we also investigated the behavior of the tool that did not fail when the other did. Figure 5 shows best, worst, average running times (in seconds) and standard deviation of Risotto and SISMA on the runs in which the other tool failed. When only the best running time is reported, we have only one run. As we have a small number of runs to analyze, averages and variances have been computed considering all runs.
search space) takes 12 min and 24 sec, on average runs take around 100 sec for number of boxes form ve on.
As we can see, all the datasets cause SISMA_Smile to spend the vast majority of the running time in stage 1. The latter lasts almost the same time independently of the box order and altogether SISMA_Smile running time varies only 20 sec from best to worst run 1 . On the contrary, the 1 The dierences depends on the fact that the planted motifs for a given box ( , e) were possibly not the same.