Regmex: a statistical tool for exploring motifs in ranked sequence lists from genomics experiments

Background Motif analysis methods have long been central for studying biological function of nucleotide sequences. Functional genomics experiments extend their potential. They typically generate sequence lists ranked by an experimentally acquired functional property such as gene expression or protein binding affinity. Current motif discovery tools suffer from limitations in searching large motif spaces, and thus more complex motifs may not be included. There is thus a need for motif analysis methods that are tailored for analyzing specific complex motifs motivated by biological questions and hypotheses rather than acting as a screen based motif finding tool. Methods We present Regmex (REGular expression Motif EXplorer), which offers several methods to identify overrepresented motifs in ranked lists of sequences. Regmex uses regular expressions to define motifs or families of motifs and embedded Markov models to calculate exact p-values for motif observations in sequences. Biases in motif distributions across ranked sequence lists are evaluated using random walks, Brownian bridges, or modified rank based statistics. A modular setup and fast analytic p value evaluations make Regmex applicable to diverse and potentially large-scale motif analysis problems. Results We demonstrate use cases of combined motifs on simulated data and on expression data from micro RNA transfection experiments. We confirm previously obtained results and demonstrate the usability of Regmex to test a specific hypothesis about the relative location of microRNA seed sites and U-rich motifs. We further compare the tool with an existing motif discovery tool and show increased sensitivity. Conclusions Regmex is a useful and flexible tool to analyze motif hypotheses that relates to large data sets in functional genomics. The method is available as an R package (https://github.com/muhligs/regmex). Electronic supplementary material The online version of this article (10.1186/s13015-018-0135-2) contains supplementary material, which is available to authorized users.


Modified Rank Sum Statistic
Let s 1 s 2 , . . . , s N be a list of sequences ranked according to an experimental setting, and let n i denote the number of observed motifs in s i . Under the null model, we assume n i ∼ po(λ i ), with λ i = − ln(1 − p), where p is the probability of observing at least one motif in the sequence. This follows from the probability mass function of the the Poisson distribution If we think of motif occurrences as a Poisson process, where our "time axis" is composed of consecutive intervals of length λ i ordered according to the experimental rank, motif occurrences are now, under the null hypothesis, uniformly distributed on the interval [0, λ.], where λ. = N i=1 λ i . We now calculate a score rm, corresponding to the mid point of the interval (sequence) in which a motif was observed.
We associate the score with motif occurrences in the sequence list. Under the null hypothesis, the probability of observing a motif in a sequence is proportional to the interval length, and thus the expectation is that motif scores are uniformly distributed across the whole interval [0, λ.]. Under the null model, the score for motif occurrences is thus normally distributed with mean λ./2 and variance λ. 2 /12. We calculate the test statistic

Brownian Bridge Method
This method is a re-implementation of the method developed by Jacobsen et al. [1] and recently implemented in cWords [2]. Our implementation differs in the calculation of the sequence dependent motif p-values. The method calculates the max value D of a running sum of mean adjusted log scores of the ranked sequence dependent p-values p i (SSP i ) where ls i = − ln(p i + α) and α is a score dampening factor of 10 −5 .ls is the mean of the log scores.
The running sum has the form of a bridge (starting and ending in 0), and the maximum value is compared to the theoretical distribution of the absolute maximum M of a Brownian bridge under the null model [3] where n is the number of sequences in the sequence list.

Random Walk Statistics
The random walk (RW) method is inspired by the way RW theory is used in the BLAST algorithm to estimate significance of observed homologies between sequences [4]. In that case matches and mismatches become steps in a walk. Here, it is the list of sequence specific p-values (SSPs) for a motif that are transformed into steps in a walk. The maximum value of the walk is compared to the probability distribution of maxima under the null model. Under the null model, sequence specific p-values (SSPs) are assumed to be uniformly distributed between 0 and 1. This situation is expected for (indefinitely) long random sequences. This condition, however, does not hold in general. For shorter sequences and longer motifs, the SSPs tend to discretize to small intervals specific for the number of observed motifs (see Figure S2). In particular, the p-value for observing "0 or more" motif occurrences is always 1. However, for random sequences, the cumulative distribution of discrete SSPs approximates that of the uniform distribution in the SSP values [5] (see Figure S2). Because the discrete SSPs depend highly on both the sequence and the motif, we normalize for the discrete effect. This is done by drawing p-values from the interval between the probability of observing n obs or more and n obs + 1 or more motifs. E.g. for a 6-mer not observed in a 1000 bases random sequence, we draw between p(n obs ≥ 0) = 1 and p(n obs ≥ 1) = 0.2. In this way the modified SSPs for random motifs in random sequences will follow the uniform distribution as required under the null model.
The modified p-values are now transformed into scores in a walk according to a scoring scheme based on the expected motif density when the motif is enriched (foreground) relative to the background. The scoring scheme ensures a negative drift under the null model by taking steps defined by the likelihood ratio: Here, , f g and bg are expected foreground (default = 0.2) and background (default = 0.05) motif densities in the sequence list.
These default values can be changed by the user.
We are interested in the distribution of maximum values that a random walk visits before absorption, under the null model. The random walk is a particular case of a discrete-time Markov chain. We calculate the distribution of interest by building a Markov model with states −1, 0, 1, . . . , M , where M is the observed maximum height of the walk, and assign transition probabilities according to the scoring scheme. From the resulting probability distribution of maxima under the null model, we derive the p-value for the observed maximum height. Alternatively, we can calculate the p-value by using a geometric-like (Gumbel) distribution approximation for the random walk maxima as given in [6].  Figure S3: Quantile-quantile plots for observed and expected p-values of a 6-mer motif run using Regmex in the Brownian Bridge setting. The comparison employed a data set of mir-430 overexpression in zebrafish that was also used in the original presentation of Sylamer. The gene rank was randomly re-sampled and the 10-mer AATGCCCGGT was spiked into the re-sampled sequence rank: A single motif was inserted 100 times randomly among the top 500 ranked sequences and two motifs were inserted 50 times among the first 100 sequences.