Regmex
In this study, we introduce Regmex, a motif analysis tool available as an R package. Regmex is designed with flexibility in mind to study rank correlation or clustering of motifs in an ordered list of sequences.
Briefly, it takes as input a list of ranked DNA sequences, which could come from a genomics experiment, and one or more motifs, each defined as a regular expression (RE) (Fig. 1). The output, in its simplest form, contains the rank correlation or clustering p-values (RCPs) for the input motifs. Alternatively, it is possible to get the underlying sequence specific p-values (SSPs) for motifs as well as count statistics, etc.
To illustrate the power of REs in a biological sequence context, we consider the following examples:
-
1.
A stem loop structure, TTTCNNNGAAA, found in the 3′UTR of many key inflammatory and immune genes [17]. Although this is a simple RE, it captures 64 11-mers in one expression, and Regmex reports the rank correlation p-value of the combined set.
-
2.
A G-quadroplex structure, GGGLGGGLGGGLGGG, where L = (N|NN|NNN|NNNN).This is found e.g., in telomeric regions [18].
-
3.
Any size open reading frame: ATG(NNN)*?((TGA)|(TAA)|(TAG)). This RE is an example of an enormous set, which would be difficult to obtain without an RE.
An advantage of REs is that they can capture any set of simple motifs. For example, a set of experimentally verified binding sequences can be expressed as a single RE, with matching and p-value evaluation based on exactly this set.
Sequence specific motif p-value calculation
Regmex calculates a motif rank correlation p-value (RCP) based on sequence specific p-values for observing the motif the observed number of times (nobs) or more. Briefly, from a deterministic finite state automaton (DFA) associated with the regular expression motif, we derive a sequence specific transition probability matrix (TPM), which is used to build an embedded TPM (eTPM) specific for nobs (Fig. 2). The SSP is subsequently read from the eTPM raised to the power of the sequence length. These steps are explained in more detail below.
Deterministic finite state automaton
For any regular expression, the corresponding DFA can be built, which is the initial step in the SSP calculation (Fig. 2b). The DFA starts in an initial state, accepts symbols (i.e. nucleotides) on the edges and moves through the states. The end state corresponds to having observed the RE. The DFA used here recognizes an extended regular expression, as described in [19]. The routine used to build the DFA for a given regular expression is implemented in Java, using [20], and supports standard regular expression operations (concatenation, union and Kleene star) and overlaps.
Markov embedding
The DFA graph structure can also be thought of as a Markov model, where instead of accepting symbols, it generates symbols on the edges with probabilities corresponding to the base frequencies in a given sequence. The Markov model can be represented by a transition probability matrix (TPM), which holds the probabilities of moving between states of the DFA upon observing bases from the sequence (Fig. 2c). The TPM raised to the power of n, TPMn, holds the probability of moving between states after observing n bases.
We are interested in the SSP and thus need to have a probability model that takes nobs into account. Regmex does this by making a model expansion using the DFA as a template. We refer to this as an embedded DFA (eDFA) (Fig. 2d). Specifically, the template DFA is copied nobs times and outgoing edges of the end state(s) of the DFA template are moved to the corresponding states in the next template copy. This effectively allows the embedded model to count how many times the RE motif has been observed. The final state of the eDFA is absorbing, so no further motif observations are scored.
Again, the eDFA can be thought of both as an automaton accepting symbols or as a Markov model generating symbols on edges. As above, Regmex constructs a transition probability matrix (eTPM) based on the eDFA (Fig. 2e). The eTPMn holds probabilities of moving between states of the eDFA given a random sequence of length n with the observed base frequencies (Fig. 2f). We can now extract the probability distribution of the RE motif in a given sequence by reading the row corresponding to the initial state (0,1) in the eTPMn. In particular the probability of observing the motif nobs number of times or more (the sequence specific p-value, SSP) can be read in the final state column of the initial state row (red field in Fig. 2f).
Motif rank correlation p-value
In the downstream analysis, Regmex uses the calculated SSPs when calculating the RCP. In Regmex, we have implemented three methods for evaluating motif rank correlation or motif clustering, which have different strengths. These methods are based on Brownian bridge (BB), random walk (RW), and modified sum of rank (MSR) statistics. The concepts underlying these statistics are illustrated on a short list of 50 sequences with an enriched motif (Fig. 3). The bias in the distribution of motifs may vary depending on the analyzed problem and the choice of method used to evaluate the correlation may differ in detection power. E.g., one test may be well-powered for detecting long motifs occurring rarely in the sequence list and another for detecting frequent short motifs.
Brownian bridge
This method is a re-implementation of the method developed by Jacobsen et al. [21] and recently implemented in cWords [2]. Our implementation differs in the calculation of the sequence specific p-values (SSPs) and in how we calculate the rank correlation p-value. The method calculates the max value D of a running sum of mean adjusted log scores of the SSPs
$$r_{i} = r_{i - 1} + ls_{i} - \overline{ls} ,$$
where \(ls_{i} = - ln\left( {p_{i} + \alpha } \right),\)\(\alpha\) is a score dampening factor set to 10−5 and \(\overline{ls}\) is the mean of the log scores.
The running sum starts and ends in zero and hence is a Brownian bridge under the null model (see Fig. 3b). We identify the rank correlation p-value from the analytical distribution of max values of a Brownian bridge [22].
$$Pr\left( {M \ge m} \right) = 1 - 2\sum\limits_{k = 1}^{\infty } {\left( { - 1} \right)^{k} e^{{ - 2k^{2} m^{2} /n}} },$$
where \(n\) is the number of sequences in the sequence list.
Random walk statistics
The random walk (RW) method is similar to the use of random walks in the BLAST algorithm [23]. This method is sensitive to clustering of motifs anywhere in the sequence list. The sequence specific p-values (SSPs) for a motif are transformed into steps in a walk (Fig. 3c). Under the null model the motif is not enriched and the SSPs follow the uniform distribution. The SSPs are transformed into steps according to a scoring scheme where small p-values (SSPs) correspond to a positive step and large p-values correspond to a negative step. The exact scoring scheme is based on assumed motif densities in the foreground relative to the background, so that higher motif densities give rise to higher walk values in local regions of the sequence list. The RW starts over from zero every time it reaches the lower bound of − 1. This makes the RW method sensitive to local runs of enriched motifs in the sequence list.
For significance evaluation, we find the probability of a walk with at least as high a max value under the null distribution. We do this using a recursion on an analytic expression for the max value distribution of random walks (see Additional file 1: Methods for details). Alternatively, we can use a geometric-like distribution (Gumbel distribution) as an approximation for the max value distribution [24].
Modified sum of ranks statistics
The modified sum of ranks (MSR) method is based on the idea of using a rank sum test to determine a rank bias in motif containing sequences. Rather than summing ranks, MSR uses a sum of scores specific for the sequences and motif. The scores are based on the sequence specific p-values, which eliminates bias from sequence composition and length. All motif observations are associated with a score that reflect the probability of the motif being found one or more times in the sequence, as well as the rank of the sequence. The score can be considered as a rank normalized for the probability of observing motifs in the sequence. In detail, let \(s_{1} ,s_{2} , \ldots ,s_{N}\) be a list of sequences ranked according to an experimental setting, and let \(n_{i}\) denote the number of observed motifs in \(s_{i}.\) Under the null model, we assume \(n_{i}\) ~ \(po\left( {\lambda_{i} } \right)\) with \(\lambda_{i} = - ln\left( {1 - p_{i} } \right)\), where \(p_{i}\) is the probability of observing at least one motif in the sequence. This follows from the probability mass function of the Poisson distribution,
$$Pr\left( {X = k} \right) = \frac{{\lambda^{k} }}{k!}e^{ - \lambda } ;$$
since \(p_{i} = 1 - Pr\left( {X = 0} \right) = 1 - e^{{ - \lambda_{i} }}\) we have \(\lambda_{i} = - ln\left( {1 - p_{i} } \right)\).
If we think of motif occurrences as a Poisson process, where our “time axis” is composed of consecutive intervals of length \(\lambda_{i}\) ordered according to the experimental rank, motif occurrences are now, under the null hypothesis, uniformly distributed across the whole interval \(\left[ {0,\lambda .} \right]\) where \(\lambda . = \mathop \sum \nolimits_{i = 1}^{N} \lambda_{i}\).
We now calculate a score \(r_{m}\), corresponding to the mid point of the interval (sequence) in which a motif was observed.
$$r_{m} = \frac{{\mathop \sum \nolimits_{i = 1}^{m - 1} \lambda_{i} + \mathop \sum \nolimits_{i = 1}^{m} \lambda_{i} }}{2}.$$
We associate the score with motif occurrences in the sequence list. Under the null hypothesis, the probability of observing a motif in a sequence is proportional to the interval length, and thus the expectation is that motif scores are uniformly distributed across the whole interval \(\left[ {0,\lambda .} \right]\). Under the null model, the score for motif occurrences is thus normally distributed with mean \(\lambda ./2\) and variance \(\lambda .^{2} /12\).
We calculate the test statistic
$$W = \frac{\sqrt n }{\lambda .}\left( {\frac{{\mathop \sum \nolimits_{i = 1}^{N} n_{i} r_{i} }}{n.} - \frac{\lambda .}{2}} \right) \sim {\mathcal{N}}\left( {0,\frac{1}{12}} \right)$$
where \(n. = \mathop \sum \nolimits_{i = 1}^{N} n_{i}\). The motif correlation p-value is \(p = 2\left[ {1 - \varPhi \left( {\left| W \right|} \right)} \right]\).
The MSR method is faster than the others because we need only the probability of observing one or more motifs in the sequence, which can be read from the TPM of the DFA (Fig. 2c) modified so that the end state is absorbing, and thus we do not need to construct the larger embedded model.