Mutual enrichment in ranked lists and the statistical assessment of position weight matrix motifs

Background Statistics in ranked lists is useful in analysing molecular biology measurement data, such as differential expression, resulting in ranked lists of genes, or ChIP-Seq, which yields ranked lists of genomic sequences. State of the art methods study fixed motifs in ranked lists of sequences. More flexible models such as position weight matrix (PWM) motifs are more challenging in this context, partially because it is not clear how to avoid the use of arbitrary thresholds. Results To assess the enrichment of a PWM motif in a ranked list we use a second ranking on the same set of elements induced by the PWM. Possible orders of one ranked list relative to another can be modelled as permutations. Due to sample space complexity, it is difficult to accurately characterize tail distributions in the group of permutations. In this paper we develop tight upper bounds on tail distributions of the size of the intersection of the top parts of two uniformly and independently drawn permutations. We further demonstrate advantages of this approach using our software implementation, mmHG-Finder, which is publicly available, to study PWM motifs in several datasets. In addition to validating known motifs, we found GC-rich strings to be enriched amongst the promoter sequences of long non-coding RNAs that are specifically expressed in thyroid and prostate tissue samples and observed a statistical association with tissue specific CpG hypo-methylation. Conclusions We develop tight bounds that can be calculated in polynomial time. We demonstrate utility of mutual enrichment in motif search and assess performance for synthetic and biological datasets. We suggest that thyroid and prostate-specific long non-coding RNAs are regulated by transcription factors that bind GC-rich sequences, such as EGR1, SP1 and E2F3. We further suggest that this regulation is associated with DNA hypo-methylation.


Introduction
Modern data analysis often faces the task of extracting characteristic features from sets of elements singled out according to some measurement. In molecular biology, for example, an experiment may lead to measurement results pertaining to genes and then questions are asked about the properties of genes for which these were high or low. This is an example, of course, and the set of elements does not have to be genes. They can be genomic regions, proteins, structures, etc. A central technique for addressing the analysis of characteristic properties of sets of elements is statistical enrichment. More specifically -the experiment results are often representable as ranked lists of elements and we then seek enrichment of other properties of these elements at the top or bottom of the ranked list. GSEA [29], for example, is a tool that addresses characteristic features of genes that are found to be differentially expressed in a comparative transcriptomics study. GOrilla [6] addresses GO terms enriched in ranked lists of genes where the ranking can be, for example, the result of processing differential expression data or of correlations computed between genomic DNA copy number and expression [19], [2], [5]. FATIGO [21] is also a tool that is useful in the context of analyzing GO terms in ranked lists of genes. DRIMust [15], [16] searches for sequence motifs that are enriched, in a statistically significant manner, in the top of a ranked list of sequences, such as one produced by techniques like ChIP-seq.
All the aforementioned tools utilize a statistical approach that is based on assessing enrichment of an input set in an input ranked list by assessing the enrichment obtained at various cutoffs applied to the ranked list. It is often the case, however, that two quantitative properties need to be compared to each other. For example, when the elements are genes, we may have measured differential expression values, as well as measured ChIP-seq signals. We are therefore interested in assessing mutual enrichment in two ranked lists. Another example consists of one ranking according to differential expression and one according to prediction scores for miRNA targets. miTEA [25] addresses this latter case by statistically assessing the enrichment of miRNA targets in a ranked list of genes (also see [8]). To address mutual enrichment in two ranked lists over the same set of N elements, miTEA [25] performs analysis on permutations. Mutual enrichment in the top of two ranked lists can be simplified, from a mathematical point of view, by arbitrarily setting the indices of one list to the identity permutation (1, 2, ...,N ) and treating the other list as a permutation π over these numbers. For the purpose of assessing the intersection of the top of the two ranked lists in a data driven manner, miTEA asks which prefix [1, . . . , n 1 ] is enriched in the first n 2 elements of that permutation, π = π(1), ..., π(N). The statistics introduced by miTEA is called mmHG (min-min-Hyper-Geometric). A variant of mmHG is explained in detail in Section 2 of the current manuscript.
Statistics in the group of permutations S N is often difficult because the number of entities to be considered by any null model is N ! Direct exhaustive calculation of tail distributions over S N is therefore impractical for all but very small values of N. This difficulty is addressed by several heuristic techniques. Mapping into continuous spaces, such as in [18], has proven useful in certain cases but not for studying large deviations. In the case of enrichment statistics that focuses on the top of the permutation and seeks to assess extreme events, such as mmHG, we prefer to use bounds on tail probabilities. Tail probabilities are useful constructs when applied to analyzing molecular biology measurement data as they enable statistical assessment of observed results.
In this work we derive a tight bound on the tail probabilities of the mutual enrichment at the top of two random permutations uniformly drawn over S N and demonstrate the utility of this approach in the context of flexible motif discovery. Our bounds are computable in polynomial time and potentially add to the accuracy of reported position weight matrix (PWM) motifs for nucleic acid sequences.

Mutual Enrichment in Ranked Lists -the mmHG Statistics
The mmHG statistics [25] is a generalization of the mHG statistics [6], [7], [26], [28]. While the mHG statistics quantifies the enrichment level of a set of elements at the top of a ranked list of elements of the same type, the mmHG statistics quantifies the level of mutual enrichment in two ranked lists over the same set of elements. While any parametric or non-parametric correlation statistics (e.g. Spearman's correlation coefficient), that takes the same input, calculates the overall agreement between the two ranked lists, the mmHG statistic focuses only on agreement at the top of the two ranked lists. mmHG counts elements common to the top of both lists, without predefining what top is. Its intended output is the probability for observing an intersection at least as large in two randomly ranked lists (the enrichment mmHG P-value). In this section we describe the mmHG statistics and in later sections we suggest a tight bound for the p-value. Our definition of the mmHG statistic varies slightly from that of Steinfeld et al. [25].
Mutual enrichment in the top of two ranked lists can be simplified, from a mathematical point of view, by arbitrarily setting the indices of one list to the identity permutation (1, 2, ...,N ) and treating the other list as a permutation. Details of this transform are given in Section 2. 3. We now define mmHG for the simple case of one permutation. Consider a permutation π = π(1), ..., π(N) ∈ S N -the group of all permutations over the numbers 1, ..., N. mmHG is a function that takes π and calculates two numbers 1 ≤ n 1 , n 2 ≤ N such that the observed intersection between the numbers 1, ..., n 1 and the first n 2 elements of π − π(1), . .., π(n 2 ) -is the most surprising in terms of the hypergeometric pvalue. Additionally, mmHG further calculates this aforementioned p-value.
Formally, given π ∈ S N and for every 1 ≤ n 1 , n 2 ≤ N, let b π (n 1 , n 2 ) be the size of the intersection of 1, ..., n 1 with π(1), . .., π(n 2 ). Set mmHG score(π) = min 1≤n1≤N min 1≤n2≤N hgt (N, n 1 , n 2 , b π (n 1 , n 2 )) where hgt is the tail distribution of an hypergeometric random variable: The mmHG score cannot be considered as a significance measure, due to the multiple testing involved in finding n 1 and n 2 . A simple way to correct an mmHG score s for multiple testing and report a p-value bound would be to use the Bonferroni correction. That is done by multiplying s by the number of multiple tests conducted which is N 2 . Therefore: In Section 3 we present significantly tighter bounds.

PWM Motifs
Data produced by techniques such as ChIP-seq [14], ChIP-exo [20], CLIP [13], PAR-CLIP [9] and others are readily representable as ranked lists of sequences, where the ranking is according to measured binding affinity. Computational tools and approaches to motif discovery form part of the data analysis workflow that is used to extract knowledge and understanding from this type of studies. We are often interested in sequence motifs that are observed to be enriched in sequences where strong binding affinity is measured. A position weight matrix (PWM) is a commonly used representation of motifs in biological sequences [24], [27], [11]. This representation is more faithful to biology than representation by exact words. A PWM is a matrix of score values that gives a weighted match to any given substring of fixed length. It has one row for each symbol of the alphabet, and one column for each position in the pattern. The score assigned by a PWM to a substring S = S 1 ...S K is defined as K j=1 m sj ,j , where j represents position in the substring, S j is the symbol at position j in the substring, and m α,j is the score in row α, column j of the matrix. In other words, a PWM score is the sum of position-specific scores for each symbol in the substring. This definition can be generalized to yield a score for a sequence S = S 1 ...S M longer than the PWM by calculating max 1≤i≤M −K+1 K j=1 m si+j−1,j . Alternatively, an enhanced model that takes into account multiple occurrences of the PWM in the sequence can be applied by summing over sufficiently strong occurrences of the PWM or by other more sophisticated approaches [22].

mmHG Statistics for PWM Motifs
Given a set of sequences that were tested in a high throughput experiment such as ChIP-seq [14], CLIP [13] and others, they can be ranked according to the measured binding affinities, yielding a ranked list L 1 . Since usually we are interested in finding motifs amongst sequences having strong binding affinities, we actually search for motifs that are more prevalent at the top of this list. It is clear that any algorithm for de-novo flexible motif search would need to evaluate candidate PWMs. Given a PWM which we want to assess, the sequences can also be ranked according to their PWM scores, yielding another ranked list L 2 , different from L 1 . A significant PWM motif would yield significant scores for sequences having strong binding affinities. Therefore, the question of PWM motif discovery from ranked experimental data can be formulated as quantifying the mutual enrichment level for the two ranked lists L 1 and L 2 . Given two ranked lists L 1 and L 2 over the universe of N sequences, they can be transformed into two respective permutations, π 1 = (π 1 (1), ..., π 1 (N )) and π 2 = (π 2 (1), ..., π 2 (N )). The relative permutation π, of π 2 w.r.t. π 1 , is defined by π(π 1 (j)) = π 2 (j), for every j = 1, ..., N or simply, using the operations in the group S N : π = π 2 · π −1 1 . Using the relative permutation π, we can represent the mutual enrichment of the top parts of L 1 and L 2 as mmHG score(π), defined above.

Estimation of the mmHG p-value -Introducing First Upper Bound
Given an mmHG score s, observed in analyzing real measurement data, we would like to assess the statistical significance of this observation. Assuming endless computational power, we would enumerate all permutations and calculate the mmHG score for each, in order to characterize the distribution of mmHG as a random variable over S N . The p-value for s is then simply: The number of permutations having mmHG score ≤ s N ! Since the number of permutations is huge, the process described above is very far from feasible. Therefore, we seek a computationally tractable upper bound, preferably tight.
A trivial upper bound is the Bonferroni corrected mmHG score defined by s · N 2 . A more subtle upper bound was suggested by Steinfeld et al. [25] and is briefly described in Section 3. 3. In this work we introduce a tighter bound that is polynomially computable.
We will next describe an intuitive upper bound and later refine it to produce a tighter bound. Our input comprises an mmHG score s, and the total number of elements N. The output will be an upper bound for the p-value. The efficiency of our approach relies on enumerating all possible hgt scores rather than enumerating all permutations in S N . This approach is computationally efficient as hgt is a function of four input parameters: N, n 1 , n 2 , and b. Given N, there are O(N 3 ) possible combinations of n 1 , n 2 and b. Next, all is left to do is to determine how many permutations stand behind each hgt score. To this end, we will define the function Λ(N, n 1 , n 2 , b) to be the number of permutations for which it holds that out of the first n 2 entries, b of them are taken from the range [1, . . . , n 1 ]. This formulation is equivalent to counting permutations for which we attain, at some point, the value hgt(N, n 1 , n 2 , b), had we taken the exhaustive approach. Λ(N, n 1 , n 2 , b) can be represented as: A straightforward upper bound for the number of permutations in S N having mmHG score better than s follows: From which an upper bound is easily derived: By algebraic manipulations we get: This upper bound is simple and requires O(N 3 ) hgt calculations. An hgt calculation takes O(N ) time, assuming binomial coefficients can be calculated in O(1) time, for example by using Stirling's approximation [1]: √ 2πn( n e ) n 1 e 12n+1 ≤ n! ≤ √ 2πn( n e ) n 1 e 12n , which is tight for large factorials.

A Refined Upper Bound for the p-value
The upper bound introduced in the previous section counts the number of permutations for which the value hgt(N, n 1 , n 2 , b) is calculated when taking the non-practical exhaustive approach that enumerates over all N ! permutations. Ideally, we wish to count the number of permutations for which the value hgt(N, n 1 , n 2 , b) is also their mmHG score, as a permutation may have several hgt values that are better than s, so it can be counted more than once. This explains why the formula introduced earlier is an upper bound and not an exact p-value. A second observation that follows is that the smaller the mmHG score s is, the tighter the bound, because a permutation will have fewer combinations (N, n 1 , n 2 , b) having hgt score better than s. Therefore, if we can reduce the extent of multiple counting of the same permutation, we will get a tighter bound. We do this by looking one step backwards. If, for example, hgt(N, n 1 , n 2 , b) ≤ s, we can exclude from the counting permutations that contain b elements from the range [1, . . . , n 1 − 1] at their first n 2 entries because they are already taken into account in Λ(N, n 1 −1, n 2 , b) (because necessarily hgt(N, n 1 − 1, n 2 , b) ≤ s, as we will later explain).
Let Ψ (N, n 1 , n 2 , b) be the set of permutations for which it holds that out of the first n 2 entries, b of them are taken from the range [1, . . . , n 1 ] (note that Λ(N, n 1 , n 2 , b) introduced earlier is, therefore, the size of Ψ (N, n 1 , n 2 , b)).
Assuming hgt(N, n 1 , n 2 , b) ≤ s, we can partition the set Ψ (N, n 1 , n 2 , b) into five disjoint subsets ψ 1 , ..., ψ 5 such that ψ = ψ 1 ∪ ψ 2 ∪ ψ 3 ∪ ψ 4 ∪ ψ 5 , as follows: The properties of the hypergeometric distribution imply that ψ 1 , ψ 2 , ψ 4 can be disregarded, in the current counting stage. To explain why, we will demonstrate the argument on ψ 1 . The permutations in ψ 1 contain b elements from the range [1, . . . , n 1 −1] at the first n 2 entries. We also assume that hgt(N, n 1 , n 2 , b) ≤ s. Therefore hgt(N, n 1 − 1, n 2 , b) ≤ s also holds, as the same intersection is observed for even a smaller set. Thus, the permutations in ψ 1 should have been counted when handling the triplet n 1 − 1, n 2 and b and disregarded for the combination n 1 , n 2 and b. Similar arguments hold for ψ 2 and ψ 4 .
The permutations in ψ 5 contain b-2 elements taken from [1, . . . , n 1 − 1] located at the first n 2 − 1, where n 1 is positioned at one of the first n 2 − 1 entries, and also entry n 2 contains an element from [1, . . . , n 1 − 1]. Therefore: From the above we next conclude an upper bound. Denote Yielding the following upper bound for the p-value: Note that when n 1 or n 2 ≤ 1, Λ * (N, n 1 , n 2 , b) is defined as Λ (N, n 1 , n 2 , b). Also, given N, n 1 and n 2 , b can be any integer in [max(0, n 2 − N + n 1 ), min(n 1 , n 2 )].
This upper bound uses more delicate counting than the bound introduced in the previous section. In the following sections we assess the tightness of this bound. In later sections we demonstrate an application for PWM motif search.

Comparison to a Different Variant
We note that the bound described in Steinfeld et al. [25] addresses a slightly different variant of mmHG as a random variable over S N . The definition with which we work here is more amenable to deriving tight bounds as described above. Given a single permutation π ∈ S N and for every i = 1, . . . , N, a binary vector λ i is defined in which exactly i entries are 1 and N -i entries are 0, as follows: λ i (j) = 1 if π(j) ≤ i. The mmHG score of a permutation π is then defined by Steinfeld et al. [25] as: where mHG(λ) = min 1≤i≤N hgt (N, B, n, b n A possible upper bound is then given by: Computing the latter quantity requires O(N 2 ) hgt calculations and is therefore more computationally efficient than the two bounds described in Sections 3.1 and 3.2 of this current work, that require O(N 3 ) hgt calculations. We observed that our bound was tighter than the bound in (*), as later shown in Figure 1D.

Assessment of Tightness
In order to assess the quality of our bound, we compared it to the exact p-value, which can be calculated for small values of N (that is, in cases where N ! is not too large). Figure 1A compares the mmHG score (which also serves as a lower bound for the p-value), the exact p-value (calculated by exhaustive enumeration of all 10! permutations), our upper bound and the Bonferroni corrected p-value for N =10. Figure 1B shows the same for N =20, except that exact p-values cannot be calculated exhaustively, and therefore an empirical p-value is produced by randomly sampling 10 7 permutations. In both cases our upper bound is significantly tighter than the Bonferroni bound. We also observed that the smaller the mmHG scores are -the tighter is our bound, consistent with lesser over-counting for smaller scores, as explained in previous sections. Comparison between the first bound described in Section 3.1 and the bound described in Section 3.2 is shown in Figure 1C (for textit N=20). We observed that enumerating all hgt scores rather than enumerating all permutations in S N significantly improves the p-value estimation. Moreover, the refinement of this approach produced by reducing the extent of multiple counting of permutation further improves the upper bound. In Figure 1D the bounds, including the bound introduced in Section 3.3 (Steinfeld bound), are shown for N =100. An empirical p-value was not calculated here as even if we sample 10 7 permutations, a p-value smaller than 10 −7 cannot be obtained. The bound suggested in this paper was almost always observed to be tighter than the bound introduced in Section 3.3.

Application in PWM Motif Search
In this section we discuss mmHG as a framework for assessing the significance of PWM motifs in ranked lists. Given a ranked list of sequences and a PWM motif, by using the mmHG statistics and the bounds introduced earlier, we can assign a p-value to represent the significance of that PWM being enriched at the top of the list. To apply this approach for de-novo motif search, one needs to theoretically consider all possible PWMs. This is not feasible and as a heuristic approach we wrote mmHG-Finder which takes as input a ranked list of DNA or RNA sequences and returns significant motifs in PWM format. In cases where sequence ranking is not relevant or not available, it allows the use of positive and negative sets of sequences, searching for enriched motifs in the positive set using the negative set as the background.
We will now describe the methodology implemented in mmHG-Finder: Input: • a ranked list of sequences (or, alternatively, two sets of sequences representing target and background) • motif width, given as a range between k 1 and k 2 Algorithm: 4. Take the most significant fifty k -mers, to be used as starting points for the next step. This set of candidates is chosen such that the members are quite different. Note that this is a heuristic approach and the number 50 is somewhat arbitrary, chosen to succeed in catching the best performing PWMs without heavily paying in complexity. 5. For each starting point, we iteratively replace one position in the k -mers by considering all possible IUPAC replacements and taking the one that improves the enrichment the most. We repeat this process for all positions several times. Eventually we get a motif in the IUPAC alphabet which is then converted to a PWM. 6. The PWMs found in the previous step are assessed using the mmHG statistics and the best is returned as output, together with the p-value. The score assigned by a PWM to a string S = S1, . . . , SM is defined as max 1≤i≤M −K+1 K j=1 mS i+j−1 ,j (assuming M ≥ K, otherwise it is −∞), where mα,c is the score in row α, column c of the position weight matrix. In other words, the PWM score calculated for S is the maximal score obtained for a substring of S.
To evaluate the performance of mmHG-Finder in comparison to other state-ofthe-art methods we ran it on 18 example cases -3 synthetically generated cases and 15 generated from high throughput binding experiments (6 transcription factors and 9 RNA-binding proteins). We compared the results to those obtained by using three other methods: the standard MEME program [3], DREME [4], and XXmotif [17]. Some of the results of this comparison are summarized in Table 1. The synthetic examples were generated by randomizing 500 sequences of length 100. An IUPAC motif was generated and planted in all top 64 sequences. mmHG-Finder outperformed all the other three tools on the synthetic examples, which contained degenerate motifs. MEME and DREME did not find the motifs in any case, while XXmotif found a similar result in 1 out of the 3 tests. The other 15 examples were taken from DNA and RNA high-throughput experiments [23], [10], [12]. In 12 out of these 15 datasets, mmHG-Finder found the motifs which were compatible with the known literature motifs as the most significant result. In comparison, DREME found the known consensus in 11 cases; XXmotif detected the literature motif in 9 cases while MEME detected the known motif in only 7 cases. In several datasets, such as for GCN4 and Pin4, mmHG-Finder identified the consensus motif while other tools returned repetitive sequences as their top results. The mmHG statistics avoids such spurious results as they typically do not correlate with the measurement driven ranking.
Computing p-value bounds for the synthetic examples (N =500) took 7-17 seconds on a simple single-core laptop. The running time depends on both the number of elements N as well as the mmHG score. The computation is optimized such that it is quicker for smaller mmHG scores. It took 33 minutes for N =5000 where the mmHG score was 3.3 · 10 −69 , and 39 minutes for N =4000 and mmHG score = 5.9 · 10 −31 . Table 1. We evaluated the performance of mmHG-Finder in comparison to other state-of-the-art methods: MEME, DREME and XXmotif. Almost all input examples comprised ranked lists, except for p53 (comprising target and background sets). Since MEME, DREME, and XXmotif expect a target set as input, we converted the ranked lists into target sets by taking the top 100 sequences for MEME (restricted by MEME's limitation of 60,000 characters) and the top 20 % sequences for the other tools. In the synthetic examples the entire ranked lists were taken as they are sufficiently small. Data and consensus motifs for p53 were taken from [23]; for REB1, CBF1, UME6, TYE7, GCN4 from [10]; and for the RNA binding proteins from [12]. Selected results are shown below. tual enrichment at the top of two permutations uniformly and independently drawn over S N . We assess tightness using simulated data. We also demonstrate utility of the mmHG statistics in identifying motifs in experimental binding affinity data. For several representative datasets, including synthetically generated data, we note that our bound improves the p-value estimates by a factor of 50.
The full characterization of the distribution of mmHG as a random variable over S N remains an open question.