An improved Four-Russians method and sparsified Four-Russians algorithm for RNA folding

Background The basic RNA secondary structure prediction problem or single sequence folding problem (SSF) was solved 35 years ago by a now well-known \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n^3)$$\end{document}O(n3)-time dynamic programming method. Recently three methodologies—Valiant, Four-Russians, and Sparsification—have been applied to speedup RNA secondary structure prediction. The sparsification method exploits two properties of the input: the number of subsequence Z with the endpoints belonging to the optimal folding set and the maximum number base-pairs L. These sparsity properties satisfy \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0 \le L \le n / 2$$\end{document}0≤L≤n/2 and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n \le Z \le n^2 / 2$$\end{document}n≤Z≤n2/2, and the method reduces the algorithmic running time to O(LZ). While the Four-Russians method utilizes tabling partial results. Results In this paper, we explore three different algorithmic speedups. We first expand the reformulate the single sequence folding Four-Russians \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta \left(\frac{n^3}{\log ^2 n}\right)$$\end{document}Θn3log2n-time algorithm, to utilize an on-demand lookup table. Second, we create a framework that combines the fastest Sparsification and new fastest on-demand Four-Russians methods. This combined method has worst-case running time of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\tilde{L}\tilde{Z})$$\end{document}O(L~Z~), where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{{L}}{\log n} \le \tilde{L}\le min\left({L},\frac{n}{\log n}\right)$$\end{document}Llogn≤L~≤minL,nlogn and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{{Z}}{\log n}\le \tilde{Z} \le min\left({Z},\frac{n^2}{\log n}\right)$$\end{document}Zlogn≤Z~≤minZ,n2logn. Third we update the Four-Russians formulation to achieve an on-demand \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O( n^2/ \log ^2n )$$\end{document}O(n2/log2n)-time parallel algorithm. This then leads to an asymptotic speedup of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\tilde{L}\tilde{Z_j})$$\end{document}O(L~Zj~) where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{{Z_j}}{\log n}\le \tilde{Z_j} \le min\left({Z_j},\frac{n}{\log n}\right)$$\end{document}Zjlogn≤Zj~≤minZj,nlogn and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z_j$$\end{document}Zj the number of subsequence with the endpoint j belonging to the optimal folding set. Conclusions The on-demand formulation not only removes all extraneous computation and allows us to incorporate more realistic scoring schemes, but leads us to take advantage of the sparsity properties. Through asymptotic analysis and empirical testing on the base-pair maximization variant and a more biologically informative scoring scheme, we show that this Sparse Four-Russians framework is able to achieve a speedup on every problem instance, that is asymptotically never worse, and empirically better than achieved by the minimum of the two methods alone.

thermodynamically optimal 1 secondary structure, secondary structure is a key component of structure prediction [14].
Efficient O(n 3 )-time dynamic programming algorithms were developed more than thirty years ago to find noncrossing secondary structure of a single RNA molecule with n bases [22,23,27,29,38,39]. We call this basic folding or single sequence folding (SSF) problem. In addition, McCaskill [19] created an O(n 3 )-time algorithm for the partition function for RNA secondary structure. Based on these algorithms, software has been developed and widely used [15,16,25,36,37]. Probabilistic methods, employing Stochastic context-free grammar (SFCG), were also developed to solve the basic folding problem [7,8].
The accuracy of all these methods is based on the parameters given by the scoring function. Thermodynamic parameters [17,18,28,33] and statistical parameters [6,7], or a combination of the two [2,13] are currently employed.
The Valiant [1,34], Sparsification [4,30], and the Four-Russians (FR) [9,24] methods where previously applied to improve on the computation time for secondary structure prediction. For SSF, the Valiant method achieves the asymptotic time bound of O n 3 2 �log(n) by incorporating the current fastest min/max-plus matrix multiplication algorithm [32,34]. The Four-Russians method was applied to single sequence [10,24], cofolding [11] and pseudoknotted [12] folding problems. The Sparsification method, was developed to improve computation time in practice for a family of RNA folding problems, while retaining the optimal solution matrix [4,20,21,26,30,35].

Methods
In this paper, we combine the Four-Russians method [24] and the Sparsification method [4]. While the former method reduces the algorithm's asymptotic running time to n 3 log 2 n , the latter eliminates many redundant computations. To combine these methods, we use an ondemand tabulation (instead of a preprocessing approach which is typically applied in FR algorithms), removing any redundant computation and guaranteeing the combined method is at least as fast as each individual method, and in certain cases even faster. First, we reformulate SSF Four-Russians n 3 log 2 n -time algorithm [24] to utilizes on-demand lookup table creation. Second, we combine the fastest Sparsification and Four-Russians SSF speedup methods. The Sparse Four Russians speedup presented here leads to a practical and asymptotically 1 Or close to optimal. fastest combinatorial algorithm (even in the worstcase). The new algorithm has an O(LZ) run time where LZ log 2 n ≤LZ ≤ min n 3 log 2 n , LZ . In practice, when accounting for every comparison operation the Sparse Four Russians outperforms both the Four-Russians and Sparsification methods. Third, we extended the Four-Russian SSF algorithm to be computed in O(n 2 / log 2 n)time. The simulated results for this formulation and O(n) processors achieve a practice speedup on the number of comparison operations performed.

Problem definition and basic algorithm
Let s = s 0 s 1 . . . s n−1 be an RNA string of length n over the four-letter alphabet = {A, U , C, G}, such that s i ∈ for 0 ≤ i < n. Let s i,j denote the substring s i s i+1 . . . s j−1 . We note that for simplicity of exposition substring s i,j does not contain the nucleotide j. A folding (or a secondary structure) of s is a set M of position pairs (k, l), such that: (1) 0 ≤ k < l < n; (2) and there are no two different pairs (k, l), (k ′ , l ′ ) ∈ M such that k ≤ k ′ ≤ l ≤ l ′ (i.e. each position participates in at most one pair, and the pairs are non-crossing).
Let β(i, j) return a score associated with position pair (i, j). Let L(s, M) be the score associated with a folding M of RNA string s, and let L(s) be the maximum score L(s, M) over all foldings M of s. The RNA Folding or SSF problem is: given an RNA string s, compute L(s), and find an optimal folding M such that L(s, M) = L(s). In this work, we assume the following simple scoring scheme: ), (C, G), (G, C)} , and β(i, j) = 0 otherwise. Richer scoring schemes allow more biologically significant information to be captured by the algorithm. However, the algorithms for solving the problem similar recurrences and other discrete scoring schemes may be accelerated in a similar way to what we present here.
For the folding M of s i,j , an index k ∈ (i, j) is called a split point in M if for every (x, y) ∈ M, either y < k or k ≤ x. A folding M is called a partitioned folding (with respect to s i,j ) if there exists at least one split point; otherwise M is called a co-terminus folding. Let the matrix L be a matrix such that L[i, j] = L(s i,j ). In addition, let For completeness, when The above recurrence may be efficiently implemented using a dynamic programming (DP) algorithm. Essentially, the DP algorithm computes and maintains values of the form L[i, j], L p [i, j] and L c [i, j] for every 0 ≤ i ≤ j ≤ n in three n + 1 × n + 1 matrices. The algorithm traverses the matrices in increasing column order index j from 1 to n. Within each column, the cell L[k, j] is computed in decreasing index order k from j − 1 to 0. Once Clearly, computing L p is the bottleneck of the computation, since for a given i, j, there may be �(n) split points to examine. We divide the solution matrix L in two ways: q × q submatrices ( Fig. 1) and size q sub column vectors (the value of q will be determined later). Let K g be the gth interval such that K g = {q · g, q · g + 1, . . . , q · g + q − 1}. We call these sets Kgroups, and use K g as the interval starting at index g · q. For an index i, define g i = i q . It is clear that

Extending the notation and moving towards a vector by vector computation of L
Similarly, we break up the row indices into groups of size q, denoted by I g where I g = {k = q · g, k + 1, ...k + q − 1} . (Clearly, row index set I g is equivalent to the Kgroup K g . We only introduce this extra notation for simplicity of the exposition).
for all i ∈ I Given this notation L P [i, j] can be rewritten as maximization L p K g [i, j] values for all K g index Kgroups between i and j. However, in some cases, the indices {i + 1, . . . q · g i+1 − 1} do not form a full Kgroup K g i . Similarly indices {qg j , qg j + 1, . . . j − 1} do not form a full Kgroup K g j . Therefore, L P [i, j] can be computed by maximizing the full and non full Kgroups K g . In Eq. 4 and the following sections we do not explicitly differentiate between full and non full groups.
We extend the notation further, to compute the matrix L p not cell by cell but instead by vectors of size q corresponding to the I g ′ row sets, as follows.
The DP algorithm can be updated to incorporate the extended notation. Within each column, compute the matrices in vectors of size q.

Sparsification of the SSF algorithm
The Sparsification method achieves a speedup by reducing the number of split points examined during the computation of L P [i, j]. As Fig. 2 shows the focus of Sparsified Four Russians algorithm will narrow down only on those submatrices whose split points are step-oct for a particular i, j [4,30].

OCT and STEP sub-instances of sequence s
Sub-instance s i,j is optimally co-terminus (OCT) if every optimal folding of s i,j is co-terminus. We introduce the extra notation below if Notation: For the index set K = {k, k + 1, . . . k ′ } and column j, let K oct j be the set of indices such that K oct j ⊂ K and ∀ k∈K oct j L[k, j] is OCT. Given the row interval I = {i, i + 1, . . . i ′ }, let I step k be the set of rows such that I step k ⊂ I, and for all i ∈ I step k L[i, k] is STEP. We further define operation ⊗ step−oct such that given is computed by the following procedure: Using the operation ⊗ step−oct and based on Fact 1. We reduce the time to compute Note Eq. 6 does not explicitly show that for L P K g ′ [I g ′ , j] the split-point i + 1 must be examined for every i ∈ I g ′.

Asymptotic time bound of sparsified SSF
Let L be the total number of STEP sub-instances in column k. More precisely L = |{0, 1, . . . k} step k | (Creating the list of splitpoints that correspond to STEP incidence requires no additional computation time [4]). The total time to compute SSF when examining only STEP, OCT combinations (Sparsification method), is O(LZ). As shown in Backofen et al. [4] Z is bounded by Z ≤ n 2 and L is bounded by L ≤ n 2 . The overall asymptotic time bound of the Sparsification method is O(LZ) remains O(n 3 ).

On-demand Four Russians speedup
Presented here is an on-demand version of the �(log 2 n) -time Four-Russians algorithm implied by Pinhas et al. [24].  Let We define the -encoding of 2-discrete vector x to be a pair of integers (x 0 , � x ) such that x 0 is the first element in x and x is the integer representation of the binary vector -encoding vector operations:

MUL lookup table
Based on Observation 1, any column vector in matrix L is 2-discrete. Given vector L[K g , j] and its -encod-  There are O n 2 q 2 submatrices within L. For each submatrix the maximum number of entries we compute for lookup table MUL is 2 q−1 . In total, the asymptotic time bound to populate lookup table MUL is

MAX lookup table
Let the max of two 2-discrete q-size vectors v and w , denoted max( v, w), result in a q-size vector z, where ∀ 0≤k<q z k = max(v k , w k ). Without loss of generality, let w 0 ≥ v 0 . Comparing the first element in each vector there are two possibilities either (1) In the second case, we make use of the following fact [24].

Fact 3 Given two vectors
Lets define lookup We summarize these results in the function max: Function max : : input: such that w 0 ≥ v 0 and = (v 0 , ∆ v ) and = (w 0 , ∆ w ) output: In Eq. 8, below, we integrate the vector comparison function max. Each vector L p [I g ′ , j] is computed by maximizing over O(n / q) vectors. We will compute the lookup table MAX on-demand for every entry that does not exist an O(q). Clearly the lookup table MAX will contain at most 2 (q−1) · 2 (q−1) · q for all entries. In worst case, the lookup table MAX computes in O(2 q 2 q) time.
The matrix L p and hence L is solved by a total of O n 2 q computations of Eq. 8. In total, given lookup table MUL

Extending to D-discrete vectors
We define the -encoding of D-discrete vector x to be a pair of integers (x 0 , � x ) such that x 0 is the first element in x and x is the integer representation in base 10 of the vector As a result for a more complicated scoring scheme B we could apply the Four-Russians speedup by augmenting the encode, and decode functions as well as the max algorithm. input: This would result in a total asymptotic time for Four-Russians SSF where |D| > 2 of Setting q = ǫ log D n, where ǫ ∈ (0, .5) [31], the total computation time is equal to

Sparse Four-Russian method
With the Four-Russians method, a speedup is gained by reducing q split point index comparisons for q subsequences to a single O(1) time lookup. The Sparsification method reduces the comparison to only those indices which correspond to STEP-OCT folds.

STEP-OCT condition for sets of split points
In this section, we achieve a Sparsified Four-Russian speedup for the computation of the L p matrix. As in the Four Russians method, we will conceptually break up the solution matrix L in two ways: in q × q size submatrices, and q size subcolumn vectors. The submatrices are indexed by g ′ and g such that the corresponding submatrix is L[I g ′ , K g ] . The subcolumn vectors are indexed by g and j , such that the corresponding subcolumn vector is L[K g , j].
We augment the Four-Russians SSF to reduce the number of entries, and lookups into the MUL table. If and only if, the matrix L[I g ′ , K g ] contains at least one cell L[i, k] that is STEP and within vector j] ] . If such an entry does not exist we will compute L[I g ′ , K g ] ⊗ (0, � L[K g ,j] ) and store the result into lookup table MUL.
The following notation will be used to help determine if a split point Kgroup should be examined in the computation. Observation 2 Suppose that, s i,k is STEP, and integer m ′ = sigStep(L[I g ′ , K g ]) such that i ∈ I g ′ (or I g ′ = I g i ) and k ∈ K g (or K g = K g k ). Then, the corresponding binary vector � m ′ must be set to 1 in position x where x is an index

OCT subcolumn vector
Observation 3 Suppose s k,j is OCT, and suppose integer m = sigOct(L[K g , j]) such that k ∈ K g . Then, the corresponding binary vector m must be set to 1 in position x, where x is an index such that k = qg + x. More precisely, if s k,j is OCT then m[x] = 1 by the definition of sigOct.
Given two binary vectors v and w the dot product of their integer representation is equal to a binary number Theorem 1 For any subinstance s i,j either i + 1 is the optimal split point, or there is an optimal split point k ∈ (i, j), such that sigStep(L[I g i , K g k ]) ⊙ sigOct(L[K g k , j]) equals 1.
Proof Based on Fact 1 for any sub-instance s i,j there is an optimal split point k such that either k = i + 1 or s i,k is STEP and s k,j is OCT. If s i,k is STEP and s k,j is OCT then L[i, k] is STEP and L[k, j] is OCT. The cell L[i, k] belongs to submatrix L[I g i , K g k ] and the cell L [k, j] belongs to the vector L[K g k , j]. Let x be an index such that k = qg k + x. Let � m ′ be a binary vector that corresponds to sigStep(L[I g i , K g k ]).
We can reformulate the computation of L p [I g ′ , j] by referencing the lookup table MUL only if g is STEP-OCT. This reduces the number of operations used in computing the bottleneck L P matrix.
We update the DP algorithm to only access the MUL lookup table for matrix and vector combinations that satisfy the property Let G be a lookup table, where give an index g ∈ [0, n/q] and integer m ∈ [0, 2 q ] the G[g][m] ⊂ {I 0 , I 1 , . . . , I g } is a set of row index intervals. Each index I g ′ within G[g][m] satisfies the following condition: Lookup table G (updated on-demand) allows us to implement Eq. 9. As L[K g , j] is computed, the corresponding SigOct is also computed. Let m = sigOct(L[K g , j]). By iterating through I g ′ ∈ G[g][m] set of row indices we access table MUL only when both of the following conditions hold at the same time: the submatrix L[I g ′ , K g ] contains at least one cell L[i, k] where s i,k is STEP and within vector L[K g , j] the cell L[k, j] contains s k,j that is OCT (where i ∈ I g ′ and k ∈ K g ).
The Sparsified Four-Russian algorithm implements Eq. 9. The complete function will tabulate STEP, and OCT instances as well as sigStep and sigOct values. The G, MUL and MAX lookup tables will be computed on-demand.
The empirical testing results are given not in seconds but in the number of operations including both lookup table creation and split-point comparisons. We do so to abstract from the effect compiler optimizations. Note that the testing does not account for memory access time, or extend the algorithm to D > 2 scoring schemes ( Table 1).
For N = 128 the Sparse Four-Russians(SFR) algorithm performs 25 % less comparisons than the Sparsified(SP) SSF algorithm and 80 % less comparison than the Four-Russians (FR) algorithm. In all test cases, the Sparse Four-Russians performed better than the minimum of either method alone.
An O(n 2 / log 2 (n)) simple parallel Four-Russians RNA folding algorithm Lets solve the recurrence relation (Eq. 1-3) in increasing index j order and then move up the column j computing one cell at a time in decreasing i order. Each cell L[i, j] is solved by calculating Eq. 1-3 for all i < k ≤ j. Given this j, i, k order, let us reformulate the computation by moving up each column in O(n/q) q-size subcolumn vectors instead of n cells.

Utilizing n processors
Lets create a new process for each column j, creating n process in total. We can synchronously move up the matrix computing each column subvector such that on iteration d we compute L[I g j −d , j] for all j ∈ (0, n). Invariant 1 Given g i and g j ∀ i∈I g i ∀ k∈K g L[i, k] = L(s i,k ) . In other words, submatrix L[I g i , K g ] is computed. Similarly L[K g , j] is computed or ∀ k∈K g L[k, j] = L(s k,j ).
Please note that the function complete assumes that L p Replacing the max(L p [I g i , j], L[I g i , K g ]) ⊗ L[K g , j]) computation with lookups into MUL and MAX tables would reduces the run-time for finding the solution matrix L to O(n 2 /log 2 n). As stated in "Extending to D-discrete vectors" section it is possible to create lookup tables on-demand and achieve a reduction in computation time of �(log 2 n) factor.
The preprocessing can also be achieve in parallel reducing the asymptotic cost form O(n 3 / log 2 n) to O(n 2 / log 2 n). If entry MUL L[ There are O n 2 q 2 submatrices within L. For each submatrix the maximum number of entries we compute for lookup table MUL is D q−1 . However, in each iteration at worse O(n) of the entries are computed simultaneously. In total, the asymptotic time bound to populate lookup table Based on Williams [31] O(D q ) is bound by O(n/ log 2 n) when setting q = ǫ log n. Hence, for the MUL lookup table the total asymptotic computation time is O(n · D q ) = O(n 2 / log 2 n), For the MAX table similarly the serial computation of O(D 2q q) total time is reduced by a factor of n in the parallel case. The total computation time for the MAX table is therefore O(n/ log 3 n).

Parallel sparisified Four-Russians single sequence folding algorithm
Let Z x be the number of OCT cells in column x. Let ∀ x∈[0,n] Z j ≥ Z x . The parallel algorithm would take as long as would take as a it takes for the last processor to complete.
In order to extend the parallel Four-Russians single sequence folding algorithm to utilize the Sparsification speedup we will limit the call to MUL table only if sigSTEP(L[I g i , K g ]) ⊙ sigOCT (L[K g , j]) > 0. As result given Z j the total time to compute for processor j is O(LZ j ) where Z j log n ≤Z j ≤ min Z j , n log n .

Conclusion
This work combines the asymptotic speedup of Four-Russians with the very practical speedup of Sparsification. The on-demand formulation of the Four-Russians not only removes all extraneous computation. This approach allows the Four-Russians SSF to achieve a speedup in practice for realistic scoring schemes. This also leads us to take advantage of the sparsity properties. Through asymptotic analysis and empirical testing on the base-pair maximization variant and a more biologically informative scoring scheme, we show that the Sparse Four-Russians framework is able to achieve a speedup on every problem instance, that is asymptotically never worse, and empirically better than achieved by the minimum of the two methods alone. We also showed that through some re-organization we could apply the Four-Russians speedup to parallel algorithm and achieve and asymptotic time of O(n 2 / log 2 n). The algorithm created here can be implemented in CUDA to compute on multiprocessor GPUs. Because the algorithm allows for memory cell independence one can apply memory and cache optimization without affecting the algorithm. The utility in this framework lies not only on its ability to speedup single sequence folding but its ability to speedup the family of RNA folding problems for which both Sparsification and Four-Russians have bene applied separately. Future work in this area would be to examine the ability to sparsify memory [3], as Four-Russians at worst case requires an additional factor of 2 log(n) in memory. Another open question is wether it is possible to apply the �(log 3 n) [5] speedup of boolean matrix multiplication to RNA folding.