Accelerating calculations of RNA secondary structure partition functions using GPUs
 Harry A Stern^{1} and
Affiliated with
 David H Mathews^{2}Email author
Affiliated with
DOI: 10.1186/17487188829
© Stern and Mathews; licensee BioMed Central Ltd. 2013
Received: 31 January 2013
Accepted: 14 October 2013
Published: 1 November 2013
Abstract
Background
RNA performs many diverse functions in the cell in addition to its role as a messenger of genetic information. These functions depend on its ability to fold to a unique threedimensional structure determined by the sequence. The conformation of RNA is in part determined by its secondary structure, or the particular set of contacts between pairs of complementary bases. Prediction of the secondary structure of RNA from its sequence is therefore of great interest, but can be computationally expensive. In this work we accelerate computations of basepair probababilities using parallel graphics processing units (GPUs).
Results
Calculation of the probabilities of base pairs in RNA secondary structures using nearestneighbor standard free energy change parameters has been implemented using CUDA to run on hardware with multiprocessor GPUs. A modified set of recursions was introduced, which reduces memory usage by about 25%. GPUs are fastest in single precision, and for some hardware, restricted to single precision. This may introduce significant roundoff error. However, deviations in basepair probabilities calculated using single precision were found to be negligible compared to those resulting from shifting the nearestneighbor parameters by a random amount of magnitude similar to their experimental uncertainties. For large sequences running on our particular hardware, the GPU implementation reduces execution time by a factor of close to 60 compared with an optimized serial implementation, and by a factor of 116 compared with the original code.
Conclusions
Using GPUs can greatly accelerate computation of RNA secondary structure partition functions, allowing calculation of basepair probabilities for large sequences in a reasonable amount of time, with a negligible compromise in accuracy due to working in single precision. The source code is integrated into the RNAstructure software package and available for download at http://rna.urmc.rochester.edu.
Background
RNA performs many diverse functions in the cell in addition to its role as a messenger of genetic information. It can form enzymes, for example for cleavage of itself or of other RNA, or to create peptide bonds as a fundamental constituent of the ribosome [1]. It can act as a signalling molecule for regulation of gene expression, for protein export, or for guiding posttranslational modifications [2–5].
As for proteins, RNA function depends on its folding to a welldefined threedimensional shape. In contrast to proteins, the folding of RNA is hierarchical [6]. Secondary structure, or the particular set of contacts between pairs of complementary bases mediated by hydrogen bonding and stacking of bases, provides a significant amount of information. This can be helpful in predicting function or accessibility to ligands [7–10]. Computational prediction of the secondary structure of RNA from its sequence is therefore of great interest. The most widelyused automated prediction methods attempt to estimate the thermodynamic stability of RNA, using empirical parameters determined from experiments on oligonucleotides [11, 12].
CUDA is a programming interface developed by the company NVIDIA to facilitate generalpurpose, parallel highperformance computing on multiprocessor graphics processing units (GPUs) [13]. In recent years, many scientific computing applications have been implemented on GPUs using CUDA, in many cases yielding speedups of several orders of magnitude [14, 15]. However, to our knowledge, only a handful of publications have appeared describing GPU implementations of codes for RNA secondary structure prediction. Rizk and Lavenier described a CUDA implementation of structure prediction by free energy minimization [16]. Their work was limited to relatively short sequences (up to 120 bases) and a simplified energy model that neglects coaxial stacking. The GPU implementation was faster than the serial implementation by a factor of 10 and 17 depending on the particular hardware. More recently, Lei et al. [17] also reported a parallelized implementation of free energy minimization using CUDA. They used only a coarsegrained parallelization scheme where the minimumfreeenergy structures for subsequences of a given length are calculated in parallel, but the search over structures for each subsequence is done serially. Their work was also limited to relatively short sequences (up to 221 bases). They reported speedups of up to a factor of 16. It should be noted that these parallelized implementations neither use the latest thermodynamic parameters for loops [18], nor include coaxial stacking interactions [19].
In addition, recent work demonstrates that calculations of basepairing probabilities calculated with partition functions can provide additional useful information [20]. Structures composed of highly probable pairs are more accurate than lowest free energy structures [20, 21]. The base pair probabilities provide confidence intervals for prediction of individual pairs. Base pairing probabilities can also be used to predict pseudoknots [22]. We have additionally extended this work to predictions using multiple, homologous sequences, where the same principles hold true [23–26].
Biological RNA sequences examined ^{ * }
Sequence  # bases  Reference 

tRNA RQ2640  75  [28] 
tRNA RD0500  76  [28] 
tRNA RA7680  76  [28] 
tRNA RD0260  77  [28] 
tRNA RR1664  77  [28] 
Candida albicans 5S rRNA  114  [29] 
Escherichia coli 5S rRNA  120  [30] 
P546 folding domain of Tetrahymenathermophilia group I intron  155  [31] 
Bacillus stearothermophilus SRP RNA  268  [32] 
3’ UTR of Bombyx mori R2 element with flanking vector sequence  300  [33] 
Tetrahymena thermophilia group I intron  433  [30] 
Saccharomyces cerevisiae A5 group II intron  632  [34] 
Escherichia coli small subunit rRNA  1542  [30] 
Escherichia coli large subunit rRNA  2904  [29] 
human ICAM1 mRNA  2986  [35] 
HIV1 NL43 genome (GenBank: AF324493)  9709  [36] 
We employed a sophisticated and accurate energy calculation that includes coaxial stacking [19]. Before attempting a parallel implementation, we first wrote an optimized serial version of the original code (the “partition” program in RNAstructure), implementing only a subset of its functionality, while improving efficiency and reducing memory usage. Subsequently, we parallelized the optimized code for GPU hardware, using CUDA. Here we made use of a finegrained parallelization scheme, in which the calculation of the restricted partition function for each subsequence of a given length is parallelized, as well as the calculation of the restricted partition function for each subsequence. We found that this finegrained parallelization resulted in greater speedups than a simpler coarsegrainedonly parallelization (up to factors of ∼60 compared to the optimized serial version and ∼116 compared to the original code).
Implementation
Calculating base pair probabilities
is the Boltzmann weight corresponding to standard free energy change g at temperature T.
These recursions are slightly different from—but equivalent to—those presented in reference [20] and used in the previous code. It should be noted that there was an error in equation 15 of reference [20]: in the second line, WMBL(k+1,j) should be replaced by [WMBL(k+1,j) + WL(k+1,j)].
 1.
W ^{Q} is the standard free energy change corresponding to the sum of terms on the right hand side of equation 11 of reference [20] not including the scaling by W5(k).
 2.
Elements of Y are −R T times the logarithm of the sum of the Boltzmann weights of corresponding elements of W and W ^{MB}.
 3.
Elements of Y ^{L} are −R T times the logarithm of the sum of the Boltzmann weights of corresponding elements of W ^{L} and W ^{MBL}.
 4.
Elements of Z are −R T times the logarithm of the sum of the Boltzmann weights of corresponding elements of W ^{L} (except for the term depending on the nextsmallest fragment) and W ^{coax}.
Reorganizing the recursions in this way might appear to use more memory because of the additional arrays. In fact, the modified version requires less memory, because several of the arrays do not need to be stored in their entirety. Specifically, using the modified recursions, storage is only required for two diagonals of W, W ^{L}, and W ^{MBL}; for five diagonals of W ^{MB}; and for a halftriangle of W ^{Q}. Reducing memory usage is important as the size of the full arrays scales as O(N ^{2}) and the available GPU memory on our hardware was limited to ∼2.5 GB. The modified recursions use four full N×N arrays and one halftriangle, rather than the six full N×N arrays used in the original recursions, and therefore reduce memory usage by about 25%. In addition, the calculation of W ^{5′} and W ^{3′} is simplified (compare equations 20 and 21 above with equation 11 of reference [20]).
Here n is the number of unpaired nucleotides and h is the number of branching helices [21]. By convention, the size of internal loops is limited to thirty unpaired nucleotides, so the number of terms in equation 8 and the overall computational expense scales as O(N ^{3}) where N is the size of the sequence.
This was done for two reasons: it requires at most a single call to exp, rather than two; and it can make use of the log1p function from the standard math library, which calculates log(1+x) accurately even for small x. This is important because often e ^{−a } and e ^{−b } will differ by several orders of magnitude, and simply adding them and then taking the logarithm can lead to significant roundoff error.
In order to determine how much additional computational overhead was imposed by the calculation of exp and log1p we performed a comparison with an artificial reference calculation, which was identical except that calls to these functions were omitted. We found that for a 1,000mer, the actual GPU calculation is only ∼20% more expensive than this reference calculation.
For a serial calculation on the CPU, there is a larger performance hit; the actual calculation is about a factor of two more expensive than the reference without exp or log1p. However, it should be noted that this is not the entire story, because overall, the new optimized serial code, which uses logarithms, is still faster than the original code, which does not. Running the calculation in log space results in simplifications such as not requiring checking for overflow and not having to multiply by scaling factors, which reduces computational expense.
Parallelization of the partition function calculation using CUDA
In the CUDA programming model, overall execution of the program is still governed by the CPU. Computeintensive portions of the program are then delegated to subroutines executed by the GPU, or kernels. In general, the GPU has its own memory, so data must be copied to and from the GPU before and after kernel execution. Many copies of a kernel, or threads, run in parallel, each of which belongs to a block. During kernel execution, threads belonging to the same block can share data and synchronize, whereas threads belonging to different blocks cannot. A program can contain many kernels, which can execute either serially or in parallel [13].
The algorithm for calculating partition functions is recursive: partition functions for larger fragments depend on those for smaller fragments. As such, the overall calculation proceeds serially, in order of fragment size. We used two levels of parallelization, a block level and a thread level. Calculations for all fragments of a given size may be done in parallel, with no communication. This was implemented in CUDA at the level of blocks of threads. The partition function for a given fragment depends on sums, with the number of terms on the order of the fragment size (e.g., equation 9). These sums were parallelized at the level of threads within a block, since calculating a sum in parallel relies on communication between the threads. In our experience, a greater speedup was obtained from this “inner loop” parallelization, even though it requires more communication between threads. Most likely, this is because optimal efficiency on GPU hardware is obtained when identical mathematical operations are performed in lockstep on different data [13]. We stress that these two different levels of parallelization are not mutually exclusive and optimal performance was obtained from including both. A separate block of threads was run for each fragment, while 256 threads were run within a block. The number of threads per block was chosen by trial and error and was optimal for our hardware (the simple sum reduction scheme we chose requires it to be a power of two). In our code, this value is set at compile time (but this is not required by CUDA—it could be set at run time if desired).
Results and discussion
Accuracy
Peak floating point performance for NVIDIA Tesla GPUs are faster by a factor of two when working in single compared with working in double precision (http://www.nvidia.com), but single precision introduces greater roundoff error. In order to examine accuracy, we calculated basepair probabilities for the same sequences using both the parallel CUDA/GPU implementation in single precision, and the serial implementation in double precision. We also calculated probabilities in double precision using a set of nearestneighbor parameters slightly modified by adding a random variate chosen from a Gaussian distribution with mean 0 and standard deviation 0.01 kcal/mol, which is comparable to or smaller than their experimental uncertainty (0.1 kcal/mol for parameters describing helical stacking and 0.5 kcal/mol for parameters describing loops [18, 37]).
We also used the calculated base pair probabilities to determine a single consensus structure for each sequences, using the ProbKnot algorithm [22]. In this case, working in single precision led to differences with double precision only for one sequence (a random sequence of 6000 bases) out of the 44 we examined, and these were very small (only three bases out of the 6000 were matched with a different partner). Working with the modified parameters led to larger discrepancies although these were still fairly small (for sequences containing more than 100 bases, at most 6% of bases were matched with different partners). This is consistent with a previous report that predictions using base pair probabilities are significantly less sensitive to errors in thermodynamic parameters than using only lowest free energy structures [39].
Computational expense
Conclusions
In this work, we introduced a modified set of recursions for calculating RNA secondary structure partition functions and basepairing probabilities using a dynamic programming algorithm, and implemented these in parallel using the CUDA framework for multiprocessor GPUs. For large sequences, the GPU implementation reduces execution time by a factor of close to 60 compared with an optimized serial implementation, and by a factor of 116 compared with the original code. It is clear from our work that using GPUs can greatly accelerate computation of RNA secondary structure partition functions, allowing calculation of basepair probabilities for large sequences in a reasonable amount of time, with a negligible compromise in accuracy due to working in single precision. It is expected that parallelization using CUDA should be applicable to other implementations of dynamic programming algorithms [12] besides ours, and result in similar speedups.
Two levels of parallelization were implemented. Calculations for all fragments of a given size were done in parallel, with no communication between threads. This was implemented in CUDA at the level of blocks of threads. In addition, the sums contributing to the partition function for a given fragment were calculated in parallel, with communication required between threads. These sums were parallelized at the level of thread within a block. We found that this “inner loop” parallelization resulted in a significantly greater speedup than the “outer loop” parallelization alone.
Availability and requirements

Project name: partitioncuda; part of RNAstructure, version 5.5 and later

Project home page: http://rna.urmc.rochester.edu/RNAstructure.html

Operating system(s): Unix

Programming languages: C and CUDA

Other requirements: CUDA compiler, available from

License: GNU GPL

Any restrictions to use by nonacademics: None.
Declarations
Acknowledgements
This work was supported by NIH grant R01GM076485 to D. H. M. and the Center for Integrated Research Computing at the University of Rochester.
Authors’ Affiliations
References
 Doudna JA, Cech TR: The chemical repertoire of natural ribozymes. Nature 2002, 418:222.PubMedView Article
 Bachellerie JP, Cavaille J, Huttenhofer A: The expanding snoRNA world. Biochimie 2002, 84:775.PubMedView Article
 Walter P, Blobel G: Signal recognition particle contains a 7S RNA essential for protein translocation across the endoplasmic reticulum. Nature 1982, 299:691.PubMedView Article
 Dykxhoorn DM, Novina CD, Sharp PA: Killing the messenger: Short RNAs that silence gene expression. Nat Rev Mol Cell Biol 2003, 4:457.PubMedView Article
 Wu L, Belasco JG: Let me count the ways: Mechanisms of gene regulation by miRNAs and siRNAs. Mol Cell 2008, 29:1.PubMedView Article
 Bustamante C, Tinoco I Jr: How RNA folds. J Mol Biol 1999, 293:271.PubMedView Article
 Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: Inference of RNA alignments. Bioinformatics 2009, 25:1335.PubMedView Article
 Li X, Quon G, Lipshitz HD, Morris Q: Predicting in vivo binding sites of RNAbinding proteins using mRNA secondary structure. RNA 2010, 16:1096.PubMedView Article
 Lu ZJ, Mathews DH: Efficient siRNA selectrion using hybridization thermodynamics. Nucleic Acids Res 2007, 36:640.PubMedView Article
 Tafer H, Ameres SL, Obernosterer G, Gebeshuber CA, Schroeder R, Martinez J, Hofacker IL: The impact of target site accessibility on the deseign of effective siRNAs. Nat Biotechnol 2008, 26:578.PubMedView Article
 Turner DH, Mathews DH: NNDB: The nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic Acids Res 2010, 38:D280.PubMedView Article
 Lorenz R, Bernhart SH, zu Siederdissen CH, Tafer H, Flamm C, Stadler PF, Hofacker IL: ViennaRNA package 2.0. Algorithms Mol Biol 2011, 6:26.PubMedView Article
 Sanders J, Kandrot E: CUDA by Example: An Introduction to GeneralPurpose GPU Programming. Boston: AddisonWesley; 2011.
 Farber RM: Topical perspective on massive threading and parallelism. J Mol Graph Model 2011, 30:82.PubMedView Article
 Gotz AW, Williamson MJ, Xu D, Poole D, Le Grand S, Walker RC: Routine microsecond molecular dynamics simulations with AMBER on GPUs. 1. Generalized Born. J Chem Theory Comput 2012, 8:1542.PubMedView Article
 Rizk G, Lavenier D: GPU accelerated RNA folding algorithm. Lect Notes Comput Sci 2009, 5544:1004.View Article
 Lei G, Dou Y, Wan W, Xia F, Li R, Ma M, Zou D: CPUGPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications. BMC Genomics 2012,13(Suppl 1):S14.PubMedView Article
 Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH: Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci USA 2004, 101:7287.PubMedView Article
 Kim J, Walter AE, Turner DH: Thermodynamics of coaxially stacked helices with GA and CC mismatches. Biochemistry 1996, 35:13753.PubMedView Article
 Mathews DH: Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA 2004, 10:1178.PubMedView Article
 Lu ZJ, Gloor JW, Mathews DH: Improved RNA secondary structure prediction by maximizing expected pair accuracy. RNA 2009, 15:1805.PubMedView Article
 Bellaousov S, Mathews DH: ProbKnot: Fast prediction of RNA secondary structure including pseudoknots. RNA 2010, 16:1870.PubMedView Article
 Harmanci AO, Sharma G, Mathews DH: PARTS: Probabilistic alignment for RNA joint secondary structure predcition. Nucleic Acids Res 2008, 36:2406.PubMedView Article
 Harmanci AO, Sharma G, Mathews DH: Stochastic sampling of the RNA structural alignment space. Nucleic Acids Res 2009, 37:4063.PubMedView Article
 Harmanci AO, Sharma G, Mathews DH: Efficient pairwise RNA structure prediction using probabilistic alignment constraints in Dynalign. BMC Bioinformatics 2011, 27:626.View Article
 Seetin MG, Mathews DH: TurboKnot: Rapid prediction of conserved RNA secondary structures including pseudoknots. Bioinformatics 2012, 28:792.PubMedView Article
 Reuter JS, Mathews DH: RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 2010, 11:129.PubMedView Article
 Sprinzl M, Vassilenko KS: Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res 2005, 33:D139.PubMedView Article
 Szymanski M, Barciszewska MZ, Barciszewski J, Erdmann VA: 5S ribosomal RNA database Y2K. Nucleic Acids Res 2000, 28:166.PubMedView Article
 Cannone JJ, Subramanian S, Schnare MN, Collett JR, D’Souza LM, Du Y, Feng B, Lin N, Madabusi LV, Muller KM, Pande N, Shang Z, Yu N, Gutell RR: The compariative RNA web (CRW) site: An online databae of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics 2002, 3:2.PubMedView Article
 Cate JH, Gooding AR, Podell E, Zhou K, Golden BL, Kundrot CE, Cech TR, Doudna JA: Crystal structure of a group I ribozyme domain: Principles of RNA packing. Science 1996, 273:1678.PubMedView Article
 Larsen N, Samuelsson T, Zwieb C: The signal recognition particle database (SRPDB). Nucleic Acids Res 1998, 26:177.PubMedView Article
 Mathews DH, Banerjee A R Luan D D, Eickbush TH, Turner DH: Secondary structure model of the RNA recognized by the reverse transcriptase from the R2 retrotransposable element. RNA 1997, 3:1.PubMed
 Michel F, Umesono K, Ozeki H: Comparative and functional anatomy of group II catalytic introns—a review. Gene 1989, 82:5.PubMedView Article
 Staunton DE, Marlin SD, Stratowa C, Dustin ML, Springer TA: Primary structure of ICAM1 demonstrates interaction between members of the immunoglobulin and integrin supergene families. Cell 1988, 52:925.PubMedView Article
 Adachi A, Gendelman HE, Koenig S, Folks T, Willey R, Rabson A, Martin MA: Production of acquired immunodeficiency syndromeassociated retrovirus in human and nonhuman cells transfected with an infectious molecular clone. J Virol 1986, 59:284.PubMed
 Xia T, Burkhard ME, Kierzek R, Schroeder SJ, Jaio X, Cox C, Turner DH, Santa Lucia J Jr: Thermodynamic parameters for an expanded nearestneighbor model for formation of RNA duplexes with WatsonCrick pairs. Biochemistry 1998, 37:14719.PubMedView Article
 Zuker M: On finding all suboptimal foldings of an RNA molecule. Science 1989, 244:48.PubMedView Article
 Layton DM, Bundschuh R: A statistical analysis of RNA folding algorithms through thermodynamic parameter perturbation. Nucleic Acids Res 2005, 33:519.PubMedView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.