Graphdistance distribution of the Boltzmann ensemble of RNA secondary structures
 Jing Qin^{1},
 Markus Fricke^{3},
 Manja Marz^{3},
 Peter F Stadler^{2, 6, 7, 8, 9} and
 Rolf Backofen^{4, 5}Email author
https://doi.org/10.1186/17487188919
© Qin et al.; licensee BioMed Central Ltd. 2014
Received: 30 November 2013
Accepted: 30 June 2014
Published: 11 September 2014
Abstract
Background
Large RNA molecules are often composed of multiple functional domains whose spatial arrangement strongly influences their function. PremRNA splicing, for instance, relies on the spatial proximity of the splice junctions that can be separated by very long introns. Similar effects appear in the processing of RNA virus genomes. Albeit a crude measure, the distribution of spatial distances in thermodynamic equilibrium harbors useful information on the shape of the molecule that in turn can give insights into the interplay of its functional domains.
Result
Spatial distance can be approximated by the graphdistance in RNA secondary structure. We show here that the equilibrium distribution of graphdistances between a fixed pair of nucleotides can be computed in polynomial time by means of dynamic programming. While a naïve implementation would yield recursions with a very high time complexity of O(n^{6}D^{5}) for sequence length n and D distinct distance values, it is possible to reduce this to O(n^{4}) for practical applications in which predominantly small distances are of of interest. Further reductions, however, seem to be difficult. Therefore, we introduced sampling approaches that are much easier to implement. They are also theoretically favorable for several reallife applications, in particular since these primarily concern longrange interactions in very large RNA molecules.
Conclusions
The graphdistance distribution can be computed using a dynamic programming approach. Although a crude approximation of reality, our initial results indicate that the graphdistance can be related to the smFRET data. The additional file and the software of our paper are available from http://www.rna.unijena.de/RNAgraphdist.html.
Keywords
Graphdistance Boltzmann distribution Partition function PremRNA splicing smFRETBackground
The distance distribution within an RNA molecule is of interest in various contexts. Most directly, the question arises whether panhandlelike structures (in which 3’ and 5’ ends of long RNA molecules are placed in close proximity) are the rule or an exception. Panhandles have been reported in particular for many RNA virus genomes. Several studies [1–4] agree based on different models that the two ends of singlestranded RNA molecules are typically not far apart. On a more technical level, the problem to compute the partition function over RNA secondary structures with given endtoend distance d, usually measured as the number of external bases (plus possibly the number of structural domains) arises for instance when predicting nucleic acid secondary structure in the presence of singlestranded binding proteins [5] or in models of RNA subjected to pulling forces (e.g. in atom force microscopy or export through a small pore) [6–8]. It also plays a role for the effect of loop energy parameters [9].
In contrast to the endtoend distance, the graphdistance between two arbitrarily prescribed nucleotides in a larger RNA structure does not seem to have been studied in any detail. However, this is of particular interest in the analysis of singlemolecule fluorescence resonance energy transfer (smFRET) experiments [10]. This technique allows to monitor the distance between two dyelabeled nucleotides and can reveal details of the kinetics of RNA folding in real time. It measures the nonradiative energy transfer between the dyelabeled donor and acceptor positions. The efficiency of this energy transfer, E_{ fret }, strongly depends on the spatial distance R according to ${E}_{\text{fret}}={R}_{0}^{6}/({R}_{0}^{6}+{R}^{6})$. The Förster radius R_{0} sets the length scale, e.g. R_{0}≈54 Å for the Cy3Cy5 dye pair. A major obstacle is that, at present, there is no general and efficient way to link smFRET measurements to interpretations in terms of explicit molecular structures. To solve this problem, a natural first step is to compute the distribution of spatial distances for an equilibrium ensemble of 3D structures. Since this is not feasible in practice despite major progress in the field of RNA 3D structure prediction [11], we can only resort to considering the graphdistances on the ensemble of RNA secondary structures instead. From a computer science point of view, furthermore, we show here that the distance distribution can be computed exactly using a dynamic programming approach. Although a crude approximation of reality, our initial results indicate that the graphdistance can be related to the smFRET data such as those reported by [12] and help to explain effects of RNA structures in premRNA splicing and viral subgenomic RNA species.
Theory
RNA secondary structures
 (i)
If {i,j} ∈ B then x _{ i } x _{ k }∈ {GC,CG,AU,UA,GU,UG};
 (ii)
If {i,j} ∈ B then j  i > 3;
 (iii)
If {i,j},{i,k} ∈ B then j = k;
 (iv)
If {i,j},{k,l} ∈ B and i < k < j then i < l < j.
The first condition allows base pairs only for WatsonCrick and GU base pairs. The second condition implements the minimal steric requirement for an RNA to bend back on itself. The third condition enforces that B forms a matching in the secondary structure. The last condition (nesting condition) forbids crossing base pairs, i.e. pseudoknots.
The nesting condition results in a natural partial order in the set of base pairs B defined as {i,j} ≺ {k,l} if k < i < j < l. In particular, given an arbitrary vertex k, the set B_{ k }= {{i,j} ∈ Bi ≤ k ≤ j} of base pairs enclosing k is totally ordered. Note that k is explicitly allowed to be incident to its enclosing base pairs. A vertex k is external if B_{ k }= ∅ . A base pair {k,l} is external if B_{ k }= B_{ l }= {{k,l}}.
Consider a fixed secondary structure G, for a given base pair {i,j} ∈ B, we say a vertex k is accessible from {i,j} if i < k < j and there is no other pair {i^{′},j^{′}} ∈ B such that i < i^{′} < k < j^{′} < j. The unique subgraph ${\mathcal{\mathcal{L}}}_{i,j}$ induced by i, j, and all the vertices accessible from {i,j} is known as the loop of {i,j}. The type of a loop ${\mathcal{\mathcal{L}}}_{i,j}$ is unique determined depending on whether {i,j} is external or not, and the numbers of unpaired vertices and base pairs. For details, see [13]. Each secondary structure G has a unique set of loops $\left\{{\mathcal{\mathcal{L}}}_{i,j}\right\{i,j\}\in B\}$, which is called the loop decomposition of G. The free energy f(G) of a given secondary structure, according to the standard energy model [14], is defined as the sum of the energies of all loops in its unique loop decomposition.
The relative location of two vertices v and w in G is determined by the base pairs B_{ v } and B_{ w } that enclose them. If B_{ v }∩ B_{ w }≠ ∅, there is a unique ≺minimal base pair {i_{v,w},j_{v,w}} that encloses both vertices and thus a uniquely defined loop ${\mathcal{\mathcal{L}}}_{\{{i}_{v,w},{j}_{v,w}\}}$ in the loop associated with v and w. If B_{ v }∖ B_{ w }= ∅ or B_{ w }∖ B_{ v }= ∅ then v or w is unpaired and part of ${\mathcal{\mathcal{L}}}_{\{{i}_{v,w},{j}_{v,w}\}}$. Otherwise, i.e. B_{ v }∩ B_{ w }= ∅, there are uniquely defined ≺maximal base pairs {k_{ v },l_{ v }} ∈ B_{ v }∖ B_{ w } and {k_{ w },l_{ w }} ∈ B_{ w }∖ B_{ v } that enclose v and w, respectively. We note that B_{ v }∖ B_{ w } (B_{ w }∖ B_{ v }) may be empty, in which case {k_{ v },l_{ v }} ({k_{ w },l_{ w }}) is also empty. This simple partition holds the key to computing distance distinguished partition functions below.
In the following, we assign the weights a for backbone edges and b for base pairs, respectively. Given a path p, we define the weight of the path d(p) as the sum of the weights of edges in the path. The (weighted) graphdistance${d}_{v,w}^{G}$ in G is defined as the weight of the path p connecting v and w with d(p) being minimal. For the weights, we require the following condition:

(W) If i and j are connected by an edge, then {i,j} ∈ E is the unique shortest path between i and j.
This condition ensures that single edges cannot be replaced by detours of shorter weight. Condition (W) and property (ii) of the secondary structure graphs implies b < 3a because the closing base pair must be shorter than a hairpin loop. Furthermore, considering a stacked pair we need b < b + 2 a, i.e. a > 0. We allow the degenerate case b = 0 that neglects the traversals of base pairs.
Before we continue with the calculations of the partition function, let us first consider the problem formulation in more detail. For the FRET application, it is wellknown that FRET efficiency is correlated with spatial distance. Furthermore, only a limited range of distance changes (e.g. 20 Å100 Å for Cy3Cy5) can be reported by the FRET experiments. Thus a more useful formulation of our problem is not to use the full expected quantity for all positions. Instead, we are interested in the average for all distancevalues within some threshold θ_{ d }. As the space and time complexity will depend on the number of distances we consider, we will parametrise our complexity by the number of nucleotides n and the number of distances considered D = θ_{ d }+ 1, as well. In the worst case, there is D = O (n). However, given that in practice only a limited range of distance changes are considered, we rather view D = O(1) as a small constant in our contribution.
Boltzmann distribution of graphdistances
For a fixed structure G, ${d}_{v,w}^{G}$ is easy to compute. Here, we are interested in the distribution $\mathit{\text{Pr}}\left[\phantom{\rule{0.3em}{0ex}}{d}_{v,w}^{G}\rightx]$ and its expected value ${d}_{v,w}=E\left[\phantom{\rule{0.3em}{0ex}}{d}_{v,w}^{G}\rightx]$ over the ensemble of all possible structures G for a given sequence x. Both quantities can be calculated from the Boltzmann distribution P r[ Gx] = e^{ f (G)/R T}/Q where $Q=\sum _{G}{e}^{f\left(G\right)/\mathit{\text{RT}}}$ denotes the partition function of the ensemble of structures. As first shown in [15], Q and related quantities can be computed in quartic time. A reduction to a cubic algorithm may be obtained if the free energy of long interior loops may be regarded as prohibitive. This restriction has been widely used for long sequences [16]. Cubic runtime can also be achieved for some but not all parametrizations of interior loop energies [17].
for a given pair v,w of positions in a given RNA sequence x. A simple computation (Appendix A in Additional file 1) verifies that the $\mathit{\text{Pr}}[\phantom{\rule{0.3em}{0ex}}{d}_{v,w}^{G}\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}dx]={Z}^{v,w}[\phantom{\rule{0.3em}{0ex}}d]/Q$ and ${d}_{v,w}=E\left[\phantom{\rule{0.3em}{0ex}}{d}_{v,w}^{G}\rightx]=\sum _{d}({Z}^{v,w}\left[\phantom{\rule{0.3em}{0ex}}d\right]/Q)d$. Hence it suffices to compute Z^{v,w}[d] for any 1 ≤ d ≤ n. In the following sections we show that this can be achieved by a variant of McCaskill’s approach [15].
For the ease of presentation we describe in the following only the recursion for the simplified energy model for the “circular maximum matching”, in which energy contributions are associated with individual base pairs rather than loops. Our approach can be easily extended to the full model by using separating the partition functions into distinct cases for the loop types.
We use the letters Z and Y to denote partition functions with distance constraints, while Q is used for quantities that appear in McCaskill’s algorithm and are considered as precomputed here. For instance, let ${Q}_{i,j}^{B}$ denote the partition function over all secondary structures on x[ i..j] that are enclosed by the base pair {i,j}. We will later also need the partition function Q_{i,j} over the subsequence x [ i..j], regardless of whether {i,j} is paired or not. In Additional file 1: Appendix C, we summarize the notations frequently used in our contribution.
Recursions of Z^{v,w}[ d]: The case when v and ware external
An important special case assumes that both v and w are external. This is the case e.g. when v and w are binded by proteins. In particular, the problem of computing endtoend distances, i.e., v = 1 and w = n, is of this type.
Assuming (W), the shortest path between two external vertices v,w consists of the external vertices and their backbone connections together with the external base pairs. We call this path the inside path of i,j since it does not involve any vertices “outside” the subsequence x [ i..j].
For efficiently calculating the internal distance between any two vertices v,w, we denote by ${Z}_{i,j}^{I}\left[\phantom{\rule{0.3em}{0ex}}d\right]$ the partition function over all secondary structures on x [ i..j] with distance exactly d.
with the initialization ${Z}_{\mathit{\text{ii}}}^{I}\left[\phantom{\rule{0.3em}{0ex}}0\right]=1$ and ${Z}_{\mathit{\text{ii}}}^{I}\left[\phantom{\rule{0.3em}{0ex}}d\right]=0$ for d > 0. For consecutive vertices, we have ${Z}_{i,i+1}^{I}\left[\phantom{\rule{0.3em}{0ex}}a\right]=1$ and ${Z}_{i,i+1}^{I}\left[\phantom{\rule{0.3em}{0ex}}d\right]=0$ for d ≠ a. These recursions have been derived in several different contexts, e.g. force induced RNA denaturations [6], the investigate of loop entropy dependence [9], the analysis of FRET signals in the presence of singlestranded binding proteins [5], as well as in mathematical studies of RNA panhandlelike structures [3, 4].
This recursion requires O(n^{3}D) time and O(n^{2}D) space. It is possible to reduce the complexity of computing the expected distance in this special case by a linear factor. The trick is to use conditional probabilities for arcs starting at i or the conditional probability for i to be singlestranded, which can be determined from the partition function for RNA folding [3], see Additional file 1: Appendix B.
Recursions of Z^{v,w}[ d]: the general case
Figure 1 shows that the outside paths are important for the green region, i.e., the region that is covered by an arc. Hence, we have to consider the different cases that the two positions v and w are covered by arcs. The set Ω of all secondary structures on xcan be divided into two disjoint subclasses that have to be treated differently:

: v and w are not enclosed in a common base pair, i.e., B_{ v }∩ B_{ w }= ∅.

: there is a base pair enclosing both v and w, i.e., B_{ v }∩ B_{ w }≠ ∅.
Note that this bipartition explicitly depends on v and w. In the following, we will first introduce the recursions that are required in Ω_{0} structures to compute Z^{v,w}[ d].
Contribution of Ω_{0} structures to Z^{v,w}[ d]: ${Z}_{0}^{v,w}\left[\phantom{\rule{0.3em}{0ex}}d\right]$
One example of this case is given in Figure 1 with the red and blue region, where v (vertex in green region) is covered by an arc, and w (vertex in blue region) is external. Denote the ≺maximal base pair enclosing v by {i,j}. Since at most one of v and w is covered by an arc, we know that j < w. Hence, every path p from v to w, and hence also the shortest paths (not necessarily unique) must run through the right end j of the arc {i,j}. More precisely, there must subpaths p_{1} and p_{2} with d (p) = d(p_{1}) + d (p_{2}) + a such that $v\stackrel{p}{\u21dd}w\to v\stackrel{{p}_{1}}{\u21dd}j(j+1)\stackrel{{p}_{2}}{\u21dd}w$, where $i\stackrel{p}{\u21dd}j$ denotes that p is a shortest path from i to j and  denotes a single backbone edge. For the shortest path from v to j, it consists either of a shortest path $v\stackrel{{p}^{\prime}}{\u21dd}i$ and the arc {i,j}, or it goes directly to j without using the arc {i,j}.
How does this distinction translate to the partition function approach? If we want to calculate the contribution of this case to the partition function Z^{v,w}[ d], we have to split both the sequence x[ i,w] and distance d as follows
where ${Z}_{j,w}^{{I}^{\prime}}\left[\phantom{\rule{0.3em}{0ex}}{d}_{2}\right]$ is the partition function starting and ending with a singlestranded base as defined in Equation (2), and ${Z}_{i,j}^{B,v}[\phantom{\rule{0.3em}{0ex}}{d}_{\ell},{d}_{r}]$ is the partition function consisting of all structures of x[ i,j] containing the base pair {i,j} with the property that the shortest path from v to i has length d_{ ℓ } and the shortest path from v to j has length d_{ r }. In addition, d, d_{ r } and d_{2} must satisfy d = d_{ r }+ d_{2}.
To simplify, we extend the definition of ${Z}_{i,j}^{B,v}[\phantom{\rule{0.3em}{0ex}}{d}_{\ell},{d}_{r}]$ by setting ${Z}_{v,v}^{B,v}[\phantom{\rule{0.3em}{0ex}}0,0]=1$ and ${Z}_{v,v}^{B,v}[\phantom{\rule{0.3em}{0ex}}{d}_{\ell},{d}_{r}]=0$ for d_{ ℓ }+ d_{ r }> 0. This allows us to conveniently model all cases where either v or w are external, i.e., a.), b.), and d.), as special cases of c.).
In case c.), we have to split the distance d into five subdistances ${d}_{l},{d}_{r},{d}_{l}^{\prime},{d}_{r}^{\prime},{d}_{I}$, in which d_{ I } can be retrieved from the first four distances. Furthermore, we would require four splitting positions for the sequence for all possible combinations of i,j,k,l. A naïve implementation of this idea would result in an algorithm with time complexity O(n^{6}D^{5}) and space complexity O(n^{2}D^{2}).
Note that for splitting the distance, we reuse the same indices (e.g., the j in ${Z}_{i,j}^{B,v,r}\left[\phantom{\rule{0.3em}{0ex}}{d}_{1}\right]\xb7{Z}_{j,k}^{{I}^{\prime}}[\phantom{\rule{0.3em}{0ex}}d({d}_{1}+{d}_{2}\left)\right]$, where as for the remaining partition function, we use successive indices (e.g.,the i in ${Q}_{1,i1}\xb7{Z}_{i,j}^{B,v,r}\left[\phantom{\rule{0.3em}{0ex}}{d}_{1}\right]$). This difference comes from the fact that splitting a sequence into subsequences is done naturally between two successive indices, whereas splitting a distance is naturally done by splitting at an individual position. We have only to guarantee that the substructures which participate in the split do agree on the structural context of the split position. This is guaranteed by requiring that ${Z}^{{I}^{\prime}}$ starts and ends with a backbone edge. We note that the incorporation of the full dangling end parameters makes is more tedious to handle the splitting positions.
We thus arrive at O(n^{3}D^{2}) time and O(n^{2}D) space complexity for the contribution of Ω_{0} structures to Z^{v,w}[ d], excluding the complexity of computing ${Z}_{i,j}^{B,v}[\phantom{\rule{0.3em}{0ex}}{d}_{\ell},{d}_{r}]$.
Contribution of Ω_{1} structures to Z^{v,w}[ d]
where ${\hat{Q}}_{p,q}^{b}$ is the external partition function over all structures on the union of the intervals x[ 1..p] ∪ x[ q..n] so that {p,q} is a base pair. Since the base pair probability can be written as $\mathit{\text{Pr}}\left(\right\{p,q\left\}\right)=\frac{{\hat{Q}}_{p,q}^{b}{Q}_{p,q}^{b}}{Q}$, this quantity can be calculated as ${\hat{Q}}_{p,q}^{b}=\frac{\mathit{\text{Pr}}\left(\right\{p,q\left\}\right)Q}{{Q}_{p,q}^{b}}$. The base pair probability P r ({p,q}), and the partition functions Q and ${Q}_{p,q}^{b}$ are computed by means of McCaskill’s algorithm.
This part now has a complexity of O(n^{2}D^{2}) space and O(n^{3}D^{4}) time. For practical applications, however, we do not need to consider all possible {p,q}. Instead, there are only few base pairs that are likely to form and that cover v,w, especially for v,w where the internal distance of v,w is large enough such that an outside path has to be considered at all. If we assume a constant number of such longrange basepairs, then the complexity is reduced by an n^{2}factor. For the complexity in terms of distance, recall that D is typically small.
Recursions for ${Z}_{i,j}^{B,v}[{d}_{\ell},{d}_{r}]$
The values that are chosen to split d_{ ℓ } and d_{add} are indicated in green and blue. When the arc {i,j} is colored violet, then there is a shortest path that does not use the distance marked in red but uses the other direction together with the arc {i,j}. If b<d_{ add }<+b, then we know that neither a shortest path $v\stackrel{p}{\u21dd}i$ nor $v\stackrel{p}{\u21dd}j$ uses the arc {i,j}. The left distance is thus given by ${d}_{\ell}{d}_{\ell}^{\prime}$. Using the shortcuts d_{ r }= d_{ ℓ }+ d_{add} and ${d}_{r}^{\prime}={d}_{\ell}^{\prime}+{d}_{\text{add}}^{\prime}$, then the distance between l and j must be ${d}_{r}{d}_{r}^{\prime}=({d}_{\ell}+{d}_{\text{add}})\left({d}_{\ell}^{\prime}+{d}_{\text{add}}^{\prime}\right)$. If, on the other hand, d_{add} = + b, then we know that there is at least one shortest path that can be composed by using a shortest path $v\u21ddi$, followed by the arc {i,j}. This of course implies that the shortest path $v\stackrel{p}{\u21dd}j$ is has exactly the length d_{ ℓ }+ b, or is larger. For a subpath $l+1\stackrel{{p}^{\prime}}{\u21dd}j$ this implies that the length is greater or equal $d={d}_{r}{d}_{r}^{\prime}=\left({d}_{\ell}+b\right)\left({d}_{\ell}^{\prime}+{d}_{\text{add}}^{\prime}\right)$. Thus, we just have to add all partition functions ${Z}_{k,j}^{{I}^{\prime}}\left[\phantom{\rule{0.3em}{0ex}}{d}^{\prime}\right]$ with d^{′} > d. This can be done efficiently by using a precalculated matrix ${Z}_{i,j}^{{I}^{\prime}\ge}\left[\phantom{\rule{0.3em}{0ex}}d\right]$, which is defined as $\sum _{{d}^{\prime}\ge d}{Z}_{i,j}^{{I}^{\prime}}\left[\phantom{\rule{0.3em}{0ex}}{d}^{\prime}\right]$. Note that ${Z}_{i,j}^{{I}^{\prime}\ge}\left[\phantom{\rule{0.3em}{0ex}}d\right]$ can also be defined if we restrict in all recursion the distance d to a threshold θ_{ d }, since ${Z}_{i,j}^{{I}^{\prime}\ge}\left[\phantom{\rule{0.3em}{0ex}}d\right]=\sum _{{d}^{\prime}\ge d}{Z}_{i,j}^{{I}^{\prime}}\left[\phantom{\rule{0.3em}{0ex}}{d}^{\prime}\right]={Q}_{i,j}^{\prime}\sum _{{d}^{\prime}<d}{Z}_{i,j}^{{I}^{\prime}}\left[\phantom{\rule{0.3em}{0ex}}{d}^{\prime}\right]$. In which, where ${Q}_{i,j}^{\prime}$ is Q_{i + 1,j  1} if j > i + 1, 1 if j = i + 1 and 0 otherwise. Note, furthermore, that all ${Z}_{i,j}^{{I}^{\prime}}\left[\phantom{\rule{0.3em}{0ex}}{d}^{\prime}\right]$ for d^{′}< d ≤ θ_{ d } are calculated when we restrict the distance to θ_{ d }.
Discussion and applications
The theoretical analysis of the distance distribution problem shows that, while polynomialtime algorithms exist, they probably cannot be improved to space and time complexities that make them widely applicable to large RNA molecules. Due to the unfavorable time complexity of the current algorithm and the associated exact implementation in C, a rather simple and efficient sampling algorithm has been implemented. We resort to sampling Boltzmannweighted secondary structures with RNAsubopt p[16], which uses the same stochastic backtracing approach as sfold[18]. As the graphdistance for a pair of nucleotides in a given secondary structure can be computed in O(n logn) time by Dijkstra’s algorithm with Fibonacci heap [19], even large samples can be evaluated efficiently.
The smFRET data of [12] indicates the presence of three subpopulations, corresponding to three different structural states: folded molecules (state F), intermediate conformation (state I) and unfolded molecules (state U). In the absence of Mg ^{2+}, the I state dominates, and only small fractions are found in states U and F. Unfortunately, the salt dependence of RNA folding is complex [21, 22] and currently is not properly modeled in the available folding programs. We can, however, make use of the qualitative correspondence of low salt concentrations with high temperature. In Figure 2C we therefore recompute the graphdistance distribution in the ensemble at an elevated temperature of 50°C. Here, the real structure becomes the second best structure with free energy 10.82 kcal/mol and we observe a much larger fraction of (nearly) unfolded structures with longer distances between the two beacon positions. Qualitatively, this matches the smFRET data showed in Figure 2B.
We observe for DAse that the contributions from the backbone edges are larger than the base pairs at both temperatures. For T = 37°C, there are in total 14 edges with I_{6,42}(e) > 0.4. Only two of them, 5(C)–18(G) and 2(G) – 21(C) are base pairs. For T = 50°C, there is only the pair 5(C) – 18(G) is heavily used (I_{6,42}(5,18) = 0.636). Combining the analysis of data illustrated in Figure 2, it may indicate that the existences of two base pairs, 2(G) – 21(C) and 28(G) – 39(C) can affect the graphdistance distribution of RNA secondary structure ensemble and consequently affect smFRET measurements. Such constraints may become an interesting source of constraints for RNA structure prediction.
Graphdistance of intron CG16979RA_intron_0_0_chr3L_15569803 from Drosophila melanogaster (dm3)
These examples indicate that the systematic analysis of the graphdistance distribution both for individual RNAs and their aggregation over ensembles of structures can provide useful insights into structural influences on RNA function. These may not be obvious directly from the structures due to the inherent difficulties of predicting longrange base pairs with sufficient accuracy and the many issues inherent in comparing RNA structures of very disparate lengths.
Due the complexity of algorithm we have refrained from attempting a direct implementation in an imperative programming language. Instead, we are aiming at an implementation in Haskell that allows us to make use of the framework of algebraic dynamic programming [28]. The graphdistance measure and the associated algorithm can be extended in principle to of RNA secondary structures with additional tertiary structural elements such as pseudoknots [29] and Gquadruples [30]. RNARNA interaction structures [31] also form a promising area for future extensions. We note finally, that the Fourier transition method introduced in [32] could be employed to achieve a further speedup.
Conclusion
The distribution of spatial distances in the equilibrium structure ensemble of an RNA molecule carries information about the overall structure of the molecule. These distance can be approximated by the graphdistance in RNA secondary structure. We introduced a polynomial time algorithm to compute the equilibrium distribution of graphdistances between a fixed pair of nucleotides. For practical applications, small distances are of main interest. Here, the time complexity of the proposed algorithm is O(n^{4}), compared to a naïve implementation with time complexity of O(n^{11}) for sequence length n and distances that can cover the whole sequence length. Since further reductions, however, seem to be difficult, we also introduced sampling approaches that are much easier to implement. They are also theoretically favorable for several reallife applications, in particular since these primarily concern longrange interactions in very large RNA molecules.
Declarations
Acknowledgments
This work was supported in part by the Deutsche Forschungsgemeinschaft proj. nos. BA 2168/33, SFB 992, STA 850/102, SPP 1596 and MA 5082/11, the BMBF (grant 0316165A) and the MWK (grant 7533711.6.1).
Authors’ Affiliations
References
 Yoffe AM, Prinsen P, Gelbart WM, BenShaul A: The ends of a large RNA molecule are necessarily close. Nucl Acids Res. 2011, 39: 292299. 10.1093/nar/gkq642.View ArticlePubMedPubMed CentralGoogle Scholar
 Fang LT: The endtoend distance of RNA as a randomly selfpaired polymer. J Theor Biol. 2011, 280: 101107. 10.1016/j.jtbi.2011.04.010.View ArticlePubMedGoogle Scholar
 Clote P, Ponty Y, Steyaert JM: Expected distance between terminal nucleotides of RNA secondary structures. J Math Biol. 2012, 65: 581599. 10.1007/s0028501104678.View ArticlePubMedGoogle Scholar
 Han HS, Reidys CM: The 5’3’ distance of RNA secondary structures. J Comput Biol. 2012, 19: 867878. 10.1089/cmb.2011.0301.View ArticlePubMedGoogle Scholar
 Forties RA, Bundschuh R: Modeling the interplay of singlestranded binding proteins and nucleic acid secondary structure. Bioinformatics. 2010, 26: 6167. 10.1093/bioinformatics/btp627.View ArticlePubMedGoogle Scholar
 Gerland U, Bundschuh R, Hwa T: Forceinduced denaturation of RNA. Biophys J. 2001, 81: 13241332. 10.1016/S00063495(01)75789X.View ArticlePubMedPubMed CentralGoogle Scholar
 Müller M, Krzakala F, Mézard M: The secondary structure of RNA under tension. Eur Phys J E. 2002, 9: 6777.PubMedGoogle Scholar
 Gerland U, Bundschuh R, Hwa T: Translocation of structured polynucleotides through nanopores. Phys Biol. 2004, 1: 1926. 10.1088/14783967/1/1/002.View ArticlePubMedGoogle Scholar
 Einert TR, Näger P, Orland H, Netz R: Impact of loop statistics on the thermodynamics of RNA Folding. Phys Rev Lett. 2008, 101: 048103View ArticlePubMedGoogle Scholar
 Roy R, Hohng S, Ha T: A practical guide to singlemolecule FRET. Nat Methods. 2008, 5: 507516. 10.1038/nmeth.1208.View ArticlePubMedPubMed CentralGoogle Scholar
 Das R, Baker D: Automated de novo prediction of nativelike RNA tertiary structures. Proc Natl Acad Sci USA. 2007, 104: 1466414669. 10.1073/pnas.0703836104.View ArticlePubMedPubMed CentralGoogle Scholar
 Kobitski A, Nierth A, Helm M, Jaschke A, Nienhaus UG: Mg^{2+}dependent folding of a DielsAlderase ribozyme probed by singlemolecule FRET analysis. Nucleic Acids Res. 2007, 35 (6): 20472059. 10.1093/nar/gkm072.View ArticlePubMedPubMed CentralGoogle Scholar
 Schuster P, Fontana W, Stadler PF, Hofacker IL: From sequences to shapes and back: a case study in RNA secondary structures. Proc R Soc London B. 1994, 255 (1344): 27984. 10.1098/rspb.1994.0040.View ArticleGoogle Scholar
 Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH: Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci USA. 2004, 101: 72877292. 10.1073/pnas.0401799101.View ArticlePubMedPubMed CentralGoogle Scholar
 McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990, 29 (6–7): 11051119.View ArticlePubMedGoogle Scholar
 Lorenz R, Bernhart SH, Höner zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL: ViennaRNA Package 2.0. Alg Mol Biol. 2011, 6: 2610.1186/17487188626.View ArticleGoogle Scholar
 Lyngsø RB, Zuker M, Pedersen C: Fast evaluation of internal loops in RNA secondary structure prediction. Bioinformatics. 1999, 15: 440445. 10.1093/bioinformatics/15.6.440.View ArticlePubMedGoogle Scholar
 Ding Y, Lawrence C: A statistical sampling algorithm for RNA secondary structure prediction. Nucl Acids Res. 2003, 31 (24): 72807301. 10.1093/nar/gkg938.View ArticlePubMedPubMed CentralGoogle Scholar
 Fredman M, Tarjan R: Fibonacci heaps and their uses in improved network optimization algorithms. J ACM. 1987, 34 (3): 596615. 10.1145/28869.28874.View ArticleGoogle Scholar
 Darty K, Denise A, Ponty Y: VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics. 2009, 25 (15): 19741975. 10.1093/bioinformatics/btp250.View ArticlePubMedPubMed CentralGoogle Scholar
 Leipply D, Lambert D, Draper DE: IonRNA interactions thermodynamic analysis of the effects of mono and divalent ions on RNA conformational equilibria. Methods Enzymol. 2009, 469: 433463.View ArticlePubMedPubMed CentralGoogle Scholar
 Mathews D, Sabina J, Zuker M, Turner DH: Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol. 1999, 288: 911940. 10.1006/jmbi.1999.2700.View ArticlePubMedGoogle Scholar
 Baraniak AP, Lasda EL, Wagner EJ, GarciaBlanco MA: A stem structure in fibroblast growth factor receptor 2 transcripts mediates celltypespecific splicing by approximating intronic control elements. Mol Cell Biol. 2003, 23: 93279337. 10.1128/MCB.23.24.93279337.2003.View ArticlePubMedPubMed CentralGoogle Scholar
 McManus CJ, Graveley BR: RNA structure and the mechanisms of alternative splicing. Curr Opin Genet Dev. 2011, 21: 373379. 10.1016/j.gde.2011.04.001.View ArticlePubMedPubMed CentralGoogle Scholar
 Amman F, Bernhart S, Doose D, Hofacker I, Qin J, Stadler P, Will S: The Trouble with LongRange Base Pairs in RNA Folding. Lecture Notes in Computer Science: Advances in Bioinformatics and Computational Biology, Volume 8213. 2013, Berlin, Heidelberg, New York: SpringerVerlag, 111.Google Scholar
 Celotto A, Graveley B: Exonspecific RNAi: a tool for dissecting the functional relevance of alternative splicing. RNA. 2002, 8 (6): 718724. 10.1017/S1355838202021064.View ArticlePubMedPubMed CentralGoogle Scholar
 Dufour D, MateosGomez PA, Enjuanes L, Gallego J, Sola I: Structure and functional relevance of a transcriptionregulating sequence involved in coronavirus discontinuous RNA synthesis. J Virol. 2011, 85 (10): 49634973. 10.1128/JVI.0231710.View ArticlePubMedPubMed CentralGoogle Scholar
 Giegerich R, Meyer C: Algebraic dynamic programming. Algebraic Methodology And Software Technology. 2002, Berlin, Heidelberg, New York: SpringerVerlag, 349364.View ArticleGoogle Scholar
 Reidys CM, Huang FWD, Andersen JE, Penner RC, Stadler PF, Nebel ME: Topology and prediction of RNA pseudoknots. Bioinformatics. 2011, 27 (8): 10761085. 10.1093/bioinformatics/btr090.View ArticlePubMedGoogle Scholar
 Lorenz R, Bernhart S, Qin J, Honer zu, Siederdissen C, Tanzer A, Amman F, Hofacker I: 2D meets 4G: GQuadruplexes in RNA Secondary Structure Prediction. IEEE/ACM Trans Comput Biol Bioinformatics. doi:10.1109/TCBB.2013.7,Google Scholar
 Li AX, Marz M, Qin J, Reidys CM: RNARNA interaction prediction based on multiple sequence alignments. Bioinformatics. 2011, 27 (4): 456463. 10.1093/bioinformatics/btq659.View ArticlePubMedGoogle Scholar
 Senter E, Sheikh S, Dotu I, Ponty Y, Clote P: Using the fast fourier transform to accelerate the computational search for RNA conformational switches. PLoS ONE. 2012, 7 (12): e5050610.1371/journal.pone.0050506.View ArticlePubMedPubMed CentralGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.