 Research
 Open
 Published:
Graphdistance distribution of the Boltzmann ensemble of RNA secondary structures
Algorithms for Molecular Biologyvolume 9, Article number: 19 (2014)
Abstract
Background
Large RNA molecules are often composed of multiple functional domains whose spatial arrangement strongly influences their function. PremRNA splicing, for instance, relies on the spatial proximity of the splice junctions that can be separated by very long introns. Similar effects appear in the processing of RNA virus genomes. Albeit a crude measure, the distribution of spatial distances in thermodynamic equilibrium harbors useful information on the shape of the molecule that in turn can give insights into the interplay of its functional domains.
Result
Spatial distance can be approximated by the graphdistance in RNA secondary structure. We show here that the equilibrium distribution of graphdistances between a fixed pair of nucleotides can be computed in polynomial time by means of dynamic programming. While a naïve implementation would yield recursions with a very high time complexity of O(n^{6}D^{5}) for sequence length n and D distinct distance values, it is possible to reduce this to O(n^{4}) for practical applications in which predominantly small distances are of of interest. Further reductions, however, seem to be difficult. Therefore, we introduced sampling approaches that are much easier to implement. They are also theoretically favorable for several reallife applications, in particular since these primarily concern longrange interactions in very large RNA molecules.
Conclusions
The graphdistance distribution can be computed using a dynamic programming approach. Although a crude approximation of reality, our initial results indicate that the graphdistance can be related to the smFRET data. The additional file and the software of our paper are available from http://www.rna.unijena.de/RNAgraphdist.html.
Background
The distance distribution within an RNA molecule is of interest in various contexts. Most directly, the question arises whether panhandlelike structures (in which 3’ and 5’ ends of long RNA molecules are placed in close proximity) are the rule or an exception. Panhandles have been reported in particular for many RNA virus genomes. Several studies [1–4] agree based on different models that the two ends of singlestranded RNA molecules are typically not far apart. On a more technical level, the problem to compute the partition function over RNA secondary structures with given endtoend distance d, usually measured as the number of external bases (plus possibly the number of structural domains) arises for instance when predicting nucleic acid secondary structure in the presence of singlestranded binding proteins [5] or in models of RNA subjected to pulling forces (e.g. in atom force microscopy or export through a small pore) [6–8]. It also plays a role for the effect of loop energy parameters [9].
In contrast to the endtoend distance, the graphdistance between two arbitrarily prescribed nucleotides in a larger RNA structure does not seem to have been studied in any detail. However, this is of particular interest in the analysis of singlemolecule fluorescence resonance energy transfer (smFRET) experiments [10]. This technique allows to monitor the distance between two dyelabeled nucleotides and can reveal details of the kinetics of RNA folding in real time. It measures the nonradiative energy transfer between the dyelabeled donor and acceptor positions. The efficiency of this energy transfer, E_{ fret }, strongly depends on the spatial distance R according to ${E}_{\text{fret}}={R}_{0}^{6}/({R}_{0}^{6}+{R}^{6})$. The Förster radius R_{0} sets the length scale, e.g. R_{0}≈54 Å for the Cy3Cy5 dye pair. A major obstacle is that, at present, there is no general and efficient way to link smFRET measurements to interpretations in terms of explicit molecular structures. To solve this problem, a natural first step is to compute the distribution of spatial distances for an equilibrium ensemble of 3D structures. Since this is not feasible in practice despite major progress in the field of RNA 3D structure prediction [11], we can only resort to considering the graphdistances on the ensemble of RNA secondary structures instead. From a computer science point of view, furthermore, we show here that the distance distribution can be computed exactly using a dynamic programming approach. Although a crude approximation of reality, our initial results indicate that the graphdistance can be related to the smFRET data such as those reported by [12] and help to explain effects of RNA structures in premRNA splicing and viral subgenomic RNA species.
Theory
RNA secondary structures
An RNA secondary structure is a vertex labeled outerplanar graph G(V,x,E), where V = {1,2,…,n} is a finite ordered set (of nucleotide positions) and x : {1,2,…,n} → {A,U,G,C},i ↦ x_{ i } assigns to each vertex at position i (along the RNA sequence from 5’ to 3’) the corresponding nucleotide x_{ i }. We write x = x_{1}…x_{ n } for the sequence underlying secondary structure and use x[ i…j] = x_{ i }…x_{ j } to denote the subsequence from i to j. The edge set E is subdivided into backbone edges of the form {i,i + 1} for 1 ≤ i<n and a set B of base pairs satisfying the following conditions:

(i)
If {i,j} ∈ B then x _{ i } x _{ k }∈ {GC,CG,AU,UA,GU,UG};

(ii)
If {i,j} ∈ B then j  i > 3;

(iii)
If {i,j},{i,k} ∈ B then j = k;

(iv)
If {i,j},{k,l} ∈ B and i < k < j then i < l < j.
The first condition allows base pairs only for WatsonCrick and GU base pairs. The second condition implements the minimal steric requirement for an RNA to bend back on itself. The third condition enforces that B forms a matching in the secondary structure. The last condition (nesting condition) forbids crossing base pairs, i.e. pseudoknots.
The nesting condition results in a natural partial order in the set of base pairs B defined as {i,j} ≺ {k,l} if k < i < j < l. In particular, given an arbitrary vertex k, the set B_{ k }= {{i,j} ∈ Bi ≤ k ≤ j} of base pairs enclosing k is totally ordered. Note that k is explicitly allowed to be incident to its enclosing base pairs. A vertex k is external if B_{ k }= ∅ . A base pair {k,l} is external if B_{ k }= B_{ l }= {{k,l}}.
Consider a fixed secondary structure G, for a given base pair {i,j} ∈ B, we say a vertex k is accessible from {i,j} if i < k < j and there is no other pair {i^{′},j^{′}} ∈ B such that i < i^{′} < k < j^{′} < j. The unique subgraph ${\mathcal{\mathcal{L}}}_{i,j}$ induced by i, j, and all the vertices accessible from {i,j} is known as the loop of {i,j}. The type of a loop ${\mathcal{\mathcal{L}}}_{i,j}$ is unique determined depending on whether {i,j} is external or not, and the numbers of unpaired vertices and base pairs. For details, see [13]. Each secondary structure G has a unique set of loops $\left\{{\mathcal{\mathcal{L}}}_{i,j}\right\{i,j\}\in B\}$, which is called the loop decomposition of G. The free energy f(G) of a given secondary structure, according to the standard energy model [14], is defined as the sum of the energies of all loops in its unique loop decomposition.
The relative location of two vertices v and w in G is determined by the base pairs B_{ v } and B_{ w } that enclose them. If B_{ v }∩ B_{ w }≠ ∅, there is a unique ≺minimal base pair {i_{v,w},j_{v,w}} that encloses both vertices and thus a uniquely defined loop ${\mathcal{\mathcal{L}}}_{\{{i}_{v,w},{j}_{v,w}\}}$ in the loop associated with v and w. If B_{ v }∖ B_{ w }= ∅ or B_{ w }∖ B_{ v }= ∅ then v or w is unpaired and part of ${\mathcal{\mathcal{L}}}_{\{{i}_{v,w},{j}_{v,w}\}}$. Otherwise, i.e. B_{ v }∩ B_{ w }= ∅, there are uniquely defined ≺maximal base pairs {k_{ v },l_{ v }} ∈ B_{ v }∖ B_{ w } and {k_{ w },l_{ w }} ∈ B_{ w }∖ B_{ v } that enclose v and w, respectively. We note that B_{ v }∖ B_{ w } (B_{ w }∖ B_{ v }) may be empty, in which case {k_{ v },l_{ v }} ({k_{ w },l_{ w }}) is also empty. This simple partition holds the key to computing distance distinguished partition functions below.
In the following, we assign the weights a for backbone edges and b for base pairs, respectively. Given a path p, we define the weight of the path d(p) as the sum of the weights of edges in the path. The (weighted) graphdistance${d}_{v,w}^{G}$ in G is defined as the weight of the path p connecting v and w with d(p) being minimal. For the weights, we require the following condition:

(W) If i and j are connected by an edge, then {i,j} ∈ E is the unique shortest path between i and j.
This condition ensures that single edges cannot be replaced by detours of shorter weight. Condition (W) and property (ii) of the secondary structure graphs implies b < 3a because the closing base pair must be shorter than a hairpin loop. Furthermore, considering a stacked pair we need b < b + 2 a, i.e. a > 0. We allow the degenerate case b = 0 that neglects the traversals of base pairs.
Before we continue with the calculations of the partition function, let us first consider the problem formulation in more detail. For the FRET application, it is wellknown that FRET efficiency is correlated with spatial distance. Furthermore, only a limited range of distance changes (e.g. 20 Å100 Å for Cy3Cy5) can be reported by the FRET experiments. Thus a more useful formulation of our problem is not to use the full expected quantity for all positions. Instead, we are interested in the average for all distancevalues within some threshold θ_{ d }. As the space and time complexity will depend on the number of distances we consider, we will parametrise our complexity by the number of nucleotides n and the number of distances considered D = θ_{ d }+ 1, as well. In the worst case, there is D = O (n). However, given that in practice only a limited range of distance changes are considered, we rather view D = O(1) as a small constant in our contribution.
Boltzmann distribution of graphdistances
For a fixed structure G, ${d}_{v,w}^{G}$ is easy to compute. Here, we are interested in the distribution $\mathit{\text{Pr}}\left[\phantom{\rule{0.3em}{0ex}}{d}_{v,w}^{G}\rightx]$ and its expected value ${d}_{v,w}=E\left[\phantom{\rule{0.3em}{0ex}}{d}_{v,w}^{G}\rightx]$ over the ensemble of all possible structures G for a given sequence x. Both quantities can be calculated from the Boltzmann distribution P r[ Gx] = e^{ f (G)/RT}/Q where $Q=\sum _{G}{e}^{f\left(G\right)/\mathit{\text{RT}}}$ denotes the partition function of the ensemble of structures. As first shown in [15], Q and related quantities can be computed in quartic time. A reduction to a cubic algorithm may be obtained if the free energy of long interior loops may be regarded as prohibitive. This restriction has been widely used for long sequences [16]. Cubic runtime can also be achieved for some but not all parametrizations of interior loop energies [17].
A crucial quantity for our task is the restricted partition function
for a given pair v,w of positions in a given RNA sequence x. A simple computation (Appendix A in Additional file 1) verifies that the $\mathit{\text{Pr}}[\phantom{\rule{0.3em}{0ex}}{d}_{v,w}^{G}\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}dx]={Z}^{v,w}[\phantom{\rule{0.3em}{0ex}}d]/Q$ and ${d}_{v,w}=E\left[\phantom{\rule{0.3em}{0ex}}{d}_{v,w}^{G}\rightx]=\sum _{d}({Z}^{v,w}\left[\phantom{\rule{0.3em}{0ex}}d\right]/Q)d$. Hence it suffices to compute Z^{v,w}[d] for any 1 ≤ d ≤ n. In the following sections we show that this can be achieved by a variant of McCaskill’s approach [15].
For the ease of presentation we describe in the following only the recursion for the simplified energy model for the “circular maximum matching”, in which energy contributions are associated with individual base pairs rather than loops. Our approach can be easily extended to the full model by using separating the partition functions into distinct cases for the loop types.
We use the letters Z and Y to denote partition functions with distance constraints, while Q is used for quantities that appear in McCaskill’s algorithm and are considered as precomputed here. For instance, let ${Q}_{i,j}^{B}$ denote the partition function over all secondary structures on x[ i..j] that are enclosed by the base pair {i,j}. We will later also need the partition function Q_{i,j} over the subsequence x [ i..j], regardless of whether {i,j} is paired or not. In Additional file 1: Appendix C, we summarize the notations frequently used in our contribution.
Recursions of Z^{v,w}[ d]: The case when v and ware external
An important special case assumes that both v and w are external. This is the case e.g. when v and w are binded by proteins. In particular, the problem of computing endtoend distances, i.e., v = 1 and w = n, is of this type.
Assuming (W), the shortest path between two external vertices v,w consists of the external vertices and their backbone connections together with the external base pairs. We call this path the inside path of i,j since it does not involve any vertices “outside” the subsequence x [ i..j].
For efficiently calculating the internal distance between any two vertices v,w, we denote by ${Z}_{i,j}^{I}\left[\phantom{\rule{0.3em}{0ex}}d\right]$ the partition function over all secondary structures on x [ i..j] with distance exactly d.
Now note that any structure on x[ i..j] starts either with an unpaired base or with a base pair connecting i to some position k satisfying i < k ≤ j. In the first case, we have ${d}_{i,j}^{G}={d}_{i,i+1}^{G}+{d}_{i+1,j}^{G}$ where ${d}_{i,i+1}^{G}=a$. In the second case, there exists ${d}_{i,j}^{G}={d}_{i,k}^{G}+{d}_{k,k+1}^{G}+{d}_{k+1,j}^{G}$ with ${d}_{i,k}^{G}=b$ and ${d}_{k,k+1}^{G}=a$. Thus, ${Z}_{i,j}^{I}\left[\phantom{\rule{0.3em}{0ex}}d\right]$ can be split as follows,
This gives the recursion
with the initialization ${Z}_{\mathit{\text{ii}}}^{I}\left[\phantom{\rule{0.3em}{0ex}}0\right]=1$ and ${Z}_{\mathit{\text{ii}}}^{I}\left[\phantom{\rule{0.3em}{0ex}}d\right]=0$ for d > 0. For consecutive vertices, we have ${Z}_{i,i+1}^{I}\left[\phantom{\rule{0.3em}{0ex}}a\right]=1$ and ${Z}_{i,i+1}^{I}\left[\phantom{\rule{0.3em}{0ex}}d\right]=0$ for d ≠ a. These recursions have been derived in several different contexts, e.g. force induced RNA denaturations [6], the investigate of loop entropy dependence [9], the analysis of FRET signals in the presence of singlestranded binding proteins [5], as well as in mathematical studies of RNA panhandlelike structures [3, 4].
In the following, it will be convenient to define also a special term for the empty structure. Setting ${Z}_{i,i1}^{I}[\phantom{\rule{0.3em}{0ex}}a]=1$ and ${Z}_{i,i1}^{I}\left[\phantom{\rule{0.3em}{0ex}}d\right]=0$ for d ≠ a allows us to formally write an individual backbone edge as two edges flanking the empty structure and hence to avoid the explicit treatment of special cases. This definition of Z^{I} also includes the case that i and j are base paired in the recursion (1). This is covered by the case k=j, where we evaluate ${Z}_{j+1,j}^{I}[\phantom{\rule{0.3em}{0ex}}dba]$. Since d = b is the only admissible value here, this refers to ${Z}_{j+1,j}^{I}[\phantom{\rule{0.3em}{0ex}}a]$, which has the correct value of 1 due to our definition. Later on, we will also need Z^{I} under the additional condition that the path starts and ends with a backbone edge. We therefore introduce ${Z}^{{I}^{\prime}}$ defined as by
Note that if ${Z}_{i,j}^{{I}^{\prime}}\left[\phantom{\rule{0.3em}{0ex}}d\right]$ is called with j = i + 1, then we call ${Z}_{i+1,i}^{I}[\phantom{\rule{0.3em}{0ex}}d2a]$. The only admissible value again is the correct value d=a. In sum, we have the following
This recursion requires O(n^{3}D) time and O(n^{2}D) space. It is possible to reduce the complexity of computing the expected distance in this special case by a linear factor. The trick is to use conditional probabilities for arcs starting at i or the conditional probability for i to be singlestranded, which can be determined from the partition function for RNA folding [3], see Additional file 1: Appendix B.
Recursions of Z^{v,w}[ d]: the general case
The distance between two positions v and w that are covered by an arc can be realized by both inside paths and outside paths. Here, “outside” emphasizes that the shortest path between two positions v and w contains vertex does not belongs to x[ v,w]. This case complicates the algorithmic approach, since both types of paths must be controlled simultaneously. Consider Figure 1, the shortest path between the green and blue regions includes some vertices outside the interval between these two regions. The basic idea is to generalize Equation (1) to computing the partition function Z^{v,w}[ d]. The main question now becomes how to recurse over decompositions of both the inside and the outside paths.
Figure 1 shows that the outside paths are important for the green region, i.e., the region that is covered by an arc. Hence, we have to consider the different cases that the two positions v and w are covered by arcs. The set Ω of all secondary structures on xcan be divided into two disjoint subclasses that have to be treated differently:

: v and w are not enclosed in a common base pair, i.e., B_{ v }∩ B_{ w }= ∅.

: there is a base pair enclosing both v and w, i.e., B_{ v }∩ B_{ w }≠ ∅.
Note that this bipartition explicitly depends on v and w. In the following, we will first introduce the recursions that are required in Ω_{0} structures to compute Z^{v,w}[ d].
Contribution of Ω_{0} structures to Z^{v,w}[ d]: ${Z}_{0}^{v,w}\left[\phantom{\rule{0.3em}{0ex}}d\right]$
One example of this case is given in Figure 1 with the red and blue region, where v (vertex in green region) is covered by an arc, and w (vertex in blue region) is external. Denote the ≺maximal base pair enclosing v by {i,j}. Since at most one of v and w is covered by an arc, we know that j < w. Hence, every path p from v to w, and hence also the shortest paths (not necessarily unique) must run through the right end j of the arc {i,j}. More precisely, there must subpaths p_{1} and p_{2} with d (p) = d(p_{1}) + d (p_{2}) + a such that $v\stackrel{p}{\u21dd}w\to v\stackrel{{p}_{1}}{\u21dd}j(j+1)\stackrel{{p}_{2}}{\u21dd}w$, where $i\stackrel{p}{\u21dd}j$ denotes that p is a shortest path from i to j and  denotes a single backbone edge. For the shortest path from v to j, it consists either of a shortest path $v\stackrel{{p}^{\prime}}{\u21dd}i$ and the arc {i,j}, or it goes directly to j without using the arc {i,j}.
How does this distinction translate to the partition function approach? If we want to calculate the contribution of this case to the partition function Z^{v,w}[ d], we have to split both the sequence x[ i,w] and distance d as follows
a.)
where ${Z}_{j,w}^{{I}^{\prime}}\left[\phantom{\rule{0.3em}{0ex}}{d}_{2}\right]$ is the partition function starting and ending with a singlestranded base as defined in Equation (2), and ${Z}_{i,j}^{B,v}[\phantom{\rule{0.3em}{0ex}}{d}_{\ell},{d}_{r}]$ is the partition function consisting of all structures of x[ i,j] containing the base pair {i,j} with the property that the shortest path from v to i has length d_{ ℓ } and the shortest path from v to j has length d_{ r }. In addition, d, d_{ r } and d_{2} must satisfy d = d_{ r }+ d_{2}.
The remaining cases for the contribution of the class Ω_{0} to Z^{v,w}[ d] are given by all other possible combinations of v and w being singlestranded or being covered by an arc, i.e.,
To simplify, we extend the definition of ${Z}_{i,j}^{B,v}[\phantom{\rule{0.3em}{0ex}}{d}_{\ell},{d}_{r}]$ by setting ${Z}_{v,v}^{B,v}[\phantom{\rule{0.3em}{0ex}}0,0]=1$ and ${Z}_{v,v}^{B,v}[\phantom{\rule{0.3em}{0ex}}{d}_{\ell},{d}_{r}]=0$ for d_{ ℓ }+ d_{ r }> 0. This allows us to conveniently model all cases where either v or w are external, i.e., a.), b.), and d.), as special cases of c.).
In case c.), we have to split the distance d into five subdistances ${d}_{l},{d}_{r},{d}_{l}^{\prime},{d}_{r}^{\prime},{d}_{I}$, in which d_{ I } can be retrieved from the first four distances. Furthermore, we would require four splitting positions for the sequence for all possible combinations of i,j,k,l. A naïve implementation of this idea would result in an algorithm with time complexity O(n^{6}D^{5}) and space complexity O(n^{2}D^{2}).
A careful inspection shows, however, that the split of the distances for the arcs into d_{ ℓ } and d_{ r } is unnecessary. Since we want to know only distance to the left/right end, we can simply introduce two matrices ${Z}_{i,j}^{B,v,\ell}\left[\phantom{\rule{0.3em}{0ex}}d\right]$ and ${Z}_{i,j}^{B,v,r}\left[\phantom{\rule{0.3em}{0ex}}d\right]$ that store these values. These matrices can be generated from ${Z}_{i,j}^{B,v}[\phantom{\rule{0.3em}{0ex}}{d}_{\ell},{d}_{r}]$ as follows:
Analogously, we compute ${Z}_{i,j}^{B,v,r}\left[\phantom{\rule{0.3em}{0ex}}d\right]$. In this way, we split the distance d into three contributions and we require four splitting positions for the sequence for all possible combinations of i,j,k,ℓ.
Therefore, the contribution to Z^{v,w}[ d] for structures in Ω^{0} is given by
Note that for splitting the distance, we reuse the same indices (e.g., the j in ${Z}_{i,j}^{B,v,r}\left[\phantom{\rule{0.3em}{0ex}}{d}_{1}\right]\xb7{Z}_{j,k}^{{I}^{\prime}}[\phantom{\rule{0.3em}{0ex}}d({d}_{1}+{d}_{2}\left)\right]$, where as for the remaining partition function, we use successive indices (e.g.,the i in ${Q}_{1,i1}\xb7{Z}_{i,j}^{B,v,r}\left[\phantom{\rule{0.3em}{0ex}}{d}_{1}\right]$). This difference comes from the fact that splitting a sequence into subsequences is done naturally between two successive indices, whereas splitting a distance is naturally done by splitting at an individual position. We have only to guarantee that the substructures which participate in the split do agree on the structural context of the split position. This is guaranteed by requiring that ${Z}^{{I}^{\prime}}$ starts and ends with a backbone edge. We note that the incorporation of the full dangling end parameters makes is more tedious to handle the splitting positions.
This results in a complexity of O(n^{6}D^{3}) time and O(n^{2}D) space. However, we do not need to split in i,j,k,l simultaneously. Instead, we could split case (c) at position j and introduce for all v ≤ j and k ≤ w the auxiliary variables
Finally, we can replace recursion (3) by
We thus arrive at O(n^{3}D^{2}) time and O(n^{2}D) space complexity for the contribution of Ω_{0} structures to Z^{v,w}[ d], excluding the complexity of computing ${Z}_{i,j}^{B,v}[\phantom{\rule{0.3em}{0ex}}{d}_{\ell},{d}_{r}]$.
Contribution of Ω_{1} structures to Z^{v,w}[ d]
Ω_{1} contains all cases where v and w are covered by a base pair. In the following, let {p,q} be the ≺minimal base pair covering v and w. In principle, this case looks similar to the case for Ω_{0}. However, we have to take into considerations the paths between v and w over the base pair {p,q}. Thus, we need to store the partition function for all inside and outside for each ≺minimal arc {p,q} that covers v and w, which we will call ${Z}_{p,q}^{v,w}[\phantom{\rule{0.3em}{0ex}}{d}_{O},{d}_{I}]$. In principle, a similar recursion as defined for ${Z}_{0}^{v,w}$ in equation (3) can be derived, with the additional complication since we have to take care of the additional outside distance due to the arc {p,q}. Thus, we obtain the following splitting:
Again we can avoid the complexity of simultaneously splitting at {i,j}and {k,l} by doing a major split after j. Thus, we get the following picture,
which leads to the following equivalent recursions:
Overall, we get the following recursion:
We can now define Z^{v,w}[ d] by
where ${\hat{Q}}_{p,q}^{b}$ is the external partition function over all structures on the union of the intervals x[ 1..p] ∪ x[ q..n] so that {p,q} is a base pair. Since the base pair probability can be written as $\mathit{\text{Pr}}\left(\right\{p,q\left\}\right)=\frac{{\hat{Q}}_{p,q}^{b}{Q}_{p,q}^{b}}{Q}$, this quantity can be calculated as ${\hat{Q}}_{p,q}^{b}=\frac{\mathit{\text{Pr}}\left(\right\{p,q\left\}\right)Q}{{Q}_{p,q}^{b}}$. The base pair probability P r ({p,q}), and the partition functions Q and ${Q}_{p,q}^{b}$ are computed by means of McCaskill’s algorithm.
This part now has a complexity of O(n^{2}D^{2}) space and O(n^{3}D^{4}) time. For practical applications, however, we do not need to consider all possible {p,q}. Instead, there are only few base pairs that are likely to form and that cover v,w, especially for v,w where the internal distance of v,w is large enough such that an outside path has to be considered at all. If we assume a constant number of such longrange basepairs, then the complexity is reduced by an n^{2}factor. For the complexity in terms of distance, recall that D is typically small.
Recursions for ${Z}_{i,j}^{B,v}[{d}_{\ell},{d}_{r}]$
So far, we have used ${Z}_{i,j}^{B,v}[\phantom{\rule{0.3em}{0ex}}{d}_{\ell},{d}_{r}]$ as a black box. In order to compute these terms, we distinguish the limiting cases a.) v = i, b.) v = j, c.) is external from the generic case d.):
Starting from the limiting cases, we initialize ${Z}_{v,j}^{B,v}[\phantom{\rule{0.3em}{0ex}}0,{d}_{r}]$ as follows:
and analogously for ${Z}_{i,v}^{B,v}[\phantom{\rule{0.3em}{0ex}}{d}_{\ell},0]$. Furthermore, ${Z}_{i,j}^{B,v}[\phantom{\rule{0.3em}{0ex}}0,0]=0$ for i ≠ v ≠ j. These conventions allow us to model all cases as special cases of d). Our key observation here is that the dependency between d_{ ℓ } and d_{ r } can be used to reduce the time complexity. Instead of using the variables d_{ ℓ } and d_{ r } in ${Z}_{i,j}^{B,v}[\phantom{\rule{0.3em}{0ex}}{d}_{\ell},{d}_{r}]$, we use the pair d_{ ℓ },d_{add} in ${Z}_{i,j}^{B,v}[\phantom{\rule{0.3em}{0ex}}{d}_{\ell},{d}_{\ell}+{d}_{\text{add}}]$. Similarly, we use ${d}_{\ell}^{\prime},{d}_{\text{add}}^{\prime}$ instead of ${d}_{\ell}^{\prime},{d}_{r}^{\prime}$ for the inner base pair, which then determines completely the splitting the distances. This results in an recursion for ${Z}_{i,j}^{B,v}[\phantom{\rule{0.3em}{0ex}}{d}_{\ell},{d}_{\ell}+{d}_{\text{add}}]$ with complexity $O\left({n}^{4}{D}^{2}{c}_{b}^{2}\right)$ time and O (n^{2}D c_{ b }) space. To be precise, there are three subcases as follows.
The values that are chosen to split d_{ ℓ } and d_{add} are indicated in green and blue. When the arc {i,j} is colored violet, then there is a shortest path that does not use the distance marked in red but uses the other direction together with the arc {i,j}. If b<d_{ add }<+b, then we know that neither a shortest path $v\stackrel{p}{\u21dd}i$ nor $v\stackrel{p}{\u21dd}j$ uses the arc {i,j}. The left distance is thus given by ${d}_{\ell}{d}_{\ell}^{\prime}$. Using the shortcuts d_{ r }= d_{ ℓ }+ d_{add} and ${d}_{r}^{\prime}={d}_{\ell}^{\prime}+{d}_{\text{add}}^{\prime}$, then the distance between l and j must be ${d}_{r}{d}_{r}^{\prime}=({d}_{\ell}+{d}_{\text{add}})\left({d}_{\ell}^{\prime}+{d}_{\text{add}}^{\prime}\right)$. If, on the other hand, d_{add} = + b, then we know that there is at least one shortest path that can be composed by using a shortest path $v\u21ddi$, followed by the arc {i,j}. This of course implies that the shortest path $v\stackrel{p}{\u21dd}j$ is has exactly the length d_{ ℓ }+ b, or is larger. For a subpath $l+1\stackrel{{p}^{\prime}}{\u21dd}j$ this implies that the length is greater or equal $d={d}_{r}{d}_{r}^{\prime}=\left({d}_{\ell}+b\right)\left({d}_{\ell}^{\prime}+{d}_{\text{add}}^{\prime}\right)$. Thus, we just have to add all partition functions ${Z}_{k,j}^{{I}^{\prime}}\left[\phantom{\rule{0.3em}{0ex}}{d}^{\prime}\right]$ with d^{′} > d. This can be done efficiently by using a precalculated matrix ${Z}_{i,j}^{{I}^{\prime}\ge}\left[\phantom{\rule{0.3em}{0ex}}d\right]$, which is defined as $\sum _{{d}^{\prime}\ge d}{Z}_{i,j}^{{I}^{\prime}}\left[\phantom{\rule{0.3em}{0ex}}{d}^{\prime}\right]$. Note that ${Z}_{i,j}^{{I}^{\prime}\ge}\left[\phantom{\rule{0.3em}{0ex}}d\right]$ can also be defined if we restrict in all recursion the distance d to a threshold θ_{ d }, since ${Z}_{i,j}^{{I}^{\prime}\ge}\left[\phantom{\rule{0.3em}{0ex}}d\right]=\sum _{{d}^{\prime}\ge d}{Z}_{i,j}^{{I}^{\prime}}\left[\phantom{\rule{0.3em}{0ex}}{d}^{\prime}\right]={Q}_{i,j}^{\prime}\sum _{{d}^{\prime}<d}{Z}_{i,j}^{{I}^{\prime}}\left[\phantom{\rule{0.3em}{0ex}}{d}^{\prime}\right]$. In which, where ${Q}_{i,j}^{\prime}$ is Q_{i + 1,j  1} if j > i + 1, 1 if j = i + 1 and 0 otherwise. Note, furthermore, that all ${Z}_{i,j}^{{I}^{\prime}}\left[\phantom{\rule{0.3em}{0ex}}{d}^{\prime}\right]$ for d^{′}< d ≤ θ_{ d } are calculated when we restrict the distance to θ_{ d }.
Finally, if d_{ad d}=  b, then the shortest path $l\stackrel{p}{\u21dd}j$ has distance $({d}_{\ell}b)\left({d}_{\ell}^{\prime}+{d}_{\text{add}}^{\prime}\right)$. For the shortest path $k\stackrel{p}{\u21dd}i$, we know that it has length ${d}_{\ell}{d}_{\ell}^{\prime}$ or greater, which can be resolved by again using ${Z}_{i,k1}^{{I}^{\prime}\ge}[\phantom{\rule{0.3em}{0ex}}{d}_{\ell}{d}_{\ell}^{\prime}]$. Thus, we get the following optimized recursion for ${Z}_{i,j}^{B,v}[\phantom{\rule{0.3em}{0ex}}{d}_{\ell},{d}_{\ell}+{d}_{\text{add}}]$ with d_{ ℓ }≠ 0 and d_{ ℓ }+ d_{add} ≠ 0:
Discussion and applications
The theoretical analysis of the distance distribution problem shows that, while polynomialtime algorithms exist, they probably cannot be improved to space and time complexities that make them widely applicable to large RNA molecules. Due to the unfavorable time complexity of the current algorithm and the associated exact implementation in C, a rather simple and efficient sampling algorithm has been implemented. We resort to sampling Boltzmannweighted secondary structures with RNAsubopt p[16], which uses the same stochastic backtracing approach as sfold[18]. As the graphdistance for a pair of nucleotides in a given secondary structure can be computed in O(n logn) time by Dijkstra’s algorithm with Fibonacci heap [19], even large samples can be evaluated efficiently.
As we pointed out in the introduction, the graphdistance measure introduced in this paper can serve as a first step towards a structural interpretation of smFRET data. As an example, we consider the graph distance distribution of a DielsAlderase (DAse) ribozyme (Figure 2A). Histograms of smFRET efficiency (E_{ fret }) for this 49 nt long catalytic RNA are reported in [12] for a large number of surfaceimmobilized ribozyme molecules as a function of the Mg ^{2+} concentration in the buffer solution. A sketch of their histograms is displayed in Figure 2B. The dyes are attached to sequence positions 6 (Cy3) and 42 (Cy5) and hence do not simply reflect the endtoend distance, Figure 2A(c). In this example, we observe the expected correspondence small graphdistances with a strong smFRET signal. This is a particular interesting example, since the minimal free energy (mfe) structure (Figure 2A(a)) predicted with RNAfold is not identified with the real secondary structure (Figure 2A(c)). In fact, the ground state secondary structure is ranked as the 3rd best suboptimal structure derived via RNAsubopt e. The free energy difference between these two structures is only 0.1 kcal/mol. However, their graphdistances show a relatively larger difference. The 2nd best suboptimal structure (Figure 2A(b)) looks rather similar with the 3rd structure, in particular, they share the same graphdistance value.
The smFRET data of [12] indicates the presence of three subpopulations, corresponding to three different structural states: folded molecules (state F), intermediate conformation (state I) and unfolded molecules (state U). In the absence of Mg ^{2+}, the I state dominates, and only small fractions are found in states U and F. Unfortunately, the salt dependence of RNA folding is complex [21, 22] and currently is not properly modeled in the available folding programs. We can, however, make use of the qualitative correspondence of low salt concentrations with high temperature. In Figure 2C we therefore recompute the graphdistance distribution in the ensemble at an elevated temperature of 50°C. Here, the real structure becomes the second best structure with free energy 10.82 kcal/mol and we observe a much larger fraction of (nearly) unfolded structures with longer distances between the two beacon positions. Qualitatively, this matches the smFRET data showed in Figure 2B.
Furthermore, for a given pair v,w of positions in a given RNA sequence x, the importance I_{v,w}(e) of a backbone edge or base pair e in calculating the graphdistance distribution is evaluated by ${I}_{v,w}\left(e\right)=\sum _{e\in {\Pi}_{e}}\mathit{\text{Pr}}\left[\phantom{\rule{0.3em}{0ex}}G\rightx]$, where the set Π_{ e } comprises the secondary structures G with (at least) one shortest path between v and w that runs through e. Figure 3 compares dot plots of I_{v,w}(e) with the basepair probabilities in the RNA structure ensemble of the DAse ribozyme at temperatures 37°C and 50°C. Since RNAgraphdist computes only one of possible many shortest paths for each G, hence we obtain only a lower bound on I_{v,w}(e).
We observe for DAse that the contributions from the backbone edges are larger than the base pairs at both temperatures. For T = 37°C, there are in total 14 edges with I_{6,42}(e) > 0.4. Only two of them, 5(C)–18(G) and 2(G) – 21(C) are base pairs. For T = 50°C, there is only the pair 5(C) – 18(G) is heavily used (I_{6,42}(5,18) = 0.636). Combining the analysis of data illustrated in Figure 2, it may indicate that the existences of two base pairs, 2(G) – 21(C) and 28(G) – 39(C) can affect the graphdistance distribution of RNA secondary structure ensemble and consequently affect smFRET measurements. Such constraints may become an interesting source of constraints for RNA structure prediction.
In addition, we compute the distribution of paths which pass through positions outside sequence interval x[ 6h,42+h] of DAse ribozyme. As illustrated in Figure 4, this “outsidepath” distribution, as expected, drops fast to 0 with respect to h.
Longrange interactions play an important role in premRNA splicing and in the regulation of alternative splicing [23–25], bringing splice donor, acceptor, branching site into close spatial proximity. Figure 5A shows for D. melanogaster premRNAs that the distribution of graphdistances between donor and acceptor sites shifted towards smaller values compared to randomly selected pairs of positions with the same distance. Due to the insufficiency of the spacialdistance information of structural elements in the secondary structures, we artificially choose a = b = 1 in our experiments. Although the effect is small, it shows a clear difference between the real RNA sequences and artificial sequences that were randomized by dinucleotide shuffling. Furthermore, Table 1 displays for a specific intron CG16979RA_intron_0_0_chr3L_15569803 from Drosophila melanogaster (dm3), the most probable secondary structures in the subensembles of secondary structures such that their graphdistances are 7, 6, and 14, respectively.
The Drosophila melanogaster Down syndrome cell adhesion molecules (DSCAM) encodes for 38.016 different mRNAs by alternative splicing. Among the 24 exons, exon 4 alone has 12 variants [26]. In Figure 6 we display the graphdistance from donor (exon 3) to any downstream position until acceptor (exon 5). Comparing the graphdistances of all twelve acceptors of exon 4, we see clearly local peaks. This suggests the acceptor being part of hairpin loops, three dimensionally poking out of the long transcript to interact easily with the spliceosome and donor. Four of the twelve acceptor sites show no local peak, however seem to be accessible as internal loops of longer hairpins.
The spatial organization of the genomic and subgenomic RNAs is important for the processing and functioning of many RNA viruses. This goes far beyond the wellknown panhandle structures. In Coronavirus the interactions of the 5’ TRSL cisacting element with body TRS elements has been proposed as an important determinant for the correct assembly of the Coronavirus genes in the host [27]. The mechanisms of interaction is unknown, and a small threedimensional distance is suspected. The matrix of expected graphdistances in Figure 5B shows that TRSL and TRSB are indeed placed close to each other. In Table 2, we show the most stable structures within the subensembles of secondary structures such that their graphdistances are 14, 5, and 35, respectively. All these RNA secondary structures brings the leader transcription regulation site (LTRS) in close spatial proximity with the body transcription regulation site (BTRS).
These examples indicate that the systematic analysis of the graphdistance distribution both for individual RNAs and their aggregation over ensembles of structures can provide useful insights into structural influences on RNA function. These may not be obvious directly from the structures due to the inherent difficulties of predicting longrange base pairs with sufficient accuracy and the many issues inherent in comparing RNA structures of very disparate lengths.
Due the complexity of algorithm we have refrained from attempting a direct implementation in an imperative programming language. Instead, we are aiming at an implementation in Haskell that allows us to make use of the framework of algebraic dynamic programming [28]. The graphdistance measure and the associated algorithm can be extended in principle to of RNA secondary structures with additional tertiary structural elements such as pseudoknots [29] and Gquadruples [30]. RNARNA interaction structures [31] also form a promising area for future extensions. We note finally, that the Fourier transition method introduced in [32] could be employed to achieve a further speedup.
Conclusion
The distribution of spatial distances in the equilibrium structure ensemble of an RNA molecule carries information about the overall structure of the molecule. These distance can be approximated by the graphdistance in RNA secondary structure. We introduced a polynomial time algorithm to compute the equilibrium distribution of graphdistances between a fixed pair of nucleotides. For practical applications, small distances are of main interest. Here, the time complexity of the proposed algorithm is O(n^{4}), compared to a naïve implementation with time complexity of O(n^{11}) for sequence length n and distances that can cover the whole sequence length. Since further reductions, however, seem to be difficult, we also introduced sampling approaches that are much easier to implement. They are also theoretically favorable for several reallife applications, in particular since these primarily concern longrange interactions in very large RNA molecules.
References
 1.
Yoffe AM, Prinsen P, Gelbart WM, BenShaul A: The ends of a large RNA molecule are necessarily close. Nucl Acids Res. 2011, 39: 292299. 10.1093/nar/gkq642.
 2.
Fang LT: The endtoend distance of RNA as a randomly selfpaired polymer. J Theor Biol. 2011, 280: 101107. 10.1016/j.jtbi.2011.04.010.
 3.
Clote P, Ponty Y, Steyaert JM: Expected distance between terminal nucleotides of RNA secondary structures. J Math Biol. 2012, 65: 581599. 10.1007/s0028501104678.
 4.
Han HS, Reidys CM: The 5’3’ distance of RNA secondary structures. J Comput Biol. 2012, 19: 867878. 10.1089/cmb.2011.0301.
 5.
Forties RA, Bundschuh R: Modeling the interplay of singlestranded binding proteins and nucleic acid secondary structure. Bioinformatics. 2010, 26: 6167. 10.1093/bioinformatics/btp627.
 6.
Gerland U, Bundschuh R, Hwa T: Forceinduced denaturation of RNA. Biophys J. 2001, 81: 13241332. 10.1016/S00063495(01)75789X.
 7.
Müller M, Krzakala F, Mézard M: The secondary structure of RNA under tension. Eur Phys J E. 2002, 9: 6777.
 8.
Gerland U, Bundschuh R, Hwa T: Translocation of structured polynucleotides through nanopores. Phys Biol. 2004, 1: 1926. 10.1088/14783967/1/1/002.
 9.
Einert TR, Näger P, Orland H, Netz R: Impact of loop statistics on the thermodynamics of RNA Folding. Phys Rev Lett. 2008, 101: 048103
 10.
Roy R, Hohng S, Ha T: A practical guide to singlemolecule FRET. Nat Methods. 2008, 5: 507516. 10.1038/nmeth.1208.
 11.
Das R, Baker D: Automated de novo prediction of nativelike RNA tertiary structures. Proc Natl Acad Sci USA. 2007, 104: 1466414669. 10.1073/pnas.0703836104.
 12.
Kobitski A, Nierth A, Helm M, Jaschke A, Nienhaus UG: Mg^{2+}dependent folding of a DielsAlderase ribozyme probed by singlemolecule FRET analysis. Nucleic Acids Res. 2007, 35 (6): 20472059. 10.1093/nar/gkm072.
 13.
Schuster P, Fontana W, Stadler PF, Hofacker IL: From sequences to shapes and back: a case study in RNA secondary structures. Proc R Soc London B. 1994, 255 (1344): 27984. 10.1098/rspb.1994.0040.
 14.
Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH: Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci USA. 2004, 101: 72877292. 10.1073/pnas.0401799101.
 15.
McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990, 29 (6–7): 11051119.
 16.
Lorenz R, Bernhart SH, Höner zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL: ViennaRNA Package 2.0. Alg Mol Biol. 2011, 6: 2610.1186/17487188626.
 17.
Lyngsø RB, Zuker M, Pedersen C: Fast evaluation of internal loops in RNA secondary structure prediction. Bioinformatics. 1999, 15: 440445. 10.1093/bioinformatics/15.6.440.
 18.
Ding Y, Lawrence C: A statistical sampling algorithm for RNA secondary structure prediction. Nucl Acids Res. 2003, 31 (24): 72807301. 10.1093/nar/gkg938.
 19.
Fredman M, Tarjan R: Fibonacci heaps and their uses in improved network optimization algorithms. J ACM. 1987, 34 (3): 596615. 10.1145/28869.28874.
 20.
Darty K, Denise A, Ponty Y: VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics. 2009, 25 (15): 19741975. 10.1093/bioinformatics/btp250.
 21.
Leipply D, Lambert D, Draper DE: IonRNA interactions thermodynamic analysis of the effects of mono and divalent ions on RNA conformational equilibria. Methods Enzymol. 2009, 469: 433463.
 22.
Mathews D, Sabina J, Zuker M, Turner DH: Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol. 1999, 288: 911940. 10.1006/jmbi.1999.2700.
 23.
Baraniak AP, Lasda EL, Wagner EJ, GarciaBlanco MA: A stem structure in fibroblast growth factor receptor 2 transcripts mediates celltypespecific splicing by approximating intronic control elements. Mol Cell Biol. 2003, 23: 93279337. 10.1128/MCB.23.24.93279337.2003.
 24.
McManus CJ, Graveley BR: RNA structure and the mechanisms of alternative splicing. Curr Opin Genet Dev. 2011, 21: 373379. 10.1016/j.gde.2011.04.001.
 25.
Amman F, Bernhart S, Doose D, Hofacker I, Qin J, Stadler P, Will S: The Trouble with LongRange Base Pairs in RNA Folding. Lecture Notes in Computer Science: Advances in Bioinformatics and Computational Biology, Volume 8213. 2013, Berlin, Heidelberg, New York: SpringerVerlag, 111.
 26.
Celotto A, Graveley B: Exonspecific RNAi: a tool for dissecting the functional relevance of alternative splicing. RNA. 2002, 8 (6): 718724. 10.1017/S1355838202021064.
 27.
Dufour D, MateosGomez PA, Enjuanes L, Gallego J, Sola I: Structure and functional relevance of a transcriptionregulating sequence involved in coronavirus discontinuous RNA synthesis. J Virol. 2011, 85 (10): 49634973. 10.1128/JVI.0231710.
 28.
Giegerich R, Meyer C: Algebraic dynamic programming. Algebraic Methodology And Software Technology. 2002, Berlin, Heidelberg, New York: SpringerVerlag, 349364.
 29.
Reidys CM, Huang FWD, Andersen JE, Penner RC, Stadler PF, Nebel ME: Topology and prediction of RNA pseudoknots. Bioinformatics. 2011, 27 (8): 10761085. 10.1093/bioinformatics/btr090.
 30.
Lorenz R, Bernhart S, Qin J, Honer zu, Siederdissen C, Tanzer A, Amman F, Hofacker I: 2D meets 4G: GQuadruplexes in RNA Secondary Structure Prediction. IEEE/ACM Trans Comput Biol Bioinformatics. doi:10.1109/TCBB.2013.7,
 31.
Li AX, Marz M, Qin J, Reidys CM: RNARNA interaction prediction based on multiple sequence alignments. Bioinformatics. 2011, 27 (4): 456463. 10.1093/bioinformatics/btq659.
 32.
Senter E, Sheikh S, Dotu I, Ponty Y, Clote P: Using the fast fourier transform to accelerate the computational search for RNA conformational switches. PLoS ONE. 2012, 7 (12): e5050610.1371/journal.pone.0050506.
Acknowledgments
This work was supported in part by the Deutsche Forschungsgemeinschaft proj. nos. BA 2168/33, SFB 992, STA 850/102, SPP 1596 and MA 5082/11, the BMBF (grant 0316165A) and the MWK (grant 7533711.6.1).
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
Conceived and designed the algorithms: JQ, PFS and RB. Implemented algorithms and performed experiments: JQ and MN. Analyzed DielsAlderase ribozyme data: JQ and PFS. Analyzed premRNA splicing data: MN and MM. Wrote the final manuscript: JQ, MM, PF and RB. All authors read and approved the final manuscript.
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Graphdistance
 Boltzmann distribution
 Partition function
 PremRNA splicing
 smFRET