Graph-distance distribution of the Boltzmann ensemble of RNA secondary structures

Qin, Jing; Fricke, Markus; Marz, Manja; Stadler, Peter F; Backofen, Rolf

doi:10.1186/1748-7188-9-19

Research
Open access
Published: 11 September 2014

Graph-distance distribution of the Boltzmann ensemble of RNA secondary structures

Jing Qin¹,
Markus Fricke³,
Manja Marz³,
Peter F Stadler^2,6,7,8,9 &
…
Rolf Backofen^4,5

Algorithms for Molecular Biology volume 9, Article number: 19 (2014) Cite this article

3203 Accesses
1 Altmetric
Metrics details

Abstract

Background

Large RNA molecules are often composed of multiple functional domains whose spatial arrangement strongly influences their function. Pre-mRNA splicing, for instance, relies on the spatial proximity of the splice junctions that can be separated by very long introns. Similar effects appear in the processing of RNA virus genomes. Albeit a crude measure, the distribution of spatial distances in thermodynamic equilibrium harbors useful information on the shape of the molecule that in turn can give insights into the interplay of its functional domains.

Result

Spatial distance can be approximated by the graph-distance in RNA secondary structure. We show here that the equilibrium distribution of graph-distances between a fixed pair of nucleotides can be computed in polynomial time by means of dynamic programming. While a naïve implementation would yield recursions with a very high time complexity of O(n⁶D⁵) for sequence length n and D distinct distance values, it is possible to reduce this to O(n⁴) for practical applications in which predominantly small distances are of of interest. Further reductions, however, seem to be difficult. Therefore, we introduced sampling approaches that are much easier to implement. They are also theoretically favorable for several real-life applications, in particular since these primarily concern long-range interactions in very large RNA molecules.

Conclusions

The graph-distance distribution can be computed using a dynamic programming approach. Although a crude approximation of reality, our initial results indicate that the graph-distance can be related to the smFRET data. The additional file and the software of our paper are available from http://www.rna.uni-jena.de/RNAgraphdist.html.

Background

The distance distribution within an RNA molecule is of interest in various contexts. Most directly, the question arises whether panhandle-like structures (in which 3’ and 5’ ends of long RNA molecules are placed in close proximity) are the rule or an exception. Panhandles have been reported in particular for many RNA virus genomes. Several studies [1–4] agree based on different models that the two ends of single-stranded RNA molecules are typically not far apart. On a more technical level, the problem to compute the partition function over RNA secondary structures with given end-to-end distance d, usually measured as the number of external bases (plus possibly the number of structural domains) arises for instance when predicting nucleic acid secondary structure in the presence of single-stranded binding proteins [5] or in models of RNA subjected to pulling forces (e.g. in atom force microscopy or export through a small pore) [6–8]. It also plays a role for the effect of loop energy parameters [9].

In contrast to the end-to-end distance, the graph-distance between two arbitrarily prescribed nucleotides in a larger RNA structure does not seem to have been studied in any detail. However, this is of particular interest in the analysis of single-molecule fluorescence resonance energy transfer (smFRET) experiments [10]. This technique allows to monitor the distance between two dye-labeled nucleotides and can reveal details of the kinetics of RNA folding in real time. It measures the non-radiative energy transfer between the dye-labeled donor and acceptor positions. The efficiency of this energy transfer, E_fret, strongly depends on the spatial distance R according to $E_{fret} = R_{0}^{6} / (R_{0}^{6} + R^{6})$ . The Förster radius R₀ sets the length scale, e.g. R₀≈54 Å for the Cy3-Cy5 dye pair. A major obstacle is that, at present, there is no general and efficient way to link smFRET measurements to interpretations in terms of explicit molecular structures. To solve this problem, a natural first step is to compute the distribution of spatial distances for an equilibrium ensemble of 3D structures. Since this is not feasible in practice despite major progress in the field of RNA 3D structure prediction [11], we can only resort to considering the graph-distances on the ensemble of RNA secondary structures instead. From a computer science point of view, furthermore, we show here that the distance distribution can be computed exactly using a dynamic programming approach. Although a crude approximation of reality, our initial results indicate that the graph-distance can be related to the smFRET data such as those reported by [12] and help to explain effects of RNA structures in pre-mRNA splicing and viral subgenomic RNA species.

Theory

RNA secondary structures

An RNA secondary structure is a vertex labeled outerplanar graph G(V,x,E), where V = {1,2,…,n} is a finite ordered set (of nucleotide positions) and x : {1,2,…,n} → {A,U,G,C},i ↦ x_i assigns to each vertex at position i (along the RNA sequence from 5’ to 3’) the corresponding nucleotide x_i. We write x = x₁…x_n for the sequence underlying secondary structure and use x[ i…j] = x_i…x_j to denote the subsequence from i to j. The edge set E is subdivided into backbone edges of the form {i,i + 1} for 1 ≤ i<n and a set B of base pairs satisfying the following conditions:

(i)
If {i,j} ∈ B then x _i x _k∈ {GC,CG,AU,UA,GU,UG};
(ii)
If {i,j} ∈ B then |j - i| > 3;
(iii)
If {i,j},{i,k} ∈ B then j = k;
(iv)
If {i,j},{k,l} ∈ B and i < k < j then i < l < j.

The first condition allows base pairs only for Watson-Crick and GU base pairs. The second condition implements the minimal steric requirement for an RNA to bend back on itself. The third condition enforces that B forms a matching in the secondary structure. The last condition (nesting condition) forbids crossing base pairs, i.e. pseudoknots.

The nesting condition results in a natural partial order in the set of base pairs B defined as {i,j} ≺ {k,l} if k < i < j < l. In particular, given an arbitrary vertex k, the set B_k= {{i,j} ∈ B|i ≤ k ≤ j} of base pairs enclosing k is totally ordered. Note that k is explicitly allowed to be incident to its enclosing base pairs. A vertex k is external if B_k= ∅ . A base pair {k,l} is external if B_k= B_l= {{k,l}}.

Consider a fixed secondary structure G, for a given base pair {i,j} ∈ B, we say a vertex k is accessible from {i,j} if i < k < j and there is no other pair {i^′,j^′} ∈ B such that i < i^′ < k < j^′ < j. The unique subgraph $ℒ_{i, j}$ induced by i, j, and all the vertices accessible from {i,j} is known as the loop of {i,j}. The type of a loop $ℒ_{i, j}$ is unique determined depending on whether {i,j} is external or not, and the numbers of unpaired vertices and base pairs. For details, see [13]. Each secondary structure G has a unique set of loops ${ℒ_{i, j} | {i, j} \in B}$ , which is called the loop decomposition of G. The free energy f(G) of a given secondary structure, according to the standard energy model [14], is defined as the sum of the energies of all loops in its unique loop decomposition.

The relative location of two vertices v and w in G is determined by the base pairs B_v and B_w that enclose them. If B_v∩ B_w≠ ∅, there is a unique ≺-minimal base pair {i_v,w,j_v,w} that encloses both vertices and thus a uniquely defined loop $ℒ_{{i_{v, w}, j_{v, w}}}$ in the loop associated with v and w. If B_v∖ B_w= ∅ or B_w∖ B_v= ∅ then v or w is unpaired and part of $ℒ_{{i_{v, w}, j_{v, w}}}$ . Otherwise, i.e. B_v∩ B_w= ∅, there are uniquely defined ≺-maximal base pairs {k_v,l_v} ∈ B_v∖ B_w and {k_w,l_w} ∈ B_w∖ B_v that enclose v and w, respectively. We note that B_v∖ B_w (B_w∖ B_v) may be empty, in which case {k_v,l_v} ({k_w,l_w}) is also empty. This simple partition holds the key to computing distance distinguished partition functions below.

In the following, we assign the weights a for backbone edges and b for base pairs, respectively. Given a path p, we define the weight of the path d(p) as the sum of the weights of edges in the path. The (weighted) graph-distance $d_{v, w}^{G}$ in G is defined as the weight of the path p connecting v and w with d(p) being minimal. For the weights, we require the following condition:

(W) If i and j are connected by an edge, then {i,j} ∈ E is the unique shortest path between i and j.

This condition ensures that single edges cannot be replaced by detours of shorter weight. Condition (W) and property (ii) of the secondary structure graphs implies b < 3a because the closing base pair must be shorter than a hairpin loop. Furthermore, considering a stacked pair we need b < b + 2 a, i.e. a > 0. We allow the degenerate case b = 0 that neglects the traversals of base pairs.

Before we continue with the calculations of the partition function, let us first consider the problem formulation in more detail. For the FRET application, it is well-known that FRET efficiency is correlated with spatial distance. Furthermore, only a limited range of distance changes (e.g. 20 Å-100 Å for Cy3-Cy5) can be reported by the FRET experiments. Thus a more useful formulation of our problem is not to use the full expected quantity for all positions. Instead, we are interested in the average for all distance-values within some threshold θ_d. As the space and time complexity will depend on the number of distances we consider, we will parametrise our complexity by the number of nucleotides n and the number of distances considered D = θ_d+ 1, as well. In the worst case, there is D = O (n). However, given that in practice only a limited range of distance changes are considered, we rather view D = O(1) as a small constant in our contribution.

Boltzmann distribution of graph-distances

For a fixed structure G, $d_{v, w}^{G}$ is easy to compute. Here, we are interested in the distribution $Pr [d_{v, w}^{G} | x]$ and its expected value $d_{v, w} = E [d_{v, w}^{G} | x]$ over the ensemble of all possible structures G for a given sequence x. Both quantities can be calculated from the Boltzmann distribution P r[ G|x] = e^{- f (G)/RT}/Q where $Q = \sum_{G} e^{- f (G) / RT}$ denotes the partition function of the ensemble of structures. As first shown in [15], Q and related quantities can be computed in quartic time. A reduction to a cubic algorithm may be obtained if the free energy of long interior loops may be regarded as prohibitive. This restriction has been widely used for long sequences [16]. Cubic runtime can also be achieved for some but not all parametrizations of interior loop energies [17].

A crucial quantity for our task is the restricted partition function

Z^{v, w} [d] = \sum_{G with d_{v, w}^{G} = d} e^{- f (G) / RT}

for a given pair v,w of positions in a given RNA sequence x. A simple computation (Appendix A in Additional file 1) verifies that the $Pr [d_{v, w}^{G} = d | x] = Z^{v, w} [d] / Q$ and $d_{v, w} = E [d_{v, w}^{G} | x] = \sum_{d} (Z^{v, w} [d] / Q) d$ . Hence it suffices to compute Z^v,w[d] for any 1 ≤ d ≤ n. In the following sections we show that this can be achieved by a variant of McCaskill’s approach [15].

For the ease of presentation we describe in the following only the recursion for the simplified energy model for the “circular maximum matching”, in which energy contributions are associated with individual base pairs rather than loops. Our approach can be easily extended to the full model by using separating the partition functions into distinct cases for the loop types.

We use the letters Z and Y to denote partition functions with distance constraints, while Q is used for quantities that appear in McCaskill’s algorithm and are considered as pre-computed here. For instance, let $Q_{i, j}^{B}$ denote the partition function over all secondary structures on x[ i..j] that are enclosed by the base pair {i,j}. We will later also need the partition function Q_i,j over the sub-sequence x [ i..j], regardless of whether {i,j} is paired or not. In Additional file 1: Appendix C, we summarize the notations frequently used in our contribution.

Recursions of Z^v,w[ d]: The case when v and ware external

An important special case assumes that both v and w are external. This is the case e.g. when v and w are binded by proteins. In particular, the problem of computing end-to-end distances, i.e., v = 1 and w = n, is of this type.

Assuming (W), the shortest path between two external vertices v,w consists of the external vertices and their backbone connections together with the external base pairs. We call this path the inside path of i,j since it does not involve any vertices “outside” the subsequence x [ i..j].

For efficiently calculating the internal distance between any two vertices v,w, we denote by $Z_{i, j}^{I} [d]$ the partition function over all secondary structures on x [ i..j] with distance exactly d.

Now note that any structure on x[ i..j] starts either with an unpaired base or with a base pair connecting i to some position k satisfying i < k ≤ j. In the first case, we have $d_{i, j}^{G} = d_{i, i + 1}^{G} + d_{i + 1, j}^{G}$ where $d_{i, i + 1}^{G} = a$ . In the second case, there exists $d_{i, j}^{G} = d_{i, k}^{G} + d_{k, k + 1}^{G} + d_{k + 1, j}^{G}$ with $d_{i, k}^{G} = b$ and $d_{k, k + 1}^{G} = a$ . Thus, $Z_{i, j}^{I} [d]$ can be split as follows,

This gives the recursion

Z_{i, j}^{I} [d] = Z_{i + 1, j}^{I} [d - a] + \sum_{i < k \leq j} Q_{i, k}^{B} Z_{k + 1, j}^{I} [d - b - a]

(1)

with the initialization $Z_{ii}^{I} [0] = 1$ and $Z_{ii}^{I} [d] = 0$ for d > 0. For consecutive vertices, we have $Z_{i, i + 1}^{I} [a] = 1$ and $Z_{i, i + 1}^{I} [d] = 0$ for d ≠ a. These recursions have been derived in several different contexts, e.g. force induced RNA denaturations [6], the investigate of loop entropy dependence [9], the analysis of FRET signals in the presence of single-stranded binding proteins [5], as well as in mathematical studies of RNA panhandle-like structures [3, 4].

In the following, it will be convenient to define also a special term for the empty structure. Setting $Z_{i, i - 1}^{I} [- a] = 1$ and $Z_{i, i - 1}^{I} [d] = 0$ for d ≠ -a allows us to formally write an individual backbone edge as two edges flanking the empty structure and hence to avoid the explicit treatment of special cases. This definition of Z^I also includes the case that i and j are base paired in the recursion (1). This is covered by the case k=j, where we evaluate $Z_{j + 1, j}^{I} [d - b - a]$ . Since d = b is the only admissible value here, this refers to $Z_{j + 1, j}^{I} [- a]$ , which has the correct value of 1 due to our definition. Later on, we will also need Z^I under the additional condition that the path starts and ends with a backbone edge. We therefore introduce $Z^{I^{'}}$ defined as by

Z_{i, j}^{I^{'}} [d] = Z_{i + 1, j - 1}^{I} [d - 2 a]

(2)

Note that if $Z_{i, j}^{I^{'}} [d]$ is called with j = i + 1, then we call $Z_{i + 1, i}^{I} [d - 2 a]$ . The only admissible value again is the correct value d=a. In sum, we have the following

This recursion requires O(n³D) time and O(n²D) space. It is possible to reduce the complexity of computing the expected distance in this special case by a linear factor. The trick is to use conditional probabilities for arcs starting at i or the conditional probability for i to be single-stranded, which can be determined from the partition function for RNA folding [3], see Additional file 1: Appendix B.

Recursions of Z^v,w[ d]: the general case

The distance between two positions v and w that are covered by an arc can be realized by both inside paths and outside paths. Here, “outside” emphasizes that the shortest path between two positions v and w contains vertex does not belongs to x[ v,w]. This case complicates the algorithmic approach, since both types of paths must be controlled simultaneously. Consider Figure 1, the shortest path between the green and blue regions includes some vertices outside the interval between these two regions. The basic idea is to generalize Equation (1) to computing the partition function Z^v,w[ d]. The main question now becomes how to recurse over decompositions of both the inside and the outside paths.

Figure 1 shows that the outside paths are important for the green region, i.e., the region that is covered by an arc. Hence, we have to consider the different cases that the two positions v and w are covered by arcs. The set Ω of all secondary structures on xcan be divided into two disjoint subclasses that have to be treated differently:

: v and w are not enclosed in a common base pair, i.e., B_v∩ B_w= ∅.
: there is a base pair enclosing both v and w, i.e., B_v∩ B_w≠ ∅.

Note that this bipartition explicitly depends on v and w. In the following, we will first introduce the recursions that are required in Ω₀ structures to compute Z^v,w[ d].

Contribution of Ω₀ structures to Z^v,w[ d]: $Z_{0}^{v, w} [d]$

One example of this case is given in Figure 1 with the red and blue region, where v (vertex in green region) is covered by an arc, and w (vertex in blue region) is external. Denote the ≺-maximal base pair enclosing v by {i,j}. Since at most one of v and w is covered by an arc, we know that j < w. Hence, every path p from v to w, and hence also the shortest paths (not necessarily unique) must run through the right end j of the arc {i,j}. More precisely, there must sub-paths p₁ and p₂ with d (p) = d(p₁) + d (p₂) + a such that $v \overset{p}{⇝} w \to v \overset{p_{1}}{⇝} j - (j + 1) \overset{p_{2}}{⇝} w$ , where $i \overset{p}{⇝} j$ denotes that p is a shortest path from i to j and - denotes a single backbone edge. For the shortest path from v to j, it consists either of a shortest path $v \overset{p^{'}}{⇝} i$ and the arc {i,j}, or it goes directly to j without using the arc {i,j}.

How does this distinction translate to the partition function approach? If we want to calculate the contribution of this case to the partition function Z^v,w[ d], we have to split both the sequence x[ i,w] and distance d as follows

a.)

where $Z_{j, w}^{I^{'}} [d_{2}]$ is the partition function starting and ending with a single-stranded base as defined in Equation (2), and $Z_{i, j}^{B, v} [d_{ℓ}, d_{r}]$ is the partition function consisting of all structures of x[ i,j] containing the base pair {i,j} with the property that the shortest path from v to i has length d_ℓ and the shortest path from v to j has length d_r. In addition, d, d_r and d₂ must satisfy d = d_r+ d₂.

The remaining cases for the contribution of the class Ω₀ to Z^v,w[ d] are given by all other possible combinations of v and w being single-stranded or being covered by an arc, i.e.,

To simplify, we extend the definition of $Z_{i, j}^{B, v} [d_{ℓ}, d_{r}]$ by setting $Z_{v, v}^{B, v} [0, 0] = 1$ and $Z_{v, v}^{B, v} [d_{ℓ}, d_{r}] = 0$ for d_ℓ+ d_r> 0. This allows us to conveniently model all cases where either v or w are external, i.e., a.), b.), and d.), as special cases of c.).

In case c.), we have to split the distance d into five sub-distances $d_{l}, d_{r}, d_{l}^{'}, d_{r}^{'}, d_{I}$ , in which d_I can be retrieved from the first four distances. Furthermore, we would require four splitting positions for the sequence for all possible combinations of i,j,k,l. A naïve implementation of this idea would result in an algorithm with time complexity O(n⁶D⁵) and space complexity O(n²D²).

A careful inspection shows, however, that the split of the distances for the arcs into d_ℓ and d_r is unnecessary. Since we want to know only distance to the left/right end, we can simply introduce two matrices $Z_{i, j}^{B, v, ℓ} [d]$ and $Z_{i, j}^{B, v, r} [d]$ that store these values. These matrices can be generated from $Z_{i, j}^{B, v} [d_{ℓ}, d_{r}]$ as follows:

Z_{i, j}^{B, v, ℓ} [d] = \sum_{\begin{matrix} d_{r} \\ d_{r} + b \geq d \end{matrix}} Z_{i, j}^{B, v} [d, d_{r}] + \sum_{\begin{matrix} d_{ℓ} \\ d_{ℓ} > d \end{matrix}} Z_{i, j}^{B, v} [d_{ℓ}, d - b]

Analogously, we compute $Z_{i, j}^{B, v, r} [d]$ . In this way, we split the distance d into three contributions and we require four splitting positions for the sequence for all possible combinations of i,j,k,ℓ.

Therefore, the contribution to Z^v,w[ d] for structures in Ω⁰ is given by

\begin{array}{lcr} Z_{0}^{v, w} [d] = \sum_{\begin{matrix} d_{1}, d_{2} \\ d_{1} + d_{2} \leq d \end{matrix}} \sum_{\begin{matrix} i, j, k, l \\ i \leq v \leq j < k \leq w \leq l \end{matrix}} (\begin{array}{l} Q_{1, i - 1} \cdot Z_{i, j}^{B, v, r} [d_{1}] \\ \cdot Z_{j, k}^{I^{'}} [d - (d_{1} + d_{2})] \\ \cdot Z_{k, l}^{B, w, ℓ} [d_{2}] \cdot Q_{l + 1, n} \end{array}) \end{array}

(3)

Note that for splitting the distance, we reuse the same indices (e.g., the j in $Z_{i, j}^{B, v, r} [d_{1}] \cdot Z_{j, k}^{I^{'}} [d - (d_{1} + d_{2})]$ , where as for the remaining partition function, we use successive indices (e.g.,the i in $Q_{1, i - 1} \cdot Z_{i, j}^{B, v, r} [d_{1}]$ ). This difference comes from the fact that splitting a sequence into subsequences is done naturally between two successive indices, whereas splitting a distance is naturally done by splitting at an individual position. We have only to guarantee that the substructures which participate in the split do agree on the structural context of the split position. This is guaranteed by requiring that $Z^{I^{'}}$ starts and ends with a backbone edge. We note that the incorporation of the full dangling end parameters makes is more tedious to handle the splitting positions.

This results in a complexity of O(n⁶D³) time and O(n²D) space. However, we do not need to split in i,j,k,l simultaneously. Instead, we could split case (c) at position j and introduce for all v ≤ j and k ≤ w the auxiliary variables

\begin{array}{l} Z_{1, j}^{B, v, r} [d_{1}] & = \sum_{i \leq v} Q_{1, i - 1} \cdot Z_{i, j}^{B, v, r} [d_{1}] \\ Z_{k, n}^{B, w, ℓ} [d_{2}] & = \sum_{w \leq l} Z_{k, l}^{B, w, ℓ} [d_{2}] \cdot Q_{l + 1, n} \\ Z_{j, n}^{I B, w, ℓ} [d^{'}] & = \sum_{k > j} \sum_{\begin{matrix} d_{2} \leq d^{'} \end{matrix}} Z_{j, k}^{I^{'}} [d^{'} - d_{2}] \cdot Z_{k, n}^{B, w, ℓ} [d_{2}] . \end{array}

Finally, we can replace recursion (3) by

Z_{0}^{v, w} [d] = \sum_{v \leq j} \sum_{d_{1} \leq d} Z_{1, j}^{B, v, r} [d_{1}] \cdot Z_{j, n}^{I B, w, ℓ} [d - d_{1}]

(4)

We thus arrive at O(n³D²) time and O(n²D) space complexity for the contribution of Ω₀ structures to Z^v,w[ d], excluding the complexity of computing $Z_{i, j}^{B, v} [d_{ℓ}, d_{r}]$ .

Contribution of Ω₁ structures to Z^v,w[ d]

Ω₁ contains all cases where v and w are covered by a base pair. In the following, let {p,q} be the ≺-minimal base pair covering v and w. In principle, this case looks similar to the case for Ω₀. However, we have to take into considerations the paths between v and w over the base pair {p,q}. Thus, we need to store the partition function for all inside and outside for each ≺-minimal arc {p,q} that covers v and w, which we will call $Z_{p, q}^{v, w} [d_{O}, d_{I}]$ . In principle, a similar recursion as defined for $Z_{0}^{v, w}$ in equation (3) can be derived, with the additional complication since we have to take care of the additional outside distance due to the arc {p,q}. Thus, we obtain the following splitting:

Again we can avoid the complexity of simultaneously splitting at {i,j}and {k,l} by doing a major split after j. Thus, we get the following picture,

which leads to the following equivalent recursions:

Y_{p, j}^{B, v, r} [d, d_{r}] = \sum_{p < i \leq v} \sum_{d_{O}^{'} \leq d} Z_{p, i}^{I^{'}} [d_{O}^{'}] \cdot Z_{i, j}^{B, v} [d - d_{O}^{'}, d_{r}]

(5)

Y_{k, q}^{B, w, ℓ} [d_{ℓ}^{'}, d^{'}] = \sum_{w \leq l < q} \sum_{d_{O}^{′′} \leq d^{'}} Z_{k, l}^{B, w} [d_{ℓ}^{'}, d^{'} - d_{O}^{′′}] \cdot Z_{l, q}^{I^{'}} [d_{O}^{′′}]

(6)

Y_{j, q}^{I B, w, ℓ} [d_{I ℓ}, d^{'}] = \sum_{j < k < q} \sum_{\begin{matrix} d_{ℓ}^{'} \leq d_{I ℓ} \end{matrix}} Z_{j, k}^{I^{'}} [d_{I ℓ} - d_{ℓ}^{'}] \cdot Y_{k, q}^{B, w, ℓ} [d_{ℓ}^{'}, d^{'}]

(7)

Overall, we get the following recursion:

Z_{p, q}^{v, w} [d_{O}, d_{I}] = \sum_{v \leq j} \sum_{\begin{matrix} d_{r} \leq d_{I} \\ d \leq d_{O} \end{matrix}} Y_{p, j}^{B, v, r} [d, d_{r}] \cdot Y_{j, q}^{I B, w, ℓ} [d_{I} - d_{r}, d_{O} - d]

(8)

We can now define Z^v,w[ d] by

\begin{array}{l} Z^{v, w} [d] = Z_{0}^{v, w} [d] + {\hat{Q}}_{p, q}^{b} \cdot & \{\sum_{\begin{matrix} {p, q} \neq {v, w} \\ d_{I} \geq d + b \end{matrix}} Z_{p, q}^{v, w} [d, d_{I}] + \sum_{\begin{matrix} {p, q} \neq {v, w} \\ d < d_{O} + b \end{matrix}} Z_{p, q}^{v, w} [d_{O}, d]\} \end{array}

(9)

where ${\hat{Q}}_{p, q}^{b}$ is the external partition function over all structures on the union of the intervals x[ 1..p] ∪ x[ q..n] so that {p,q} is a base pair. Since the base pair probability can be written as $Pr ({p, q}) = \frac{{\hat{Q}}_{p, q}^{b} Q_{p, q}^{b}}{Q}$ , this quantity can be calculated as ${\hat{Q}}_{p, q}^{b} = \frac{Pr ({p, q}) Q}{Q_{p, q}^{b}}$ . The base pair probability P r ({p,q}), and the partition functions Q and $Q_{p, q}^{b}$ are computed by means of McCaskill’s algorithm.

This part now has a complexity of O(n²D²) space and O(n³D⁴) time. For practical applications, however, we do not need to consider all possible {p,q}. Instead, there are only few base pairs that are likely to form and that cover v,w, especially for v,w where the internal distance of v,w is large enough such that an outside path has to be considered at all. If we assume a constant number of such long-range base-pairs, then the complexity is reduced by an n²-factor. For the complexity in terms of distance, recall that D is typically small.

Recursions for $Z_{i, j}^{B, v} [d_{ℓ}, d_{r}]$

So far, we have used $Z_{i, j}^{B, v} [d_{ℓ}, d_{r}]$ as a black box. In order to compute these terms, we distinguish the limiting cases a.) v = i, b.) v = j, c.) is external from the generic case d.):

Starting from the limiting cases, we initialize $Z_{v, j}^{B, v} [0, d_{r}]$ as follows:

Z_{v, j}^{B, v} [0, d_{r}] = \{\begin{array}{l} Z_{v, j}^{I^{'}} [d_{r}] & for a \leq d_{r} < b \\ \sum_{d^{'} \geq b} Z_{v, j}^{I^{'}} [d^{'}] & for d_{r} = b \\ 0 & otherwise \end{array}

and analogously for $Z_{i, v}^{B, v} [d_{ℓ}, 0]$ . Furthermore, $Z_{i, j}^{B, v} [0, 0] = 0$ for i ≠ v ≠ j. These conventions allow us to model all cases as special cases of d). Our key observation here is that the dependency between d_ℓ and d_r can be used to reduce the time complexity. Instead of using the variables d_ℓ and d_r in $Z_{i, j}^{B, v} [d_{ℓ}, d_{r}]$ , we use the pair d_ℓ,d_add in $Z_{i, j}^{B, v} [d_{ℓ}, d_{ℓ} + d_{add}]$ . Similarly, we use $d_{ℓ}^{'}, d_{add}^{'}$ instead of $d_{ℓ}^{'}, d_{r}^{'}$ for the inner base pair, which then determines completely the splitting the distances. This results in an recursion for $Z_{i, j}^{B, v} [d_{ℓ}, d_{ℓ} + d_{add}]$ with complexity $O (n^{4} D^{2} c_{b}^{2})$ time and O (n²D c_b) space. To be precise, there are three sub-cases as follows.

The values that are chosen to split d_ℓ and d_add are indicated in green and blue. When the arc {i,j} is colored violet, then there is a shortest path that does not use the distance marked in red but uses the other direction together with the arc {i,j}. If -b<d_add<+b, then we know that neither a shortest path $v \overset{p}{⇝} i$ nor $v \overset{p}{⇝} j$ uses the arc {i,j}. The left distance is thus given by $d_{ℓ} - d_{ℓ}^{'}$ . Using the shortcuts d_r= d_ℓ+ d_add and $d_{r}^{'} = d_{ℓ}^{'} + d_{add}^{'}$ , then the distance between l and j must be $d_{r} - d_{r}^{'} = (d_{ℓ} + d_{add}) - (d_{ℓ}^{'} + d_{add}^{'})$ . If, on the other hand, d_add = + b, then we know that there is at least one shortest path that can be composed by using a shortest path $v ⇝ i$ , followed by the arc {i,j}. This of course implies that the shortest path $v \overset{p}{⇝} j$ is has exactly the length d_ℓ+ b, or is larger. For a sub-path $l + 1 \overset{p^{'}}{⇝} j$ this implies that the length is greater or equal $d = d_{r} - d_{r}^{'} = (d_{ℓ} + b) - (d_{ℓ}^{'} + d_{add}^{'})$ . Thus, we just have to add all partition functions $Z_{k, j}^{I^{'}} [d^{'}]$ with d^′ > d. This can be done efficiently by using a precalculated matrix $Z_{i, j}^{I^{'} \geq} [d]$ , which is defined as $\sum_{d^{'} \geq d} Z_{i, j}^{I^{'}} [d^{'}]$ . Note that $Z_{i, j}^{I^{'} \geq} [d]$ can also be defined if we restrict in all recursion the distance d to a threshold θ_d, since $Z_{i, j}^{I^{'} \geq} [d] = \sum_{d^{'} \geq d} Z_{i, j}^{I^{'}} [d^{'}] = Q_{i, j}^{'} - \sum_{d^{'} < d} Z_{i, j}^{I^{'}} [d^{'}]$ . In which, where $Q_{i, j}^{'}$ is Q_{i + 1,j - 1} if j > i + 1, 1 if j = i + 1 and 0 otherwise. Note, furthermore, that all $Z_{i, j}^{I^{'}} [d^{'}]$ for d^′< d ≤ θ_d are calculated when we restrict the distance to θ_d.

Finally, if d_{ad d}= - b, then the shortest path $l \overset{p}{⇝} j$ has distance $(d_{ℓ} - b) - (d_{ℓ}^{'} + d_{add}^{'})$ . For the shortest path $k \overset{p}{⇝} i$ , we know that it has length $d_{ℓ} - d_{ℓ}^{'}$ or greater, which can be resolved by again using $Z_{i, k - 1}^{I^{'} \geq} [d_{ℓ} - d_{ℓ}^{'}]$ . Thus, we get the following optimized recursion for $Z_{i, j}^{B, v} [d_{ℓ}, d_{ℓ} + d_{add}]$ with d_ℓ≠ 0 and d_ℓ+ d_add ≠ 0:

Z_{i, j}^{B, v} [d_{l}, d_{l} + d_{add}] = \{\begin{array}{l} \sum_{\underset{\underset{v \leq l < j}{i < k \leq v}}{k \neq l}} \sum_{d_{l}^{'} \leq d_{l}} \sum_{\underset{- b \leq d_{add}^{'} \leq b}{d_{add}^{'}}} (\begin{array}{l} Z_{i, k}^{I^{'}} [d_{l} - d_{l}^{'}] \cdot Z_{k, l}^{B, v} [d_{l}^{'}, d_{l}^{'} + d_{add}^{'}] \\ \cdot Z_{l, j}^{I^{'}} [(d_{l} + d_{add}) - (d_{l}^{'} + d_{add}^{'})] \end{array}) if - b < d_{add} < b \\ \sum_{\underset{\underset{v \leq l < j}{i < k \leq v}}{k \neq l}} \sum_{d_{l}^{'} \leq d_{l}} \sum_{\underset{- b \leq d_{add}^{'} \leq b}{d_{add}^{'}}} (\begin{array}{l} Z_{i, k}^{I^{'}} [d_{l} - d_{l}^{'}] \cdot Z_{k, l}^{B, v} [d_{l}^{'}, d_{l}^{'} + d_{add}^{'}] \\ \cdot Z_{l, j}^{I^{'}} \geq [(d_{l} - b) - (d_{l}^{'} + d_{add}^{'})] \end{array}) if d_{add} = b \\ \sum_{\underset{\underset{v \leq l < j}{i < k \leq v}}{k \neq l}} \sum_{d_{l}^{'} \leq d_{l}} \sum_{\underset{- b \leq d_{add}^{'} \leq b}{d_{add}^{'}}} (\begin{array}{l} Z_{i, k}^{I^{'}} [d_{l} - d_{l}^{'}] \cdot Z_{k, l}^{B, v} [d_{l}^{'}, d_{l}^{'} + d_{add}^{'}] \\ \cdot Z_{l, j}^{I^{'}} [(d_{l} - b) - (d_{l}^{'} + d_{add}^{'})] \end{array}) if d_{add} = - b \end{array}

(10)

Discussion and applications

The theoretical analysis of the distance distribution problem shows that, while polynomial-time algorithms exist, they probably cannot be improved to space and time complexities that make them widely applicable to large RNA molecules. Due to the unfavorable time complexity of the current algorithm and the associated exact implementation in C, a rather simple and efficient sampling algorithm has been implemented. We resort to sampling Boltzmann-weighted secondary structures with RNAsubopt -p[16], which uses the same stochastic backtracing approach as sfold[18]. As the graph-distance for a pair of nucleotides in a given secondary structure can be computed in O(n logn) time by Dijkstra’s algorithm with Fibonacci heap [19], even large samples can be evaluated efficiently.

As we pointed out in the introduction, the graph-distance measure introduced in this paper can serve as a first step towards a structural interpretation of smFRET data. As an example, we consider the graph distance distribution of a Diels-Alderase (DAse) ribozyme (Figure 2A). Histograms of smFRET efficiency (E_fret) for this 49 nt long catalytic RNA are reported in [12] for a large number of surface-immobilized ribozyme molecules as a function of the Mg ²⁺ concentration in the buffer solution. A sketch of their histograms is displayed in Figure 2B. The dyes are attached to sequence positions 6 (Cy3) and 42 (Cy5) and hence do not simply reflect the end-to-end distance, Figure 2A(c). In this example, we observe the expected correspondence small graph-distances with a strong smFRET signal. This is a particular interesting example, since the minimal free energy (mfe) structure (Figure 2A(a)) predicted with RNAfold is not identified with the real secondary structure (Figure 2A(c)). In fact, the ground state secondary structure is ranked as the 3rd best sub-optimal structure derived via RNAsubopt -e. The free energy difference between these two structures is only 0.1 kcal/mol. However, their graph-distances show a relatively larger difference. The 2nd best sub-optimal structure (Figure 2A(b)) looks rather similar with the 3rd structure, in particular, they share the same graph-distance value.

The smFRET data of [12] indicates the presence of three sub-populations, corresponding to three different structural states: folded molecules (state F), intermediate conformation (state I) and unfolded molecules (state U). In the absence of Mg ²⁺, the I state dominates, and only small fractions are found in states U and F. Unfortunately, the salt dependence of RNA folding is complex [21, 22] and currently is not properly modeled in the available folding programs. We can, however, make use of the qualitative correspondence of low salt concentrations with high temperature. In Figure 2C we therefore re-compute the graph-distance distribution in the ensemble at an elevated temperature of 50°C. Here, the real structure becomes the second best structure with free energy -10.82 kcal/mol and we observe a much larger fraction of (nearly) unfolded structures with longer distances between the two beacon positions. Qualitatively, this matches the smFRET data showed in Figure 2B.

Furthermore, for a given pair v,w of positions in a given RNA sequence x, the importance I_v,w(e) of a backbone edge or base pair e in calculating the graph-distance distribution is evaluated by $I_{v, w} (e) = \sum_{e \in Π_{e}} Pr [G | x]$ , where the set Π_e comprises the secondary structures G with (at least) one shortest path between v and w that runs through e. Figure 3 compares dot plots of I_v,w(e) with the base-pair probabilities in the RNA structure ensemble of the DAse ribozyme at temperatures 37°C and 50°C. Since RNAgraphdist computes only one of possible many shortest paths for each G, hence we obtain only a lower bound on I_v,w(e).

We observe for DAse that the contributions from the backbone edges are larger than the base pairs at both temperatures. For T = 37°C, there are in total 14 edges with I_6,42(e) > 0.4. Only two of them, 5(C)–18(G) and 2(G) – 21(C) are base pairs. For T = 50°C, there is only the pair 5(C) – 18(G) is heavily used (I_6,42(5,18) = 0.636). Combining the analysis of data illustrated in Figure 2, it may indicate that the existences of two base pairs, 2(G) – 21(C) and 28(G) – 39(C) can affect the graph-distance distribution of RNA secondary structure ensemble and consequently affect smFRET measurements. Such constraints may become an interesting source of constraints for RNA structure prediction.

In addition, we compute the distribution of paths which pass through positions outside sequence interval x[ 6-h,42+h] of DAse ribozyme. As illustrated in Figure 4, this “outside-path” distribution, as expected, drops fast to 0 with respect to h.

Long-range interactions play an important role in pre-mRNA splicing and in the regulation of alternative splicing [23–25], bringing splice donor, acceptor, branching site into close spatial proximity. Figure 5A shows for D. melanogaster pre-mRNAs that the distribution of graph-distances between donor and acceptor sites shifted towards smaller values compared to randomly selected pairs of positions with the same distance. Due to the insufficiency of the spacial-distance information of structural elements in the secondary structures, we artificially choose a = b = 1 in our experiments. Although the effect is small, it shows a clear difference between the real RNA sequences and artificial sequences that were randomized by di-nucleotide shuffling. Furthermore, Table 1 displays for a specific intron CG16979-RA_intron_0_0_chr3L_15569803 from Drosophila melanogaster (dm3), the most probable secondary structures in the sub-ensembles of secondary structures such that their graph-distances are 7, 6, and 14, respectively.

Table 1 Graph-distance of intron CG16979-RA_intron_0_0_chr3L_15569803 from Drosophila melanogaster (dm3)

Full size table

The Drosophila melanogaster Down syndrome cell adhesion molecules (DSCAM) encodes for 38.016 different mRNAs by alternative splicing. Among the 24 exons, exon 4 alone has 12 variants [26]. In Figure 6 we display the graph-distance from donor (exon 3) to any downstream position until acceptor (exon 5). Comparing the graph-distances of all twelve acceptors of exon 4, we see clearly local peaks. This suggests the acceptor being part of hairpin loops, three dimensionally poking out of the long transcript to interact easily with the spliceosome and donor. Four of the twelve acceptor sites show no local peak, however seem to be accessible as internal loops of longer hairpins.

The spatial organization of the genomic and sub-genomic RNAs is important for the processing and functioning of many RNA viruses. This goes far beyond the well-known panhandle structures. In Coronavirus the interactions of the 5’ TRS-L cis-acting element with body TRS elements has been proposed as an important determinant for the correct assembly of the Coronavirus genes in the host [27]. The mechanisms of interaction is unknown, and a small three-dimensional distance is suspected. The matrix of expected graph-distances in Figure 5B shows that TRS-L and TRS-B are indeed placed close to each other. In Table 2, we show the most stable structures within the sub-ensembles of secondary structures such that their graph-distances are 14, 5, and 35, respectively. All these RNA secondary structures brings the leader transcription regulation site (L-TRS) in close spatial proximity with the body transcription regulation site (B-TRS).

Table 2 Graph-distance of the genomic RNA of human Coronavirus 229E computed from a concatenation of position 1-576 and 25188-25688

Full size table

These examples indicate that the systematic analysis of the graph-distance distribution both for individual RNAs and their aggregation over ensembles of structures can provide useful insights into structural influences on RNA function. These may not be obvious directly from the structures due to the inherent difficulties of predicting long-range base pairs with sufficient accuracy and the many issues inherent in comparing RNA structures of very disparate lengths.

Due the complexity of algorithm we have refrained from attempting a direct implementation in an imperative programming language. Instead, we are aiming at an implementation in Haskell that allows us to make use of the framework of algebraic dynamic programming [28]. The graph-distance measure and the associated algorithm can be extended in principle to of RNA secondary structures with additional tertiary structural elements such as pseudoknots [29] and G-quadruples [30]. RNA-RNA interaction structures [31] also form a promising area for future extensions. We note finally, that the Fourier transition method introduced in [32] could be employed to achieve a further speedup.

Conclusion

The distribution of spatial distances in the equilibrium structure ensemble of an RNA molecule carries information about the overall structure of the molecule. These distance can be approximated by the graph-distance in RNA secondary structure. We introduced a polynomial time algorithm to compute the equilibrium distribution of graph-distances between a fixed pair of nucleotides. For practical applications, small distances are of main interest. Here, the time complexity of the proposed algorithm is O(n⁴), compared to a naïve implementation with time complexity of O(n¹¹) for sequence length n and distances that can cover the whole sequence length. Since further reductions, however, seem to be difficult, we also introduced sampling approaches that are much easier to implement. They are also theoretically favorable for several real-life applications, in particular since these primarily concern long-range interactions in very large RNA molecules.

References

Yoffe AM, Prinsen P, Gelbart WM, Ben-Shaul A: The ends of a large RNA molecule are necessarily close. Nucl Acids Res. 2011, 39: 292-299. 10.1093/nar/gkq642.
Article CAS PubMed PubMed Central Google Scholar
Fang LT: The end-to-end distance of RNA as a randomly self-paired polymer. J Theor Biol. 2011, 280: 101-107. 10.1016/j.jtbi.2011.04.010.
Article CAS PubMed Google Scholar
Clote P, Ponty Y, Steyaert JM: Expected distance between terminal nucleotides of RNA secondary structures. J Math Biol. 2012, 65: 581-599. 10.1007/s00285-011-0467-8.
Article PubMed Google Scholar
Han HS, Reidys CM: The 5’-3’ distance of RNA secondary structures. J Comput Biol. 2012, 19: 867-878. 10.1089/cmb.2011.0301.
Article CAS PubMed Google Scholar
Forties RA, Bundschuh R: Modeling the interplay of single-stranded binding proteins and nucleic acid secondary structure. Bioinformatics. 2010, 26: 61-67. 10.1093/bioinformatics/btp627.
Article CAS PubMed Google Scholar
Gerland U, Bundschuh R, Hwa T: Force-induced denaturation of RNA. Biophys J. 2001, 81: 1324-1332. 10.1016/S0006-3495(01)75789-X.
Article CAS PubMed PubMed Central Google Scholar
Müller M, Krzakala F, Mézard M: The secondary structure of RNA under tension. Eur Phys J E. 2002, 9: 67-77.
PubMed Google Scholar
Gerland U, Bundschuh R, Hwa T: Translocation of structured polynucleotides through nanopores. Phys Biol. 2004, 1: 19-26. 10.1088/1478-3967/1/1/002.
Article CAS PubMed Google Scholar
Einert TR, Näger P, Orland H, Netz R: Impact of loop statistics on the thermodynamics of RNA Folding. Phys Rev Lett. 2008, 101: 048103-
Article PubMed Google Scholar
Roy R, Hohng S, Ha T: A practical guide to single-molecule FRET. Nat Methods. 2008, 5: 507-516. 10.1038/nmeth.1208.
Article CAS PubMed PubMed Central Google Scholar
Das R, Baker D: Automated de novo prediction of native-like RNA tertiary structures. Proc Natl Acad Sci USA. 2007, 104: 14664-14669. 10.1073/pnas.0703836104.
Article CAS PubMed PubMed Central Google Scholar
Kobitski A, Nierth A, Helm M, Jaschke A, Nienhaus UG: Mg²⁺-dependent folding of a Diels-Alderase ribozyme probed by single-molecule FRET analysis. Nucleic Acids Res. 2007, 35 (6): 2047-2059. 10.1093/nar/gkm072.
Article CAS PubMed PubMed Central Google Scholar
Schuster P, Fontana W, Stadler PF, Hofacker IL: From sequences to shapes and back: a case study in RNA secondary structures. Proc R Soc London B. 1994, 255 (1344): 279-84. 10.1098/rspb.1994.0040.
Article CAS Google Scholar
Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH: Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci USA. 2004, 101: 7287-7292. 10.1073/pnas.0401799101.
Article CAS PubMed PubMed Central Google Scholar
McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990, 29 (6–7): 1105-1119.
Article CAS PubMed Google Scholar
Lorenz R, Bernhart SH, Höner zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL: ViennaRNA Package 2.0. Alg Mol Biol. 2011, 6: 26-10.1186/1748-7188-6-26.
Article Google Scholar
Lyngsø RB, Zuker M, Pedersen C: Fast evaluation of internal loops in RNA secondary structure prediction. Bioinformatics. 1999, 15: 440-445. 10.1093/bioinformatics/15.6.440.
Article PubMed Google Scholar
Ding Y, Lawrence C: A statistical sampling algorithm for RNA secondary structure prediction. Nucl Acids Res. 2003, 31 (24): 7280-7301. 10.1093/nar/gkg938.
Article CAS PubMed PubMed Central Google Scholar
Fredman M, Tarjan R: Fibonacci heaps and their uses in improved network optimization algorithms. J ACM. 1987, 34 (3): 596-615. 10.1145/28869.28874.
Article Google Scholar
Darty K, Denise A, Ponty Y: VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics. 2009, 25 (15): 1974-1975. 10.1093/bioinformatics/btp250.
Article CAS PubMed PubMed Central Google Scholar
Leipply D, Lambert D, Draper DE: Ion-RNA interactions thermodynamic analysis of the effects of mono- and divalent ions on RNA conformational equilibria. Methods Enzymol. 2009, 469: 433-463.
Article CAS PubMed PubMed Central Google Scholar
Mathews D, Sabina J, Zuker M, Turner DH: Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol. 1999, 288: 911-940. 10.1006/jmbi.1999.2700.
Article CAS PubMed Google Scholar
Baraniak AP, Lasda EL, Wagner EJ, Garcia-Blanco MA: A stem structure in fibroblast growth factor receptor 2 transcripts mediates cell-type-specific splicing by approximating intronic control elements. Mol Cell Biol. 2003, 23: 9327-9337. 10.1128/MCB.23.24.9327-9337.2003.
Article CAS PubMed PubMed Central Google Scholar
McManus CJ, Graveley BR: RNA structure and the mechanisms of alternative splicing. Curr Opin Genet Dev. 2011, 21: 373-379. 10.1016/j.gde.2011.04.001.
Article CAS PubMed PubMed Central Google Scholar
Amman F, Bernhart S, Doose D, Hofacker I, Qin J, Stadler P, Will S: The Trouble with Long-Range Base Pairs in RNA Folding. Lecture Notes in Computer Science: Advances in Bioinformatics and Computational Biology, Volume 8213. 2013, Berlin, Heidelberg, New York: Springer-Verlag, 1-11.
Google Scholar
Celotto A, Graveley B: Exon-specific RNAi: a tool for dissecting the functional relevance of alternative splicing. RNA. 2002, 8 (6): 718-724. 10.1017/S1355838202021064.
Article CAS PubMed PubMed Central Google Scholar
Dufour D, Mateos-Gomez PA, Enjuanes L, Gallego J, Sola I: Structure and functional relevance of a transcription-regulating sequence involved in coronavirus discontinuous RNA synthesis. J Virol. 2011, 85 (10): 4963-4973. 10.1128/JVI.02317-10.
Article CAS PubMed PubMed Central Google Scholar
Giegerich R, Meyer C: Algebraic dynamic programming. Algebraic Methodology And Software Technology. 2002, Berlin, Heidelberg, New York: Springer-Verlag, 349-364.
Chapter Google Scholar
Reidys CM, Huang FWD, Andersen JE, Penner RC, Stadler PF, Nebel ME: Topology and prediction of RNA pseudoknots. Bioinformatics. 2011, 27 (8): 1076-1085. 10.1093/bioinformatics/btr090.
Article CAS PubMed Google Scholar
Lorenz R, Bernhart S, Qin J, Honer zu, Siederdissen C, Tanzer A, Amman F, Hofacker I: 2D meets 4G: G-Quadruplexes in RNA Secondary Structure Prediction. IEEE/ACM Trans Comput Biol Bioinformatics. doi:10.1109/TCBB.2013.7,
Li AX, Marz M, Qin J, Reidys CM: RNA-RNA interaction prediction based on multiple sequence alignments. Bioinformatics. 2011, 27 (4): 456-463. 10.1093/bioinformatics/btq659.
Article CAS PubMed Google Scholar
Senter E, Sheikh S, Dotu I, Ponty Y, Clote P: Using the fast fourier transform to accelerate the computational search for RNA conformational switches. PLoS ONE. 2012, 7 (12): e50506-10.1371/journal.pone.0050506.
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgments

This work was supported in part by the Deutsche Forschungsgemeinschaft proj. nos. BA 2168/3-3, SFB 992, STA 850/10-2, SPP 1596 and MA 5082/1-1, the BMBF (grant 0316165A) and the MWK (grant 7533-7-11.6.1).

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, Campusvej 55, DK-5230, Odense M, Denmark
Jing Qin
Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103, Leipzig, Germany
Peter F Stadler
Bioinformatics/High Throughput Analysis Faculty of Mathematics und Computer Science Friedrich-Schiller-University, Leutragraben 1, D-07743, Jena, Germany
Markus Fricke & Manja Marz
Department of Computer Science, Chair for Bioinformatics, University of Freiburg, Georges-Koehler-Allee 106, D-79110, Freiburg, Germany
Rolf Backofen
Center for Biological Signaling Studies (BIOSS), Albert-Ludwigs-Universität, Freiburg, Germany
Rolf Backofen
Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16-18, D-04107, Leipzig, Germany
Peter F Stadler
Fraunhofer Institut for Cell Therapy and Immunology, Perlickstraße 1, D-04103, Leipzig, Germany
Peter F Stadler
Institute for Theoretical Chemistry, University of Vienna, Währingerstrasse 17, A-1090, Vienna, Austria
Peter F Stadler
Santa Fe Institute, 1399 Hyde Park Rd., NM87501, Santa Fe, USA
Peter F Stadler

Authors

Jing Qin
View author publications
You can also search for this author in PubMed Google Scholar
Markus Fricke
View author publications
You can also search for this author in PubMed Google Scholar
Manja Marz
View author publications
You can also search for this author in PubMed Google Scholar
Peter F Stadler
View author publications
You can also search for this author in PubMed Google Scholar
Rolf Backofen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rolf Backofen.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

Conceived and designed the algorithms: JQ, PFS and RB. Implemented algorithms and performed experiments: JQ and MN. Analyzed Diels-Alderase ribozyme data: JQ and PFS. Analyzed pre-mRNA splicing data: MN and MM. Wrote the final manuscript: JQ, MM, PF and RB. All authors read and approved the final manuscript.

Electronic supplementary material

13015_2013_221_MOESM1_ESM.pdf

Additional file 1: Appendix A: Proof of the $E [d_{G} (v, w)] = \sum_{d} d \times$ $\frac{Z^{v, w} [d]}{Z}$ . Appendix B: The conditional probability for i to be single-stranded can be determined from the partition function for RNA folding. Appendix C: Tables of notations. (PDF 171 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( https://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Qin, J., Fricke, M., Marz, M. et al. Graph-distance distribution of the Boltzmann ensemble of RNA secondary structures. Algorithms Mol Biol 9, 19 (2014). https://doi.org/10.1186/1748-7188-9-19

Download citation

Received: 30 November 2013
Accepted: 30 June 2014
Published: 11 September 2014
DOI: https://doi.org/10.1186/1748-7188-9-19

Graph-distance distribution of the Boltzmann ensemble of RNA secondary structures