On the combinatorics of sparsification

  • Fenix WD Huang1 and

    Affiliated with

    • Christian M Reidys1Email author

      Affiliated with

      Algorithms for Molecular Biology20127:28

      DOI: 10.1186/1748-7188-7-28

      Received: 31 December 2011

      Accepted: 11 October 2012

      Published: 22 October 2012

      Abstract

      Background

      We study the sparsification of dynamic programming based on folding algorithms of RNA structures. Sparsification is a method that improves significantly the computation of minimum free energy (mfe) RNA structures.

      Results

      We provide a quantitative analysis of the sparsification of a particular decomposition rule, Λ. This rule splits an interval of RNA secondary and pseudoknot structures of fixed topological genus. Key for quantifying sparsifications is the size of the so called candidate sets. Here we assume mfe-structures to be specifically distributed (see Assumption 1) within arbitrary and irreducible RNA secondary and pseudoknot structures of fixed topological genus. We then present a combinatorial framework which allows by means of probabilities of irreducible sub-structures to obtain the expectation of the Λ-candidate set w.r.t. a uniformly random input sequence. We compute these expectations for arc-based energy models via energy-filtered generating functions (GF) in case of RNA secondary structures as well as RNA pseudoknot structures. Furthermore, for RNA secondary structures we also analyze a simplified loop-based energy model. Our combinatorial analysis is then compared to the expected number of Λ-candidates obtained from the folding mfe-structures. In case of the mfe-folding of RNA secondary structures with a simplified loop-based energy model our results imply that sparsification provides a significant, constant improvement of 91% (theory) to be compared to an 96% (experimental, simplified arc-based model) reduction. However, we do not observe a linear factor improvement. Finally, in case of the “full” loop-energy model we can report a reduction of 98% (experiment).

      Conclusions

      Sparsification was initially attributed a linear factor improvement. This conclusion was based on the so called polymer-zeta property, which stems from interpreting polymer chains as self-avoiding walks. Subsequent findings however reveal that the O(n) improvement is not correct. The combinatorial analysis presented here shows that, assuming a specific distribution (see Assumption 1), of mfe-structures within irreducible and arbitrary structures, the expected number of Λ-candidates is Θ(n2). However, the constant reduction is quite significant, being in the range of 96%. We furthermore show an analogous result for the sparsification of the Λ-decomposition rule for RNA pseudoknotted structures of genus one. Finally we observe that the effect of sparsification is sensitive to the employed energy model.

      Keywords

      Sparsification Generating function Dynamic programming

      Background

      RNA structures, diagrams and genus filtration

      An RNA sequence is a linear, oriented sequence of the nucleotides (bases) A,U,G,C. These sequences “fold” by establishing bonds between pairs of nucleotides. In this paper, we only consider the Watson-Crick base pair A-U or G-C and wobble base pairs U-G. The global conformation of an RNA molecule is determined by topological constraints encoded at the level of secondary structure, i.e., by the mutual arrangements of the base pairs[1].

      Secondary structures can be interpreted as (partial) matchings in a graph of permissible base pairs[2]. They can be represented as diagrams, i.e. graphs over the vertices 1,…,n, drawn on a horizontal line with bonds (arcs) in the upper half-plane. The length of an arc (i j) is denoted by ji. Furthermore, we call two arc (i j) and (r s) (suppose i < r) cross if i<r<j<s holds. In this representation one refers to a secondary structure without crossing arcs as a simple secondary structure and pseudoknot structure, otherwise, see Figure1.
      http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Fig1_HTML.jpg
      Figure 1

      RNA structures as planar graphs and diagrams. (A) an RNA secondary structure and (B) an RNA pseudoknot structure.

      A diagram is a labeled graph over the vertex set [n]={1,…,n} in which each vertex has degree ≤ 3, represented by drawing its vertices in a horizontal line. The backbone of a diagram is the sequence of consecutive integers (1,…,n) together with the edges {{i,i + 1}∣1 ≤ in−1}. The arcs of a diagram, (i,j), where i<j, are drawn in the upper half-plane. We shall distinguish the backbone edge {i,i + 1} from the arc (i,i + 1), which we refer to as a 1-arc. A stack of length is a maximal sequence of “parallel” arcs, ((i,j),(i + 1,j−1),…,(i + (−1),j−(−1))) and is also referred to as a -stack, see Figure2.
      http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Fig2_HTML.jpg
      Figure 2

      Diagram representation and irreducibility. A diagram over {1,…,55}. The arcs (1,21) and (11,33) are crossing and the dashed arc (9,10) is a 1-arc which is not allowed. This structure contains 4 stacks with length 7, 4, 6 and 4, from left to right respectively. Irreducibility relative also to a decomposition rule. The rule Λ splitting Si,j to Si,k and Sk + 1,j, S1,55 is not Λ-irreducible, while S2,40 and S43,55 are. However, for a specific decomposition rule Λ, which removes the outmost arc, S43,55 is not Λ-irreducible while S2,40 is.

      We shall consider diagrams as fatgraphs, G http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq1_HTML.gif, that is graphs G together with a collection of cyclic orderings, called fattenings. Each fatgraph G http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq2_HTML.gif determines an oriented surface F ( G ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq3_HTML.gif[3, 4] which is connected if G is and has some associated genus g(G) ≥ 0 and number r(G)≥1 of boundary components. Clearly, F ( G ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq4_HTML.gif contains G as a deformation retract[5]. Fatgraphs were first applied to RNA secondary structures in[6, 7].

      A diagram G http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq5_HTML.gif hence determines a unique surface F ( G ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq6_HTML.gif (with boundary). Filling the boundary components with discs we can pass from F ( G ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq7_HTML.gif to a surface without boundary. Euler characteristic, χ, and genus, g, of this surface is given by χ = ve + r and g = 1 1 2 χ http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq8_HTML.gif, respectively, where v,e,r is the number of discs, ribbons and boundary components in G http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq9_HTML.gif,[5]. The genus of a diagram is that of its associated surface without boundary and a diagram of genus g is referred to as g-diagram.

      A g-diagram without arcs of the form (i,i + 1) (1-arcs) is called a g-structure. A g-diagram that contains only vertices of degree three, i.e. does not contain any vertices not incident to arcs in the upper half-plane, is called a g-matching. A diagram is called irreducible, if and only if it cannot be split into two by cutting the backbone without cutting an arc, see Figure2.

      Folding algorithms

      Folded configurations are energetically somewhat optimal. Here energy is obtained by adding contributions of loops[8] contained in RNA secondary and pseudoknot structures. Any RNA structure has a unique and disjoint decomposition into such loops which are really stems from the fatgraph[9, 10] interpretation of such structures in which loops correspond to boundary components[11]. Additional constraints imply further properties, like for instance certain minimum arc-length conditions[12] and the nonexistence of isolated bonds. An mfe-RNA structure can be predicted in polynomials time by means of dynamic programming (DP) routines[12, 13].

      The most commonly used tools predicting simple RNA secondary structure mfold[13] and the Vienna RNA Package[14], require O(n2) space and O(n3) time. In the following we omit “simple” and refer to secondary structures containing crossing arcs as pseudoknot structures.

      Generalizing the matrices of the DP-routines of secondary structure folding[13, 14] to gap-matrices[15], leads to a DP-folding of pseudoknotted structures[15] (pknot‐R&E) with O(n4) space an O(n6) time complexity. The following references provide a certainly incomplete list of DP-approaches to RNA pseudoknot structure prediction using various structure classes characterized in terms of recursion equations and/or stochastic grammars:[9, 1526]. The most efficient algorithm for pseudoknot structures is[22] (pknotsRG) having O(n2) space and O(n4) time complexity. This algorithm however considers only a restricted class of pseudoknots.

      Note that RNA secondary structures are exactly structures of topological genus zero[27]. The topological classification of RNA structures[10, 11, 28] has recently been translated into an efficient DP-algorithm[9]. Fixing the topological genus of RNA structures implies that there are only finitely many types, the so called irreducible shadows[11].

      Sparsification

      Let us have a closer look at sparsification and the results of[2931]. Sparsification is a method tailored to speed up DP-algorithms predicting mfe-secondary structures[29, 31]. The idea is to prune certain computation paths encountered in the DP-recursions, see Figure3A. Let us consider the case of RNA secondary structure folding. Here sparsification reduces the DP-recursion paths to be based on so called candidates. A candidate is in this case an interval, for which the optimal solution cannot be written as a sum of optimal solutions of sub-intervals. This implies the structure over a candidate is an “irreducible” structures when tracing back from the optimal solution. Considering only these candidates gives the same optimal solution as considering all possible intervals. The crucial observation here is that if these irreducibles appear only at a low rate we have a significant reduction in time and space complexity.
      http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Fig3_HTML.jpg
      Figure 3

      (A) Sparsification of secondary structure folding. Suppose the optimal solution Li,j is obtained from the optimal solutions Li,k, Lk + 1,q and Lq + 1,j. Based on the recursions of the secondary structures, Li,kand Lk + 1,q produce an optimal solution of Li,q. Similarly, Lk + 1,q and Lq + 1,j produce an optimal solution of Lk + 1,j. Now, in order to obtain an optimal solution of Li,j it is sufficient to consider either the grouping Li,q and Lq + 1,j or Li,k and Lk + 1,j. (B) General idea of sparsification: L v is alternatively realized via L v 1 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq10_HTML.gif and L v 2 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq11_HTML.gif, or L v 1 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq12_HTML.gif and L v 3 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq13_HTML.gif. Thus it is sufficient to only consider one of the computation paths.

      Sparsification has been also applied in the context of RNA-RNA interaction structures[30] as well as RNA pseudoknot structures[32]. In difference to RNA secondary structures, however, not every decomposition rule in the DP-folding of RNA pseudoknot structures is amendable to sparsification. By construction, sparsification can only be applied for calculating mfe-energy structures. Since the computation of the partition function[20, 33] needs to take into account all sub-structures, sparsification does not work.

      Sparsification[29, 31, 32] can be described as follows: let V={v1v2,…} be a set whose elements v i are unions of pairwise disjoint intervals. Let furthermore L v denote an optimal solution (here optimal means to maximize the scores) of the DP-routine over v. By assumption L v is recursively obtained. Suppose we are given a decomposition rule Λ1, for which the optimal solution L v is L v = L v 1 + L v 2 + L v 3 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq14_HTML.gif, where v = v 1 ̇ v 2 ̇ v 3 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq15_HTML.gif. Then, under certain circumstances, the DP-routine may interpret L v either as ( L v 1 + L v 2 ) + L v 3 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq16_HTML.gif or as L v 1 + ( L v 2 + L v 3 ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq17_HTML.gif, see Figure3B. To be precise, this situation is encountered iff

      there exists an optimal solution L v 1 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq18_HTML.gif for a sub-structure over v 1 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq19_HTML.gif where v 1 = v 1 ̇ v 2 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq20_HTML.gif via Λ2 and L v is obtained from L v 1 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq21_HTML.gif and L v 3 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq22_HTML.gif via Λ1,

      there exists an optimal solution L v 2 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq23_HTML.gif for a sub-structure over v 2 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq24_HTML.gif where v 2 = v 2 ̇ v 3 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq25_HTML.gif via Λ3 and L v is obtained by L v 1 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq26_HTML.gif and L v 2 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq27_HTML.gif via Λ1.

      Given a decomposition
      L v = L v 1 + L v 2 } Λ 2 + L v 3 } Λ 1 , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equa_HTML.gif
      we call Λ2s-compatible to Λ1 if there exists a decomposition rule Λ3 such that
      L v = L v 1 + L v 2 + L v 3 } Λ 3 } Λ 1 . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equb_HTML.gif

      Note that if Λ2 is s-compatible to Λ1then Λ3 is s-compatible to Λ1. To summarize

      Definition 1

      ( s -compatible) Suppose L v is the optimal solution for S v over v, L v = L v 1 + L v 3 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq28_HTML.gif under decomposition rule Λ1. L v 1 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq29_HTML.gif is obtained from two optimal solutions L v 1 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq30_HTML.gif and L v 2 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq31_HTML.gif under rule Λ2. Then Λ2 is called s-compatible to Λ1 if there exist some rule Λ3 such that L v 2 = L v 2 + L v 3 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq32_HTML.gif and L v = L v 1 + L v 2 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq33_HTML.gif.

      Figure3B depicts two such ways that realize the same optimal solution L v . Sparsification prunes any such multiple computations of the same optimal value. Note that by symmetry, Λ2 and Λ3 are both s-compatible to Λ1.

      We next come to the important concept of candidates. The latter mark the essential computation paths for the DP-routine.

      Definition 2

      (Candidates) Suppose L v is an optimal solution in a sense of maximizing. We call v is a Λ-candidate if for any v1v obtained by Λand v = v 1 ̇ v 2 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq34_HTML.gif, we have
      L v > L v 1 + L v 2 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equc_HTML.gif

      and we shall denote the set of Λ-candidates set by QΛ.

      By construction a Λ-candidate v is a union of disjoint intervals such that its optimal solution L v cannot be obtained via a Λ-splitting. This optimal solution allows to construct a non-unique arc-configuration (sub-structure) over v[13, 14] and the above Λ-splitting consequently translates into a splitting of this sub-structure. This connects the notion of Λ-candidates with that of sub-structures and shows that a Λ-candidate implies a sub-structure that is Λ-irreducible.

      Lemma 1

      [29, 32] Suppose L v is obtained by selecting the optimal solution from the decomposition rules Λ1Λ2,…,Λ n . If Λ is s-compatible to all Λ i ,∀1 ≤ in, then L v can be obtained via Λ-candidates.

      In summary, as for the impact of sparsification,[29] claims that sparsification reduces the time complexity by a linear factor. This claim is based on the assumption that RNA molecules satisfy the polymer-zeta property[29]. Subsequent studies draw a slightly different picture[31] concluding that that sparsification requires O(nZ) time, where n denotes the length of input sequence, and Z is a sparsity parameter satisfying nZ<n2. Recently, it has been shown in[34] that an asymptotic time complexity of a sparsified RNA folding algorithm using standard energy parameters remains O(n3) under a wide variety of condition.

      Sparsification of RNA secondary structures

      Here we recall some results of[29, 31] on the sparsification of RNA secondary structures. Secondary structures satisfy a simple recursion which gives the optimal (maximum) solution over i j by L i , j = max { V i , j , W i , j } http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq35_HTML.gif, where Vi,j denotes the optimal solution in which (i j) is a base pair, and Wi,j denotes the optimal solution obtained by adding the optimal solutions of two subsequent intervals, respectively. Note that the optimal solution over a single vertex is denoted by Li,i. We have the recursion equation for Vi,j and Wi,j:
      ( Λ 1 ) V i , j = L i + 1 , j 1 + w ( i , j ) , ( Λ 2 ) W i , j = max i < k < j { L i , k + L k + 1 , j } , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equd_HTML.gif
      where w(i j) is the energy contribution of (i j) forming a base pair, see Figure4. In case two positions, i, j in the sequence are incompatible then we have w(i j)=−.
      http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Fig4_HTML.jpg
      Figure 4

      The recursion solving the optimal solution for secondary structures.

      An interval [i,j] is a Λ-candidate if the optimal solution over [i,j] is given by Li,j=Vi,j>Wi,j. Indeed, [i,j] is a candidate iff [i,j] is in the candidate set of Λ, and we denote the set Q Λ http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq36_HTML.gif by Q. Suppose the optimal solution Wi,j is given by Wi,j = Li,q + Lq + 1,j and suppose we have Li,q = Li,k + Lk + 1,q. Then since [i,q] is not a candidate, Lemma 1 shows that we can compute Wi,j = Li,k + Lk + 1,j, where [i,k] is a candidate.

      Sparsification on RNA pseudoknot structures

      Sparsification can also be applied to the DP-algorithm folding RNA structures with pseudoknots[32]. In contrast to the decomposition rule Λ that spliced an interval into two subsequent intervals, we encounter in the grammar for pseudoknotted structures additional more complex decomposition rules[15]. As shown in[32] there exist some decomposition rules which are not s-compatible and which can accordingly not be sparsified at all, see Figure5B. For instance, given a decomposition rule Λ in pknot‐R&E subsequent decomposition rules which are s-compatible to Λ are referred to as split type of Λ[32].
      http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Fig5_HTML.jpg
      Figure 5

      Decomposition rules for pseudoknot structures of fixed genus (decomposed into three colors). (A) three decompositions via the rule Λ, which is s-compatible to itself. (B) three decomposition rules Λ123where Λ23 are s-compatible to Λ1. (C) three decomposition rules Λ123 where Λ23 are not s-compatible to Λ1.

      In the following we will study RNA pseudoknot structures of fixed topological genus, see RNA structures, diagrams and genus filtration for details. An algorithm folding such pseudoknot structures, gfold, has been presented in[9]. The decomposition rules that appear in gfold are reminiscent to those of pknot‐R&E but as they restrict the genus of sub-structures, the iteration of gap-matrices is severely restricted and the effect of sparsification of these decompositions is significantly smaller.

      In the following, we restrict our analysis in pseudoknotted structures to only the decomposition rule Λ, which splices an interval into two subsequent intervals. Put differently, Λ cuts the backbone of an RNA pseudoknot structure of fixed genus g over one interval without cutting a bond.

      Efficiency of sparsification

      By construction, the fewer candidates the DP-routine encounters, the more efficient the sparsification. Thus it is of utmost importance to analyze the number of candidates. In the case of sparsification of RNA secondary structures we have one basic decomposition rule Λ acting on intervals, namely Λsplices an interval into two disjoint, subsequent intervals. The implied notion of a Λ-irreducible sub-structure is that of a sub-structure nested in a maximal arc, where maximal refers to the partial order of two arcs (i j) ≤ (i j ) iff i ijj . This observation relates irreducibility to nesting of arcs and following this line of thought[29] identifies a specific property of polymer-chains introduced in[35, 36] to be of relevance for the size of candidate sets:

      Definition 3

      (Polymer-zeta property) Let P ( i , j ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq37_HTML.gif denote the probability of a structure over an interval [i,j] under some decomposition rule Λ. Then we say Λ follows the polymer-zeta property if P ( i , j ) = b m c http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq38_HTML.gif for some constant b, c > 0 and m = ji.

      Polymer-zeta comes from modeling the 2D-folding of a polymer chain as a self-avoiding walk (SAW) in a 2D lattice[37]. It implies that the probability of a base pair (i j) depends only on the length of the arc, i.e. P ( i , j ) = P ( m ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq39_HTML.gif, where m = ji. In[29] stipulate that RNA molecules satisfy the polymer-zeta property and approximate P ( i , j ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq40_HTML.gif by P ( m ) = b m c http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq41_HTML.gif[29] using 50,000 mRNA sequences of an average length of 1992 nucleotides[38]. They find b≈2.11 and c≈1.47. The average probability P ( m ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq42_HTML.gif is displayed in Figure4, Page 865[29] for increasing m. Furthermore, it is implied via Figure six, Page 867[29] that the average number of candidates converges to a constant, implying that sparsification of DP-routine folding secondary structure takes Λ(n2) time complexity.

      These findings have been questioned by[34], where it has been observed that the time complexity of a sparsified RNA folding algorithm based on energy minimization remains O(n3) independently of the energy function used and the base composition of the RNA sequence.[34] argues that the significant effect of sparsification on the DP-routine is largely a finite-size effect. Namely, when the sequence length is below some threshold, the algorithm is dominated by the quadratic time factor. In this context, it may be worth pointing out that In[31] noticed that the improvement of a sparsified base-pairing maximization algorithm depends heavily on the base composition of the input. Backofen parameterizes explicitly the cardinality of candidate sets in[31].

      Contribution

      In this paper we study the sparsification of the decomposition rule Λ[31, 32] for RNA secondary and RNA pseudoknot structures of fixed topological genus. Based on Assumption 1 below our paper provides a combinatorial framework for quantifying the effects of sparsification of the Λ rule.

      We shall prove that the candidate set[29, 31, 32] is indeed small. We compute the probability of an interval being a candidate for two different energy models. For both models, this is facilitated via computing the generating function (GF) of structures and the generating function of irreducible structures. By studying the asymptotics of coefficients in these generating functions, we can compute the expected number of candidates of a uniformly random input sequence for large n. We show similar results for RNA pseudoknot structures of fixed topological genus. This provides new insights into the improvements of the sparsification of the concatenation-rule Λ in the presence of cross serial interactions. Our observations complement the detailed analysis of Backofen[31, 32]. We show that although for pseudoknot structures of fixed topological genus[10, 11] the effect of sparsification on the global time complexity is still unclear, the decomposition rule that splits an interval can be sped up significantly.

      Methods

      Suppose w is an energy function for RNA structures. Let w δ (σ) denote the energy of an RNA structure σ over a sequence δ. The partition function of δ is given by
      Q ( δ ) = σ e w δ ( σ ) RT , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Eque_HTML.gif
      where R is the universal gas constant and T is the temperature. (Here we consider w δ (σ) as a positive score.) The partition function induces a probability space in which the probability of a structure σ is
      P δ ( σ ) = e w δ ( σ ) RT Q ( δ ) . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equf_HTML.gif

      The concept of a partition function is close to that of a generating function. In case of e w δ ( σ ) / RT = 1 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq43_HTML.gif, i.e., each structure contributes equally regardless the underlying sequence and the partition function equals [z n ]G(z), where G is the generating function and [z n ]G is the coefficient of the term z n .

      Two important energy models are arc-based[39] and loop-based[8], respectively. The loop-based energy-filtration is different from the notion of “stickiness”[40]. The compatibility of two positions by folding random sequences is considered to be 6/16, reminiscent of the probability of two given positions to be compatible by Watson-Crick and Wobble base pairs rules.

      Assumption 1

      Let
      W ( σ ) = 6 16 η w ( σ ) , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equg_HTML.gif
      where η > 1 is a constant, w(σ) is the energy value assigned to σ based on a given energy model and is the number of arcs contained in σ. Then the probability of a particular structure σ to be the mfe-structure of a uniformly random input sequence is
      P ( σ ) = W ( σ ) σ W ( σ ) . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ1_HTML.gif
      (1)

      Asymptotics

      In this section we compute two generating functions and their singular expansions[11]. Let c g (n) and d g (n) denote the number of g-matchings and g-structures having n arcs and n vertices, respectively, with GF
      C g ( z ) = n = 0 c g ( n ) z n D g ( z ) = n = 0 d g ( n ) z n . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equh_HTML.gif
      The GF C g (z) has been computed in the context of the virtual Euler characteristic of the moduli-space of curves in[41] and D g (z) can be derived from C g (z) by means of symbolic enumeration[11]. The GF of genus zero diagrams C0(z) is well-known to be the GF of the Catalan numbers, i.e., the numbers of triangulations of a polygon with (n + 2) sides,
      C 0 ( z ) = 1 1 4 z 2 z . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equi_HTML.gif

      As for g ≥ 1 we have the following situation[11]

      Theorem 1

      Suppose g ≥ 1. Then the following assertions hold
      1. (a)
        D g (z) is algebraic and
        D g ( z ) = 1 z 2 z + 1 C g z 2 z 2 z + 1 2 . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ2_HTML.gif
        (2)
         
      In particular, z 2 / ( z 2 z + 1 ) 2 = 1 / 4 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq44_HTML.gif is the only dominant singularity of D g (z). we have for some constant a g depending only on g and γ≈2.618:
      [ z n ] D g ( z ) a g n 3 ( g 1 2 ) γ n . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ3_HTML.gif
      (3)
      1. (b)
        The bivariate GF of g-structures over n vertices, containing exactly m arcs, E g (z,t), is given by
        E g ( z , t ) = 1 t z 2 z + 1 D g t z 2 ( t z 2 z + 1 ) 2 . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ4_HTML.gif
        (4)
         

      Irreducible g-structures

      In the context of Λ-candidates we observed that irreducible sub-structures are of key importance. It is accordingly of relevance to understand the combinatorics of these structures. To this end let D g ( z ) = n = 0 d g ( n ) z n http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq45_HTML.gif denote the GF of irreducible g-structures.

      Lemma 2

      For g ≥ 0, the GF D g ( z ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq46_HTML.gif satisfies the recursion
      D 0 ( z ) = 1 1 D 0 ( z ) D g ( z ) = ( D 0 ( z ) 1 ) D g ( z ) + g 1 = 1 g 1 D g 1 ( z ) D g g 1 ( z ) D 0 ( z ) . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equj_HTML.gif

      For a proof of Lemma 2, see Section Proofs.

      Theorem 2

      For g ≥ 1 we have
      1. (a)
        the GF of irreducible g-structures over n vertices is given by
        D g ( z ) = ( z 2 z + 1 ) U g ( u ) ( 1 4 u ) 3 g 1 2 + V g ( u ) ( 1 4 u ) 3 g 1 , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ5_HTML.gif
        (5)
         
      where u = z 2 ( z 2 z + 1 ) 2 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq47_HTML.gif, U g (z) and V g (z) are both polynomials with lowest degree at least 2g, and U g (1/4), V g (1/4) ≠ 0. In particular, for some constant a g > 0 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq48_HTML.gif and γ≈2.618:
      D g ( n ) a g n 3 g 1 2 γ n . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ6_HTML.gif
      (6)
      1. (b)
        the bivariate GF of irreducible g-structures over n vertices, containing exactly m arcs, E g ( z , t ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq49_HTML.gif, is given by
        E g ( z , t ) = ( t z 2 z + 1 ) U g ( v ) ( 1 4 v ) 3 g 1 2 + V g ( v ) ( 1 4 v ) 3 g 1 , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ7_HTML.gif
        (7)
         

      where v = t z 2 ( t z 2 z + 1 ) 2 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq50_HTML.gif.

      We shall postpone the proof of Theorem 2 to Section Proofs.

      The main result

      Nussinov-like energy model

      In the following we mimic some form of mfe-g-structures: inspired by the Nussinov energy model[39] we consider the weight of a g-structure over n vertices σg,n to be given by w(σg,n) = cℓ, where c is a constant contribution of a single arc and is the number of arcs in σg,n[40]. Then by Assumption 1, we have the weight function W ( σ g , n ) = ( 6 / 16 ) η c = ( ( 6 / 16 ) η c ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq51_HTML.gif. Note that the case (6/16)η c =1 corresponds to the uniform distribution, i.e. all g-structure have identical weight.

      This approach requires to keep track of the number of arcs, i.e. we need to employ bivariate GF. In Theorem 1(b) we computed this bivariate GF and in Theorem 2(b) we derived from this bivariate GF E g ( z , t ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq52_HTML.gif, the GF of irreducible g-structures over n vertices containing arcs.

      The idea now is to substitute for the second indeterminant, t, some fixed τ = ( 6 / 16 ) η c R http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq53_HTML.gif. This substitution induces the formal power series
      D g , τ ( z ) = E g ( z , τ ) , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equk_HTML.gif
      which we regard as being parameterized by τ. Obviously, setting τ = 1 we recover D g (z), i.e. we have D g (z) = Dg,1(z) = E g (z,1). Note that for τ>1/4, the polynomial τ z2z + 1 has no real root. Thus we have for τ>1/4 the asymptotics
      d g , τ ( n ) a g , τ n 3 g 1 2 γ τ n and d g , τ ( n ) a g , τ n 3 g 1 2 γ τ n , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ8_HTML.gif
      (8)
      with identical exponential growth rates as long as the supercritical paradigm[42] applies, i.e. as long as γ τ , the real root of minimal modulus of
      τ z 2 ( τ z 2 z + 1 ) 2 = 1 4 , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equl_HTML.gif

      is smaller than any singularity of 1 τ z 2 z + 1 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq54_HTML.gif. In this situation τ affects the constant ag,τ and the exponential growth rate γ τ but not the sub-exponential factor n 3 ( g 1 2 ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq55_HTML.gif. The latter stems from the singular expansion of C g (z). Analogously, we derive the τ-parameterized family of GF D g , τ ( z ) = E g ( z , τ ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq56_HTML.gif. We set the contribution of a single arc c=1 and the constant η = e, where e is the Euler number. Then we have the parameter τ = (6/16)e1≈1.0125. By abuse of notation we will omit the subscript τ assuming τ = (6/16)e1.

      The main result of this section is that the set of Λ-candidates is a small proportion of all entries. To put this size into context we note that the total number of entries considered for the Λ-decomposition rule is given by
      M ( n ) = m = 1 n ( n m + 1 ) . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equm_HTML.gif
      Theorem 3
      Suppose an mfe-g-structure over an interval of length m is irreducible with probability d g ( m ) / d g ( m ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq57_HTML.gif, then the expected number of candidates of g-structures for sequences of lengths n satisfies
      E g ( n ) = Θ ( n 2 ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equn_HTML.gif
      and furthermore, setting E ¯ g ( n ) = E g ( n ) / M ( n ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq58_HTML.gif we have
      E ¯ g ( n ) d g ( n ) / d g ( n ) b g , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equo_HTML.gif

      where b g > 0 is a constant.

      We provide an illustration of Theorem 3 in Figure6.
      http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Fig6_HTML.jpg
      Figure 6

      The expected number of candidates for secondary and 1-structures from an random input with a simplified arc-based energy model, E ¯ 0 ( n ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq59_HTML.gifand E ¯ 1 ( n ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq60_HTML.gif: we compute the expected number of candidates obtained by folding 100 random sequences for secondary structures (A)(solid) and 1-structures (B)(solid). We also display the theoretical expectations implied by Theorem 3 (A)(dashed) and (B)(dashed).

      Proof

      We proof the theorem by quantifying the probability of [i,j] being a Λ-candidate. In this case any (not necessarily unique) sub-structure, realizing the optimal solution Li,j, is Λ-irreducible, and therefore an irreducible structure over [i,j].

      Let m = (ji + 1), by assumption, the probability that [i,j] is a candidate conditional to the existence of a sub-structure over [i,j] is given by
      P [ i , j ] [ i , j ] is a candidate = d g ( m ) d g ( m ) , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ9_HTML.gif
      (9)
      Note that P [ i , j ] [ i , j ] is a candidate http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq61_HTML.gif does not depend on the relative location of the interval but only on the interval-length. Let P g ( m ) = d g ( m ) / d g ( m ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq62_HTML.gif, then according to Theorem 1,
      ( 1 ε ) a g m 3 g 1 2 γ m d g ( m ) ( 1 + ε ) a g m 3 g 1 2 γ m , ( 1 ε ) a g m 3 g 1 2 γ m d g ( m ) ( 1 + ε ) a g m 3 g 1 2 γ m , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equp_HTML.gif
      for mm0 where m0 > 0 and 0 < ε < 1 are constants. On the one hand
      P g ( m ) = d g ( m ) d g ( m ) ( 1 + ε ) a g m 3 g 1 2 γ m ( 1 ε ) a g m 3 g 1 2 γ m = ( 1 + ε ) a g a g = ( 1 + ε ) b g , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ10_HTML.gif
      (10)
      where b g = a g / a g > 0 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq63_HTML.gif is a constant. On the other hand, we have
      P g ( m ) = d g ( m ) d g ( m ) ( 1 ε ) a g m 3 g 1 2 γ m ( 1 + ε ) a g m 3 g 1 2 γ m = ( 1 ε ) a g a g = ( 1 ε ) b g . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ11_HTML.gif
      (11)
      Setting ε = max { ε , ε } http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq64_HTML.gif, we can conclude that P g ( m ) d g ( m ) / d g ( m ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq65_HTML.gif, see Figure7.
      http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Fig7_HTML.jpg
      Figure 7

      The probability distribution of P 0 ( m ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq66_HTML.gif (A) and P 1 ( m ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq67_HTML.gif (B) on a simplified arc-based energy model.

      We next study the expected number of candidates over an interval of length m. To this end let
      X m = | { [ i , j ] [ i , j ] is a Λ -candidate of length m } | . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equq_HTML.gif
      The expected cardinality of the set of Λ-candidates of length m = (ji + 1) encountered in the DP-algorithm is given by
      E g ( X m ) ( n ( m 1 ) ) P g ( m ) , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equr_HTML.gif
      since there are n−(m−1) starting points for such an interval [i,j]. Therefore, by linearity of expectation, for sufficiently large m > m0, P g ( m ) ( 1 + ε ) b g http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq68_HTML.gif with ε being a small constant. Thus we have
      E g ( n ) = E g m X m m = 1 m 0 ( n m + 1 ) P g ( m ) + ( 1 + ε ) b g m = m 0 n ( n m + 1 ) . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ12_HTML.gif
      (12)
      Consequently, the expected size of the Λ-candidate set is Λ(n2). We proceed by comparing the expected number of candidates of a sequence with length n with M ( n ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq69_HTML.gif,
      E g ( n ) M ( n ) m = 1 m 0 ( n m + 1 ) P g ( m ) + ( 1 + ε ) b g m = m 0 n ( n m + 1 ) m = 1 n ( n m + 1 ) ( 1 + ε ) b g + m = 1 m 0 ( P g ( m ) ( 1 + ε ) b g ) ( n m + 1 ) m = 1 n ( n m + 1 ) ( 1 + ε ) b g + k · n n 2 . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equs_HTML.gif
      For sufficient large nn0, E g ( n ) / M ( n ) ( 1 + ε ) b g http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq70_HTML.gif. Furthermore
      E g ( n ) M ( n ) m = 1 m 0 ( n m + 1 ) P g ( m ) + ( 1 ε ) b g m = m 0 n ( n m + 1 ) m = 1 n ( n m + 1 ) ( 1 ε ) b g , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equt_HTML.gif

      from which we can conclude E g ( n ) / M ( n ) d g ( m ) / d g ( m ) b g http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq71_HTML.gif and the theorem is proved. □

      Loop-based energy model

      In this section we discuss the loop-based energy model of RNA secondary structure folding. To be precise we evoke here trivariate GFs F(z t v) and F(z t v) whose coefficients counting the numbers of secondary structures and irreducible secondary structures over n vertices having arcs and energy j, respectively. This becomes necessary since the loop-based model distinguishes between arcs and energy. The “cancelation” effect or reparameterization of stickiness[40] to which we referred to before does not appear in this context. Thus we need both an arc- as well as an energy-filtration.

      A further complication emerges. In difference to the GFs E g (z,t) and E g ( z , t ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq72_HTML.gif the new GFs are not simply obtained by formally substituting ( t z 2 / ( ( t z 2 z + 1 ) 2 ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq73_HTML.gif into the power series D g (z) and D g ( z ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq74_HTML.gif as bivariate terms. The more complicated energy model requires a specific recursion for irreducible secondary structures.

      The energy model used in prediction of secondary structure is more complicated than the simple arc-based energy model. Loops which are formed by arcs as well as isolated vertices between the arcs are considered to give energy contribution. Loops are categorized as hairpin loops (no nested arcs), interior loops (including bulge loops and stacks) and multi-loops (more than two arcs nested), see Figure8. An arbitrary secondary structure can be uniquely decomposed into a collection of mutually disjoint loops. A result of the particular energy parameters[8] is that the energy model prefers interior loops, in particular stacks (no isolated vertex between two parallel arc), and disfavors multi-loops. Based on this observation, we give a simplified energy model for a loop λ contained in secondary structure which only depends on the loop types by

      • w(λ)=0.5 if λ is a hairpin loop,

      http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Fig8_HTML.jpg
      Figure 8

      Diagram representation of loop types in secondary structures: (A) hairpin loop, (B) interior loop, (C) multi-loop.

      • w(λ)=1 if λ is an interior loop,

      • w(λ)=−5 if λ is a multi-loop,

      where λ is a loop in a structure. The energy for a secondary structure σ accordingly is given by
      w ( σ ) = λ σ w ( λ ) . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ13_HTML.gif
      (13)
      Let F 0 ( z ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq75_HTML.gif and F0(z) be the energy-filtered GFs obtained by setting t=6/16 and v=η=e in F(z,t,v) and F(z,t,v), where e is the Euler number. Then
      f n = σ 6 16 e w ( σ ) = σ W ( σ ) , f n = σ 6 16 e w ( σ ) = σ W ( σ ) , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equu_HTML.gif

      where σ is an arbitary and σ is an irreducible secondary structure. Along these lines, , denote the number of arcs in σ and σ . In other words, what happens here is that we find a suitable parameterization which brings us back to a simple univariate GF whose coefficients count the sum of weights of structures over n vertices.

      Lemma 3
      The energy-filtered generating function of RNA secondary structures, F 0 ( z ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq76_HTML.gif, satisfies the recursion
      F 0 ( z ) = 6 16 e 0 . 5 z 2 z 1 z + 6 16 e 1 z 2 1 1 z 2 F 0 ( z ) + 6 16 e 5 z 2 F 0 ( z ) 1 1 z 2 1 F 0 ( z ) 1 1 z 1 1 z . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ14_HTML.gif
      (14)
      and F(z) is uniquely determined by the above equation. Furthermore
      F 0 ( z ) = 1 1 z 1 1 F 0 ( z ) 1 1 z . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ15_HTML.gif
      (15)
      Proof
      We first consider the GF F 0 ( z ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq77_HTML.gif whose coefficient of z n denotes the total weight of irreducible secondary structures over n vertices, where (1,n) is an arc. Thus it gives a term 6/16z2. Isolated vertex lead to the term
      z p i = 0 z i = z p 1 1 z , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equv_HTML.gif

      where p denotes the minimum number of isolated vertices to be inserted. Depending on the types of loops formed by (i,n), we have

      • hairpin loops: z 1 z http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq78_HTML.gif,

      • interior loops: F 0 ( z ) 1 1 z 2 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq79_HTML.gif,

      • multi-loops: there are at least two irreducible sub-structures, as well as isolated vertices, thus
        1 1 z i = 2 F 0 ( z ) 1 1 z i = F 0 ( z ) 1 1 z 2 1 F 0 ( z ) 1 1 z 1 1 z . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equw_HTML.gif
      Considering the contributions from the energy model we compute
      F 0 ( z ) = 6 16 e 0 . 5 z 2 z 1 z + e 1 z 2 1 1 z 2 F 0 ( z ) + e 5 z 2 F 0 ( z ) 1 1 z 2 1 F 0 ( z ) 1 1 z 1 1 z , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equx_HTML.gif

      which establishes the recursion. The uniqueness of the solution as a power series follows from the fact that each coefficient can evidently be recursively computed.

      An arbitrary secondary structure can be considered as a sequence of irreducible sub-structures with certain intervals of isolated vertices. Thus
      F 0 ( z ) = 1 1 z i = 0 1 1 z F 0 ( z ) = 1 1 z 1 1 F 0 ( z ) 1 1 z . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equy_HTML.gif

      Lemma 4
      F 0 ( z ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq80_HTML.gif and F0(z) have the same singular expansion.
      f 0 ( n ) α n 3 2 γ n , and f 0 ( n ) β n 3 2 γ n , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ16_HTML.gif
      (16)

      where α≈0.24 and β≈2.88 are constants and γ≈2.1673

      Proof

      Solving eq. 14 we obtain a unique solution for F 0 ( z ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq81_HTML.gif whose coefficient are all positive. Observing the dominant singularity of F 0 ( z ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq82_HTML.gif is ρ≈0.4614. F0(z) is a function of F 0 ( z ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq83_HTML.gif and we examine the real root of minimal modulus of 1 F 0 ( z ) 1 1 z = 0 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq84_HTML.gif is bigger than ρ. Then by the supercritical paradigm[42] applying, F0(z) and F 0 ( z ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq85_HTML.gif have identical exponential growth rates. Furthermore, F 0 ( z ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq86_HTML.gif and F0(z) have the same sub-exponential factor n 3 2 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq87_HTML.gif, hence the lemma. □

      Theorem 4
      Suppose an mfe-secondary structure over an interval of length m is irreducible with probability P 0 ( m ) = f 0 ( m ) f 0 ( m ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq88_HTML.gif, then the expected number of candidates from a random sequence of length n with a simplified loop-based energy model is
      E 0 ( n ) = Θ ( n 2 ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equz_HTML.gif
      and furthermore, setting E ¯ g ( n ) = E g ( n ) / M ( n ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq89_HTML.gif, we have
      E ¯ 0 ( n ) f 0 ( n ) / f 0 ( n ) b , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equaa_HTML.gif

      where b=α/β≈0.08.

      Proof

      By Lemma 4 we have f 0 ( m ) / f 0 ( m ) b http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq90_HTML.gif where b is a constant. The proof is completely analogous to that of Theorem 3. □

      We show the distribution of P 0 ( m ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq91_HTML.gif and E ¯ 0 ( n ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq92_HTML.gif in Figure9.
      http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Fig9_HTML.jpg
      Figure 9

      The distribution of P 0 ( m ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq93_HTML.gif(A) and E ¯ 0 ( n ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq94_HTML.gifobtained by folding 100 random sequences on the loop-based model (B)(solid), as well as the theoretical expectation implied by Theorem 4 (B)(dashed).

      Conclusion

      In this paper we quantify the effect of sparsification of the rule Λ. This rule splits intervals and separates concatenated sub-structures. The sparsification of Λ alone is claimed to provide a speed up of up to a linear factor of the DP-folding of RNA secondary structures[29]. A similar conclusion is drawn in[30] where the sparsification of RNA-RNA interaction structures is shown to experience also a linear reduction in time complexity. Both papers[29, 30] base their conclusion on the validity of the polymer-zeta property. However,[34] comes to a different conclusion reporting a mere constant reduction in time complexity. While Λis the key for the time complexity reduction of secondary structure folding, it is conceivable that for pseudoknot structures there may exist non-sparsifiable rules in which case the overall time complexity is not reduced.

      In any case, the key is the set of candidates and we provide an analysis of Λ-candidates by combinatorial means. In general, the connection between candidates, i.e. unions of disjoint intervals and the combinatorics of structures is actually established by the algorithm itself via backtracking: at the end of the DP-algorithm a structure is being generated that realizes the previously computed energy as mfe-structure. This connects intervals and sub-structures.

      So, does the condition c>1 in polymer-zeta apply in the context of RNA structures? In fact this condition would follow if the intervals in question are distributed as in uniformly sampled structures. This however, is far from reasonable, due to the fact that the mfe-algorithm deliberately designs some mfe-structure over the given interval. What the algorithm produces is in fact antagonistic to uniform sampling. We here wish to acknowledge the help of one anonymous referee in clarifying this point.

      Our results imply that polymer-zeta does not hold. Our framework critically depends on a specific distribution of mfe structures within irreducible and arbitrary structures, explicated in Assumption 1. We have cross-checked Assumption 1 with the number of candidates in DP-programs (using the same energy model), see Figure7 and Figure9. With this conclusion we are in accord with[31, 34] but provide an entirely different approach.

      The non validity of polymer-zeta has also been observed in the context of the limit distribution of the 5’-3’ distances of RNA secondary structures[43]. Here it is observed that long arcs, to be precise arcs of lengths O(n) always exist. This is of course a contradiction to the polymer-zeta property in case of c>1.

      The key to quantification of the expected number of candidates is the singularity analysis of a pair of energy-filtered GF, namely that of a class of structures and that of the subclass of all such structures that are irreducible. We show that for various energy models the singular expansions of both these functions are essentially equal–modulo some constant. This implies that the expected number of candidates is Λ(n2) and all constants can explicitly be computed from a detailed singularity analysis. The good news is that depending on the energy model, a significant constant reduction, around 96% can be obtained. This is in accordance with data produced in[31] for the mfe-folding of random sequences. There a reduction by 98% is reported for sequences of length ≥ 500.

      Our findings are of relevance for numerous results, that are formulated in terms of sizes of candidate sets[32]. These can now be quantified. It is certainly of interest to devise a full fledged analysis of the loop-based energy model. While these computations are far from easy our framework shows how to perform such an analysis.

      Using the paradigm of gap-matrices Backofen has shown[32] that the sparsification of the DP-folding of RNA pseudoknot structures exhibits additional instances, where sparsification can be applied, see Figure5B. Our results show that the expected number of candidates is Λ(n2), where the constant reduction is around 90%. This is in fact very good new since the sequence length in the context of RNA pseudoknot structure folding is in the order of hundreds of nucleotides. So sparsification of further instances does have an significant impact on the time complexity of the folding.

      Proofs

      In this section, we prove Lemma 2 and Theorem 2.

      Proof for Lemma 2: let D(z,u) and D(z,u) be the bivariate GF D ( z , u ) = n 0 g = 0 n 2 d g ( n ) z n u g http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq95_HTML.gif, and D ( z , u ) = n 1 g = 0 n 2 d g ( n ) z n u g http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq96_HTML.gif. Suppose a structure contains exactly j irreducible structures, then
      D ( z , u ) = j 0 R ( z , u ) j = 1 1 R ( z , u ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ17_HTML.gif
      (17)
      and
      D g ( z ) = [ u g ] D ( z , u ) = [ u g ] 1 D ( z , u ) , g 1 , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ18_HTML.gif
      (18)
      as well as D 0 ( z ) = 1 [ u 0 ] 1 D ( z , u ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq97_HTML.gif. Let F ( z , u ) = n 0 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq98_HTML.gif g 0 f g ( n ) z n u g = 1 D ( z , u ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq99_HTML.gif. Then F(z,u)D(z,u)=1, whence for g ≥ 1,
      g 1 = 0 g F g 1 ( z ) D g g 1 ( z ) = [ u g ] F ( z , u ) D ( z , u ) = 0 , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ19_HTML.gif
      (19)
      and F0(z)D0(z)=1, where F g ( z ) = n 0 f g ( n ) z n = [ u g ] F ( z , u ) = [ u g ] 1 D ( z , u ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq100_HTML.gif. Furthermore, we have F 0 ( z ) = 1 D 0 ( z ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq101_HTML.gif and
      F g ( z ) = g 1 = 0 g 1 F g 1 ( z ) D g g 1 ( z ) D 0 ( z ) , g 1 , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ20_HTML.gif
      (20)
      which implies D 0 ( z ) = 1 F 0 ( z ) = 1 1 D 0 ( z ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq102_HTML.gif and
      D g ( z ) = F g ( z ) = ( D 0 ( z ) 1 ) D g ( z ) + g 1 = 1 g 1 D g 1 ( z ) D g g 1 ( z ) D 0 ( z ) . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ21_HTML.gif
      (21)
      Proof for Theorem 2 Let [n] k denote the set of compositions of n having k parts, i.e. for σ∈[n] k we have σ=(σ1,…,σ k ) and i = 1 k σ i = n http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq103_HTML.gif.Claim.
      D g + 1 ( z ) = D g + 1 ( z ) D 0 ( z ) 2 + j = 0 g 1 ( 1 ) g + 2 j D 0 ( z ) g + 2 j × σ [ g + 1 ] g + 1 j i = 1 g + 1 j D σ i ( z ) . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ22_HTML.gif
      (22)
      We shall prove the claim by induction on g. For g=1 we have
      D 1 ( x ) = D 1 ( z ) D 0 ( z ) 2 , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ23_HTML.gif
      (23)
      whence eq. (22) holds for g=1. By induction hypothesis, we may now assume that for jg, eq. (22) holds. According to Lemma 2, we have
      D g + 1 ( z ) = ( D 0 ( z ) 1 ) D g + 1 ( z ) + g 1 = 1 g D g 1 ( z ) D g + 1 g 1 ( z ) D 0 ( z ) = D g + 1 ( z ) D 0 ( z ) 2 g 1 = 1 g D g 1 ( z ) D 0 ( z ) 3 + j = 0 g 1 2 ( 1 ) g 1 + 1 j D 0 ( z ) g 1 + 2 j × σ [ g 1 ] g 1 j i = 1 g 1 j D σ i ( z ) D g + 1 g 1 ( z ) . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equab_HTML.gif
      We next observe
      g 1 = 1 g D g 1 ( z ) D 0 ( z ) 3 D g + 1 g 1 ( z ) = ( 1 ) g + 2 ( g 1 ) D 0 ( z ) g + 2 ( g 1 ) σ [ g + 1 ] g + 1 ( g 1 ) i = 1 g + 1 ( g 1 ) D σ i ( z ) , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ24_HTML.gif
      (24)
      and setting h=g1j we obtain,
      g 1 = 1 g j = 0 g 1 2 ( 1 ) g 1 + 1 j D 0 ( z ) g 1 + 2 j σ [ g 1 ] g 1 j i = 1 g 1 j D σ i ( z ) D g + 1 g 1 ( z ) = g 1 = 1 g h = 2 g 1 ( 1 ) h + 2 D 0 ( z ) h + 2 σ [ g 1 ] h i = 1 h D σ i ( z ) D g + 1 g 1 ( z ) = h = 2 g ( 1 ) h + 2 D 0 ( z ) h + 2 g 1 = h g σ [ g 1 ] h i = 1 h D σ i ( z ) D g + 1 g 1 ( z ) = h = 2 g ( 1 ) h + 2 D 0 ( z ) h + 2 σ [ g + 1 ] h + 1 i = 1 h + 1 D σ i ( z ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equac_HTML.gif
      and setting j=gh
      = j = 0 g 2 ( 1 ) g + 2 j D 0 ( z ) g + 2 j σ [ g + 1 ] g + 1 j i = 1 g + 1 j D σ i ( z ) . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equad_HTML.gif

      Consequently, the Claim holds for any g ≥ 1.

      For any g ≥ 1, we have[11]
      D g ( z ) = 1 z 2 z + 1 P g ( u ) ( 1 4 u ) 3 g 1 / 2 , D 0 ( z ) = 1 z 2 z + 1 2 ( 1 + 1 4 u ) , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equae_HTML.gif
      where P g (u) is a polynomial with integral coefficients of degree at most (3g−1), P g (1/4)≠0, u2gP g (u)≠0 and u h P g (u)=0 for 0 ≤ h ≤ 2g−1. Let u = z 2 ( z 2 z + 1 ) 2 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq104_HTML.gif, the Claim provides in this context the following interpretation of D g ( z ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq105_HTML.gif
      1 z 2 z + 1 D g ( z ) = P g ( u ) ( 1 4 u ) 3 g 1 / 2 1 + 1 4 u 2 2 + j = 0 g 2 1 + 1 4 u 2 g + 1 j × σ [ g ] g j i = 1 g j P σ i ( u ) ( 1 4 u ) 3 g g j 2 , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ25_HTML.gif
      (56)
      and
      j = 0 g 2 1 + 1 4 u 2 g + 1 j σ [ g ] g j i = 1 g j P σ i ( u ) ( 1 4 u ) 3 g g j 2 = j = 0 g 2 k = 0 g + 1 j 1 2 g + 1 j g + 1 j k σ [ g ] g j i = 1 g j P σ i ( u ) ( 1 4 u ) 3 g g j + k 2 = j = 0 g 2 s = g j 2 g + 1 2 j 1 2 g + 1 j g + 1 j s g + j σ [ g ] g j i = 1 g j P σ i ( u ) ( 1 4 u ) 3 g s 2 . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equaf_HTML.gif
      As 0 ≤ jg−2 and gjs ≤ 2g + 1−2j, we have s ≥ 2. Consequently we arrive at
      1 z 2 z + 1 D g ( z ) = U g ( u ) ( 1 4 u ) 3 g 1 / 2 + V g ( u ) ( 1 4 u ) 3 g 1 , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ26_HTML.gif
      (26)
      where
      U g ( u ) = P g ( u ) 4 + P g ( u ) ( 1 4 u ) 4 + j = 0 g 2 g j s 2 g + 1 2 j s is odd × 1 2 g + 1 j g + 1 j s g + j σ [ g ] g j i = 1 g j P σ i ( u ) × ( 1 4 u ) s 1 2 , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equag_HTML.gif
      and
      V g ( u ) = P g ( u ) 2 + 1 2 3 σ [ g ] 2 i = 1 2 P σ i ( u ) + 3 1 2 3 × σ [ g ] 2 i = 1 2 P σ i ( u ) ( 1 4 u ) + j = 0 g 3 g j s 2 g + 1 2 j s is even × 1 2 g + 1 j g + 1 j s g + j σ [ g ] g j i = 1 g j P σ i ( u ) × ( 1 4 u ) s 2 2 . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equah_HTML.gif
      We have for σ∈[g] k , k ≥ 1
      [ u h ] σ [ g ] k i = 1 k P σ i ( u ) = σ [ g ] k i = 1 k [ u h i ] P σ i ( u ) , http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equai_HTML.gif
      where i = 1 k h i = h http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq106_HTML.gif, h i ≥ 0. Then we obtain that
      [ u h ] σ [ g ] k i = 1 k P σ i ( u ) = 0 , 0 h 2 g 1 . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ27_HTML.gif
      (27)
      Since [ u h i ] P σ i ( u ) = 0 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq107_HTML.gif, h i ≤2σ i −1, [ u 2 σ i ] P σ i ( u ) 0 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq108_HTML.gif and i = 1 k σ i = g http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq109_HTML.gif. Thus for 0 ≤ h ≤ 2g−1,
      [ u h ] U g ( u ) = 0 and [ u h ] V g ( u ) = 0 . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ28_HTML.gif
      (28)
      As shown in[11] we have
      P g ( 1 / 4 ) = Γ g 1 / 6 Γ g + 1 / 2 Γ g + 1 / 6 9 g 4 g 6 Π 3 / 2 Γ g + 1 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equ29_HTML.gif
      (29)
      and we obtain U g (1/4)=P g (1/4)/4. Furthermore,
      V g ( 1 / 4 ) = P g ( 1 / 4 ) 2 + 1 2 3 σ [ g ] 2 i = 1 2 P σ i ( 1 / 4 ) = 1 8 4 P g ( 1 / 4 ) j = 1 g 1 P j ( 1 / 4 ) P g j ( 1 / 4 ) 0 . http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_Equaj_HTML.gif

      We can recruit the computation of[11] in order to observe 4 P g ( 1 / 4 ) j = 1 g 1 P j ( 1 / 4 ) P g j ( 1 / 4 ) 0 http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq110_HTML.gif. In order to compute the bivariate GF, E g ( z , t ) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-7-28/MediaObjects/13015_2011_Article_166_IEq111_HTML.gif, we only need to replace in eq. (22) D g (z) by E g (z t) and the proof is completely analogous.

      Declarations

      Acknowledgements

      We want to thank Thomas J.X. Li for discussions and comments. We want to thank an anonymous referee for pointing out an incorrect assumption of first version of this paper.

      Authors’ Affiliations

      (1)
      Department of Mathematic and Computer science, University of Southern Denmark

      References

      1. Bailor MH, Sun X, Al-Hashimi HM: Topology Links RNA Secondary Structure with Global Conformation, Dynamics, and Adaptation. Science. 2010, 327: 202-206. 10.1126/science.1181085PubMedView Article
      2. Tabaska JE, Cary RB, Gabow HN, Stormo GD: An RNA folding method capable of identifying pseudoknots and base triples. Bioinformatics. 1998, 14: 691-699. 10.1093/bioinformatics/14.8.691PubMedView Article
      3. Loebl M, Moffatt I: The chromatic polynomial of fatgraphs and its categorification. Adv. Math. 2008, 217: 1558-1587. 10.1016/j.aim.2007.11.016View Article
      4. Penner RC, Knudsen M, Wiuf C, Andersen JE: Fatgraph models of proteins. Comm Pure Appl Math. 2010, 63: 1249-1297. 10.1002/cpa.20340View Article
      5. Massey WS: Algebraic Topology: An Introduction. 1967, Springer-Veriag, New York,
      6. Penner RC, Waterman MS: Spaces of RNA secondary structures. Adv. Math. 1993, 101: 31-49. 10.1006/aima.1993.1039View Article
      7. Penner RC: Cell decomposition and compactification of Riemann’s moduli space in decorated Teichmü, ller theory. Woods Hole Mathematics-perspectives in math and physics. Edited by: Tongring N, Penner RC. World Scientific 2004, 263-301. [ArXiv: math. GT/0306190], Singapore,View Article
      8. Mathews D, Sabina J, Zuker M, Turner D: Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 1999, 288: 911-940. 10.1006/jmbi.1999.2700PubMedView Article
      9. Reidys CM, Huang FWD, Andersen JE, Penner RC, Stadler PF, Nebel ME: Topology and prediction of RNA pseudoknots. Bioinformatics. 2011, 27: 1076-1085. 10.1093/bioinformatics/btr090PubMedView Article
      10. Bon M, Vernizzi G, Orland H, Zee A: Topological Classification of RNA Structures. J Mol Biol. 2008, 379: 900-911. 10.1016/j.jmb.2008.04.033PubMedView Article
      11. Andersen JE, Penner RC, Reidys CM, Waterman MS: Topological classification and enumeration of RNA structrues by genus. J. Math. Biol. 2011, 10.1007/s00285-012-0594-x. [Prepreint].
      12. Smith T, Waterman M: RNA secondary structure. Math. Biol. 1978, 42: 31-49.
      13. Zuker M: On finding all suboptimal foldings of an RNA molecule. Science. 1989, 244: 48-52. 10.1126/science.2468181PubMedView Article
      14. Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P: Fast Folding and Comparison of RNA Secondary Structures. Monatsh Chem. 1994, 125: 167-188. 10.1007/BF00818163View Article
      15. Rivas E, Eddy SR: A Dynamic Programming Algorithm for RNA Structure Prediction Including Pseudoknots. J Mol Biol. 1999, 285: 2053-2068. 10.1006/jmbi.1998.2436PubMedView Article
      16. Uemura Y A Hasegawa, Kobayashi S, Yokomori T: Tree adjoining grammars for RNA structure prediction. Theor Comp Sci. 1999, 210: 277-303. 10.1016/S0304-3975(98)00090-5View Article
      17. Akutsu T: Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. Discr Appl Math. 2000, 104: 45-62. 10.1016/S0166-218X(00)00186-4View Article
      18. Lyngsø RB, Pedersen CN: RNA pseudoknot prediction in energy-based models. J Comp Biol. 2000, 7: 409-427. 10.1089/106652700750050862View Article
      19. Cai L, Malmberg RL, Wu Y: Stochastic modeling of RNA pseudoknotted structures: a grammatical approach. Bioinformatics. 2003, 19 S1: i66-i73.View Article
      20. Dirks RM, Pierce NA: A partition function algorithm for nucleic acid secondary structure including pseudoknots. J Comput Chem. 2003, 24: 1664-1677. 10.1002/jcc.10296PubMedView Article
      21. Deogun JS, Donis R, Komina O, Ma F: RNA secondary structure prediction with simple pseudoknots. Proceedings of the second conference on Asia-Pacific bioinformatics (APBC 2004), Australian Computer Society. 2004, 239-246.
      22. Reeder J, Giegerich R: Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics. 2004, 5: 104. 10.1186/1471-2105-5-104PubMedPubMed CentralView Article
      23. Li H, Zhu D: A New Pseudoknots Folding Algorithm for RNA Structure Prediction. COCOON 2005, Volume 3595. Edited by: Wang L. Springer, Berlin, 2005, 94-103.
      24. Matsui H, Sato K, Sakakibara Y: Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures. Bioinformatics. 2005, 21: 2611-2617. 10.1093/bioinformatics/bti385PubMedView Article
      25. Kato Y, Seki H, Kasami T: RNA Pseudoknotted Structure Prediction Using Stochastic Multiple Context-Free Grammar. IPSJ Digital Courier. 2006, 2: 655-664.View Article
      26. Chen HL, Condon A, Jabbari H: An O(n5) Algorithm for MFE Prediction of Kissing Hairpins and 4-Chains in Nucleic Acids. J Comp Biol. 2009, 16: 803-815. 10.1089/cmb.2008.0219View Article
      27. Waterman MS: Secondary structure of single-stranded nucleic acids. Adv Math (Suppl Studies). 1978, 1: 167-212.
      28. Orland H, Zee A: RNA folding and large N matrix theory. Nuclear Physics B. 2002, 620: 456-476. 10.1016/S0550-3213(01)00522-3View Article
      29. Wexler Y, Zilberstein C, Ziv-Ukelson M: A study of accessible motifs and RNA complexity. J Comput Biol. 2007, 14 (6): 856-872. 10.1089/cmb.2007.R020PubMedView Article
      30. Salari R, Möhl M, Will S, Sahinalp C, Backofen R: Time and space efficient RNA-RNA interaction prediction via sparse folding. Proc of RECOMB. 2010, 6044: 473-490.
      31. Backofen R, Tsur D, Zakov S, Ziv-Ukelson M: Sparse RNA folding: Time and space efficient algorithms. J Disc Algor. 2011, 9 (1): 12-31. 10.1016/j.jda.2010.09.001View Article
      32. Möhl M, Salari R, Will S, Backofen R, Sahinalp SC: Sparsification of RNA structure prediction including pseudoknots. Algorithms Mol Biol. 2010, 5: 39. 10.1186/1748-7188-5-39PubMedPubMed CentralView Article
      33. McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990, 29: 1105-1119. 10.1002/bip.360290621PubMedView Article
      34. Dimitrieva S, Bucher P: Practicality and time complexity of a sparsified RNA folding algorithm. J Bioinfo Comput Biol. 2012, 10 (2): 1241007-10.1142/S0219720012410077. 10.1142/S0219720012410077View Article
      35. Kafri Y, Mukamel D, Peliti L: Why is the DNA Denaturation Transition First Order?. Phys Rev Lett. 2000, 85: 4988-4991. 10.1103/PhysRevLett.85.4988PubMedView Article
      36. Kabakcioglu A, Stella AL: A scale-free network hidden in the collapsing polymer. Phys Rev E. 2005, 72: 055102.View Article
      37. Vanderzande C: Lattic models of polymers. Cambridge University Press, New York, 1998.View Article
      38. NCBI database. [http://​www.​ncbi.​nlm.​nih.​gov/​guide/​dna-rna/​#downloads_​], []
      39. Nussinov R, Piecznik G, Griggs JR, Kleitman DJ: Algorithms for Loop Matching. SIAM J Appl Math. 1978, 35: 68-82. 10.1137/0135006View Article
      40. Nebel ME: Investigation of the Bernoulli model for RNA secondary structures. Bull math biol. 2003, 66 (5): 925-964.View Article
      41. Zagier D: On the distribution of the number of cycles of elements in symmetric groups. Nieuw Arch Wisk IV. 1995, 13: 489-495.
      42. Flajolet P, Sedgewick R: Analytic Combinatorics. Cambridge University Press, New York, 2009.View Article
      43. Han HSW, Reidys CM: The 5’-3’ distance of RNA secondary structures. J Comput Biol. 2012, 19 (7): 867-878. 10.1089/cmb.2011.0301PubMedView Article

      Copyright

      © Huang and Reidys; licensee BioMed Central Ltd. 2012

      This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://​creativecommons.​org/​licenses/​by/​2.​0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

      Advertisement