On the combinatorics of sparsification
 Fenix WD Huang^{1} and
 Christian M Reidys^{1}Email author
DOI: 10.1186/17487188728
© Huang and Reidys; licensee BioMed Central Ltd. 2012
Received: 31 December 2011
Accepted: 11 October 2012
Published: 22 October 2012
Abstract
Background
We study the sparsification of dynamic programming based on folding algorithms of RNA structures. Sparsification is a method that improves significantly the computation of minimum free energy (mfe) RNA structures.
Results
We provide a quantitative analysis of the sparsification of a particular decomposition rule, Λ^{∗}. This rule splits an interval of RNA secondary and pseudoknot structures of fixed topological genus. Key for quantifying sparsifications is the size of the so called candidate sets. Here we assume mfestructures to be specifically distributed (see Assumption 1) within arbitrary and irreducible RNA secondary and pseudoknot structures of fixed topological genus. We then present a combinatorial framework which allows by means of probabilities of irreducible substructures to obtain the expectation of the Λ^{∗}candidate set w.r.t. a uniformly random input sequence. We compute these expectations for arcbased energy models via energyfiltered generating functions (GF) in case of RNA secondary structures as well as RNA pseudoknot structures. Furthermore, for RNA secondary structures we also analyze a simplified loopbased energy model. Our combinatorial analysis is then compared to the expected number of Λ^{∗}candidates obtained from the folding mfestructures. In case of the mfefolding of RNA secondary structures with a simplified loopbased energy model our results imply that sparsification provides a significant, constant improvement of 91% (theory) to be compared to an 96% (experimental, simplified arcbased model) reduction. However, we do not observe a linear factor improvement. Finally, in case of the “full” loopenergy model we can report a reduction of 98% (experiment).
Conclusions
Sparsification was initially attributed a linear factor improvement. This conclusion was based on the so called polymerzeta property, which stems from interpreting polymer chains as selfavoiding walks. Subsequent findings however reveal that the O(n) improvement is not correct. The combinatorial analysis presented here shows that, assuming a specific distribution (see Assumption 1), of mfestructures within irreducible and arbitrary structures, the expected number of Λ^{∗}candidates is Θ(n^{2}). However, the constant reduction is quite significant, being in the range of 96%. We furthermore show an analogous result for the sparsification of the Λ^{∗}decomposition rule for RNA pseudoknotted structures of genus one. Finally we observe that the effect of sparsification is sensitive to the employed energy model.
Keywords
Sparsification Generating function Dynamic programmingBackground
RNA structures, diagrams and genus filtration
An RNA sequence is a linear, oriented sequence of the nucleotides (bases) A,U,G,C. These sequences “fold” by establishing bonds between pairs of nucleotides. In this paper, we only consider the WatsonCrick base pair AU or GC and wobble base pairs UG. The global conformation of an RNA molecule is determined by topological constraints encoded at the level of secondary structure, i.e., by the mutual arrangements of the base pairs[1].
We shall consider diagrams as fatgraphs,$\mathbb{G}$, that is graphs G together with a collection of cyclic orderings, called fattenings. Each fatgraph$\mathbb{G}$ determines an oriented surface$F\left(\mathbb{G}\right)$[3, 4] which is connected if G is and has some associated genus g(G) ≥ 0 and number r(G)≥1 of boundary components. Clearly,$F\left(\mathbb{G}\right)$ contains G as a deformation retract[5]. Fatgraphs were first applied to RNA secondary structures in[6, 7].
A diagram$\mathbb{G}$ hence determines a unique surface$F\left(\mathbb{G}\right)$ (with boundary). Filling the boundary components with discs we can pass from$F\left(\mathbb{G}\right)$ to a surface without boundary. Euler characteristic, χ, and genus, g, of this surface is given by χ = v−e + r and$g=1\frac{1}{2}\chi $, respectively, where v,e,r is the number of discs, ribbons and boundary components in$\mathbb{G}$,[5]. The genus of a diagram is that of its associated surface without boundary and a diagram of genus g is referred to as gdiagram.
A gdiagram without arcs of the form (i,i + 1) (1arcs) is called a gstructure. A gdiagram that contains only vertices of degree three, i.e. does not contain any vertices not incident to arcs in the upper halfplane, is called a gmatching. A diagram is called irreducible, if and only if it cannot be split into two by cutting the backbone without cutting an arc, see Figure2.
Folding algorithms
Folded configurations are energetically somewhat optimal. Here energy is obtained by adding contributions of loops[8] contained in RNA secondary and pseudoknot structures. Any RNA structure has a unique and disjoint decomposition into such loops which are really stems from the fatgraph[9, 10] interpretation of such structures in which loops correspond to boundary components[11]. Additional constraints imply further properties, like for instance certain minimum arclength conditions[12] and the nonexistence of isolated bonds. An mfeRNA structure can be predicted in polynomials time by means of dynamic programming (DP) routines[12, 13].
The most commonly used tools predicting simple RNA secondary structure mfold[13] and the Vienna RNA Package[14], require O(n^{2}) space and O(n^{3}) time. In the following we omit “simple” and refer to secondary structures containing crossing arcs as pseudoknot structures.
Generalizing the matrices of the DProutines of secondary structure folding[13, 14] to gapmatrices[15], leads to a DPfolding of pseudoknotted structures[15] (pknot‐R&E) with O(n^{4}) space an O(n^{6}) time complexity. The following references provide a certainly incomplete list of DPapproaches to RNA pseudoknot structure prediction using various structure classes characterized in terms of recursion equations and/or stochastic grammars:[9, 15–26]. The most efficient algorithm for pseudoknot structures is[22] (pknotsRG) having O(n^{2}) space and O(n^{4}) time complexity. This algorithm however considers only a restricted class of pseudoknots.
Note that RNA secondary structures are exactly structures of topological genus zero[27]. The topological classification of RNA structures[10, 11, 28] has recently been translated into an efficient DPalgorithm[9]. Fixing the topological genus of RNA structures implies that there are only finitely many types, the so called irreducible shadows[11].
Sparsification
Sparsification has been also applied in the context of RNARNA interaction structures[30] as well as RNA pseudoknot structures[32]. In difference to RNA secondary structures, however, not every decomposition rule in the DPfolding of RNA pseudoknot structures is amendable to sparsification. By construction, sparsification can only be applied for calculating mfeenergy structures. Since the computation of the partition function[20, 33] needs to take into account all substructures, sparsification does not work.
Sparsification[29, 31, 32] can be described as follows: let V={v_{1}v_{2},…} be a set whose elements v_{ i }are unions of pairwise disjoint intervals. Let furthermore L_{ v } denote an optimal solution (here optimal means to maximize the scores) of the DProutine over v. By assumption L_{ v } is recursively obtained. Suppose we are given a decomposition rule Λ_{1}, for which the optimal solution L_{ v } is${L}_{v}={L}_{{v}_{1}}+{L}_{{v}_{2}}+{L}_{{v}_{3}}$, where$v={v}_{1}\stackrel{\u0307}{\cup}{v}_{2}\stackrel{\u0307}{\cup}{v}_{3}$. Then, under certain circumstances, the DProutine may interpret L_{ v } either as$({L}_{{v}_{1}}+{L}_{{v}_{2}})+{L}_{{v}_{3}}$ or as${L}_{{v}_{1}}+({L}_{{v}_{2}}+{L}_{{v}_{3}})$, see Figure3B. To be precise, this situation is encountered iff
there exists an optimal solution${L}_{{v}_{1}^{\prime}}$ for a substructure over${v}_{1}^{\prime}$ where${v}_{1}^{\prime}={v}_{1}\stackrel{\u0307}{\cup}{v}_{2}$ via Λ_{2} and L_{ v } is obtained from${L}_{{v}_{1}^{\prime}}$ and${L}_{{v}_{3}}$ via Λ_{1},
there exists an optimal solution${L}_{{v}_{2}^{\prime}}$ for a substructure over${v}_{2}^{\prime}$ where${v}_{2}^{\prime}={v}_{2}\stackrel{\u0307}{\cup}{v}_{3}$ via Λ_{3} and L_{ v } is obtained by${L}_{{v}_{1}}$ and${L}_{{v}_{2}^{\prime}}$ via Λ_{1}.
Note that if Λ_{2} is scompatible to Λ_{1}then Λ_{3} is scompatible to Λ_{1}. To summarize
Definition 1
( s compatible) Suppose L_{ v } is the optimal solution for S_{ v } over v,${L}_{v}={L}_{{v}_{1}^{\prime}}+{L}_{{v}_{3}}$ under decomposition rule Λ_{1}.${L}_{{v}_{1}^{\prime}}$ is obtained from two optimal solutions${L}_{{v}_{1}}$ and${L}_{{v}_{2}}$ under rule Λ_{2}. Then Λ_{2} is called scompatible to Λ_{1} if there exist some rule Λ_{3} such that${L}_{{v}_{2}^{\prime}}={L}_{{v}_{2}}+{L}_{{v}_{3}}$ and${L}_{v}={L}_{{v}_{1}}+{L}_{{v}_{2}^{\prime}}$.
Figure3B depicts two such ways that realize the same optimal solution L_{ v }. Sparsification prunes any such multiple computations of the same optimal value. Note that by symmetry, Λ_{2} and Λ_{3} are both scompatible to Λ_{1}.
We next come to the important concept of candidates. The latter mark the essential computation paths for the DProutine.
Definition 2
and we shall denote the set of Λcandidates set by Q^{Λ}.
By construction a Λcandidate v is a union of disjoint intervals such that its optimal solution L_{ v } cannot be obtained via a Λsplitting. This optimal solution allows to construct a nonunique arcconfiguration (substructure) over v[13, 14] and the above Λsplitting consequently translates into a splitting of this substructure. This connects the notion of Λcandidates with that of substructures and shows that a Λcandidate implies a substructure that is Λirreducible.
Lemma 1
[29, 32] Suppose L_{ v } is obtained by selecting the optimal solution from the decomposition rules Λ_{1}Λ_{2},…,Λ_{ n }. If Λ is scompatible to all Λ_{ i },∀1 ≤ i ≤ n, then L_{ v } can be obtained via Λcandidates.
In summary, as for the impact of sparsification,[29] claims that sparsification reduces the time complexity by a linear factor. This claim is based on the assumption that RNA molecules satisfy the polymerzeta property[29]. Subsequent studies draw a slightly different picture[31] concluding that that sparsification requires O(nZ) time, where n denotes the length of input sequence, and Z is a sparsity parameter satisfying n ≤ Z<n^{2}. Recently, it has been shown in[34] that an asymptotic time complexity of a sparsified RNA folding algorithm using standard energy parameters remains O(n^{3}) under a wide variety of condition.
Sparsification of RNA secondary structures
An interval [i,j] is a Λ^{∗}candidate if the optimal solution over [i,j] is given by L_{i,j}=V_{i,j}>W_{i,j}. Indeed, [i,j] is a candidate iff [i,j] is in the candidate set of Λ^{∗}, and we denote the set$\left(\right)close="">{Q}^{{\Lambda}^{\ast}}$ by Q. Suppose the optimal solution W_{i,j} is given by W_{i,j} = L_{i,q} + L_{q + 1,j} and suppose we have L_{i,q} = L_{i,k} + L_{k + 1,q}. Then since [i,q] is not a candidate, Lemma 1 shows that we can compute W_{i,j} = L_{i,k} + L_{k + 1,j}, where [i,k] is a candidate.
Sparsification on RNA pseudoknot structures
In the following we will study RNA pseudoknot structures of fixed topological genus, see RNA structures, diagrams and genus filtration for details. An algorithm folding such pseudoknot structures, gfold, has been presented in[9]. The decomposition rules that appear in gfold are reminiscent to those of pknot‐R&E but as they restrict the genus of substructures, the iteration of gapmatrices is severely restricted and the effect of sparsification of these decompositions is significantly smaller.
In the following, we restrict our analysis in pseudoknotted structures to only the decomposition rule Λ^{∗}, which splices an interval into two subsequent intervals. Put differently, Λ^{∗} cuts the backbone of an RNA pseudoknot structure of fixed genus g over one interval without cutting a bond.
Efficiency of sparsification
By construction, the fewer candidates the DProutine encounters, the more efficient the sparsification. Thus it is of utmost importance to analyze the number of candidates. In the case of sparsification of RNA secondary structures we have one basic decomposition rule Λ^{∗} acting on intervals, namely Λ^{∗}splices an interval into two disjoint, subsequent intervals. The implied notion of a Λ^{∗}irreducible substructure is that of a substructure nested in a maximal arc, where maximal refers to the partial order of two arcs (i j) ≤ (i^{ ′ }j^{ ′ }) iff i^{ ′ } ≤ i∧j ≤ j^{ ′ }. This observation relates irreducibility to nesting of arcs and following this line of thought[29] identifies a specific property of polymerchains introduced in[35, 36] to be of relevance for the size of candidate sets:
Definition 3
(Polymerzeta property) Let$\mathbb{P}(i,j)$ denote the probability of a structure over an interval [i,j] under some decomposition rule Λ. Then we say Λ follows the polymerzeta property if$\mathbb{P}(i,j)=b\phantom{\rule{2.77695pt}{0ex}}{m}^{c}$ for some constant b, c > 0 and m = j−i.
Polymerzeta comes from modeling the 2Dfolding of a polymer chain as a selfavoiding walk (SAW) in a 2D lattice[37]. It implies that the probability of a base pair (i j) depends only on the length of the arc, i.e.$\mathbb{P}(i,j)=\mathbb{P}\left(m\right)$, where m = j−i. In[29] stipulate that RNA molecules satisfy the polymerzeta property and approximate$\mathbb{P}(i,j)$ by$\mathbb{P}\left(m\right)=b{m}^{c}$[29] using 50,000 mRNA sequences of an average length of 1992 nucleotides[38]. They find b≈2.11 and c≈1.47. The average probability$\mathbb{P}\left(m\right)$ is displayed in Figure4, Page 865[29] for increasing m. Furthermore, it is implied via Figure six, Page 867[29] that the average number of candidates converges to a constant, implying that sparsification of DProutine folding secondary structure takes Λ(n^{2}) time complexity.
These findings have been questioned by[34], where it has been observed that the time complexity of a sparsified RNA folding algorithm based on energy minimization remains O(n^{3}) independently of the energy function used and the base composition of the RNA sequence.[34] argues that the significant effect of sparsification on the DProutine is largely a finitesize effect. Namely, when the sequence length is below some threshold, the algorithm is dominated by the quadratic time factor. In this context, it may be worth pointing out that In[31] noticed that the improvement of a sparsified basepairing maximization algorithm depends heavily on the base composition of the input. Backofen parameterizes explicitly the cardinality of candidate sets in[31].
Contribution
In this paper we study the sparsification of the decomposition rule Λ^{∗}[31, 32] for RNA secondary and RNA pseudoknot structures of fixed topological genus. Based on Assumption 1 below our paper provides a combinatorial framework for quantifying the effects of sparsification of the Λ^{∗} rule.
We shall prove that the candidate set[29, 31, 32] is indeed small. We compute the probability of an interval being a candidate for two different energy models. For both models, this is facilitated via computing the generating function (GF) of structures and the generating function of irreducible structures. By studying the asymptotics of coefficients in these generating functions, we can compute the expected number of candidates of a uniformly random input sequence for large n. We show similar results for RNA pseudoknot structures of fixed topological genus. This provides new insights into the improvements of the sparsification of the concatenationrule Λ^{∗} in the presence of cross serial interactions. Our observations complement the detailed analysis of Backofen[31, 32]. We show that although for pseudoknot structures of fixed topological genus[10, 11] the effect of sparsification on the global time complexity is still unclear, the decomposition rule that splits an interval can be sped up significantly.
Methods
The concept of a partition function is close to that of a generating function. In case of$\left(\right)close="">{e}^{{w}_{\delta}\left(\sigma \right)/\mathit{\text{RT}}}=1$, i.e., each structure contributes equally regardless the underlying sequence and the partition function equals [z^{ n }]G(z), where G is the generating function and [z^{ n }]G is the coefficient of the term z^{ n }.
Two important energy models are arcbased[39] and loopbased[8], respectively. The loopbased energyfiltration is different from the notion of “stickiness”[40]. The compatibility of two positions by folding random sequences is considered to be 6/16, reminiscent of the probability of two given positions to be compatible by WatsonCrick and Wobble base pairs rules.
Assumption 1
Asymptotics
As for g ≥ 1 we have the following situation[11]
Theorem 1
 (a)D _{ g }(z) is algebraic and$\begin{array}{lcr}\phantom{\rule{10.0pt}{0ex}}{\mathbf{D}}_{g}\left(z\right)& =& \frac{1}{{z}^{2}z+1}\phantom{\rule{1em}{0ex}}{\mathbf{C}}_{g}\left(\frac{{z}^{2}}{{\left({z}^{2}z+1\right)}^{2}}\right).\end{array}$(2)
 (b)The bivariate GF of gstructures over n vertices, containing exactly m arcs, E _{ g }(z,t), is given by${\mathbf{E}}_{g}(z,t)=\frac{1}{t{z}^{2}z+1}{\mathbf{D}}_{g}\left(\frac{t\phantom{\rule{2.77695pt}{0ex}}{z}^{2}}{{(t\phantom{\rule{2.77695pt}{0ex}}{z}^{2}z+1)}^{2}}\right).$(4)
Irreducible gstructures
In the context of Λ^{∗}candidates we observed that irreducible substructures are of key importance. It is accordingly of relevance to understand the combinatorics of these structures. To this end let${\mathbf{D}}_{g}^{\ast}\left(z\right)=\sum _{n=0}^{\infty}{\mathbf{\text{d}}}_{g}^{\ast}\left(n\right){z}^{n}$ denote the GF of irreducible gstructures.
Lemma 2
For a proof of Lemma 2, see Section Proofs.
Theorem 2
 (a)the GF of irreducible gstructures over n vertices is given by${\mathbf{D}}_{g}^{\ast}\left(z\right)=({z}^{2}z+1)\left(\frac{{\mathbf{U}}_{g}\left(u\right)}{{(14u)}^{3g\frac{1}{2}}}+\frac{{\mathbf{V}}_{g}\left(u\right)}{{(14u)}^{3g1}}\right),$(5)
 (b)the bivariate GF of irreducible gstructures over n vertices, containing exactly m arcs, $\left(\right)close="">{\mathbf{E}}_{g}^{\ast}(z,t)$, is given by${\mathbf{E}}_{g}^{\ast}(z,t)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}(t{z}^{2}z+1)\left(\frac{{\mathbf{U}}_{g}\left(v\right)}{{(14v)}^{3g\frac{1}{2}}}+\frac{{\mathbf{V}}_{g}\left(v\right)}{{(14v)}^{3g1}}\right)\phantom{\rule{0.3em}{0ex}},$(7)
where$v=\frac{t{z}^{2}}{{(t{z}^{2}z+1)}^{2}}$.
We shall postpone the proof of Theorem 2 to Section Proofs.
The main result
Nussinovlike energy model
In the following we mimic some form of mfegstructures: inspired by the Nussinov energy model[39] we consider the weight of a gstructure over n vertices σ_{g,n} to be given by w(σ_{g,n}) = cℓ, where c is a constant contribution of a single arc and ℓ is the number of arcs in σ_{g,n}[40]. Then by Assumption 1, we have the weight function$\left(\right)close="">W\left({\sigma}_{g,n}\right)={(6/16)}^{\ell}{\eta}^{c\ell}={\left(\right(6/16\left){\eta}^{c}\right)}^{\ell}$. Note that the case (6/16)η^{ c }=1 corresponds to the uniform distribution, i.e. all gstructure have identical weight.
This approach requires to keep track of the number of arcs, i.e. we need to employ bivariate GF. In Theorem 1(b) we computed this bivariate GF and in Theorem 2(b) we derived from this bivariate GF$\left(\right)close="">{\mathbf{E}}_{g}^{\ast}(z,t)$, the GF of irreducible gstructures over n vertices containing ℓ arcs.
is smaller than any singularity of$\frac{1}{\tau {z}^{2}z+1}$. In this situation τ affects the constant a_{g,τ} and the exponential growth rate γ_{ τ } but not the subexponential factor${n}^{3(g\frac{1}{2})}$. The latter stems from the singular expansion of C_{ g }(z). Analogously, we derive the τparameterized family of GF$\left(\right)close="">{\mathbf{D}}_{g,\tau}^{\ast}\left(z\right)={\mathbf{E}}_{g}^{\ast}(z,\tau )$. We set the contribution of a single arc c=1 and the constant η = e, where e is the Euler number. Then we have the parameter τ = (6/16)e^{1}≈1.0125. By abuse of notation we will omit the subscript τ assuming τ = (6/16)e^{1}.
Theorem 3
where b_{ g }> 0 is a constant.
Proof
We proof the theorem by quantifying the probability of [i,j] being a Λ^{∗}candidate. In this case any (not necessarily unique) substructure, realizing the optimal solution L_{i,j}, is Λ^{∗}irreducible, and therefore an irreducible structure over [i,j].
from which we can conclude${\mathbb{E}}_{g}\phantom{\rule{0.3em}{0ex}}\left(n\right)/\mathbb{M}\left(n\right)\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}{\mathbf{\text{d}}}_{g}^{\ast}\phantom{\rule{0.3em}{0ex}}\left(m\right)/{\mathbf{\text{d}}}_{g}\left(m\right)\sim {b}_{g}$ and the theorem is proved. □
Loopbased energy model
In this section we discuss the loopbased energy model of RNA secondary structure folding. To be precise we evoke here trivariate GFs F(z t v) and F^{∗}(z t v) whose coefficients counting the numbers of secondary structures and irreducible secondary structures over n vertices having ℓ arcs and energy j, respectively. This becomes necessary since the loopbased model distinguishes between arcs and energy. The “cancelation” effect or reparameterization of stickiness[40] to which we referred to before does not appear in this context. Thus we need both an arc as well as an energyfiltration.
A further complication emerges. In difference to the GFs E_{ g }(z,t) and$\left(\right)close="">{\mathbf{E}}_{g}^{\ast}(z,t)$ the new GFs are not simply obtained by formally substituting$\left(\right)close="">(t{z}^{2}/({(t{z}^{2}z+1)}^{2})$ into the power series D_{ g }(z) and$\left(\right)close="">{\mathbf{D}}_{g}^{\ast}\left(z\right)$ as bivariate terms. The more complicated energy model requires a specific recursion for irreducible secondary structures.
The energy model used in prediction of secondary structure is more complicated than the simple arcbased energy model. Loops which are formed by arcs as well as isolated vertices between the arcs are considered to give energy contribution. Loops are categorized as hairpin loops (no nested arcs), interior loops (including bulge loops and stacks) and multiloops (more than two arcs nested), see Figure8. An arbitrary secondary structure can be uniquely decomposed into a collection of mutually disjoint loops. A result of the particular energy parameters[8] is that the energy model prefers interior loops, in particular stacks (no isolated vertex between two parallel arc), and disfavors multiloops. Based on this observation, we give a simplified energy model for a loop λ contained in secondary structure which only depends on the loop types by

w(λ)=0.5 if λ is a hairpin loop,

w(λ)=1 if λ is an interior loop,

w(λ)=−5 if λ is a multiloop,
where σ is an arbitary and σ^{ ′ } is an irreducible secondary structure. Along these lines, ℓ, ℓ^{ ′ } denote the number of arcs in σ and σ^{ ′ }. In other words, what happens here is that we find a suitable parameterization which brings us back to a simple univariate GF whose coefficients count the sum of weights of structures over n vertices.
Lemma 3
Proof
where p denotes the minimum number of isolated vertices to be inserted. Depending on the types of loops formed by (i,n), we have

hairpin loops:$\frac{z}{1z}$,

interior loops:${\mathbf{F}}_{0}^{\ast}\left(z\right){\left(\frac{1}{1z}\right)}^{2}$,

multiloops: there are at least two irreducible substructures, as well as isolated vertices, thus$\frac{1}{1z}\sum _{i=2}^{\infty}{\left({\mathbf{F}}_{0}^{\ast}\left(z\right)\frac{1}{1z}\right)}^{i}=\frac{{\left({\mathbf{F}}_{0}^{\ast}\left(z\right)\frac{1}{1z}\right)}^{2}}{1{\mathbf{F}}_{0}^{\ast}\left(z\right)\frac{1}{1z}}\frac{1}{1z}.$
which establishes the recursion. The uniqueness of the solution as a power series follows from the fact that each coefficient can evidently be recursively computed.
□
Lemma 4
where α≈0.24 and β≈2.88 are constants and γ≈2.1673
Proof
Solving eq. 14 we obtain a unique solution for$\left(\right)close="">{\mathbf{F}}_{0}^{\ast}\left(z\right)$ whose coefficient are all positive. Observing the dominant singularity of$\left(\right)close="">{\mathbf{F}}_{0}^{\ast}\left(z\right)$ is ρ≈0.4614. F_{0}(z) is a function of$\left(\right)close="">{\mathbf{F}}_{0}^{\ast}\left(z\right)$ and we examine the real root of minimal modulus of$1{\mathbf{F}}_{0}^{\ast}\left(z\right)\frac{1}{1z}=0$ is bigger than ρ. Then by the supercritical paradigm[42] applying, F_{0}(z) and$\left(\right)close="">{\mathbf{F}}_{0}^{\ast}\left(z\right)$ have identical exponential growth rates. Furthermore,$\left(\right)close="">{\mathbf{F}}_{0}^{\ast}\left(z\right)$ and F_{0}(z) have the same subexponential factor${n}^{\frac{3}{2}}$, hence the lemma. □
Theorem 4
where b=α/β≈0.08.
Proof
By Lemma 4 we have$\left(\right)close="">{\mathbf{f}}_{0}^{\ast}\left(m\right)/{\mathbf{f}}_{0}\left(m\right)\sim b$ where b is a constant. The proof is completely analogous to that of Theorem 3. □
Conclusion
In this paper we quantify the effect of sparsification of the rule Λ^{∗}. This rule splits intervals and separates concatenated substructures. The sparsification of Λ^{∗} alone is claimed to provide a speed up of up to a linear factor of the DPfolding of RNA secondary structures[29]. A similar conclusion is drawn in[30] where the sparsification of RNARNA interaction structures is shown to experience also a linear reduction in time complexity. Both papers[29, 30] base their conclusion on the validity of the polymerzeta property. However,[34] comes to a different conclusion reporting a mere constant reduction in time complexity. While Λ^{∗}is the key for the time complexity reduction of secondary structure folding, it is conceivable that for pseudoknot structures there may exist nonsparsifiable rules in which case the overall time complexity is not reduced.
In any case, the key is the set of candidates and we provide an analysis of Λ^{∗}candidates by combinatorial means. In general, the connection between candidates, i.e. unions of disjoint intervals and the combinatorics of structures is actually established by the algorithm itself via backtracking: at the end of the DPalgorithm a structure is being generated that realizes the previously computed energy as mfestructure. This connects intervals and substructures.
So, does the condition c>1 in polymerzeta apply in the context of RNA structures? In fact this condition would follow if the intervals in question are distributed as in uniformly sampled structures. This however, is far from reasonable, due to the fact that the mfealgorithm deliberately designs some mfestructure over the given interval. What the algorithm produces is in fact antagonistic to uniform sampling. We here wish to acknowledge the help of one anonymous referee in clarifying this point.
Our results imply that polymerzeta does not hold. Our framework critically depends on a specific distribution of mfe structures within irreducible and arbitrary structures, explicated in Assumption 1. We have crosschecked Assumption 1 with the number of candidates in DPprograms (using the same energy model), see Figure7 and Figure9. With this conclusion we are in accord with[31, 34] but provide an entirely different approach.
The non validity of polymerzeta has also been observed in the context of the limit distribution of the 5’3’ distances of RNA secondary structures[43]. Here it is observed that long arcs, to be precise arcs of lengths O(n) always exist. This is of course a contradiction to the polymerzeta property in case of c>1.
The key to quantification of the expected number of candidates is the singularity analysis of a pair of energyfiltered GF, namely that of a class of structures and that of the subclass of all such structures that are irreducible. We show that for various energy models the singular expansions of both these functions are essentially equal–modulo some constant. This implies that the expected number of candidates is Λ(n^{2}) and all constants can explicitly be computed from a detailed singularity analysis. The good news is that depending on the energy model, a significant constant reduction, around 96% can be obtained. This is in accordance with data produced in[31] for the mfefolding of random sequences. There a reduction by 98% is reported for sequences of length ≥ 500.
Our findings are of relevance for numerous results, that are formulated in terms of sizes of candidate sets[32]. These can now be quantified. It is certainly of interest to devise a full fledged analysis of the loopbased energy model. While these computations are far from easy our framework shows how to perform such an analysis.
Using the paradigm of gapmatrices Backofen has shown[32] that the sparsification of the DPfolding of RNA pseudoknot structures exhibits additional instances, where sparsification can be applied, see Figure5B. Our results show that the expected number of candidates is Λ(n^{2}), where the constant reduction is around 90%. This is in fact very good new since the sequence length in the context of RNA pseudoknot structure folding is in the order of hundreds of nucleotides. So sparsification of further instances does have an significant impact on the time complexity of the folding.
Proofs
In this section, we prove Lemma 2 and Theorem 2.
Consequently, the Claim holds for any g ≥ 1.
We can recruit the computation of[11] in order to observe$4{\mathbf{\text{P}}}_{g}(1/4)\sum _{j=1}^{g1}{\mathbf{\text{P}}}_{j}(1/4){\mathbf{\text{P}}}_{gj}(1/4)\ne 0$. In order to compute the bivariate GF,$\left(\right)close="">{\mathbf{E}}_{g}^{\ast}(z,t)$, we only need to replace in eq. (22) D_{ g }(z) by E_{ g }(z t) and the proof is completely analogous.
Declarations
Acknowledgements
We want to thank Thomas J.X. Li for discussions and comments. We want to thank an anonymous referee for pointing out an incorrect assumption of first version of this paper.
Authors’ Affiliations
References
 Bailor MH, Sun X, AlHashimi HM: Topology Links RNA Secondary Structure with Global Conformation, Dynamics, and Adaptation. Science. 2010, 327: 202206. 10.1126/science.1181085PubMedView ArticleGoogle Scholar
 Tabaska JE, Cary RB, Gabow HN, Stormo GD: An RNA folding method capable of identifying pseudoknots and base triples. Bioinformatics. 1998, 14: 691699. 10.1093/bioinformatics/14.8.691PubMedView ArticleGoogle Scholar
 Loebl M, Moffatt I: The chromatic polynomial of fatgraphs and its categorification. Adv. Math. 2008, 217: 15581587. 10.1016/j.aim.2007.11.016View ArticleGoogle Scholar
 Penner RC, Knudsen M, Wiuf C, Andersen JE: Fatgraph models of proteins. Comm Pure Appl Math. 2010, 63: 12491297. 10.1002/cpa.20340View ArticleGoogle Scholar
 Massey WS: Algebraic Topology: An Introduction. 1967, SpringerVeriag, New York,Google Scholar
 Penner RC, Waterman MS: Spaces of RNA secondary structures. Adv. Math. 1993, 101: 3149. 10.1006/aima.1993.1039View ArticleGoogle Scholar
 Penner RC: Cell decomposition and compactification of Riemann’s moduli space in decorated Teichmü, ller theory. Woods Hole Mathematicsperspectives in math and physics. Edited by: Tongring N, Penner RC. World Scientific 2004, 263301. [ArXiv: math. GT/0306190], Singapore,View ArticleGoogle Scholar
 Mathews D, Sabina J, Zuker M, Turner D: Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 1999, 288: 911940. 10.1006/jmbi.1999.2700PubMedView ArticleGoogle Scholar
 Reidys CM, Huang FWD, Andersen JE, Penner RC, Stadler PF, Nebel ME: Topology and prediction of RNA pseudoknots. Bioinformatics. 2011, 27: 10761085. 10.1093/bioinformatics/btr090PubMedView ArticleGoogle Scholar
 Bon M, Vernizzi G, Orland H, Zee A: Topological Classification of RNA Structures. J Mol Biol. 2008, 379: 900911. 10.1016/j.jmb.2008.04.033PubMedView ArticleGoogle Scholar
 Andersen JE, Penner RC, Reidys CM, Waterman MS: Topological classification and enumeration of RNA structrues by genus. J. Math. Biol. 2011, 10.1007/s002850120594x. [Prepreint].Google Scholar
 Smith T, Waterman M: RNA secondary structure. Math. Biol. 1978, 42: 3149.Google Scholar
 Zuker M: On finding all suboptimal foldings of an RNA molecule. Science. 1989, 244: 4852. 10.1126/science.2468181PubMedView ArticleGoogle Scholar
 Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P: Fast Folding and Comparison of RNA Secondary Structures. Monatsh Chem. 1994, 125: 167188. 10.1007/BF00818163View ArticleGoogle Scholar
 Rivas E, Eddy SR: A Dynamic Programming Algorithm for RNA Structure Prediction Including Pseudoknots. J Mol Biol. 1999, 285: 20532068. 10.1006/jmbi.1998.2436PubMedView ArticleGoogle Scholar
 Uemura Y A Hasegawa, Kobayashi S, Yokomori T: Tree adjoining grammars for RNA structure prediction. Theor Comp Sci. 1999, 210: 277303. 10.1016/S03043975(98)000905View ArticleGoogle Scholar
 Akutsu T: Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. Discr Appl Math. 2000, 104: 4562. 10.1016/S0166218X(00)001864View ArticleGoogle Scholar
 Lyngsø RB, Pedersen CN: RNA pseudoknot prediction in energybased models. J Comp Biol. 2000, 7: 409427. 10.1089/106652700750050862View ArticleGoogle Scholar
 Cai L, Malmberg RL, Wu Y: Stochastic modeling of RNA pseudoknotted structures: a grammatical approach. Bioinformatics. 2003, 19 S1: i66i73.View ArticleGoogle Scholar
 Dirks RM, Pierce NA: A partition function algorithm for nucleic acid secondary structure including pseudoknots. J Comput Chem. 2003, 24: 16641677. 10.1002/jcc.10296PubMedView ArticleGoogle Scholar
 Deogun JS, Donis R, Komina O, Ma F: RNA secondary structure prediction with simple pseudoknots. Proceedings of the second conference on AsiaPacific bioinformatics (APBC 2004), Australian Computer Society. 2004, 239246.Google Scholar
 Reeder J, Giegerich R: Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics. 2004, 5: 104. 10.1186/147121055104PubMedPubMed CentralView ArticleGoogle Scholar
 Li H, Zhu D: A New Pseudoknots Folding Algorithm for RNA Structure Prediction. COCOON 2005, Volume 3595. Edited by: Wang L. Springer, Berlin, 2005, 94103.Google Scholar
 Matsui H, Sato K, Sakakibara Y: Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures. Bioinformatics. 2005, 21: 26112617. 10.1093/bioinformatics/bti385PubMedView ArticleGoogle Scholar
 Kato Y, Seki H, Kasami T: RNA Pseudoknotted Structure Prediction Using Stochastic Multiple ContextFree Grammar. IPSJ Digital Courier. 2006, 2: 655664.View ArticleGoogle Scholar
 Chen HL, Condon A, Jabbari H: An O(n5) Algorithm for MFE Prediction of Kissing Hairpins and 4Chains in Nucleic Acids. J Comp Biol. 2009, 16: 803815. 10.1089/cmb.2008.0219View ArticleGoogle Scholar
 Waterman MS: Secondary structure of singlestranded nucleic acids. Adv Math (Suppl Studies). 1978, 1: 167212.Google Scholar
 Orland H, Zee A: RNA folding and large N matrix theory. Nuclear Physics B. 2002, 620: 456476. 10.1016/S05503213(01)005223View ArticleGoogle Scholar
 Wexler Y, Zilberstein C, ZivUkelson M: A study of accessible motifs and RNA complexity. J Comput Biol. 2007, 14 (6): 856872. 10.1089/cmb.2007.R020PubMedView ArticleGoogle Scholar
 Salari R, Möhl M, Will S, Sahinalp C, Backofen R: Time and space efficient RNARNA interaction prediction via sparse folding. Proc of RECOMB. 2010, 6044: 473490.Google Scholar
 Backofen R, Tsur D, Zakov S, ZivUkelson M: Sparse RNA folding: Time and space efficient algorithms. J Disc Algor. 2011, 9 (1): 1231. 10.1016/j.jda.2010.09.001View ArticleGoogle Scholar
 Möhl M, Salari R, Will S, Backofen R, Sahinalp SC: Sparsification of RNA structure prediction including pseudoknots. Algorithms Mol Biol. 2010, 5: 39. 10.1186/17487188539PubMedPubMed CentralView ArticleGoogle Scholar
 McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990, 29: 11051119. 10.1002/bip.360290621PubMedView ArticleGoogle Scholar
 Dimitrieva S, Bucher P: Practicality and time complexity of a sparsified RNA folding algorithm. J Bioinfo Comput Biol. 2012, 10 (2): 124100710.1142/S0219720012410077. 10.1142/S0219720012410077View ArticleGoogle Scholar
 Kafri Y, Mukamel D, Peliti L: Why is the DNA Denaturation Transition First Order?. Phys Rev Lett. 2000, 85: 49884991. 10.1103/PhysRevLett.85.4988PubMedView ArticleGoogle Scholar
 Kabakcioglu A, Stella AL: A scalefree network hidden in the collapsing polymer. Phys Rev E. 2005, 72: 055102.View ArticleGoogle Scholar
 Vanderzande C: Lattic models of polymers. Cambridge University Press, New York, 1998.View ArticleGoogle Scholar
 NCBI database. [http://www.ncbi.nlm.nih.gov/guide/dnarna/#downloads_], []
 Nussinov R, Piecznik G, Griggs JR, Kleitman DJ: Algorithms for Loop Matching. SIAM J Appl Math. 1978, 35: 6882. 10.1137/0135006View ArticleGoogle Scholar
 Nebel ME: Investigation of the Bernoulli model for RNA secondary structures. Bull math biol. 2003, 66 (5): 925964.View ArticleGoogle Scholar
 Zagier D: On the distribution of the number of cycles of elements in symmetric groups. Nieuw Arch Wisk IV. 1995, 13: 489495.Google Scholar
 Flajolet P, Sedgewick R: Analytic Combinatorics. Cambridge University Press, New York, 2009.View ArticleGoogle Scholar
 Han HSW, Reidys CM: The 5’3’ distance of RNA secondary structures. J Comput Biol. 2012, 19 (7): 867878. 10.1089/cmb.2011.0301PubMedView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.