 Research
 Open Access
 Published:
Random generation of RNA secondary structures according to native distributions
Algorithms for Molecular Biology volume 6, Article number: 24 (2011)
Abstract
Background
Random biological sequences are a topic of great interest in genome analysis since, according to a powerful paradigm, they represent the background noise from which the actual biological information must differentiate. Accordingly, the generation of random sequences has been investigated for a long time. Similarly, random object of a more complicated structure like RNA molecules or proteins are of interest.
Results
In this article, we present a new general framework for deriving algorithms for the nonuniform random generation of combinatorial objects according to the encoding and probability distribution implied by a stochastic contextfree grammar. Briefly, the framework extends on the wellknown recursive method for (uniform) random generation and uses the popular framework of admissible specifications of combinatorial classes, introducing weighted combinatorial classes to allow for the nonuniform generation by means of unranking. This framework is used to derive an algorithm for the generation of RNA secondary structures of a given fixed size. We address the random generation of these structures according to a realistic distribution obtained from reallife data by using a very detailed contextfree grammar (that models the class of RNA secondary structures by distinguishing between all known motifs in RNA structure). Compared to wellknown sampling approaches used in several structure prediction tools (such as SFold) ours has two major advantages: Firstly, after a preprocessing step in time $\mathcal{O}\left({n}^{2}\right)$ for the computation of all weighted class sizes needed, with our approach a set of m random secondary structures of a given structure size n can be computed in worstcase time complexity $\mathcal{O}\left(m\cdot n\cdot log\left(n\right)\right)$ while other algorithms typically have a runtime in $\mathcal{O}\left(m\cdot {n}^{2}\right)$. Secondly, our approach works with integer arithmetic only which is faster and saves us from all the discomforting details of using floating point arithmetic with logarithmized probabilities.
Conclusion
A number of experimental results shows that our random generation method produces realistic output, at least with respect to the appearance of the different structural motifs. The algorithm is available as a webservice at http://wwwagak.cs.unikl.de/NonUniRandGen and can be used for generating random secondary structures of any specified RNA type. A link to download an implementation of our method (in Wolfram Mathematica) can be found there, too.
Background and Introduction
The topic of random generation algorithms (also called samplers) has been widely studied by computer scientists. As stated in [1], it has been examined under different perspectives, including combinatorics, algorithmics (design and/or engineering), as well as probability theory, where two of the main motivations for random sampling are the testing of combinatorial properties of structures (e.g. conjectured structural properties, quantitative aspects), as well as the testing of properties of the corresponding algorithms (with respect to correctness and/or efficiency).
As considers software engineering, the socalled random testing approach is commonly used to test implementations of particular algorithms, as it is usually not feasible to consider all possible inputs and unknown which of these inputs are among the most interesting ones. In fact, this approach requires for the generation of random instances of program inputs that obey various sorts of syntactic and semantic constraints (where the random instances usually ought to be of a preliminarily fixed input size in order to be comparable to each other).
In the Bioinformatics area, algorithms for generating random biological sequences have been investigated for a long time (see e.g. [2, 3]). As stated in [4], random sequences are a topic of great interest in genome analysis, since according to a powerful paradigm, they represent the background noise from which the actual biological information must differentiate. Thus, random generation of combinatorial objects can be used in this context for simulations studies in order to isolate signal (unexpected events) from noise (statistically unavoidable regularities). In fact, according to [4], random biological sequences are for instance widely used for the detection of overrepresented and underrepresented motifs, as well as for determining whether scores of pairwise alignments are relevant or not: although there exist analytic approaches for these kinds of problems, for the most complex cases, it is often still necessary to be able to alternatively use a corresponding experimental approach (based on randomly generated sequences obtained from a computer programm). For this purpose, random sequences must obviously obey to a certain model that takes into account some relevant properties of actual reallife sequences, where such models are usually based on statistical parameters only. However, it is known that these classical models can be enriched by adding structural parameters (see [4]). Over the past years, several methods have been proposed for the random generation of more complex structures, where special attention has been paid to RNA secondary structures. RNA is a singlestranded nucleotide polymer and a major component of cellular processes (like DNA and proteins). An RNA strand is formed by linking together certain nucleotide units. The specific sequence of nucleotides along this chain is called the primary structure of the molecule. By pairing of nucleotides that are not linked in this chain (i.e. by the socalled effects of base pairing), the linear primary structure is folded into a threedimensional conformation, called the tertiary structure, which in many cases determines the function of the molecule. Most of the 3D structure is determined by the intramolecular basepairing interactions in the plane, which together form the secondary structure of the molecule. For this reason, pseudoknots (induced by crossing base pairs) are considered as tertiary interactions and are usually not permitted in the definition of secondary structure. As unknotted structures contain only nested base pairs and are thus essentially twodimensional, they can be modeled as planar graphs. This rather descriptive and commonly used planar graph model for RNA secondary structures was first formalized in [5]. An example is shown in Figure 1.
Most of the existing random generation algorithms for RNA secondary structures are used for predicting the structure of a given RNA sequence (see e.g. [6, 7]), while others can be employed for instance for evaluating structure comparison softwares [8]. Note that secondary structure prediction methods based on random sampling represent a nondeterministic counterpart to the uptodate most successful and popular physicsbased prediction methods that make use of the energy minimization paradigm and are realized by dynamic programming algorithms (see e.g. [9–12]). Random sampling also differs from the stochastical RNA structure prediction approach that is based on contextfree modeling of structural motifs and adding some statistical parameters observed in reallife data by assigning probabilities to the corresponding motifs (see e.g. [13–15]). Nevertheless, it should be mentioned that statistical sampling methods like [6, 7] used for RNA structure prediction are based on thermodynamics and thus inevitably inherit the problems and imprecisions related to energy minimizing methods, which are caused by the still incomplete commonly used free energy models for RNAs. In order to overcome these pitfalls, one could take the competing point of view and consider only typical structural information observed in a set of sample data as the basis for a new random generation method. If that information draws a realistic picture for all the different motifs of a molecule's folding, the corresponding sampling method is likely to produce realistic results. Accordingly, several authors made use of stochastic context free grammars and employed machinelearning techniques to train parameter values from a set of known secondary structures. Such grammars have widely been used in a predictive mode (see, e.g., [14]) but there are also successful examples of applications where the random sampling of derivation trees has been the core of the method (see, e.g., [16–18] but also [19] for examples). In the present paper, we follow that line of ideas and rely on the approach of the technical report [20] to develop a new algorithm for the (nonuniform) random generation of RNA secondary structures (without pseudoknots) according to a distribution induced by a set of sample RNA data (note that the algorithm actually generates secondary structures for a preliminary fixed size, not for a given RNA sequence of this size, which means we take the combinatorial point of view and completely abstract from sequence).
The main contribution of this manuscript is the derivation of a new and efficient algorithm for the random generation of RNA secondary structures according to an elaborate and thus very realistic model. For this purpose we use and generalize the approach from [20]. Particularly, our random generation method is based on a sophisticated contextfree grammar for unknotted structures which, in order to model the class of all considered RNA secondary structures as realistic as possible, distinguishes between all known structural motifs that may occur in unknotted RNA secondary structure. This means that any structural feature is modeled by one or more specific grammar rules with corresponding probabilities observed from reallife data. Note that this grammar is actually a special variant of the comprehensive grammar used in [21] for deriving a realistic RNA structure model and for performing the first ever analytical analysis of the expected free energy of a random secondary structure (of a specified RNA type). Actually, that grammar has been designed as a mirror of the famous Turner energy model [22, 23] which serves as the foundation for most of the existing physicsbased RNA structure prediction methods: all structural motifs for which there are different thermodynamic rules and parameters are created by distinct production rules (with corresponding probabilities).
According to [20], our sampling method involves a weighted unranking algorithm for obtaining the final structures. Briefly, considering an arbitrary structure class of size (cardinality) c, a corresponding unranking method uses a welldefined ordering of all class elements (according to a particular numbering scheme, the socalled ranking method) and for a given input number r ∈ {1,..., c} outputs the structure with rank r in the considered ordering. That way, the random sampling based on a stochastic grammar  building heavily on the use of small floating point numbers  is translated into an unranking algorithm using integer values only. Notably, a complete structure of size n is generated by recursively unranking the distinct structural components from the corresponding subclasses (of substructures with sizes less than n). In our case, the weighted unranking algorithm requires a precomputation step in worstcase time $\mathcal{O}\left({n}^{2}\right)$ for computing all weighted class sizes up to input size n. The worstcase complexity for generating a secondary structure of size n at random is then given by $\mathcal{O}\left(n\phantom{\rule{0.3em}{0ex}}log\phantom{\rule{0.3em}{0ex}}n\right)$ since we are ranking structures according to the boustrophedon order (see e.g. [7]).
By the end of this paper, we analyze the quality of randomly generated structures by considering some experimental results. First, we will consider statistical indicators of many important parameters related to particular structural motifs and compare the ones observed in the used sample set of real world RNA data to those observed in a corresponding set of random structures. Their comparison measures indicate that our method actually generates realistic RNA structures. Obviously, an algorithm which, for a given structure size n, produces random RNA secondary structures that are  related to expected shapes of such structures in most cases realistic is a major improvement over existing approaches which, for example, are only capable of generating secondary structures uniformly for size n. Furthermore, we will consider the two different free energy models defined in [21] for RNA secondary structure (with unknown RNA sequence) to get further evidence of the good quality of our random generation method (with respect to free energies and thus rather likely also with respect to appearance of the different structural motifs of RNA).
Prior Results and Basic Definitions
Uniform Random Generation
In the past, the problem of uniform random generation of combinatorial structures, that is the problem of randomly generating objects (of a preliminary fixed input size) of a specified class that have the same or similar properties, has been extensively studied. Special attention has been paid on the wide class of decomposable structures which are basically defined as combinatorial structures that can be constructed recursively in an unambiguous way.
In principle, two general (systematic) approaches have been developed for the uniform generation of these structures: First, the recursive method originated in [24] (to generate various data structures) and later systematized and extended in [25] (to decomposable data structures), where general combinatorial decompositions are used to generate objects at random based on counting possibilities. Second and more recently, the socalled Boltzmann method [1, 26], where random objects (under the corresponding Boltzmann model) have a fluctuating size, but objects with the same size invariably occur with the same probability. Note that according to [26], Boltzmann samplers may be employed for approximatesize (objects with a randomly varying size are drawn) as well as fixedsize (objects of a strictly fixed size are drawn) random generation and are an alternative to standard combinatorial generators based on the recursive method. However, fixedsize generation is considered the standard paradigm for the random generation of combinatorial structures.
(Admissible) Constructions and Specifications
According to [25], a decomposable structure is a structure that admits an equivalent combinatorial specification:
Definition 0.1 ( [25]). Let $\mathcal{A}=\left({\mathcal{A}}_{1},\mathrm{...},{\mathcal{A}}_{r}\right)$ be an rtuple of classes of combinatorial structures. A specification for $\mathcal{A}$ is a collection or r equations with the i th equation being of the form ${\mathcal{A}}_{i}={\varphi}_{i}\left({\mathcal{A}}_{1},...,{\mathcal{A}}_{r}\right)$, where ϕ_{ i }denotes a term built of the ${\mathcal{A}}_{j}$ using the constructions of disjoint union, cartesian product, sequence, set and cycle, as well as the initial (neutral and atomic) classes.
The needed formalities that will also be used in the sequel are given as follows:
Definition 0.2 ( [27]). If $\mathcal{A}$ is a combinatorial class, then ${\mathcal{A}}^{n}$ denotes the class of objects in $\mathcal{A}$ that have size (defined as number of atoms) n. Furthermore:

Objects of size 0 are called neutral objects or tags and a class consisting of a single neutral object ϵ is called a neutral class, which will be denoted by ε (ε_{1}, ε_{2},... to distinguish multiple neutral classes containing the objects ϵ_{1}, ϵ_{2}, ..., respectively).

Objects of size 1 are called atomic objects or atoms and a class consisting of a single atomic object is called an atomic class, which will be denoted by Ƶ(Ƶ_{ a }, Ƶ_{ b },... to distinguish the classes containing the atoms a,b,..., respectively).

If ${\mathcal{A}}_{1},...,{\mathcal{A}}_{k}$ are combinatorial classes and ϵ_{1}, ..., ϵ_{ k }are neutral objects, the combinatorial sum or disjoint union is defined as ${\mathcal{A}}_{1}+...+{\mathcal{A}}_{k}:=\left({\epsilon}_{1}\times {\mathcal{A}}_{1}\right)\cup ...\cup \left({\epsilon}_{k}\times {\mathcal{A}}_{k}\right)$ where ∪ denotes set theoretic union.

If $\mathcal{A}$ and $\mathcal{B}$ are combinatorial classes, the cartesian product is defined as $\mathcal{A}\times \mathcal{B}:=\left\{\left(\alpha ,\beta \right)\alpha \in \mathcal{A}\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{and}}\phantom{\rule{2.77695pt}{0ex}}\beta \in \mathcal{B}\right\}$, where size(α, β) = size(α) + size(β).
Note that the constructions of disjoint union, cartesian product, sequence, set and cycle are all admissible:
Definition 0.3 ( [27]). Let $\varphi $ be an mary construction that associates to a any collection of classes ${\mathcal{B}}_{1},...,{\mathcal{B}}_{m}$ a new class $\mathcal{A}:=\varphi \left[{\mathcal{B}}_{1},...,{\mathcal{B}}_{m}\right]$. The construction $\varphi $ is admissible iff the counting sequence (a_{ n }) of $\mathcal{A}$ only depends on the counting sequences (b_{1,n}),..., (b_{m,n}) of ${\mathcal{B}}_{1},...,{\mathcal{B}}_{m}$, where the counting sequence of a combinatorial class $\mathcal{A}$ is the sequence of integers (a_{ n })_{n≥0} for ${a}_{n}=\mathsf{\text{card}}\left({\mathcal{A}}^{n}\right)$.
The framework of (admissible) specifications obviously resembles that of contextfree grammars (CFGs) known from formal language theory (note that we assume the reader has basic knowledge of the notions concerning contextfree languages and grammars. An introduction can be found for instance in [28]). In order to translate a CFG into the framework of admissible constructions, it is sufficient to make each terminal symbol an atom and to assume each nonterminal A to represent a class $\mathcal{A}$ (the set of all words which can be derived from nonterminal A). However, for representing CFGs, only the admissible constructions disjoint union, cartesian product and sequence are needed: Words are constructed as cartesian products of atoms, sentential forms as cartesian products of atoms and the classes assigned to the corresponding nonterminal symbols. For instance, a production rule A → aB translates into the symbolic equation $\mathcal{A}=a\times \mathcal{B}$. Different production rules with the same lefthand side give rise to the union of the corresponding cartesian products. Nevertheless, it should be noted that [25] also shows how to reduce specifications to standard form, where the corresponding standard specifications constitute the basis of the recursive method for uniform random generation and extends the usual Chomsky normal form (CNF) for CFGs. Briefly, in standard specifications, all sums and products are binary and the constructions of sequences, sets and cycles are actually replaced with other constructions (for details see [25]).
The prime advantage of standard specifications is that they translate directly into procedures for computing the sizes of all combinatorial subclasses of the considered class $\mathcal{C}$ of combinatorial objects. This means they can be used to count the number of structures of a given size that are generated from a given nonterminal symbol. Moreover, standard specifications immediately translate into procedures for generating one such structure uniformly at random. The corresponding procedures (for class size calculations and structure generations) are actually required for (uniform) random generation of words of a given CFG by means of unranking.
Simply speaking, the unranking of decomposable structures (like for instance RNA secondary structures which can be uniquely decomposed into distinct structural components) works as follows: Each structure s in the combinatorial class ${\mathcal{S}}^{n}$ of all feasible structures having size n is given a number (rank) $i\in \left\{0,...,\mathsf{\text{card}}\left({\mathcal{S}}^{n}\right)1\right\}$, defined by a particular ranking method. Based on this ordering of the considered structure class ${\mathcal{S}}^{n}$, the corresponding unranking algorithm for a given input number $i\in \left\{0,...,\mathsf{\text{card}}\left({\mathcal{S}}^{n}\right)1\right\}$ computes the single structure $s\in {\mathcal{S}}^{n}$ having number i in the ranking scheme defined for class ${\mathcal{S}}^{n}$.
Note that in this context of unranking particular elements from a considered structure class, the corresponding algorithms make heavy use of their decomposability, as the distinct structural components are unranked from the corresponding subclasses. In fact, the class sizes can be derived according to the following recursion:
Note that when computing the sums for cartesian products, we can either consider the values for j in the sequential (also called lexicographic) order (1, 2, 3,..., n) or in the socalled boustrophedon order $\left(1,n,2,n1,...,\u2308\frac{n}{2}\u2309\right)$. In either case, given a fix number of considered combinatorial (sub)classes (or corresponding nonterminal symbols), the precomputation of all class size tables up to size n requires $\mathcal{O}\left({n}^{2}\right)$ operations on coefficients. One random generation step then needs $\mathcal{O}\left({n}^{2}\right)$ arithmetic operations when using the sequential method and $\mathcal{O}\left(n\cdot log\left(n\right)\right)$ operations when using the boustrophedon method (for details we refer to [25]). Obviously, using uniform unranking procedures to construct the i th structure of size n for a randomly drawn number i, any structure of size n is equiprobably generated. Consequently, in order to make sure that, for given size n and a sample set of random numbers i, the corresponding structures are in accordance with an appropriate probability distribution (as for instance observed from reallife RNA data), it is mandatory to use a corresponding nonuniform unranking method or an alternative nonuniform random generation approach.
NonUniform Random Generation
Coming back to the random testing problem from software engineering, we observe that generating objects of a given class of input data according to a uniform distribution is sufficient for testing the correctness of particular algorithms. However, if one intends to gather information about the "reallife behaviour" of the algorithm (e.g. with respect to runtime or space requirements), we need to perform simulations with input data that are as closely as possible related to corresponding application. This means to obtain suitable test data, we need to specify a distribution on the considered class that is similar to the one observed in real life and draw objects at random according to this (nonuniform) distribution. Deriving such a "realistic" distribution on a given class of objects can easily be done by modeling the class by an appropriate stochastic contextfree grammar (SCFG). Details will follow in the next section.
As regards RNA, it has been proven that both the combinatorial model (that is based on a uniform distribution such that all structures of a given size are equiprobable and that completely abstracts from the primary structure, see e.g. [29–31]) and the Bernoullimodel (which is capable of incorporating information on the possible RNA sequences for a given secondary structure, see e.g. [32–34]) for RNA secondary structures are rather unrealistic. However, modeling these structures by an appropriate SCFG yields a more realistic RNA model, where the probability distribution on all structures is determined from a database of real world RNA data (see e.g. [35, 36]).
Based on this observation, the problem of nonuniform random generation of combinatorial structures has been recently addressed in [20]. There, it is described how to get algorithms for the random generation of objects of a previously fixed size according to an arbitrary (nonuniform) distribution implied by a given SCFG. In principle, the construction scheme introduced in [20] extends on the recursive method for the (uniform) random generation [25] and adapted it to the problem of unranking of [37]: the basic principle is that any (complex) combinatorial class can be decomposed into (or can be constructed from) simpler classes by using admissible constructions.
Essentially, in [20], a new admissible construction called weighting has been introduced in order to make nonuniform random generation possible. By weighting, we understand the generation of distinguishable copies of objects. Formally:
Definition 0.4. If $\mathcal{A}$ is a combinatorial class and λ is an integer, the weighting of $\mathcal{A}$ by λ is defined as . We will call two objects from a combinatorial class copies of the same object iff they only differ in the tags added by weighting operations.
For example, if we weight the class $\mathcal{A}=\left\{a\right\}$ by two, we assume the result to be the set {a, a}; weighting $\mathcal{B}=\left\{b\right\}$ by three generates {b,b,b}. Thus, $2\mathcal{A}+3\mathcal{B}=\left\{a,a,b,b,b\right\}$ and within this class, a has relative frequency $\frac{2}{5}$, while b has relative frequency $\frac{3}{5}$. Hence, this way it becomes possible to regard nonuniformly distributed classes.
As weighting a class can be replaced by a disjoint union, $\mathsf{\text{size}}\left(\lambda \mathcal{A},n\right)=\lambda \cdot \mathsf{\text{size}}\left(\mathcal{A},n\right)$ and the complexity results from [37] also hold for weighted classes. Hence, the corresponding class size computations up to n need $\mathcal{O}\left({n}^{2}\right)$ time.
Stochastic ContextFree Grammars
As already mentioned, stochastic contextfree grammars (SCFGs) are a powerful tool for modeling combinatorial classes and the essence of the nonuniform random sampling approach that will be worked out in this article. Therefore, we will now give the needed background information.
Basic Concepts
Briefly, SCFGs are an extension of traditional CFGs: usual CFGs are only capable of modeling the class of all generated structures and thus inevitably induce a uniform distribution on the objects, while SCFGs additionally produce a (nonuniform) probability distribution on the considered class of objects. In fact, an SCFG is derived by equipping the productions of a corresponding CFG with probabilities such that the induced distribution on the generated language models as closely as possible the distribution of the sample data.
The needed formalities are given as follows:
Definition 0.5 ( [38]). A weighted contextfree grammar (WCFG) is a 5tuple $\mathcal{G}=\left(I,T,R,S,W\right)$, where I (resp. T) is an alphabet (finite set) of intermediate (resp. terminal) symbols (I and T are disjoint), S ∈ I is a distinguished intermediate symbol called axiom, R ⊂ I × (I ∪ T)* is a finite set of production rules and W : R → ℝ^{+} is a mapping such that each rule f ∈ R is equipped with a weight w_{ f }: = W(f). If $\mathcal{G}$ is a WCFG, then $\mathcal{G}$ is a stochastic contextfree grammar (SCFG) iff the following additional restrictions hold:

1.
For all f ∈ R, we have W(f) ∈ (0,1], which means the weights are probabilities.

2.
The probabilities are chosen in such a way that for all A ∈ I, we have ${\sum}_{f\in R,Q\left(f\right)=A}{w}_{f}=1$, where Q(f) denotes the premise of the production f, i.e. the first component A of a production rule (A, α) ∈ R. In the sequel, we will write w_{ f }: A → α instead of f = (A, α) ∈ R, w_{ f }= w(f).
However, at this point, we decided to not recall the basic concepts regarding SCFGs, as they are not really necessary for the understanding of this article. The interested reader is referred to the corresponding section in [21]. For a more fundamental introduction on stochastic contextfree languages, see for example [39]. In fact, the only information needed in the sequel is that if structures are modeled by a consistent SCFG, then the probability distribution on the production rules of the SCFG implies a probability distribution on the words of the generated language and thus on the modeled structures. To ensure that a SCFG gets consistent, one can for example assign relative frequencies to the productions, which are computed by counting the production rules used in the leftmost derivations of a finite sample of words from the generated language. For unambiguous SCFGs, the relative frequencies can actually be counted efficiently, as for every word, there is only one leftmost derivation to consider.
Modeling RNA Secondary Structure via SCFGs
Besides the popular planar graph representation of unknotted secondary structures, many other ways of formalizing RNA folding have been described in literature. One wellestablished example is the so called barbracket representation, where a secondary structure is modeled as a string over the alphabet Σ: = {(,), }, with a bar  and a pair of corresponding brackets ( ) representing an unpaired nucleotide and two paired bases in the molecule, respectively (see, e.g. [30]). Obviously, both models abstract from primary structure, as they only consider the number of base pairs and unpaired bases and their positions. Moreover, there exists a onetoone correspondence between both representations, as illustrated by the following example:
Example 0.1. The secondary structure shown in Figure 1 has the following equivalent barbracket representation that can be decomposed into subwords corresponding to the basic structural motifs that are distinguished in stateoftheart thermodynamic models:
Note that the reading order of secondary structures is from left to right, which is due to the chemical structure of the molecule.
Consequently, secondary structures without pseudoknots can be encoded as words of a contextfree language and the class of all feasible structures can thus effectively be modeled via a corresponding CFG. Basically, that CFG can be constructed to describe a number of classical constraints (e.g. the presence of particular motifs in structures) and it can also express longrange interactions (e.g. base pairings). By extending it to a corresponding SCFG, we can also model the fact that specific motifs of RNA secondary structures are more likely to be folded at certain stages than others (and not all possible motifs are equiprobable at any folding stage).
In fact, it is known for a long time that SCFGs can be used to model RNA secondary structures (see e.g. [40]). Additionally, SCFGs have already been used successfully for the prediction of RNA secondary structure [14, 15]. Moreoever, they can be employed for identifying structural motifs as well as for deriving stochastic RNA models that are  with respect to the expected shapes  more realistic than other models [36]. Furthermore, note that an SCFG mirror of the famous Turner energy model has been used in [21] to perform the first analytical analysis of the free energy of RNA secondary structures; this SCFG marks a cornerstone between stochastic and pyhsicsbased approaches towards RNA structure prediction.
Random Generation With SCFGs
SCFGs can easily be used for the random generation of combinatorial objects according to the probability distribution induced by a sample set, where the only problem is that they do not allow the user to fix the length of generated structures. In particular, given an SCFG $\mathcal{G}$ and the corresponding language (combinatorial class) $\mathcal{L}\left(\mathcal{G}\right)$, a random word $w\in \mathcal{L}\left(\mathcal{G}\right)$ can be generated in the following way:

Start with the sentential form S (where S denotes the axiom of the grammar $\mathcal{G}$).

While there are nonterminal symbols (in the currently considered sentential form), do the following:1) Let A denote the leftmost nonterminal symbol.

2)
Draw a random number r from the interval (0,1].

3)
Substitute symbol A by the righthand side α of the production A → α determined by the random number r.

2)
This means consider all m ≥ 1 rules p_{1} : A → α_{1},..., p_{ m }: A → α_{ m }having lefthand side A, where according to the definition of SCFGs, ${\sum}_{i=1}^{m}{p}_{i}=1$ must hold. Then, find k ≥ 1 with ${\sum}_{i=1}^{k1}{p}_{i}<r\le {\sum}_{i=1}^{k}{p}_{i}$, i.e. determine k ≥ 1 with $r\in \left({\sum}_{i=1}^{k1}{p}_{i},{\sum}_{i=1}^{k}{p}_{i}\right]$. The production corresponding to the randomly drawn number r ∈ (0,1] is then given by A → α_{ k }and hence, in the currently considered sentential form, the nonterminal symbol A is substituted by α_{ k }.

If there are no more nonterminal symbols, then the currently considered sentential form is equal to a word $w\in \mathcal{L}\left(\mathcal{G}\right);w$ has been randomly generated.
Note that the choice of the production made in 3) according to the previously drawn random number is appropriate, since it is conform to the probability distribution on the grammar rules.
Example 0.2. Consider the language generated by the SCFG with productions ¾: S → ϵ and ¼: S → (S). Thus, we start with the sentential form S, then consider the leftmost nonterminal symbol, which is given by S, and draw a random number r ∈ (0,1]. If 0 <r ≤ ¾, the production determined by r is S → ϵ and thus, we get the empty word and are finished. Otherwise, ¾ <r ≤ ¾ + ¼ = 1, which means we have to consider A → (S) for the substitution in step 3) and thus obtain the sentential form (S). Afterwards, we must repeat the process, as there is still one nonterminal symbol left.
Unfortunately, there is one major problem that comes with this approach for the (nonuniform) random generation of combinatorial objects: The underlying (consistent) SCFG $\mathcal{G}$ implies a probability distribution on the whole language $\mathcal{L}\left(\mathcal{G}\right)$, such that we generate a word of arbitrary size. In order to fix the size, we can proceed along the following lines:

1)
We translate the grammar $\mathcal{G}$ into a new framework which allows to consider fixed sizes for the random generation, such that

2)
the distribution implied on $\mathcal{L}\left(\mathcal{G}\right)$ conditioned on any fixed size n is kept within the new framework.
A wellknown approach which allows for 1) is connected to the concept of admissible constructions used to describe a decomposable combinatorial class (see above). As the operations (like cartesian products, unions, and so on) used to construct the combinatorial objects are also used to define an order on them, it becomes possible to identify the i th object of a given size and the problem of generating objects uniformly at random reduces to the problem of unranking, that is the problem of constructing the object of order (rank) i, for i a random number (see e.g. [41]).
Remark. Some might think that with an appropriate SCFG (modeling a given class of objects) at hand, it is not really necessary to use an unranking method that implies cumbersome formalities such as admissible constructions and decomposable classes if we want to generate random objects of a fixed size n. As a matter of principle, they are right  we could also use a conditional sampling method: If we need to generate a word of size n from nonterminal symbol A, where there are m ≥ 1 rules f_{ i }= A → α_{ i }, 1 ≤ i ≤ m, having lefthand side A, then we just need to choose the next production f_{ i }according to
which is the posterior probability that we used production rule f_{ i }under the condition that a word of size n is generated.
Similarly, if the production rule is of the type A → BC (assuming the grammar is in Chomsky normal form (CNF), which does not pose a problem, as an unambiguous SCFG can be efficiently transformed into CNF [39]), we can choose a way to split size n into sizes j and n  j for the lengths generated from nonterminal symbols B and C. This requires precomputing n lengthdependent probabilities (i.e. all probabilities for generating a word of any length up to n) for each nonterminal symbol, which might seem to be similar (with respect to complexity) to precomputing all class sizes up to n for all considered combinatorial (sub)classes as needs to be done for unranking. However, there is a striking difference between the two approaches: While conditional sampling makes heavy use of rather small floating point values  with all the wellknown problems and discomforting details like underflows or using logarithms associated with it  our unranking approach builds on integer values only which we assume a major advantage. There is another striking difference: lengthdependent probabilities (which by the way yield a socalled lengthdependent SCFG (LSCFG), see [42], and already have been used in [43]), require a very rich training set. In fact, if the RNA data set used for determining the distribution induced by the grammar is not rich enough, then the corresponding stochastic RNA model is underestimated and its quality decreases. This is especially a problem when considering comprehensive CFGs that distinguish between many different structural motifs in order to get a realistic picture of the molecules' behaviour; such a grammar should however be preferred over simple lightweight grammars as basis for a nonuniform random generation method. Nevertheless, this problem does not surface when sticking to conventional probabilities and the corresponding traditional SCFG model. Actually, since we consider a huge CFG where all possible structural motifs are created by distinct productions, we generally obtain realistic probability distributions and RNA models (see [21]).
Finally note that of course we could make use of random sampling strategies originally designed to sample structures connected to a given sequence in order to generate a random secondary structure only. However, such algorithms typically use a linear time to sample a single base pair (see, e.g., [6]) such that the time to sample a complete structure is quadratic in its length. This causes no problems for the original application of such algorithms since the sequencedependent preprocessing which is part of their overall procedure is at least quadratic in time and thus the dominating part. Here our approach is of advantage (replacing a factor n by log(n)) and since our preprocessing only depends on the size of the structure to be generated it is performed once and stored to disk for later reuse. Last but not least we are not sure, if the different existing approaches just mentioned could easily be made as fast as ours by simple changes only.
Bottom line is that hooking up to unranking of combinatorial classes offers three significant benefit compared to conditional sampling, namely a fast sampling strategy, the usage of integers instead of floating point values and a greater independence of the richness of the training data (compared to lengthdependent models). For this reason, we assume our unranking algorithm a valuable contribution, even though it requires a more cumbersome framework.
Unranking of Combinatorial Objects
The problem of unranking can easily be solved along the composition of the objects at hand, i.e. the operations used for its construction, once we know the number of possible choices for each substructure. Assume for example we want to unrank objects from a class $\mathcal{C}=\mathcal{A}+\mathcal{B}$. We will assume all elements of $\mathcal{A}$ to be of smaller order than those of $\mathcal{B}$ (this way we use the construction of the class to imply an ordering). Finding the i th element of $\mathcal{C}$, i.e. unranking class $\mathcal{C}$, now becomes possible by deciding whether $i<\mathsf{\text{card}}\left(\mathcal{A}\right)$. In this case, we recursively call the unranking procedure for $\mathcal{A}$. Otherwise (i.e. if $i\ge \mathsf{\text{card}}\left(\mathcal{A}\right)$), we consider $\mathcal{B}$, searching for its ($i\mathsf{\text{card}}\left(\mathcal{A}\right)$))th element.
Formally, we first need to specify an order on all objects of the considered combinatorial class that have the same size. This can be done in a recursive way according to the admissible specification of the class:
Definition 0.6 ( [37]). Neutral and atomic classes contain only one element, such that there is only one possible ordering. Furthermore, let ${<}_{{C}^{n}}$ denote the ordering within the combinatorial class ${\mathcal{C}}^{n}$, then

If $\mathcal{C}={\mathcal{A}}_{1}+...+{\mathcal{A}}_{k}$ and $\gamma ,{\gamma}^{\prime}\in {\mathcal{C}}^{n}$, then $\gamma {<}_{{C}^{n}}{\gamma}^{\prime}$ iff
$$\left[\gamma \in {\left({\mathcal{A}}_{i}\right)}^{n}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{and}}\phantom{\rule{0.3em}{0ex}}{\gamma}^{\prime}\in {\left({\mathcal{A}}_{j}\right)}^{n}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{and}}\phantom{\rule{0.3em}{0ex}}i<j\right]\phantom{\rule{0.3em}{0ex}}\mathsf{\text{or}}\phantom{\rule{0.3em}{0ex}}\left[\gamma ,{\gamma}^{\prime}\in {\left({\mathcal{A}}_{i}\right)}^{n}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{and}}\phantom{\rule{0.3em}{0ex}}\gamma {<}_{{\left({\mathcal{A}}_{i}\right)}^{n}}{\gamma}^{\prime}\right].$$ 
If $\mathcal{C}=\mathcal{A}\times \mathcal{B}$ and $\gamma =\left(\alpha ,\beta \right),{\gamma}^{\prime}=\left({\alpha}^{\prime},{\beta}^{\prime}\right)\in {\mathcal{C}}^{n}$, then $\gamma {<}_{{C}^{n}}{\gamma}^{\prime}$ iff
$$\left[\mathsf{\text{size}}\left(\alpha \right)<\mathsf{\text{size}}\left({\alpha}^{\prime}\right)\right]\mathsf{\text{or}}\left[j=\mathsf{\text{size}}\left(\alpha \right)=\mathsf{\text{size}}\left({\alpha}^{\prime}\right)\phantom{\rule{0.3em}{0ex}}\mathsf{\text{and}}\phantom{\rule{0.3em}{0ex}}\alpha {<}_{{\left(\mathcal{A}\right)}^{j}}{\alpha}^{\prime}\right]\mathsf{\text{or}}\left[\alpha ={\alpha}^{\prime}\mathsf{\text{and}}\phantom{\rule{0.3em}{0ex}}\beta {<}_{\left(\mathcal{B}\right)nj}{\beta}^{\prime}\right]$$
when considering the lexicographic order (1, 2, 3,..., n), which is induced by the specification ${\mathcal{C}}^{n}={\mathcal{A}}^{0}\times {\mathcal{B}}^{n}+{\mathcal{A}}^{1}\times {\mathcal{B}}^{n1}+{\mathcal{A}}^{2}\times {\mathcal{B}}^{n2}+...+{\mathcal{A}}^{n}\times {\mathcal{B}}^{0}$.

If $\mathcal{C}=\mathcal{A}\times \mathcal{B}$ and $\gamma =\left(\alpha ,\beta \right),{\gamma}^{\prime}=\left({\alpha}^{\prime},{\beta}^{\prime}\right)\in {\mathcal{C}}^{n}$, then $\gamma \gamma {<}_{{C}^{n}}{\gamma}^{\prime}$ iff
$$\begin{array}{c}\left[min\left(\mathsf{\text{size}}\left(\alpha \right),\mathsf{\text{size}}\left(\beta \right)\right)<min\left(\mathsf{\text{size}}\left({\alpha}^{\prime}\right),\mathsf{\text{size}}\left({\beta}^{\prime}\right)\right)\right]\mathsf{\text{or}}\\ \left[min\left(\mathsf{\text{size}}\left(\alpha \right),\mathsf{\text{size}}\left(\beta \right)\right)=min\left(\mathsf{\text{size}}\left({\alpha}^{\prime}\right),\mathsf{\text{size}}\left({\mathcal{B}}^{\prime}\right)\right)\phantom{\rule{0.3em}{0ex}}\mathsf{\text{and}}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{size}}\left(\alpha \right)<\mathsf{\text{size}}\left({\alpha}^{\prime}\right)\right]\mathsf{\text{or}}\\ \left[j=\mathsf{\text{size}}\left(\alpha \right)=\mathsf{\text{size}}\left({\alpha}^{\prime}\right)\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{and}}\phantom{\rule{2.77695pt}{0ex}}\alpha {<}_{{\left(\mathcal{A}\right)}^{j}}{\alpha}^{\prime}\right]\mathsf{\text{or}}\left[\alpha ={\alpha}^{\prime}\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{and}}\phantom{\rule{0.3em}{0ex}}\beta {<}_{{\left(\mathcal{B}\right)}^{nj}}{\beta}^{\prime}\right]\end{array}$$
when considering the boustrophedon order $\left(1,n,2,n1,...,\u2308\frac{n}{2}\u2309\right)$, induced by the specification ${\mathcal{C}}^{n}={\mathcal{A}}^{0}\times {\mathcal{B}}^{n}+{\mathcal{A}}^{n}\times {\mathcal{B}}^{0}+{\mathcal{A}}^{1}\times {\mathcal{B}}^{n1}+{\mathcal{A}}^{n1}\times {\mathcal{B}}^{1}+...$
Considering ${<}_{{C}^{n}}$, the actual unranking algorithms are quite straightforward. Therefore, they will not be presented here and we refer to [20, 44] for details.
Recall that in [20], the basic approach towards nonuniform random generation is weighting of combinatorial classes, as this makes it possible that the classes are nonuniformly distributed. If those combinatorial classes are to correspond to a considered SCFG, we have to face the problem that the maximum likelihood (ML) training introduces rational weights for the production rules while weighting as an admissible construction needs integer arguments.
When translating rational probabilities into integral weights, we have to assure that the relative weight of each (unambiguously) generated word remains unchanged. This can be reached by scaling all productions by the same factor (common denominator of all probabilities), while ensuring that derivations are of equal length for words of the same size (ensured by using grammars in CNF). However, a much more elegant way is to scale each production according to its contribution to the length of the word generated, that is, productions lengthening the word by k will be scaled by c^{k}. Since we consider CFGs, the lengthening of a production of the form A → α is given by α  1. However, this rule leads to productions with a conclusion of length 1 not being reweighted, hence we have to assure that all those productions already have integral weights. Furthermore, ϵproductions need a special treatment. We don't want to discuss full details here and conclude by noticing that the reweighting normal form (RNF) keeps track of all possible issues:
Definition 0.7 ( [20]). If $\mathcal{G}=\left(I,T,R,S,W\right)$ is a WCFG, $\mathcal{G}$ is said to be in reweighting normal form (RNF) iff

1.
$\mathcal{G}$ is loopfree and ϵfree.

2.
For all A → α ∈ R with A = S, we have α ≤ 1.

3.
For all A → α ∈ R with A ≠ S, we have α > 1 or W(A → α) ∈ ℕ.

4.
For all A ∈ I there exists α ∈ (I ∪ T)* such that A → α ∈ R.
Note that the last condition (that any intermediate symbol occurs as premise of at least one production) is not required for reweighting, but necessary for the translation of a grammar into an admissible specification.
Definition 0.8 ( [20]). A WCFG $\mathcal{G}$ is called loopfree iff there exists no nonempty derivation A ⇒^{+} A for A ∈ I. It is called ϵfree iff there exists no (A, ϵ) ∈ R with A = S and there exists no (A, α_{1}Sα_{2}) ∈ R, where ϵ denotes the empty word.
If $\mathcal{G}$ and ${\mathcal{G}}^{\prime}$ are WCFGs, then $\mathcal{G}$ and ${\mathcal{G}}^{\prime}$ are said to be wordequivalent iff $\mathcal{L}\left(\mathcal{G}\right)=\mathcal{L}\left({\mathcal{G}}^{\prime}\right)$ and for each word $w\in \mathcal{L}\left(\mathcal{G}\right)$, we have W(w) = W'(w).
In [20], it is shown how to transform an arbitrary WCFG to a wordequivalent, loopfree and ϵfree grammar, that grammar to one in RNF and the latter to the corresponding admissible specification. Formally:
Theorem 0.1 ( [39]). If $\mathcal{G}$is a SCFG, there exists a SCFG ${\mathcal{G}}^{\prime}$in Chomsky normal form (CNF) that is wordequivalent to $\mathcal{G}$, and ${\mathcal{G}}^{\prime}$can be effectively constructed from $\mathcal{G}$.
The construction given in [39] assumes that $\mathcal{G}$ is ϵfree. It can however be extended to nonϵfree grammars by adding an additional step after the intermediate grammar $\mathcal{G}$ has been created (see e.g. [20]). Furthermore, it should be noted that an unambiguous grammar is inevitably loopfree.
Theorem 0.2 ( [20]). If $\mathcal{G}$is a loopfree, ϵfree WCFG, there exists a WCFG ${\mathcal{G}}^{\prime}$in RNF that is wordequivalent to $\mathcal{G}$and ${\mathcal{G}}^{\prime}$can be effectively constructed from $\mathcal{G}$.
Altogether, starting with an arbitrary unambiguous SCFG ${\mathcal{G}}_{0}$ that models the class of objects to be randomly generated, we have to proceed along the following lines:

Transform ${\mathcal{G}}_{0}$ to a corresponding ϵfree and loopfree SCFG ${\mathcal{G}}_{1}$.

Transform ${\mathcal{G}}_{1}$ into ${\mathcal{G}}_{2}$ in RNF (where all production weights are rational).

Reweight the production rules of ${\mathcal{G}}_{2}$ (such that all production weights are integral), yielding reweighted WCFG ${\mathcal{G}}_{3}$.

Transform ${\mathcal{G}}_{3}$ (with integral weights) into the corresponding admissible specification.

This specification (with weighted classes) can be translated directly into a recursion for the function size of all involved combinatorial (sub)classes (where class sizes are weighted) and

into generating algorithms for the specified (weighted) classes,

yielding the desired weighted unranking algorithm for generating random elements of $\mathcal{L}\left({\mathcal{G}}_{0}\right)$.
A small example that shows how to proceed from SCFG to reweighted normal form and the corresponding weighted combinatorial classes which allow for nonuniform generation by means of unranking is discussed in the Appendix.
Generating Random RNA Secondary Structures
We will now consider the previously discussed approach to construct a weighted unranking algorithm that generates random RNA secondary structures of a given size according to a realistic probability distribution. As for this paper, the corresponding probability distribution will be induced by a set of sample (SSU and LSU r)RNA secondary structures from the databases [45, 46], which will be referred to as biological database in the sequel. However, the presented algorithm can easily be used for any other distribution, which can be defined by a database of known RNA structures of a particular RNA type; our webservice implementation accessible at http://wwwagak.cs.unikl.de/NonUniRandGen is actually able to sample random secondary structures of any specified RNA type. A link to download an implementation of our algorithm (in Wolfram Mathematica) can be found there, too.
Considered Combinatorial Class
According to the common definition of RNA secondary structure, we decided to consider the combinatorial class of all RNA secondary structures without pseudoknots that meet the stereochemical constraint of hairpin loops consisting of at least 3 unpaired nucleotides, formally:
Definition 0.9 ( [21]). The language $\mathcal{L}$ containing exactly all RNA secondary structures is given by (note that according to this definition, completely unpaired structures are prohibited) $\mathcal{L}:={\mathcal{L}}_{u}{\mathcal{L}}_{lu}^{+}$, where ${\mathcal{L}}_{lu}:=({\mathcal{L}}_{l}){\mathcal{L}}_{u},{\mathcal{L}}_{u}:={\left\{\right\}}^{*}$ is the language of all barbracket representations of singlestranded regions and ${\mathcal{L}}_{l}$ is the language of all barbracket representations of other possible substructures, i.e. is the smallest language satisfying the following conditions:

1.
${\left\{\right\}}^{+}\backslash \left\{,\right\}\subset {\mathcal{L}}_{l}$ (barbracket representations of hairpin loops).

2.
If $w\in {\mathcal{L}}_{l}$, then $(w)\in {\mathcal{L}}_{l}$ (barbracket representation of a stacked pair).

3.
If $w\in {\mathcal{L}}_{l}$, then ${\left\{\right\}}^{+}\mathbf{(}\mathit{w}\mathbf{)}\subset {\mathcal{L}}_{l}$ and $(w){\left\{\right\}}^{+}\subset {\mathcal{L}}_{l}$ (barbracket representations of bulge loops).

4.
If $w\in {\mathcal{L}}_{l}$, then ${\left\{\right\}}^{+}\mathbf{(}\mathit{w}\mathbf{)}{\left\{\right\}}^{+}\subset {\mathcal{L}}_{l}$ (barbracket representations of interior loops).

5.
If ${w}_{1},...,{w}_{n}\in {\mathcal{L}}_{l}$ and n ≥ 2, then ${\mathcal{L}}_{u}\mathbf{\left(}{\mathit{w}}_{1}\mathbf{\right)}{\mathcal{L}}_{u}\mathbf{\left(}{\mathit{w}}_{2}\mathbf{\right)}\cdots {\mathcal{L}}_{u}\mathbf{\left(}{\mathit{w}}_{n}\mathbf{\right)}{\mathcal{L}}_{u}\subset {\mathcal{L}}_{l}$ (barbracket representations of multibranched loops).
The desired weighted unranking algorithm thus generates, for a given size n and a given number $i\in \left\{0,...,\mathsf{\text{card}}\left({\mathcal{L}}^{n}\right)1\right\}$, the i th secondary structure $s\in {\mathcal{L}}^{n}$, where $\mathsf{\text{card}}\left({\mathcal{L}}^{n}\right)=\mathsf{\text{size}}\left(\mathcal{L},n\right)$ is the number of elements in the weighted class ${\mathcal{L}}^{n}$.
Considered SCFG Model
First, we have to find a suitable SCFG that generates $\mathcal{L}$ and models the distribution of the sample data as closely as possible. To reach this goal, it is important to appropriately specify the set of production rules in order to guarantee that all substructures that have to be distinguished are generated by different rules. This is due to the fact that by using only one production rule f to generate different substructures (e.g. any unpaired nucleotides independent of the type of loop they belong to), there is only one weight (the probability p_{ f }of this production f) with which any of these substructures is generated, whereas the use of different rules f_{1},..., f_{ k }to distinguish between these substructures implies that they may be generated with different probabilities ${{p}_{f}}_{{}_{1}},...{p}_{{f}_{k}}$, where ${p}_{{f}_{1}}+...+{p}_{{f}_{k}}={p}_{f}$. This way, we ensure that more common substructures are generated with higher probabilities than less common ones.
Example 0.3. A (rather simple) unambiguous SCFG ${\mathcal{G}}_{s}$ generating the language $\mathcal{L}$ is given by:
This grammar unambiguously generates $\mathcal{L}$ for the following reasons:

Every sentential form C ( B ) C ( B ) ... ( B ) C obviously is generated in a unique way; this resembles $\mathcal{L}={\mathcal{L}}_{u}{\mathcal{L}}_{lu}^{+}$ and ${\mathcal{L}}_{lu}:=({\mathcal{L}}_{l}){\mathcal{L}}_{u}$ of $\mathcal{L}\prime s$ definition. The number of outermost pairs of brackets in the entire string uniquely determines the corresponding sentential form to be used.

Now B either generates a hairpinloop ^{≥3}, which unambiguously is possible by rules B →  C, C →  C and C → ϵ, or

B itself has to generate at least one additional pair of brackets. In this case, B → CA must be applied (only A can generate brackets) and then A → ( B ) C resp. A → ( B ) CA are used; the number of outermost brackets to be generated (from B under consideration) again uniquely determines that part of the derivation.
When changing the production w_{5}: B → CA used to generate any possible kloop for k ≥ 2 (any loop that is not a hairpin loop) with probability w_{5} into the two rules
where w_{5.1} + w_{5.2} = w_{5}, it becomes possible to generate any possible 2loop (i.e. a stacked pair, a bulge (on the left or on the right), or an interior loop) and all kinds of multiloops (i.e. any kloop with k ≥ 3) with different probabilities, which could increase the accuracy of the SCFG model. By additionally replacing the first of these two new rules, w_{5.1} : B → C ( B ) C, by the four productions
where (w_{5.1.1} + ... + w_{5.1.4}) + w_{5.2} = w_{5.1} + w_{5.2} = w_{5}, we can distinguish between the different types of 2loops more accurately, yielding a more realistic secondary structure model. In fact, in the case of significant differences of the new probabilities (w_{5.1.1}, ..., w_{5.1.4} and w_{5.2}), we can expect a huge improvement in the model's accuracy. Note that it is not hard to see that changes to a grammar like the ones just discussed do not change the language generated. However, this is not at all obvious with respect to ambiguity of the grammar.
According to the previously mentioned facts (and the corresponding illustrations by Example 0.3), we decided that the basis for our weighted unranking algorithm should be the following efree, loopfree and unambiguous (note that these are exactly the preliminary required conditions for the basis SCFG according to [20]) SCFG, which has been derived from the sophisticated SCFG presented in [21] that distinguishes between all known structural motifs that can be found in RNA secondary structure:
Definition 0.10. The unambiguous ϵfree SCFG ${\widehat{\mathcal{G}}}_{\mathsf{\text{sto}}}$ generating exactly the language $\mathcal{L}$ is given by ${\u011c}_{\mathsf{\text{sto}}}=\left({I}_{{\u011c}_{\mathsf{\text{sto}}}},{\sum}_{{\u011c}_{\mathsf{\text{sto}}}},{R}_{{\u011c}_{\mathsf{\text{sto}}}},{S}^{\prime}\right)$, where
${\sum}_{{\widehat{\mathcal{G}}}_{\text{sto}}}=\{(,),\}$ and ${R}_{{\u011c}_{\mathsf{\text{sto}}}}$ contains exactly the following rules:
Figures 2 and 3 illustrate by examples how (parts of) secondary structures are generated by this SCFG, where we used to denote the full parse tree for I ⇒* x (i.e. for consecutive applications of an arbitrary number of production rules that generate the subword x from the intermediate symbol I) in oder to obtain a more compact tree representation. In fact, it is easy to see that the overall structure is always produced by starting with the axiom S', while any particular substructure or structural motif that belongs to the combinatorial (sub)class $\mathcal{I}$ is created from the corresponding intermediate symbol I.
For our application it is crucial that ${\u011c}_{\mathsf{\text{sto}}}$  as claimed its definition  is unambiguous. To prove this, we first note that ${\u011c}_{\mathsf{\text{sto}}}$ has been constructed starting from a simple grammar which generates $\mathcal{L}$ by iteratively replacing one production by several ones (like we did in the previous example) in order to distinguish more and more structural motifs but without changing the language generated. Furthermore, a standard construction to make the grammar ϵfree has been applied. That way, we can be sure that ${\u011c}_{\mathsf{\text{sto}}}$ generates $\mathcal{L}$ (formally this fact easily follows by obvious bisimulation proofs for each substitution and by the proven correctness of the used construction to ensure ϵfreeness). To prove unambiguity, we translate ${\u011c}_{\mathsf{\text{sto}}}$ into a system of equations for its structure generating function (see [47] for details) $S\left[z\right]={\sum}_{w\in \mathcal{L}}d\left(w\right){z}^{\leftw\right}$ where d(w) denotes the number of derivation trees ${\u011c}_{\mathsf{\text{sto}}}$ offers for w. Eliminating all but the variable associated with the axiom and simplifying (for this step we made use of Mathematica) yields the single equation
This equation is exactly the one of our grammar ${\mathcal{G}}_{s}$ from Example 0.3 which proves that for all n both grammars have the same number of derivation trees for words of size n. Knowing that both grammars generate $\mathcal{L}$ and that ${\mathcal{G}}_{s}$ is unambiguous, the same can be concluded for ${\u011c}_{\mathsf{\text{sto}}}$.
Note that ${\u011c}_{\mathsf{\text{sto}}}$ contains more production rules (and more different nonterminal symbols) than the SCFG considered in [21], but this new grammar is ϵfree and additionally, the righthand side of every single production contains at most two nonterminal symbols, such that the resulting unranking algorithm has to consider less cases (i.e. less "else if ( )" cases). For details, see [20] and the Appendix.
Furthermore, it should be mentioned that we decided to assign relative frequencies to the production rules of ${\u011c}_{\mathsf{\text{sto}}}$, since such probabilities can be computed efficiently for unambiguous SCFGs. Moreover, by estimating the probabilities ${\widehat{p}}_{i},1\le i\le 54$, by their relative frequencies, the resulting grammar ${\u011c}_{\mathsf{\text{sto}}}$ has the consistency property, which means ${\u011c}_{\mathsf{\text{sto}}}$ provides a probability distribution on the language $\mathcal{L}\left({\u011c}_{\mathsf{\text{sto}}}\right)=\mathcal{L}$. In particular, it is wellknown that relative frequencies in our context yield a maximum likelihood (ML) estimator for the rule probabilities and thus a consistent estimator for the parameter set. We have trained the probabilities (relative frequencies) of ${\u011c}_{\mathsf{\text{sto}}}$ from the structures $s\in \mathcal{L}\left({\u011c}_{\mathsf{\text{sto}}}\right)$ given in our biological database. The resulting probabilities are given in Table 1, their floating point approximations, rounded to the third decimal place in Table 2.
In oder to see if overfitting is an issue for our sophisticated grammar and its rich parameter set, i.e. to see if our training set is large enough to derive reliable values for the rule probabilities, we performed the following experiments: We selected a random 90% (resp. 50%) portion of the original training set and reestimated the probabilities of all the grammar rules. This process was iterated 40 times, resulting in a sample of 40 parameter sets. Finally, for each parameter we determined its variance along this sample of size 40. The corresponding values lay between 0 (resulting for intermediate symbols without alternatives; for whose productions a probability of 1 is predetermined) and 2.87652 × 10^{6} (resp. 2.86242 × 10^{5}). We can conclude that overfitting is no issue in connection with our sophisticated grammar and the training set used.
Derivation of the Algorithm
The elaborate SCFG ${\u011c}_{\mathsf{\text{sto}}}$ is appropriate for being used as the basis for the desired weighed unranking method: after having determined the RNF of this SCFG and the corresponding weighted combinatorial classes, we easily find a recursion for the size function (in the same ways as discussed in Example App.4). Then, we can use the resulting weighted class sizes for the straightforward construction of the desired unranking algorithm.
In fact, for the construction of the complete algorithm, we simply have to use Algorithms 1 to 4 (Unranking of neutral classes, atomic classes, disjoint unions and cartesian products, respectively) and Algorithm 6 (Unranking of weighted classes) given in [20] as subroutines. However, to improve the worstcase complexity of the resulting unranking procedure from $\mathcal{O}\left({n}^{2}\right)$ to $\mathcal{O}\left(n\cdot log\left(n\right)\right)$ by using the boustrophedonic order instead of the sequential order, a simple change in Algorithm 4 (Unranking of cartesian products) is neccessary (see e.g. [7]).
A random RNA secondary structure of size n can easily be computed by drawing a random number $i\in \left\{0,\dots ,\mathsf{\text{size}}\left(\mathcal{L},n\right)1\right\}$ and then unranking the i th structure of size n. The worstcase runtime complexity of this procedure is equal to that of unranking and is thus given by $\mathcal{O}\left(n\cdot log\left(n\right)\right)$ when using the boustrophedonic order. By repeating this procedure m times, a set of m (not necessarily distinct) random RNA secondary structures of size n can be generated in time $\mathcal{O}\left(m\cdot n\cdot log\left(n\right)\right)$, where a preprocessing time of $\mathcal{O}\left({n}^{2}\right)$ is required for the computation of all (weighted) class sizes up to input length n.
A complete and detailed description of the derivation of our weighted unranking algorithm for (SSU and LSU r)RNA secondary structures can be found in the Appendix, since it is too comprehensive to be presented here and the different steps for its generation correspond to those described in [20].
Availability of Software
It may be of interest to the reader that this nonuniform random generation algorithm for RNA secondary structures has been implemented as a webservice which is accessible to the scientific community under http://wwwagak.cs.unikl.de/NonUniRandGen. Since it is relevant for researchers to have methods available for generating random structures that are realistic for a particular investigation, this webservice is also capable of allowing the user to specify the distribution from which the corresponding structures should be sampled (in the form of a set of secondary structure samples from which the parameters for our grammars are inferred). Furthermore, our Mathematica source code used to implement the webservice can be downloaded from our website and used under GNU public licence.
Discussion
The purpose of this section is to analyze the quality of randomly generated structures by considering some experimental results.
Parameters for Structural Motifs
As a first step, we decided to consider several important parameters related to particular structural motifs of RNA secondary structure and compare the observed statistical values derived from a native sample (here our biological database, i.e. the set of reallife RNA data that we used for deriving the distribution and thus the weights for the unranking algorithm) to those derived from a corresponding random sample (i.e. a set of random structures generated by our algorithm). In order to obtain an appropriate random sample, we have generated exactly one random structure of size n for each native RNA structure of size n given in our database, such that for each occurring size n, the random sample and the native sample contain the same number of structures having this size.
The determined results are presented in Table 3. Comparing the specific values of all different parameters, we can guess that our algorithm produces random RNA secondary structures that are, related to the different structural motifs and thus related to the expected shape of such structures, in most cases realistic. Obviously, this is a major improvement over existing approaches for the random generation of secondary structures of a given input size n (where the corresponding specific RNA sequence is not known, but only its length n), as those (sequenceindependent) methods are only capable of generating structures uniformly at random for input size n. Furthermore, with the SCFG model used here, we have an new model for RNA secondary structures at hand which realistically reflects the structure of an RNA molecule and its basic structural motifs.
Related Free Energies
For further investigation on the accuracy of our random generator, we take on a completely different point of view and consider thermodynamics. The reason behind this idea is that if an RNA secondary structure model induced by a SCFG shows a realistic behaviour (expectation and variance) with respect to minimum free energy, then it is rather likely that our grammar also shows a realistic picture for all the different structural motifs of a molecule's folding (as the free energy of a molecule's structure is defined as the sum of the energy contributions of all its substructures).
Since we do not know the corresponding RNA sequences for the randomly generated structures, we can not use one of the common sequencedependent thermodynamic models for RNAs. Therefore, we decided to consider both the static and dynamic free energy models (in the static model, averaged free energy contributions for the distinguished structural motifs are considered which can easily be derived from the training data (by sequence counting). These averaged values actually represent the free energy contributions that have to be added for the respective whole substructures. For the dynamic model, corresponding average values for lengthdependent free energy contributions (that depend on the number of unpaired or paired bases within particular substructures) are added for each component (unpaired base or base pair) in the respective motifs, such that in contrast to the static model, substructures of different lengths are assigned different free energy values) defined in [21] for RNA secondary structures with unknown sequence. These models are based on the wellknown Turner energy model [22, 23] and model parameters have been derived from the same biological database (of SSU and LSU rRNAs) that we consider in this article. In fact, both models have turned out to show a realistic behaviour and can therefore be used to judge the quality of random structures generated by our algorithm.
Unquantified Results
Similar to [21], we denote the free energy of a given secondary structure $s\in \mathcal{L}$ according to the static and dynamic model by g_{ stat }(s) and g_{ dyn }(s), respectively. Moreover, the expected free energy and corresponding variance that have been analytically derived in that paper for any n > 0 are denoted by ${\mu}_{energy,n}:=E\left[energy\left(s\right)\mid \mathsf{\text{size}}\left(s\right)=n\right]$ and ${\sigma}_{energy,n}^{2}:=V\left[energy\left(s\right)\mid \mathsf{\text{size}}\left(s\right)=n\right]$, respectively, where energy ∈ {g_{ stat }, g_{ dyn }}. The corresponding confidence interval for n > 0 and k > 1, which contains at least $\left(100\frac{100}{{k}^{2}}\right)$ percent of the energies in $\left\{energy\left(s\right)\mid s\in {\mathcal{L}}^{n}\right\}$ is denoted by I_{ energy,n }(k): = (μ_{ energy,n } k σ_{ energy },_{n}, μ_{ energy,n }+ k σ_{ energy,n }). As these analytical energy results from [21] and our unranking algorithm have been derived from the same database of reallife RNA data and by modeling the same class $\mathcal{L}$ of structures via very similar SCFGs, it seems adequate to use them for comparisons with the energies of our randomly generated structures.
Before we start with our comparisons, note that for any sample set $\mathcal{S}$ of secondary structures, we can calculate the corresponding energy points $EP\left(\mathcal{S},energy\right):=\left\{\left(\mathsf{\text{size}}\left(s\right),energy\left(s\right)\right)\mid s\in \mathcal{S}\right\}$, where energy ∈ {g_{ stat }, g_{ dyn }}. Obviously, we can also compute the corresponding "average energy points" $AvEP\left(\mathcal{S},energy\right):=\left\{\left(n,{\mu}_{n}:=\frac{1}{\mathsf{\text{card}}\left({\mathcal{S}}^{n}\right)}{\sum}_{s\in {\mathcal{S}}^{n}}energy\left(s\right)\right)\mid {\mathcal{S}}^{n}\ne \varnothing \right\}$ and the corresponding "energy variance points" $VarEP(\mathcal{S},energy):=\left\{(n,{\sigma}_{n}^{2}:=\frac{1}{\text{card}({\mathcal{S}}^{n})}{\displaystyle {\sum}_{s\in {\mathcal{S}}^{n}}{\left({\mu}_{n}energy(s)\right)}^{2}}\mid {\mathcal{S}}^{n}\ne \varnothing \right\}$, respectively. In the sequel, we will denote a random sample generated by our algorithm by ℛ and a native sample (biological database) by $\mathcal{N}$.
In order to obtain an appropriate random sample for our energy comparisons, we derived a large set of random structures by generating 1000 RNA secondary structures for each of the sizes n ∈ {500,1000,1500,..., 5000, 5500} with our weighted unranking algorithm. To compare the energies of our randomly generated structures to the corresponding confidence interval(s), we decided to consider any $k\in \left\{\sqrt{2},2,\sqrt{10},\sqrt{20}\right\}$, meaning the probability that the free energy of a random RNA secondary structure of size n lies within the corresponding interval is greater than 0.5, 0.75, 0.9, and 0.95, respectively.
Figure 4 shows a plot of the corresponding four confidence intervals (analytically derived, related to our biological data) along with the energy points for our random sample and for our native database, respectively, under the assumption of the static energy model. The corresponding plots for the dynamic energy model are shown in Figure 5. Looking at both figures, we immediately see that the energies for our set of randomly generated RNA secondary structures seem to fit to the ones for the considered RNA database and also to the corresponding analytically obtained energy results from [21]. This observation becomes even more clear by considering Figures 6 and 7. There, we compare the previously introduced "average energy points" and "energy variance points" to the analytically determined expected free energy and corresponding variance from [21], respectively.
Quantified Results
The previously considered energy comparisons have been presented only by unquantified plots. This may not be very satisfying, since it is obvious that the free energy would decrease with structure size and aside from this, it could have been expected that for large randomly generated sets of structures of a given size, the average energy and corresponding variance fit the analytically obtained energy results derived under the assumption of a basically equivalent SCFG model for secondary structures. Therefore, there is a need to consider some sort of quantification and additionally present corresponding quantified comparison results. What really matters is the degree to which the energy ranges of the random structures agree, in distribution, with our biological database. This means we have to find out if the energies related to a random sample (generated by our unranking method) and those related to a native sample (given by the structures in our biological database) come from a common distribution. Consequently, we have to consider the energies of a random sample and those of a native one as two independent sets of values and determine the extend to which their distributions coincide, or in other words to test for significant differences between these two sets. For this reason, we decided to apply one of the most common (nonparametric) significance tests known from statistics, the socalled MannWhitney Utest [48], which is widely used as statistical hypothesis test for assessing whether two independent samples of observations (with arbitrary sample sizes) come from the same distribution. It is also known as the Wilcoxon ranksum test [49] which however can only be applied for equal sample sizes.
Formally, this test is used to check whether the null hypothesis N_{0}  which states that the two independent samples X and Y are identically distributed (i.e. F(X) = F(Y))  can be accepted or else, has to be rejected. More specifically, the result of such a test, the socalled pvalue, is a probability answering the following question: If the two samples really have the same distribution, what is the probability that the observed difference is due to chance alone? In other words, were the deviations (differences between the two samples) the result of chance, or were they due to other factors and how much deviation can occur before one must conclude that something other than chance causes the differences? The pvalue is called statistically significant if it is unlikely that the differences occurred by chance alone, according to a preliminary chosen threshold probability, the significance level α (common choices are e.g. α ∈ {0.10,0.05,0.01}). If p ≥ α, the deviation is small enough that chance alone accounts for it; this is within the range of acceptable deviation. If p < α, we must conclude that some factor other than chance causes the deviation to be so great, this will lead us to decide that the two sets come from different distributions.
For our analysis, we again decided to generate the same numbers of random structures for any size as are given for this size in our biological database, such that random and native sample contain the same numbers of structures for any occurring size (and hence the sample sizes are equal). Moreover, note that the unquantified results presented in Figures 4 and 5 might yield the assumption that for any structure size, some energy values of randomly generated structures are scattered too widely around the corresponding expected value, such that those randomly drawn secondary structures can not be considered realistic (neither with respect to thermodynamics nor with respect to structural composition and expected shape). In an attempt to disprove that assumption, we decided to perform a series of Wilcoxon tests by considering a number of different random samples. These samples are created by obeying a specified energybased rejection scheme: Do not add a randomly generated structure of a given size to the sample if its free energy (according to the static or dynamic model or according to both models) lies outside the corresponding confidence interval(s). Formally, for any preliminary chosen value k > 1, a generated structure $s\in {\mathcal{L}}^{n}$ is added to the random sample iff
otherwise it is rejected. This means we accept only a specified deviation of the energy energy (s) of the random structure s from the corresponding expected free energy μ_{ energy,n }and reject structures whose energy differs too much from the expected value. Note that for k = ∞ (confidence interval I_{ energy,n }(k) contains 100 percent of the energies energy(s) of all $s\in {\mathcal{L}}^{n}$), no structures are rejected. Hence, in this case, the corresponding random sample corresponds to the usual (unrestricted) output of our algorithm.
The Wilcoxon test results for our native sample together with any of a number of random sample sets generated in the previously described restricted manner, respectively, can be found in Table 4. As we can see, the best results are achieved for the unrestricted sample sets, where all free energies of randomly generated structures were allowed during the sample creation process. Moreover, these two results (for the unrestricted case k = ∞) are not statistically significant when considering the common significance level α = 0.05, that is in both cases, we can assume that the energies of the random structures and those of the biological data follow a common distribution. These observations indicate that our weighted unranking algorithm produces random RNA secondary structures that are  related to the free energy of such structures (in expectation and variation)  in expectation realistic.
Besides that, it is obvious that the computed pvalues are much better for the dynamic energy model than for the static one. This underlines the suggestion made in [21] that, although both energy models have been proven to be realistic, due to the more realistic variation of free energies connected to varying loop length, the dynamic model should be used for possible applications. Since at least for the dynamic model, the random data fit very nicely with the native data, we can conclude that structures generated by our nonuniform random generation algorithm behave realistic with respect to free energies and  as the energy of the overall structure is assumed to be equal to the sum of the substructure energies  rather likely also with respect to appearance of the different structural motifs of RNA molecules.
Conclusion
Altogether, we can finally conclude that the nonuniform random generation method proposed in this article produces appropriate output and may thus be used (for research issues as well as for practical applications) to generate random RNA secondary structures. In fact, for any arbitrary type of (pseudoknotfree) RNA, a corresponding random sampler can be derived in the presented way. Actually, our webservice can be used for generating random secondary structures of any specified type of RNA. It just requires a database of known structures for the respective RNA type as input.
Note that in this work, we abstract from sequence and consider only the structure size as input for our algorithm. Thus, an interesting problem for future research would be to find a way to extend the presented realistic SCFG model to additionally deal with RNA sequence. In fact, this work and especially the considered elaborate SCFG could mark some sort of stepping stone towards new stochastic RNA secondary structure prediction methods realized by statistical random sampling.
Appendix
How to Construct a Weighted Unranking Algorithm from a Given SCFG
The purpose of this section is to give a rather small example for applying the construction scheme described in detail in [20] to proceed from an arbitrary SCFG to reweighted normal form (RNF) and then to the corresponding weighted combinatorial classes which allow for nonuniform generation by means of unranking.
Example App.4. Let us consider the SCFG ${\mathcal{G}}_{d}$, which contains the following rules:
To apply the approach presented in [20] to transform a given SCFG to RNF, the grammar needs to be ϵfree and loopfree. Thus, we first have to transform grammar ${\mathcal{G}}_{d}$ into the following one:
The transformation of ${\mathcal{G}}_{d}$ into RNF now works as follows: First, we have to gather all possible chains A → A_{1} → A_{2} → ... → α, where A ≠ S and α = 1. These chains are B → C, B → C →  and C → ; the rules B → C and C →  are then removed. Second, we have to replace each of these chains by a specific new rule. In fact, we have to add B^{C,ϵ}→ C, B^{,C}→  and C^{,ϵ}→  to the new set of productions. Consequently, our new rule set is now given by
Third, for each occurrence of a nonterminal symbol A in the conclusion of a production and each previously added new rule ${A}^{\alpha ,{A}_{1}{A}_{2}\dots}\to \alpha $ corresponding to a chain A → A_{1} → A_{2} → ... → α, add a specific new rule. This way, we obtain the following production set:
Fourth, each intermediate symbol that no longer occurs as premise in any of the productions has to be removed and fifth, each production of the form S → α, where S is the axiom and α > 1 has to be changed in a specific way. However, since in our case, there is obviously nothing left to do, the transformation of ${\mathcal{G}}_{d}$ into RNF is finished.
For ${\mathcal{G}}_{d}$ (in RNF), where all production weights are rational, we can determine the common denominator s of the weights of productions with premise S, as well as the common denominator c of the weights of the remaining productions (i.e., of the productions with premise B or C). Then, the reweighting of the production rules of (the RNF of) ${\mathcal{G}}_{d}$ is done by multiplying the weights of productions with source S by s, and the weights of the other productions A → α, where A ≠ S, by the factor c^{α1}. After that, we obtain the following reweighted grammar ${\mathcal{G}}_{d}^{\prime}$:
where each ${W}_{i}^{\prime},\phantom{\rule{0.3em}{0ex}}1\le i\le 8$, is integral.
The (now weighted) grammar can easily be translated into a corresponding admissible specification, which includes the weighting of all involved combinatorial (sub)classes, as described earlier. For the reweighted grammar ${\mathcal{G}}_{d}^{\prime}$, this specification is given by the following equations:
which can be simplified in the following way:
As described earlier, this specification (with weighted classes) derived from reweighted grammar ${\mathcal{G}}_{d}^{\prime}$ transforms immediately into a recursion for the function size of all needed combinatorial classes. For ${\mathcal{G}}_{d}^{\prime}$, the recursion for the function size has the following form:
This recursive size function (with weighted class sizes) can now be used for the straightforward construction of a corresponding algorithm for the nonuniform generation of elements of $\mathcal{L}\left({\mathcal{G}}_{d}\right)$ by means of unranking, as proposed in [20].
Derivation of the Algorithm
In this section, we give a complete and detailled description of the derivation of our weighted unranking algorithm for RNA secondary structures. The different steps are made according to the approach described in [20] to get an unranking algorithm that generates random RNA secondary structures of a given size n according to the distribution on all these structures.
Considered (unambiguous, ϵfree and loopfree) SCFG
First, note that in [21], to obtain the stochastic model for RNA secondary structures derived from realworld RNA data, the following unambiguous SCFG which unambiguously generates exactly the language $\mathcal{L}$ given in Definition 0.9 has been used:
Definition App.11. The unambiguous SCFG ${\mathcal{G}}_{\mathsf{\text{sto}}}$ generating exactly the language $\mathcal{L}$ is given by ${\mathcal{G}}_{\mathsf{\text{sto}}}=\left({I}_{{\mathcal{G}}_{\mathsf{\text{sto}}}},\phantom{\rule{0.3em}{0ex}}{\sum}_{{\mathcal{G}}_{\mathsf{\text{sto}}}},\phantom{\rule{0.3em}{0ex}}{R}_{{\mathcal{G}}_{\mathsf{\text{sto}}}},\phantom{\rule{0.3em}{0ex}}S\right)$, where
${\Sigma}_{{\mathcal{G}}_{\mathsf{\text{sto}}}}=\left\{\left(,\right),\phantom{\rule{0.3em}{0ex}}\right\}$ and ${\mathcal{R}}_{{\mathcal{G}}_{\mathsf{\text{sto}}}}$ contains exactly the following rules:
In this grammar, different intermediate symbols have been used to distinguish between different substructures. In fact, the reason why this grammar has so many production rules is that the grammar must be able to distinguish between all the different classes of substructures for which there are different free energy rules according to Turner's thermodynamic model considered in [21].
However, as ϵfreeness and loopfreeness are required preliminarily, we have to consider another unambiguous SCFG generating the same language $\mathcal{L}$, where we have to guarantee that the same substructures are distinguished as are distinguished in ${\mathcal{G}}_{\mathsf{\text{sto}}}$.
Using the usual way of transforming a nonϵfree grammar into an ϵfree one, the following definition can immediately be obtained from the previous one:
Definition App.12. The unambigous and ϵfree SCFG ${\mathcal{G}}_{\mathsf{\text{sto}}}^{\prime}$ generating exactly the language $\mathcal{L}$ is given by ${\mathcal{G}}_{sto}^{\prime}=({I}_{{\mathcal{G}}_{sto}^{\prime}},{\Sigma}_{{\mathcal{G}}_{sto}^{\prime}},{R}_{{\mathcal{G}}_{sto}^{\prime}},{S}^{\prime})$, where
${\Sigma}_{{\mathcal{G}}_{\mathsf{\text{sto}}}^{\prime}}=\left\{\left(,\right),\phantom{\rule{0.3em}{0ex}}\right\}$ and ${\mathcal{R}}_{{\mathcal{G}}_{sto}^{\prime}}$ contains exactly the following rules:
Unfortunately, the set of productions of ${\mathcal{G}}_{\mathsf{\text{sto}}}^{\prime}$ contains productions with up to 5 nonterminal symbols in the conclusion. This is not acceptable for our purpose, for the following reason: the desired unranking algorithm makes use of the size of combinatorial classes whose representations somehow are derived from CFGs with particular integer weights on their productions. If we constructed this WCFG by starting with the grammar ${\mathcal{G}}_{\mathsf{\text{sto}}}^{\prime}$, then this would yield a huge number of production rules. Consequently, the translation would imply a huge specification of the combinatorial classes and the corresponding function to compute their sizes and thus the corresponding unranking algorithm would have to distinguish between an unnecessarily and most importantly unacceptably large number of cases.
Nevertheless, the size of the production set of the weighted grammar underlying the desired unranking algorithm can be significantly reduced by starting with a modification of grammar ${\mathcal{G}}_{\mathsf{\text{sto}}}^{\prime}$ which has only production rules with minimum possible numbers of nonterminal symbols in the conclusion. In fact, by transforming ${\mathcal{G}}_{\mathsf{\text{sto}}}^{\prime}$ appropriately considering this observation, we obtained the SCFG ${\u011c}_{\mathsf{\text{sto}}}$:
Definition App.13. The unambiguous ϵfree SCFG ${\widehat{\mathcal{G}}}_{sto}$ generating exactly the language $\mathcal{L}$ is given by ${\hat{\mathcal{G}}}_{sto}=({I}_{{\hat{\mathcal{G}}}_{sto}},{\Sigma}_{{\hat{\mathcal{G}}}_{sto}},{R}_{{\hat{\mathcal{G}}}_{sto}},{S}^{\prime})$, where
${\sum}_{{\widehat{\mathcal{G}}}_{\mathsf{\text{sto}}}}=\left\{\left(,\right),\right)$ and ${R}_{{\widehat{\mathcal{G}}}_{\mathsf{\text{sto}}}}$ contains exactly the following rules:
Transforming our SCFG into RNF
Now, we can construct the desired weighted grammar that will be underlying our unranking algorithm: In the first step, we gather all possible chains of productions that do not lengthen the sentential form. In fact, we have to consider all rules A → α, A ≠ S', with α = 1, to obtain all such chains (note that these rules will be removed after step 1). Hence, we have to consider the following set ${R}_{rnf}^{1}$ of 22 production rules:
Thus, the following 32 chains are gathered in step 1: