Random generation of RNA secondary structures according to native distributions

  • Markus E Nebel1,

    Affiliated with

    • Anika Scheid1Email author and

      Affiliated with

      • Frank Weinberg1

        Affiliated with

        Algorithms for Molecular Biology20116:24

        DOI: 10.1186/1748-7188-6-24

        Received: 20 April 2011

        Accepted: 12 October 2011

        Published: 12 October 2011



        Random biological sequences are a topic of great interest in genome analysis since, according to a powerful paradigm, they represent the background noise from which the actual biological information must differentiate. Accordingly, the generation of random sequences has been investigated for a long time. Similarly, random object of a more complicated structure like RNA molecules or proteins are of interest.


        In this article, we present a new general framework for deriving algorithms for the non-uniform random generation of combinatorial objects according to the encoding and probability distribution implied by a stochastic context-free grammar. Briefly, the framework extends on the well-known recursive method for (uniform) random generation and uses the popular framework of admissible specifications of combinatorial classes, introducing weighted combinatorial classes to allow for the non-uniform generation by means of unranking. This framework is used to derive an algorithm for the generation of RNA secondary structures of a given fixed size. We address the random generation of these structures according to a realistic distribution obtained from real-life data by using a very detailed context-free grammar (that models the class of RNA secondary structures by distinguishing between all known motifs in RNA structure). Compared to well-known sampling approaches used in several structure prediction tools (such as SFold) ours has two major advantages: Firstly, after a preprocessing step in time http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq1_HTML.gif for the computation of all weighted class sizes needed, with our approach a set of m random secondary structures of a given structure size n can be computed in worst-case time complexity http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq2_HTML.gif while other algorithms typically have a runtime in http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq3_HTML.gif . Secondly, our approach works with integer arithmetic only which is faster and saves us from all the discomforting details of using floating point arithmetic with logarithmized probabilities.


        A number of experimental results shows that our random generation method produces realistic output, at least with respect to the appearance of the different structural motifs. The algorithm is available as a webservice at http://​wwwagak.​cs.​uni-kl.​de/​NonUniRandGen and can be used for generating random secondary structures of any specified RNA type. A link to download an implementation of our method (in Wolfram Mathematica) can be found there, too.


        Random generation stochastic context-free grammars RNA secondary structures

        Background and Introduction

        The topic of random generation algorithms (also called samplers) has been widely studied by computer scientists. As stated in [1], it has been examined under different perspectives, including combinatorics, algorithmics (design and/or engineering), as well as probability theory, where two of the main motivations for random sampling are the testing of combinatorial properties of structures (e.g. conjectured structural properties, quantitative aspects), as well as the testing of properties of the corresponding algorithms (with respect to correctness and/or efficiency).

        As considers software engineering, the so-called random testing approach is commonly used to test implementations of particular algorithms, as it is usually not feasible to consider all possible inputs and unknown which of these inputs are among the most interesting ones. In fact, this approach requires for the generation of random instances of program inputs that obey various sorts of syntactic and semantic constraints (where the random instances usually ought to be of a preliminarily fixed input size in order to be comparable to each other).

        In the Bioinformatics area, algorithms for generating random biological sequences have been investigated for a long time (see e.g. [2, 3]). As stated in [4], random sequences are a topic of great interest in genome analysis, since according to a powerful paradigm, they represent the background noise from which the actual biological information must differentiate. Thus, random generation of combinatorial objects can be used in this context for simulations studies in order to isolate signal (unexpected events) from noise (statistically unavoidable regularities). In fact, according to [4], random biological sequences are for instance widely used for the detection of over-represented and under-represented motifs, as well as for determining whether scores of pairwise alignments are relevant or not: although there exist analytic approaches for these kinds of problems, for the most complex cases, it is often still necessary to be able to alternatively use a corresponding experimental approach (based on randomly generated sequences obtained from a computer programm). For this purpose, random sequences must obviously obey to a certain model that takes into account some relevant properties of actual real-life sequences, where such models are usually based on statistical parameters only. However, it is known that these classical models can be enriched by adding structural parameters (see [4]). Over the past years, several methods have been proposed for the random generation of more complex structures, where special attention has been paid to RNA secondary structures. RNA is a single-stranded nucleotide polymer and a major component of cellular processes (like DNA and proteins). An RNA strand is formed by linking together certain nucleotide units. The specific sequence of nucleotides along this chain is called the primary structure of the molecule. By pairing of nucleotides that are not linked in this chain (i.e. by the so-called effects of base pairing), the linear primary structure is folded into a three-dimensional conformation, called the tertiary structure, which in many cases determines the function of the molecule. Most of the 3D structure is determined by the intramolecular base-pairing interactions in the plane, which together form the secondary structure of the molecule. For this reason, pseudoknots (induced by crossing base pairs) are considered as tertiary interactions and are usually not permitted in the definition of secondary structure. As unknotted structures contain only nested base pairs and are thus essentially two-dimensional, they can be modeled as planar graphs. This rather descriptive and commonly used planar graph model for RNA secondary structures was first formalized in [5]. An example is shown in Figure 1.
        Figure 1

        An RNA secondary structure. Unpaired and paired bases are represented by white and gray points, respectively.

        Most of the existing random generation algorithms for RNA secondary structures are used for predicting the structure of a given RNA sequence (see e.g. [6, 7]), while others can be employed for instance for evaluating structure comparison softwares [8]. Note that secondary structure prediction methods based on random sampling represent a non-deterministic counterpart to the up-to-date most successful and popular physics-based prediction methods that make use of the energy minimization paradigm and are realized by dynamic programming algorithms (see e.g. [912]). Random sampling also differs from the stochastical RNA structure prediction approach that is based on context-free modeling of structural motifs and adding some statistical parameters observed in real-life data by assigning probabilities to the corresponding motifs (see e.g. [1315]). Nevertheless, it should be mentioned that statistical sampling methods like [6, 7] used for RNA structure prediction are based on thermodynamics and thus inevitably inherit the problems and imprecisions related to energy minimizing methods, which are caused by the still incomplete commonly used free energy models for RNAs. In order to overcome these pitfalls, one could take the competing point of view and consider only typical structural information observed in a set of sample data as the basis for a new random generation method. If that information draws a realistic picture for all the different motifs of a molecule's folding, the corresponding sampling method is likely to produce realistic results. Accordingly, several authors made use of stochastic context free grammars and employed machine-learning techniques to train parameter values from a set of known secondary structures. Such grammars have widely been used in a predictive mode (see, e.g., [14]) but there are also successful examples of applications where the random sampling of derivation trees has been the core of the method (see, e.g., [1618] but also [19] for examples). In the present paper, we follow that line of ideas and rely on the approach of the technical report [20] to develop a new algorithm for the (non-uniform) random generation of RNA secondary structures (without pseudoknots) according to a distribution induced by a set of sample RNA data (note that the algorithm actually generates secondary structures for a preliminary fixed size, not for a given RNA sequence of this size, which means we take the combinatorial point of view and completely abstract from sequence).

        The main contribution of this manuscript is the derivation of a new and efficient algorithm for the random generation of RNA secondary structures according to an elaborate and thus very realistic model. For this purpose we use and generalize the approach from [20]. Particularly, our random generation method is based on a sophisticated context-free grammar for unknotted structures which, in order to model the class of all considered RNA secondary structures as realistic as possible, distinguishes between all known structural motifs that may occur in unknotted RNA secondary structure. This means that any structural feature is modeled by one or more specific grammar rules with corresponding probabilities observed from real-life data. Note that this grammar is actually a special variant of the comprehensive grammar used in [21] for deriving a realistic RNA structure model and for performing the first ever analytical analysis of the expected free energy of a random secondary structure (of a specified RNA type). Actually, that grammar has been designed as a mirror of the famous Turner energy model [22, 23] which serves as the foundation for most of the existing physics-based RNA structure prediction methods: all structural motifs for which there are different thermodynamic rules and parameters are created by distinct production rules (with corresponding probabilities).

        According to [20], our sampling method involves a weighted unranking algorithm for obtaining the final structures. Briefly, considering an arbitrary structure class of size (cardinality) c, a corresponding unranking method uses a well-defined ordering of all class elements (according to a particular numbering scheme, the so-called ranking method) and for a given input number r ∈ {1,..., c} outputs the structure with rank r in the considered ordering. That way, the random sampling based on a stochastic grammar - building heavily on the use of small floating point numbers - is translated into an unranking algorithm using integer values only. Notably, a complete structure of size n is generated by recursively unranking the distinct structural components from the corresponding subclasses (of substructures with sizes less than n). In our case, the weighted unranking algorithm requires a precomputation step in worst-case time http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq1_HTML.gif for computing all weighted class sizes up to input size n. The worst-case complexity for generating a secondary structure of size n at random is then given by http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq4_HTML.gif since we are ranking structures according to the boustrophedon order (see e.g. [7]).

        By the end of this paper, we analyze the quality of randomly generated structures by considering some experimental results. First, we will consider statistical indicators of many important parameters related to particular structural motifs and compare the ones observed in the used sample set of real world RNA data to those observed in a corresponding set of random structures. Their comparison measures indicate that our method actually generates realistic RNA structures. Obviously, an algorithm which, for a given structure size n, produces random RNA secondary structures that are - related to expected shapes of such structures -in most cases realistic is a major improvement over existing approaches which, for example, are only capable of generating secondary structures uniformly for size n. Furthermore, we will consider the two different free energy models defined in [21] for RNA secondary structure (with unknown RNA sequence) to get further evidence of the good quality of our random generation method (with respect to free energies and thus rather likely also with respect to appearance of the different structural motifs of RNA).

        Prior Results and Basic Definitions

        Uniform Random Generation

        In the past, the problem of uniform random generation of combinatorial structures, that is the problem of randomly generating objects (of a preliminary fixed input size) of a specified class that have the same or similar properties, has been extensively studied. Special attention has been paid on the wide class of decomposable structures which are basically defined as combinatorial structures that can be constructed recursively in an unambiguous way.

        In principle, two general (systematic) approaches have been developed for the uniform generation of these structures: First, the recursive method originated in [24] (to generate various data structures) and later systematized and extended in [25] (to decomposable data structures), where general combinatorial decompositions are used to generate objects at random based on counting possibilities. Second and more recently, the so-called Boltzmann method [1, 26], where random objects (under the corresponding Boltzmann model) have a fluctuating size, but objects with the same size invariably occur with the same probability. Note that according to [26], Boltzmann samplers may be employed for approximate-size (objects with a randomly varying size are drawn) as well as fixed-size (objects of a strictly fixed size are drawn) random generation and are an alternative to standard combinatorial generators based on the recursive method. However, fixed-size generation is considered the standard paradigm for the random generation of combinatorial structures.

        (Admissible) Constructions and Specifications

        According to [25], a decomposable structure is a structure that admits an equivalent combinatorial specification:

        Definition 0.1 ( [25]). Let http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq5_HTML.gif be an r-tuple of classes of combinatorial structures. A specification for http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq6_HTML.gif is a collection or r equations with the ith equation being of the form http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq7_HTML.gif , where ϕ i denotes a term built of the http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq8_HTML.gif using the constructions of disjoint union, cartesian product, sequence, set and cycle, as well as the initial (neutral and atomic) classes.

        The needed formalities that will also be used in the sequel are given as follows:

        Definition 0.2 ( [27]). If http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq6_HTML.gif is a combinatorial class, then http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq9_HTML.gif denotes the class of objects in http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq6_HTML.gif that have size (defined as number of atoms) n. Furthermore:

        • Objects of size 0 are called neutral objects or tags and a class consisting of a single neutral object ϵ is called a neutral class, which will be denoted by ε (ε 1 , ε 2,... to distinguish multiple neutral classes containing the objects ϵ 1, ϵ 2, ..., respectively).

        • Objects of size 1 are called atomic objects or atoms and a class consisting of a single atomic object is called an atomic class, which will be denoted by Ƶ(Ƶ a , Ƶ b ,... to distinguish the classes containing the atoms a,b,..., respectively).

        • If http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq10_HTML.gif are combinatorial classes and ϵ 1, ..., ϵ k are neutral objects, the combinatorial sum or disjoint union is defined as http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq11_HTML.gif where ∪ denotes set theoretic union.

        • If http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq6_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq12_HTML.gif are combinatorial classes, the cartesian product is defined as http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq13_HTML.gif , where size(α, β) = size(α) + size(β).

        Note that the constructions of disjoint union, cartesian product, sequence, set and cycle are all admissible:

        Definition 0.3 ( [27]). Let http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq14_HTML.gif be an m-ary construction that associates to a any collection of classes http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq15_HTML.gif a new class http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq16_HTML.gif . The construction http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq14_HTML.gif is admissible iff the counting sequence (a n ) of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq6_HTML.gif only depends on the counting sequences (b 1,n ),..., (b m,n ) of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq15_HTML.gif , where the counting sequence of a combinatorial class http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq6_HTML.gif is the sequence of integers (a n )n≥0 for http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq17_HTML.gif .

        The framework of (admissible) specifications obviously resembles that of context-free grammars (CFGs) known from formal language theory (note that we assume the reader has basic knowledge of the notions concerning context-free languages and grammars. An introduction can be found for instance in [28]). In order to translate a CFG into the framework of admissible constructions, it is sufficient to make each terminal symbol an atom and to assume each non-terminal A to represent a class http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq6_HTML.gif (the set of all words which can be derived from non-terminal A). However, for representing CFGs, only the admissible constructions disjoint union, cartesian product and sequence are needed: Words are constructed as cartesian products of atoms, sentential forms as cartesian products of atoms and the classes assigned to the corresponding non-terminal symbols. For instance, a production rule AaB translates into the symbolic equation http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq18_HTML.gif . Different production rules with the same left-hand side give rise to the union of the corresponding cartesian products. Nevertheless, it should be noted that [25] also shows how to reduce specifications to standard form, where the corresponding standard specifications constitute the basis of the recursive method for uniform random generation and extends the usual Chomsky normal form (CNF) for CFGs. Briefly, in standard specifications, all sums and products are binary and the constructions of sequences, sets and cycles are actually replaced with other constructions (for details see [25]).

        The prime advantage of standard specifications is that they translate directly into procedures for computing the sizes of all combinatorial subclasses of the considered class http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq19_HTML.gif of combinatorial objects. This means they can be used to count the number of structures of a given size that are generated from a given non-terminal symbol. Moreover, standard specifications immediately translate into procedures for generating one such structure uniformly at random. The corresponding procedures (for class size calculations and structure generations) are actually required for (uniform) random generation of words of a given CFG by means of unranking.

        Simply speaking, the unranking of decomposable structures (like for instance RNA secondary structures which can be uniquely decomposed into distinct structural components) works as follows: Each structure s in the combinatorial class http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq20_HTML.gif of all feasible structures having size n is given a number (rank) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq21_HTML.gif , defined by a particular ranking method. Based on this ordering of the considered structure class http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq20_HTML.gif , the corresponding unranking algorithm for a given input number http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq21_HTML.gif computes the single structure http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq22_HTML.gif having number i in the ranking scheme defined for class http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq20_HTML.gif .

        Note that in this context of unranking particular elements from a considered structure class, the corresponding algorithms make heavy use of their decomposability, as the distinct structural components are unranked from the corresponding subclasses. In fact, the class sizes can be derived according to the following recursion:

        Note that when computing the sums for cartesian products, we can either consider the values for j in the sequential (also called lexicographic) order (1, 2, 3,..., n) or in the so-called boustrophedon order http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq23_HTML.gif . In either case, given a fix number of considered combinatorial (sub)classes (or corresponding non-terminal symbols), the precomputation of all class size tables up to size n requires http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq1_HTML.gif operations on coefficients. One random generation step then needs http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq1_HTML.gif arithmetic operations when using the sequential method and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq24_HTML.gif operations when using the boustrophedon method (for details we refer to [25]). Obviously, using uniform unranking procedures to construct the ith structure of size n for a randomly drawn number i, any structure of size n is equiprobably generated. Consequently, in order to make sure that, for given size n and a sample set of random numbers i, the corresponding structures are in accordance with an appropriate probability distribution (as for instance observed from real-life RNA data), it is mandatory to use a corresponding non-uniform unranking method or an alternative non-uniform random generation approach.

        Non-Uniform Random Generation

        Coming back to the random testing problem from software engineering, we observe that generating objects of a given class of input data according to a uniform distribution is sufficient for testing the correctness of particular algorithms. However, if one intends to gather information about the "real-life behaviour" of the algorithm (e.g. with respect to runtime or space requirements), we need to perform simulations with input data that are as closely as possible related to corresponding application. This means to obtain suitable test data, we need to specify a distribution on the considered class that is similar to the one observed in real life and draw objects at random according to this (non-uniform) distribution. Deriving such a "realistic" distribution on a given class of objects can easily be done by modeling the class by an appropriate stochastic context-free grammar (SCFG). Details will follow in the next section.

        As regards RNA, it has been proven that both the combinatorial model (that is based on a uniform distribution such that all structures of a given size are equiprobable and that completely abstracts from the primary structure, see e.g. [2931]) and the Bernoulli-model (which is capable of incorporating information on the possible RNA sequences for a given secondary structure, see e.g. [3234]) for RNA secondary structures are rather unrealistic. However, modeling these structures by an appropriate SCFG yields a more realistic RNA model, where the probability distribution on all structures is determined from a database of real world RNA data (see e.g. [35, 36]).

        Based on this observation, the problem of non-uniform random generation of combinatorial structures has been recently addressed in [20]. There, it is described how to get algorithms for the random generation of objects of a previously fixed size according to an arbitrary (non-uniform) distribution implied by a given SCFG. In principle, the construction scheme introduced in [20] extends on the recursive method for the (uniform) random generation [25] and adapted it to the problem of unranking of [37]: the basic principle is that any (complex) combinatorial class can be decomposed into (or can be constructed from) simpler classes by using admissible constructions.

        Essentially, in [20], a new admissible construction called weighting has been introduced in order to make non-uniform random generation possible. By weighting, we understand the generation of distinguishable copies of objects. Formally:

        Definition 0.4. If http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq6_HTML.gif is a combinatorial class and λ is an integer, the weighting of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq6_HTML.gif by λ is defined as http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq25_HTML.gif . We will call two objects from a combinatorial class copies of the same object iff they only differ in the tags added by weighting operations.

        For example, if we weight the class http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq26_HTML.gif by two, we assume the result to be the set {a, a}; weighting http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq27_HTML.gif by three generates {b,b,b}. Thus, http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq28_HTML.gif and within this class, a has relative frequency http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq29_HTML.gif , while b has relative frequency http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq30_HTML.gif . Hence, this way it becomes possible to regard non-uniformly distributed classes.

        As weighting a class can be replaced by a disjoint union, http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq31_HTML.gif and the complexity results from [37] also hold for weighted classes. Hence, the corresponding class size computations up to n need http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq1_HTML.gif time.

        Stochastic Context-Free Grammars

        As already mentioned, stochastic context-free grammars (SCFGs) are a powerful tool for modeling combinatorial classes and the essence of the non-uniform random sampling approach that will be worked out in this article. Therefore, we will now give the needed background information.

        Basic Concepts

        Briefly, SCFGs are an extension of traditional CFGs: usual CFGs are only capable of modeling the class of all generated structures and thus inevitably induce a uniform distribution on the objects, while SCFGs additionally produce a (non-uniform) probability distribution on the considered class of objects. In fact, an SCFG is derived by equipping the productions of a corresponding CFG with probabilities such that the induced distribution on the generated language models as closely as possible the distribution of the sample data.

        The needed formalities are given as follows:

        Definition 0.5 ( [38]). A weighted context-free grammar (WCFG) is a 5-tuple http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq32_HTML.gif , where I (resp. T) is an alphabet (finite set) of intermediate (resp. terminal) symbols (I and T are disjoint), SI is a distinguished intermediate symbol called axiom, RI × (IT)* is a finite set of production rules and W : R → ℝ+ is a mapping such that each rule fR is equipped with a weight w f : = W(f). If http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif is a WCFG, then http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif is a stochastic context-free grammar (SCFG) iff the following additional restrictions hold:

        1. For all fR, we have W(f) ∈ (0,1], which means the weights are probabilities.

        2. The probabilities are chosen in such a way that for all AI, we have http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq34_HTML.gif , where Q(f) denotes the premise of the production f, i.e. the first component A of a production rule (A, α) ∈ R. In the sequel, we will write w f : Aα instead of f = (A, α) ∈ R, w f = w(f).

        However, at this point, we decided to not recall the basic concepts regarding SCFGs, as they are not really necessary for the understanding of this article. The interested reader is referred to the corresponding section in [21]. For a more fundamental introduction on stochastic context-free languages, see for example [39]. In fact, the only information needed in the sequel is that if structures are modeled by a consistent SCFG, then the probability distribution on the production rules of the SCFG implies a probability distribution on the words of the generated language and thus on the modeled structures. To ensure that a SCFG gets consistent, one can for example assign relative frequencies to the productions, which are computed by counting the production rules used in the leftmost derivations of a finite sample of words from the generated language. For unambiguous SCFGs, the relative frequencies can actually be counted efficiently, as for every word, there is only one leftmost derivation to consider.

        Modeling RNA Secondary Structure via SCFGs

        Besides the popular planar graph representation of unknotted secondary structures, many other ways of formalizing RNA folding have been described in literature. One well-established example is the so called bar-bracket representation, where a secondary structure is modeled as a string over the alphabet Σ: = {(,), |}, with a bar | and a pair of corresponding brackets ( ) representing an unpaired nucleotide and two paired bases in the molecule, respectively (see, e.g. [30]). Obviously, both models abstract from primary structure, as they only consider the number of base pairs and unpaired bases and their positions. Moreover, there exists a one-to-one correspondence between both representations, as illustrated by the following example:

        Example 0.1. The secondary structure shown in Figure 1 has the following equivalent bar-bracket representation that can be decomposed into subwords corresponding to the basic structural motifs that are distinguished in state-of-the-art thermodynamic models:

        Note that the reading order of secondary structures is from left to right, which is due to the chemical structure of the molecule.

        Consequently, secondary structures without pseudoknots can be encoded as words of a context-free language and the class of all feasible structures can thus effectively be modeled via a corresponding CFG. Basically, that CFG can be constructed to describe a number of classical constraints (e.g. the presence of particular motifs in structures) and it can also express long-range interactions (e.g. base pairings). By extending it to a corresponding SCFG, we can also model the fact that specific motifs of RNA secondary structures are more likely to be folded at certain stages than others (and not all possible motifs are equiprobable at any folding stage).

        In fact, it is known for a long time that SCFGs can be used to model RNA secondary structures (see e.g. [40]). Additionally, SCFGs have already been used successfully for the prediction of RNA secondary structure [14, 15]. Moreoever, they can be employed for identifying structural motifs as well as for deriving stochastic RNA models that are - with respect to the expected shapes - more realistic than other models [36]. Furthermore, note that an SCFG mirror of the famous Turner energy model has been used in [21] to perform the first analytical analysis of the free energy of RNA secondary structures; this SCFG marks a cornerstone between stochastic and pyhsics-based approaches towards RNA structure prediction.

        Random Generation With SCFGs

        SCFGs can easily be used for the random generation of combinatorial objects according to the probability distribution induced by a sample set, where the only problem is that they do not allow the user to fix the length of generated structures. In particular, given an SCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif and the corresponding language (combinatorial class) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq35_HTML.gif , a random word http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq36_HTML.gif can be generated in the following way:

        • Start with the sentential form S (where S denotes the axiom of the grammar http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif ).

        • While there are non-terminal symbols (in the currently considered sentential form), do the following:
          1. 1.

            Let A denote the leftmost non-terminal symbol.

          2. 2.

            Draw a random number r from the interval (0,1].

          3. 3.

            Substitute symbol A by the right-hand side α of the production Aα determined by the random number r.


        This means consider all m ≥ 1 rules p 1 : Aα 1,..., p m : Aα m having left-hand side A, where according to the definition of SCFGs, http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq37_HTML.gif must hold. Then, find k ≥ 1 with http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq38_HTML.gif , i.e. determine k ≥ 1 with http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq39_HTML.gif . The production corresponding to the randomly drawn number r ∈ (0,1] is then given by Aα k and hence, in the currently considered sentential form, the non-terminal symbol A is substituted by α k .

        • If there are no more non-terminal symbols, then the currently considered sentential form is equal to a word http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq40_HTML.gif has been randomly generated.

        Note that the choice of the production made in 3) according to the previously drawn random number is appropriate, since it is conform to the probability distribution on the grammar rules.

        Example 0.2. Consider the language generated by the SCFG with productions ¾: Sϵ and ¼: S → (S). Thus, we start with the sentential form S, then consider the leftmost non-terminal symbol, which is given by S, and draw a random number r ∈ (0,1]. If 0 <r ≤ ¾, the production determined by r is Sϵ and thus, we get the empty word and are finished. Otherwise, ¾ <r ≤ ¾ + ¼ = 1, which means we have to consider A → (S) for the substitution in step 3) and thus obtain the sentential form (S). Afterwards, we must repeat the process, as there is still one non-terminal symbol left.

        Unfortunately, there is one major problem that comes with this approach for the (non-uniform) random generation of combinatorial objects: The underlying (consistent) SCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif implies a probability distribution on the whole language http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq35_HTML.gif , such that we generate a word of arbitrary size. In order to fix the size, we can proceed along the following lines:

        1. 1)

          We translate the grammar http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif into a new framework which allows to consider fixed sizes for the random generation, such that

        2. 2)

          the distribution implied on http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq35_HTML.gif conditioned on any fixed size n is kept within the new framework.


        A well-known approach which allows for 1) is connected to the concept of admissible constructions used to describe a decomposable combinatorial class (see above). As the operations (like cartesian products, unions, and so on) used to construct the combinatorial objects are also used to define an order on them, it becomes possible to identify the ith object of a given size and the problem of generating objects uniformly at random reduces to the problem of unranking, that is the problem of constructing the object of order (rank) i, for i a random number (see e.g. [41]).

        Remark. Some might think that with an appropriate SCFG (modeling a given class of objects) at hand, it is not really necessary to use an unranking method that implies cumbersome formalities such as admissible constructions and decomposable classes if we want to generate random objects of a fixed size n. As a matter of principle, they are right - we could also use a conditional sampling method: If we need to generate a word of size n from non-terminal symbol A, where there are m ≥ 1 rules f i = Aα i , 1 ≤ im, having left-hand side A, then we just need to choose the next production f i according to

        which is the posterior probability that we used production rule f i under the condition that a word of size n is generated.

        Similarly, if the production rule is of the type ABC (assuming the grammar is in Chomsky normal form (CNF), which does not pose a problem, as an unambiguous SCFG can be efficiently transformed into CNF [39]), we can choose a way to split size n into sizes j and n - j for the lengths generated from non-terminal symbols B and C. This requires precomputing n length-dependent probabilities (i.e. all probabilities for generating a word of any length up to n) for each non-terminal symbol, which might seem to be similar (with respect to complexity) to precomputing all class sizes up to n for all considered combinatorial (sub)classes as needs to be done for unranking. However, there is a striking difference between the two approaches: While conditional sampling makes heavy use of rather small floating point values - with all the well-known problems and discomforting details like underflows or using logarithms associated with it - our unranking approach builds on integer values only which we assume a major advantage. There is another striking difference: length-dependent probabilities (which by the way yield a so-called length-dependent SCFG (LSCFG), see [42], and already have been used in [43]), require a very rich training set. In fact, if the RNA data set used for determining the distribution induced by the grammar is not rich enough, then the corresponding stochastic RNA model is underestimated and its quality decreases. This is especially a problem when considering comprehensive CFGs that distinguish between many different structural motifs in order to get a realistic picture of the molecules' behaviour; such a grammar should however be preferred over simple lightweight grammars as basis for a non-uniform random generation method. Nevertheless, this problem does not surface when sticking to conventional probabilities and the corresponding traditional SCFG model. Actually, since we consider a huge CFG where all possible structural motifs are created by distinct productions, we generally obtain realistic probability distributions and RNA models (see [21]).

        Finally note that of course we could make use of random sampling strategies originally designed to sample structures connected to a given sequence in order to generate a random secondary structure only. However, such algorithms typically use a linear time to sample a single base pair (see, e.g., [6]) such that the time to sample a complete structure is quadratic in its length. This causes no problems for the original application of such algorithms since the sequence-dependent preprocessing which is part of their overall procedure is at least quadratic in time and thus the dominating part. Here our approach is of advantage (replacing a factor n by log(n)) and since our preprocessing only depends on the size of the structure to be generated it is performed once and stored to disk for later reuse. Last but not least we are not sure, if the different existing approaches just mentioned could easily be made as fast as ours by simple changes only.

        Bottom line is that hooking up to unranking of combinatorial classes offers three significant benefit compared to conditional sampling, namely a fast sampling strategy, the usage of integers instead of floating point values and a greater independence of the richness of the training data (compared to length-dependent models). For this reason, we assume our unranking algorithm a valuable contribution, even though it requires a more cumbersome framework.

        Unranking of Combinatorial Objects

        The problem of unranking can easily be solved along the composition of the objects at hand, i.e. the operations used for its construction, once we know the number of possible choices for each substructure. Assume for example we want to unrank objects from a class http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq41_HTML.gif . We will assume all elements of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq6_HTML.gif to be of smaller order than those of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq12_HTML.gif (this way we use the construction of the class to imply an ordering). Finding the ith element of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq19_HTML.gif , i.e. unranking class http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq19_HTML.gif , now becomes possible by deciding whether http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq42_HTML.gif . In this case, we recursively call the unranking procedure for http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq6_HTML.gif . Otherwise (i.e. if http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq43_HTML.gif ), we consider http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq12_HTML.gif , searching for its ( http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq44_HTML.gif ))th element.

        Formally, we first need to specify an order on all objects of the considered combinatorial class that have the same size. This can be done in a recursive way according to the admissible specification of the class:

        Definition 0.6 ( [37]). Neutral and atomic classes contain only one element, such that there is only one possible ordering. Furthermore, let http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq45_HTML.gif denote the ordering within the combinatorial class http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq46_HTML.gif , then

        • If http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq47_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq48_HTML.gif , then http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq49_HTML.gif iff
        • If http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq50_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq51_HTML.gif , then http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq52_HTML.gif iff

        when considering the lexicographic order (1, 2, 3,..., n), which is induced by the specification http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq53_HTML.gif .

        • If http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq50_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq51_HTML.gif , then http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq52_HTML.gif iff

        when considering the boustrophedon order http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq23_HTML.gif , induced by the specification http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq54_HTML.gif

        Considering http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq45_HTML.gif , the actual unranking algorithms are quite straightforward. Therefore, they will not be presented here and we refer to [20, 44] for details.

        Recall that in [20], the basic approach towards non-uniform random generation is weighting of combinatorial classes, as this makes it possible that the classes are non-uniformly distributed. If those combinatorial classes are to correspond to a considered SCFG, we have to face the problem that the maximum likelihood (ML) training introduces rational weights for the production rules while weighting as an admissible construction needs integer arguments.

        When translating rational probabilities into integral weights, we have to assure that the relative weight of each (unambiguously) generated word remains unchanged. This can be reached by scaling all productions by the same factor (common denominator of all probabilities), while ensuring that derivations are of equal length for words of the same size (ensured by using grammars in CNF). However, a much more elegant way is to scale each production according to its contribution to the length of the word generated, that is, productions lengthening the word by k will be scaled by c k . Since we consider CFGs, the lengthening of a production of the form Aα is given by |α| - 1. However, this rule leads to productions with a conclusion of length 1 not being reweighted, hence we have to assure that all those productions already have integral weights. Furthermore, ϵ-productions need a special treatment. We don't want to discuss full details here and conclude by noticing that the reweighting normal form (RNF) keeps track of all possible issues:

        Definition 0.7 ( [20]). If http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq55_HTML.gif is a WCFG, http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif is said to be in reweighting normal form (RNF) iff

        1. 1.

          http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif is loop-free and ϵ-free.

        2. 2.

          For all AαR with A = S, we have |α| ≤ 1.

        3. 3.

          For all AαR with AS, we have |α| > 1 or W(Aα) ∈ ℕ.

        4. 4.

          For all AI there exists α ∈ (IT)* such that AαR.


        Note that the last condition (that any intermediate symbol occurs as premise of at least one production) is not required for reweighting, but necessary for the translation of a grammar into an admissible specification.

        Definition 0.8 ( [20]). A WCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif is called loop-free iff there exists no nonempty derivation A+ A for AI. It is called ϵ-free iff there exists no (A, ϵ) ∈ R with A = S and there exists no (A, α 1 2) ∈ R, where ϵ denotes the empty word.

        If http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq56_HTML.gif are WCFGs, then http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq56_HTML.gif are said to be word-equivalent iff http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq57_HTML.gif and for each word http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq36_HTML.gif , we have W(w) = W'(w).

        In [20], it is shown how to transform an arbitrary WCFG to a word-equivalent, loop-free and ϵ-free grammar, that grammar to one in RNF and the latter to the corresponding admissible specification. Formally:

        Theorem 0.1 ( [39]). If http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif is a SCFG, there exists a SCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq56_HTML.gif in Chomsky normal form (CNF) that is word-equivalent to http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif , and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq56_HTML.gif can be effectively constructed from http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif .

        The construction given in [39] assumes that http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif is ϵ-free. It can however be extended to non-ϵ-free grammars by adding an additional step after the intermediate grammar http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif has been created (see e.g. [20]). Furthermore, it should be noted that an unambiguous grammar is inevitably loop-free.

        Theorem 0.2 ( [20]). If http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif is a loop-free, ϵ-free WCFG, there exists a WCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq56_HTML.gif in RNF that is word-equivalent to http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq56_HTML.gif can be effectively constructed from http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq33_HTML.gif .

        Altogether, starting with an arbitrary unambiguous SCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq58_HTML.gif that models the class of objects to be randomly generated, we have to proceed along the following lines:

        • Transform http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq58_HTML.gif to a corresponding ϵ-free and loop-free SCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq59_HTML.gif .

        • Transform http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq59_HTML.gif into http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq60_HTML.gif in RNF (where all production weights are rational).

        • Reweight the production rules of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq60_HTML.gif (such that all production weights are integral), yielding reweighted WCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq61_HTML.gif .

        • Transform http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq61_HTML.gif (with integral weights) into the corresponding admissible specification.

        • This specification (with weighted classes) can be translated directly

        - into a recursion for the function size of all involved combinatorial (sub)classes (where class sizes are weighted) and

        - into generating algorithms for the specified (weighted) classes,

        yielding the desired weighted unranking algorithm for generating random elements of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq62_HTML.gif .

        A small example that shows how to proceed from SCFG to reweighted normal form and the corresponding weighted combinatorial classes which allow for non-uniform generation by means of unranking is discussed in the Appendix.

        Generating Random RNA Secondary Structures

        We will now consider the previously discussed approach to construct a weighted unranking algorithm that generates random RNA secondary structures of a given size according to a realistic probability distribution. As for this paper, the corresponding probability distribution will be induced by a set of sample (SSU and LSU r)RNA secondary structures from the databases [45, 46], which will be referred to as biological database in the sequel. However, the presented algorithm can easily be used for any other distribution, which can be defined by a database of known RNA structures of a particular RNA type; our webservice implementation accessible at http://​wwwagak.​cs.​uni-kl.​de/​NonUniRandGen is actually able to sample random secondary structures of any specified RNA type. A link to download an implementation of our algorithm (in Wolfram Mathematica) can be found there, too.

        Considered Combinatorial Class

        According to the common definition of RNA secondary structure, we decided to consider the combinatorial class of all RNA secondary structures without pseudoknots that meet the stereochemical constraint of hairpin loops consisting of at least 3 unpaired nucleotides, formally:

        Definition 0.9 ( [21]). The language http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq63_HTML.gif containing exactly all RNA secondary structures is given by (note that according to this definition, completely unpaired structures are prohibited) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq64_HTML.gif , where http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq65_HTML.gif is the language of all bar-bracket representations of single-stranded regions and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq66_HTML.gif is the language of all bar-bracket representations of other possible substructures, i.e. is the smallest language satisfying the following conditions:

        1. 1.

          http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq67_HTML.gif (bar-bracket representations of hairpin loops).

        2. 2.

          If http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq68_HTML.gif , then http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq69_HTML.gif (bar-bracket representation of a stacked pair).

        3. 3.

          If http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq68_HTML.gif , then http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq70_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq71_HTML.gif (bar-bracket representations of bulge loops).

        4. 4.

          If http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq68_HTML.gif , then http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq72_HTML.gif (bar-bracket representations of interior loops).

        5. 5.

          If http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq73_HTML.gif and n ≥ 2, then http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq74_HTML.gif (bar-bracket representations of multibranched loops).


        The desired weighted unranking algorithm thus generates, for a given size n and a given number http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq75_HTML.gif , the ith secondary structure http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq76_HTML.gif , where http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq77_HTML.gif is the number of elements in the weighted class http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq78_HTML.gif .

        Considered SCFG Model

        First, we have to find a suitable SCFG that generates http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq63_HTML.gif and models the distribution of the sample data as closely as possible. To reach this goal, it is important to appropriately specify the set of production rules in order to guarantee that all substructures that have to be distinguished are generated by different rules. This is due to the fact that by using only one production rule f to generate different substructures (e.g. any unpaired nucleotides independent of the type of loop they belong to), there is only one weight (the probability p f of this production f) with which any of these substructures is generated, whereas the use of different rules f 1,..., f k to distinguish between these substructures implies that they may be generated with different probabilities http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq79_HTML.gif , where http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq80_HTML.gif . This way, we ensure that more common substructures are generated with higher probabilities than less common ones.

        Example 0.3. A (rather simple) unambiguous SCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq81_HTML.gif generating the language http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq63_HTML.gif is given by:

        This grammar unambiguously generates http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq63_HTML.gif for the following reasons:

        • Every sentential form C ( B ) C ( B ) ... ( B ) C obviously is generated in a unique way; this resembles http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq82_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq83_HTML.gif of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq84_HTML.gif definition. The number of outermost pairs of brackets in the entire string uniquely determines the corresponding sentential form to be used.

        • Now B either generates a hairpin-loop | ≥3, which unambiguously is possible by rules B||| C, C| C and Cϵ, or

        • B itself has to generate at least one additional pair of brackets. In this case, BCA must be applied (only A can generate brackets) and then A( B ) C resp. A( B ) CA are used; the number of outermost brackets to be generated (from B under consideration) again uniquely determines that part of the derivation.

        When changing the production w 5: BCA used to generate any possible k-loop for k ≥ 2 (any loop that is not a hairpin loop) with probability w 5 into the two rules
        where w 5.1 + w 5.2 = w 5, it becomes possible to generate any possible 2-loop (i.e. a stacked pair, a bulge (on the left or on the right), or an interior loop) and all kinds of multiloops (i.e. any k-loop with k ≥ 3) with different probabilities, which could increase the accuracy of the SCFG model. By additionally replacing the first of these two new rules, w 5.1 : BC ( B ) C, by the four productions

        where (w 5.1.1 + ... + w 5.1.4) + w 5.2 = w 5.1 + w 5.2 = w 5, we can distinguish between the different types of 2-loops more accurately, yielding a more realistic secondary structure model. In fact, in the case of significant differences of the new probabilities (w 5.1.1, ..., w 5.1.4 and w 5.2), we can expect a huge improvement in the model's accuracy. Note that it is not hard to see that changes to a grammar like the ones just discussed do not change the language generated. However, this is not at all obvious with respect to ambiguity of the grammar.

        According to the previously mentioned facts (and the corresponding illustrations by Example 0.3), we decided that the basis for our weighted unranking algorithm should be the following e-free, loop-free and unambiguous (note that these are exactly the preliminary required conditions for the basis SCFG according to [20]) SCFG, which has been derived from the sophisticated SCFG presented in [21] that distinguishes between all known structural motifs that can be found in RNA secondary structure:

        Definition 0.10. The unambiguous ϵ-free SCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif generating exactly the language http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq63_HTML.gif is given by http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq86_HTML.gif , where
        http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq87_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq88_HTML.gif contains exactly the following rules:

        Figures 2 and 3 illustrate by examples how (parts of) secondary structures are generated by this SCFG, where we used http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq89_HTML.gif to denote the full parse tree for I ⇒* x (i.e. for consecutive applications of an arbitrary number of production rules that generate the subword x from the intermediate symbol I) in oder to obtain a more compact tree representation. In fact, it is easy to see that the overall structure is always produced by starting with the axiom S', while any particular substructure or structural motif that belongs to the combinatorial (sub)class http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq90_HTML.gif is created from the corresponding intermediate symbol I.

        For our application it is crucial that http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif - as claimed its definition - is unambiguous. To prove this, we first note that http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif has been constructed starting from a simple grammar which generates http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq63_HTML.gif by iteratively replacing one production by several ones (like we did in the previous example) in order to distinguish more and more structural motifs but without changing the language generated. Furthermore, a standard construction to make the grammar ϵ-free has been applied. That way, we can be sure that http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif generates http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq63_HTML.gif (formally this fact easily follows by obvious bi-simulation proofs for each substitution and by the proven correctness of the used construction to ensure ϵ-freeness). To prove unambiguity, we translate http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif into a system of equations for its structure generating function (see [47] for details) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq91_HTML.gif where d(w) denotes the number of derivation trees http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif offers for w. Eliminating all but the variable associated with the axiom and simplifying (for this step we made use of Mathematica) yields the single equation
        Figure 2

        Unique parse tree for the bar-bracket word considered in Example 0.1 that corresponds to the planar secondary structure from Figure 1.

        Figure 3

        Particular subtrees of the tree presented in Figure 2.


        This equation is exactly the one of our grammar http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq81_HTML.gif from Example 0.3 which proves that for all n both grammars have the same number of derivation trees for words of size n. Knowing that both grammars generate http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq63_HTML.gif and that http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq81_HTML.gif is unambiguous, the same can be concluded for http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif .

        Note that http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif contains more production rules (and more different non-terminal symbols) than the SCFG considered in [21], but this new grammar is ϵ-free and additionally, the right-hand side of every single production contains at most two non-terminal symbols, such that the resulting unranking algorithm has to consider less cases (i.e. less "else if ( )" cases). For details, see [20] and the Appendix.

        Furthermore, it should be mentioned that we decided to assign relative frequencies to the production rules of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif , since such probabilities can be computed efficiently for unambiguous SCFGs. Moreover, by estimating the probabilities http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq92_HTML.gif , by their relative frequencies, the resulting grammar http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif has the consistency property, which means http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif provides a probability distribution on the language http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq93_HTML.gif . In particular, it is well-known that relative frequencies in our context yield a maximum likelihood (ML) estimator for the rule probabilities and thus a consistent estimator for the parameter set. We have trained the probabilities (relative frequencies) of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif from the structures http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq94_HTML.gif given in our biological database. The resulting probabilities are given in Table 1, their floating point approximations, rounded to the third decimal place in Table 2.
        Table 1

        The probabilities (relative frequencies) for the production rules of the SCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif obtained by training it using our biological database

        Nonterminal Nt

        Probabilities of Rules with Premise Nt























































        Table 2

        Floating point approximations of the probabilities (relative frequencies) for the production rules of the SCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif (rounded to three decimal places)

        Nonterminal Nt

        Probabilities of Rules with Premise Nt























































        In oder to see if over-fitting is an issue for our sophisticated grammar and its rich parameter set, i.e. to see if our training set is large enough to derive reliable values for the rule probabilities, we performed the following experiments: We selected a random 90% (resp. 50%) portion of the original training set and re-estimated the probabilities of all the grammar rules. This process was iterated 40 times, resulting in a sample of 40 parameter sets. Finally, for each parameter we determined its variance along this sample of size 40. The corresponding values lay between 0 (resulting for intermediate symbols without alternatives; for whose productions a probability of 1 is predetermined) and 2.87652 × 10-6 (resp. 2.86242 × 10-5). We can conclude that over-fitting is no issue in connection with our sophisticated grammar and the training set used.

        Derivation of the Algorithm

        The elaborate SCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif is appropriate for being used as the basis for the desired weighed unranking method: after having determined the RNF of this SCFG and the corresponding weighted combinatorial classes, we easily find a recursion for the size function (in the same ways as discussed in Example App-.4). Then, we can use the resulting weighted class sizes for the straightforward construction of the desired unranking algorithm.

        In fact, for the construction of the complete algorithm, we simply have to use Algorithms 1 to 4 (Unranking of neutral classes, atomic classes, disjoint unions and cartesian products, respectively) and Algorithm 6 (Unranking of weighted classes) given in [20] as subroutines. However, to improve the worst-case complexity of the resulting unranking procedure from http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq1_HTML.gif to http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq24_HTML.gif by using the boustrophedonic order instead of the sequential order, a simple change in Algorithm 4 (Unranking of cartesian products) is neccessary (see e.g. [7]).

        A random RNA secondary structure of size n can easily be computed by drawing a random number http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq149_HTML.gif and then unranking the ith structure of size n. The worst-case runtime complexity of this procedure is equal to that of unranking and is thus given by http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq24_HTML.gif when using the boustrophedonic order. By repeating this procedure m times, a set of m (not necessarily distinct) random RNA secondary structures of size n can be generated in time http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq2_HTML.gif , where a preprocessing time of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq1_HTML.gif is required for the computation of all (weighted) class sizes up to input length n.

        A complete and detailed description of the derivation of our weighted unranking algorithm for (SSU and LSU r)RNA secondary structures can be found in the Appendix, since it is too comprehensive to be presented here and the different steps for its generation correspond to those described in [20].

        Availability of Software

        It may be of interest to the reader that this non-uniform random generation algorithm for RNA secondary structures has been implemented as a webservice which is accessible to the scientific community under http://​wwwagak.​cs.​uni-kl.​de/​NonUniRandGen. Since it is relevant for researchers to have methods available for generating random structures that are realistic for a particular investigation, this webservice is also capable of allowing the user to specify the distribution from which the corresponding structures should be sampled (in the form of a set of secondary structure samples from which the parameters for our grammars are inferred). Furthermore, our Mathematica source code used to implement the webservice can be downloaded from our website and used under GNU public licence.


        The purpose of this section is to analyze the quality of randomly generated structures by considering some experimental results.

        Parameters for Structural Motifs

        As a first step, we decided to consider several important parameters related to particular structural motifs of RNA secondary structure and compare the observed statistical values derived from a native sample (here our biological database, i.e. the set of real-life RNA data that we used for deriving the distribution and thus the weights for the unranking algorithm) to those derived from a corresponding random sample (i.e. a set of random structures generated by our algorithm). In order to obtain an appropriate random sample, we have generated exactly one random structure of size n for each native RNA structure of size n given in our database, such that for each occurring size n, the random sample and the native sample contain the same number of structures having this size.

        The determined results are presented in Table 3. Comparing the specific values of all different parameters, we can guess that our algorithm produces random RNA secondary structures that are, related to the different structural motifs and thus related to the expected shape of such structures, in most cases realistic. Obviously, this is a major improvement over existing approaches for the random generation of secondary structures of a given input size n (where the corresponding specific RNA sequence is not known, but only its length n), as those (sequence-independent) methods are only capable of generating structures uniformly at random for input size n. Furthermore, with the SCFG model used here, we have an new model for RNA secondary structures at hand which realistically reflects the structure of an RNA molecule and its basic structural motifs.
        Table 3

        Expectation and variance of important parameters related to particular structural motifs of RNA secondary structure


        Expected Value































































































































        Values are derived from a native sample (our biological database) and from a random sample, respectively. num x denotes the number of occurrences of motif x in one secondary structure and unp x (bps x ) denotes the number of accessible unpaired bases (base pairs) in one substructure of type x. unp, bps, urs denote unpaired bases, base pairs and unpaired regions, whereas e, h, s, b, i, m, hel denote exterior loop, hairpin loop, stacked pair, bulge loop, interior loop, multiloop and helix, respectively.

        Related Free Energies

        For further investigation on the accuracy of our random generator, we take on a completely different point of view and consider thermodynamics. The reason behind this idea is that if an RNA secondary structure model induced by a SCFG shows a realistic behaviour (expectation and variance) with respect to minimum free energy, then it is rather likely that our grammar also shows a realistic picture for all the different structural motifs of a molecule's folding (as the free energy of a molecule's structure is defined as the sum of the energy contributions of all its substructures).

        Since we do not know the corresponding RNA sequences for the randomly generated structures, we can not use one of the common sequence-dependent thermodynamic models for RNAs. Therefore, we decided to consider both the static and dynamic free energy models (in the static model, averaged free energy contributions for the distinguished structural motifs are considered which can easily be derived from the training data (by sequence counting). These averaged values actually represent the free energy contributions that have to be added for the respective whole substructures. For the dynamic model, corresponding average values for length-dependent free energy contributions (that depend on the number of unpaired or paired bases within particular substructures) are added for each component (unpaired base or base pair) in the respective motifs, such that in contrast to the static model, substructures of different lengths are assigned different free energy values) defined in [21] for RNA secondary structures with unknown sequence. These models are based on the well-known Turner energy model [22, 23] and model parameters have been derived from the same biological database (of SSU and LSU rRNAs) that we consider in this article. In fact, both models have turned out to show a realistic behaviour and can therefore be used to judge the quality of random structures generated by our algorithm.

        Unquantified Results

        Similar to [21], we denote the free energy of a given secondary structure http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq150_HTML.gif according to the static and dynamic model by g stat (s) and g dyn (s), respectively. Moreover, the expected free energy and corresponding variance that have been analytically derived in that paper for any n > 0 are denoted by http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq151_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq152_HTML.gif , respectively, where energy ∈ {g stat , g dyn }. The corresponding confidence interval for n > 0 and k > 1, which contains at least http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq153_HTML.gif percent of the energies in http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq154_HTML.gif is denoted by I energy,n (k): = (μ energy,n - kσ energy ,n, μ energy,n + kσ energy,n ). As these analytical energy results from [21] and our unranking algorithm have been derived from the same database of real-life RNA data and by modeling the same class http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq63_HTML.gif of structures via very similar SCFGs, it seems adequate to use them for comparisons with the energies of our randomly generated structures.

        Before we start with our comparisons, note that for any sample set http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq155_HTML.gif of secondary structures, we can calculate the corresponding energy points http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq156_HTML.gif , where energy ∈ {g stat , g dyn }. Obviously, we can also compute the corresponding "average energy points" http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq157_HTML.gif and the corresponding "energy variance points" http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq158_HTML.gif , respectively. In the sequel, we will denote a random sample generated by our algorithm by ℛ and a native sample (biological database) by http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq159_HTML.gif .

        In order to obtain an appropriate random sample for our energy comparisons, we derived a large set of random structures by generating 1000 RNA secondary structures for each of the sizes n ∈ {500,1000,1500,..., 5000, 5500} with our weighted unranking algorithm. To compare the energies of our randomly generated structures to the corresponding confidence interval(s), we decided to consider any http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq160_HTML.gif , meaning the probability that the free energy of a random RNA secondary structure of size n lies within the corresponding interval is greater than 0.5, 0.75, 0.9, and 0.95, respectively.

        Figure 4 shows a plot of the corresponding four confidence intervals (analytically derived, related to our biological data) along with the energy points for our random sample and for our native database, respectively, under the assumption of the static energy model. The corresponding plots for the dynamic energy model are shown in Figure 5. Looking at both figures, we immediately see that the energies for our set of randomly generated RNA secondary structures seem to fit to the ones for the considered RNA database and also to the corresponding analytically obtained energy results from [21]. This observation becomes even more clear by considering Figures 6 and 7. There, we compare the previously introduced "average energy points" and "energy variance points" to the analytically determined expected free energy and corresponding variance from [21], respectively.
        Figure 4

        Plots of the confidence intervals http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq161_HTML.gif Intervals are shown for the static energy model (blue), for http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq162_HTML.gif (top left to bottom right), together with the corresponding energy points http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq163_HTML.gif for the random sample (cyan) and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq164_HTML.gif for the native sample (green).

        Figure 5

        Plots of the confidence intervals http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq165_HTML.gif Intervals are shown for the dynamic energy model (purple), for http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq166_HTML.gif (top left to bottom right), together with the corresponding energy points http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq167_HTML.gif for the random sample (magenta) and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq168_HTML.gif for the native sample (yellow).

        Figure 6

        Plot of expectations of the free energy. Plots show http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq169_HTML.gif (blue) and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq170_HTML.gif (purple) of a random RNA secondary structure of size n, together with the "average energy points" http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq171_HTML.gif (cyan) and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq172_HTML.gif (magenta) for the random sample.

        Figure 7

        Plot of variances of the free energy. Plots show http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq173_HTML.gif (blue) and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq174_HTML.gif (purple) of a random RNA secondary structure of size n, together with the "energy variance points" http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq175_HTML.gif (cyan) and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq176_HTML.gif (magenta) for the random sample.

        Quantified Results

        The previously considered energy comparisons have been presented only by unquantified plots. This may not be very satisfying, since it is obvious that the free energy would decrease with structure size and aside from this, it could have been expected that for large randomly generated sets of structures of a given size, the average energy and corresponding variance fit the analytically obtained energy results derived under the assumption of a basically equivalent SCFG model for secondary structures. Therefore, there is a need to consider some sort of quantification and additionally present corresponding quantified comparison results. What really matters is the degree to which the energy ranges of the random structures agree, in distribution, with our biological database. This means we have to find out if the energies related to a random sample (generated by our unranking method) and those related to a native sample (given by the structures in our biological database) come from a common distribution. Consequently, we have to consider the energies of a random sample and those of a native one as two independent sets of values and determine the extend to which their distributions coincide, or in other words to test for significant differences between these two sets. For this reason, we decided to apply one of the most common (non-parametric) significance tests known from statistics, the so-called Mann-Whitney U-test [48], which is widely used as statistical hypothesis test for assessing whether two independent samples of observations (with arbitrary sample sizes) come from the same distribution. It is also known as the Wilcoxon rank-sum test [49] which however can only be applied for equal sample sizes.

        Formally, this test is used to check whether the null hypothesis N 0 - which states that the two independent samples X and Y are identically distributed (i.e. F(X) = F(Y)) - can be accepted or else, has to be rejected. More specifically, the result of such a test, the so-called p-value, is a probability answering the following question: If the two samples really have the same distribution, what is the probability that the observed difference is due to chance alone? In other words, were the deviations (differences between the two samples) the result of chance, or were they due to other factors and how much deviation can occur before one must conclude that something other than chance causes the differences? The p-value is called statistically significant if it is unlikely that the differences occurred by chance alone, according to a preliminary chosen threshold probability, the significance level α (common choices are e.g. α ∈ {0.10,0.05,0.01}). If pα, the deviation is small enough that chance alone accounts for it; this is within the range of acceptable deviation. If p < α, we must conclude that some factor other than chance causes the deviation to be so great, this will lead us to decide that the two sets come from different distributions.

        For our analysis, we again decided to generate the same numbers of random structures for any size as are given for this size in our biological database, such that random and native sample contain the same numbers of structures for any occurring size (and hence the sample sizes are equal). Moreover, note that the unquantified results presented in Figures 4 and 5 might yield the assumption that for any structure size, some energy values of randomly generated structures are scattered too widely around the corresponding expected value, such that those randomly drawn secondary structures can not be considered realistic (neither with respect to thermodynamics nor with respect to structural composition and expected shape). In an attempt to disprove that assumption, we decided to perform a series of Wilcoxon tests by considering a number of different random samples. These samples are created by obeying a specified energy-based rejection scheme: Do not add a randomly generated structure of a given size to the sample if its free energy (according to the static or dynamic model or according to both models) lies outside the corresponding confidence interval(s). Formally, for any preliminary chosen value k > 1, a generated structure http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq76_HTML.gif is added to the random sample iff

        otherwise it is rejected. This means we accept only a specified deviation of the energy energy (s) of the random structure s from the corresponding expected free energy μ energy,n and reject structures whose energy differs too much from the expected value. Note that for k = ∞ (confidence interval I energy,n (k) contains 100 percent of the energies energy(s) of all http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq76_HTML.gif ), no structures are rejected. Hence, in this case, the corresponding random sample corresponds to the usual (unrestricted) output of our algorithm.

        The Wilcoxon test results for our native sample together with any of a number of random sample sets generated in the previously described restricted manner, respectively, can be found in Table 4. As we can see, the best results are achieved for the unrestricted sample sets, where all free energies of randomly generated structures were allowed during the sample creation process. Moreover, these two results (for the unrestricted case k = ∞) are not statistically significant when considering the common significance level α = 0.05, that is in both cases, we can assume that the energies of the random structures and those of the biological data follow a common distribution. These observations indicate that our weighted unranking algorithm produces random RNA secondary structures that are - related to the free energy of such structures (in expectation and variation) - in expectation realistic.
        Table 4

        Significance results for statistical hypothesis testing, computed by the Wilcoxon rank-sum method

        Chosen Value of k

        Percent Within Corr. Interval

        Models Used for Rejection

        Model for Native Energies

        Model for Random Energies

        Resulting Wilcoxon p-Value (approx.)









































































































        1.515 10-7
































        Besides that, it is obvious that the computed p-values are much better for the dynamic energy model than for the static one. This underlines the suggestion made in [21] that, although both energy models have been proven to be realistic, due to the more realistic variation of free energies connected to varying loop length, the dynamic model should be used for possible applications. Since at least for the dynamic model, the random data fit very nicely with the native data, we can conclude that structures generated by our non-uniform random generation algorithm behave realistic with respect to free energies and - as the energy of the overall structure is assumed to be equal to the sum of the substructure energies - rather likely also with respect to appearance of the different structural motifs of RNA molecules.


        Altogether, we can finally conclude that the non-uniform random generation method proposed in this article produces appropriate output and may thus be used (for research issues as well as for practical applications) to generate random RNA secondary structures. In fact, for any arbitrary type of (pseudoknot-free) RNA, a corresponding random sampler can be derived in the presented way. Actually, our webservice can be used for generating random secondary structures of any specified type of RNA. It just requires a database of known structures for the respective RNA type as input.

        Note that in this work, we abstract from sequence and consider only the structure size as input for our algorithm. Thus, an interesting problem for future research would be to find a way to extend the presented realistic SCFG model to additionally deal with RNA sequence. In fact, this work and especially the considered elaborate SCFG could mark some sort of stepping stone towards new stochastic RNA secondary structure prediction methods realized by statistical random sampling.


        How to Construct a Weighted Unranking Algorithm from a Given SCFG

        The purpose of this section is to give a rather small example for applying the construction scheme described in detail in [20] to proceed from an arbitrary SCFG to reweighted normal form (RNF) and then to the corresponding weighted combinatorial classes which allow for non-uniform generation by means of unranking.

        Example App-.4. Let us consider the SCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq182_HTML.gif , which contains the following rules:
        To apply the approach presented in [20] to transform a given SCFG to RNF, the grammar needs to be ϵ-free and loop-free. Thus, we first have to transform grammar http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq182_HTML.gif into the following one:
        The transformation of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq182_HTML.gif into RNF now works as follows: First, we have to gather all possible chains AA 1A 2 → ... → α, where AS and |α| = 1. These chains are BC, BC| and C → |; the rules BC and C| are then removed. Second, we have to replace each of these chains by a specific new rule. In fact, we have to add B C,ϵ C, B |,C → | and C |,ϵ | to the new set of productions. Consequently, our new rule set is now given by
        Third, for each occurrence of a non-terminal symbol A in the conclusion of a production and each previously added new rule http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq183_HTML.gif corresponding to a chain AA 1A 2 → ... → α, add a specific new rule. This way, we obtain the following production set:

        Fourth, each intermediate symbol that no longer occurs as premise in any of the productions has to be removed and fifth, each production of the form Sα, where S is the axiom and |α| > 1 has to be changed in a specific way. However, since in our case, there is obviously nothing left to do, the transformation of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq182_HTML.gif into RNF is finished.

        For http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq182_HTML.gif (in RNF), where all production weights are rational, we can determine the common denominator s of the weights of productions with premise S, as well as the common denominator c of the weights of the remaining productions (i.e., of the productions with premise B or C). Then, the reweighting of the production rules of (the RNF of) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq182_HTML.gif is done by multiplying the weights of productions with source S by s, and the weights of the other productions Aα, where AS, by the factor c |α|-1. After that, we obtain the following reweighted grammar http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq184_HTML.gif :

        where each http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq185_HTML.gif , is integral.

        The (now weighted) grammar can easily be translated into a corresponding admissible specification, which includes the weighting of all involved combinatorial (sub)classes, as described earlier. For the reweighted grammar http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq184_HTML.gif , this specification is given by the following equations:
        which can be simplified in the following way:
        As described earlier, this specification (with weighted classes) derived from reweighted grammar http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq184_HTML.gif transforms immediately into a recursion for the function size of all needed combinatorial classes. For http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq184_HTML.gif , the recursion for the function size has the following form:

        This recursive size function (with weighted class sizes) can now be used for the straightforward construction of a corresponding algorithm for the non-uniform generation of elements of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq186_HTML.gif by means of unranking, as proposed in [20].

        Derivation of the Algorithm

        In this section, we give a complete and detailled description of the derivation of our weighted unranking algorithm for RNA secondary structures. The different steps are made according to the approach described in [20] to get an unranking algorithm that generates random RNA secondary structures of a given size n according to the distribution on all these structures.

        Considered (unambiguous, ϵ-free and loop-free) SCFG

        First, note that in [21], to obtain the stochastic model for RNA secondary structures derived from real-world RNA data, the following unambiguous SCFG which unambiguously generates exactly the language http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq63_HTML.gif given in Definition 0.9 has been used:

        Definition App-.11. The unambiguous SCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq187_HTML.gif generating exactly the language http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq63_HTML.gif is given by http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq188_HTML.gif , where
        http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq189_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq190_HTML.gif contains exactly the following rules:

        In this grammar, different intermediate symbols have been used to distinguish between different substructures. In fact, the reason why this grammar has so many production rules is that the grammar must be able to distinguish between all the different classes of substructures for which there are different free energy rules according to Turner's thermodynamic model considered in [21].

        However, as ϵ-freeness and loop-freeness are required preliminarily, we have to consider another unambiguous SCFG generating the same language http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq63_HTML.gif , where we have to guarantee that the same substructures are distinguished as are distinguished in http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq187_HTML.gif .

        Using the usual way of transforming a non-ϵ-free grammar into an ϵ-free one, the following definition can immediately be obtained from the previous one:

        Definition App-.12. The unambigous and ϵ-free SCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq191_HTML.gif generating exactly the language http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq63_HTML.gif is given by http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq192_HTML.gif , where
        http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq193_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq194_HTML.gif contains exactly the following rules:

        Unfortunately, the set of productions of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq191_HTML.gif contains productions with up to 5 non-terminal symbols in the conclusion. This is not acceptable for our purpose, for the following reason: the desired unranking algorithm makes use of the size of combinatorial classes whose representations somehow are derived from CFGs with particular integer weights on their productions. If we constructed this WCFG by starting with the grammar http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq191_HTML.gif , then this would yield a huge number of production rules. Consequently, the translation would imply a huge specification of the combinatorial classes and the corresponding function to compute their sizes and thus the corresponding unranking algorithm would have to distinguish between an unnecessarily and most importantly unacceptably large number of cases.

        Nevertheless, the size of the production set of the weighted grammar underlying the desired unranking algorithm can be significantly reduced by starting with a modification of grammar http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq191_HTML.gif which has only production rules with minimum possible numbers of non-terminal symbols in the conclusion. In fact, by transforming http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq191_HTML.gif appropriately considering this observation, we obtained the SCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif :

        Definition App-.13. The unambiguous ϵ-free SCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif generating exactly the language http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq63_HTML.gif is given by http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq195_HTML.gif , where
        http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq196_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq197_HTML.gif contains exactly the following rules:

        Transforming our SCFG into RNF

        Now, we can construct the desired weighted grammar that will be underlying our unranking algorithm: In the first step, we gather all possible chains of productions that do not lengthen the sentential form. In fact, we have to consider all rules Aα, AS', with |α| = 1, to obtain all such chains (note that these rules will be removed after step 1). Hence, we have to consider the following set http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq198_HTML.gif of 22 production rules:
        Thus, the following 32 chains are gathered in step 1:
        Furthermore, the 22 production rules contained in http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq198_HTML.gif are now removed. This results in the following set http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq199_HTML.gif of 32 rules:
        Additionally, in step 2, for each chain a new intermediate symbol and a new production are introduced. Thus, according to the 32 chains gathered in step 1, we here obtain the following set http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq200_HTML.gif of 32 new production rules:
        In step 3, for each occurrence of a non-terminal symbol in the conclusion of a production and each chain starting with this non-terminal symbol, we have to add a new production with the corresponding new intermediate symbol instead of the considered one. Thus, in step 3, the remaining 32 production rules from http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq199_HTML.gif are transformed (according to http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq200_HTML.gif ) into the following set http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq201_HTML.gif of 79 new rules:
        In step 4, we must delete all intermediate symbols that no longer occur as premise. Obviously, intermediate symbols no longer occurring as premise of a production are
        We easily observe that the productions that contain at least one of these 4 intermediate symbols in the conclusion and thus have to be removed are exactly the following ones:

        Consequently, after the removal of these 6 rules from http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq201_HTML.gif , there still remain 73 new production rules.

        Finally in step 5, we must make sure that the conclusion of all productions with premise S' (axiom of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif that we started with) does not have a length greater than 1. However, since there is only one production with premise S' in our start grammar http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq85_HTML.gif and the conclusion of this production has size 1, there is nothing to do. Thus, the resulting new grammar is given by:

        Definition App-.14. The WCFG http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq202_HTML.gif generating exactly the language http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq63_HTML.gif is given by http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq203_HTML.gif , where
        http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq204_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq205_HTML.gif contains exactly the following rules:
        whereas http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq206_HTML.gif contains exactly the following rules:

        Reweighting the Production Rules

        Now, the weights of the 73 production rules given in the subset of productions http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq205_HTML.gif have to be reweighted. In order to achieve this goal, we first have to compute the two common denominators s and c, where s is the common denominator of the weights of productions with premise S' (i.e., of productions number 1 to 3), and c is the common denominator of the weights of the remaining productions (i.e., of productions number 4 to 73) of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq205_HTML.gif . Using the rounded probabilities (weights) for the production rules of http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq202_HTML.gif as given in Table 5, we immediately find the smallest common denominators to be s = 10,000 and c = 10,000.
        Table 5

        Floating point approximations of the probabilities (weights) http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq207_HTML.gif , for the production rules of the grammar http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq202_HTML.gif (rounded to four decimal places)

        Nonterminal Nt


        Weights of Rules with Premise Nt


        λ1 : = 1.0000,

        λ2 : = 0.0212,

        λ3 : = 0.0003,



        λ4 : = 0.9788,

        λ5 : = 0.0134,

        λ6 : = 0.0944,

        λ7 : = 0.0013,


        λ8 : = 0.8559,

        λ9 : = 0.1304,

        λ10 : = 0.0126,

        λ11 : = 0.0181,


        λ12 : = 0.0002,



        λ13 : = 0.9036,

        λ14 : = 0.0871,



        λ15 : = 0.7630,

        λ16 : = 0.0402,

        λ17 : = 0.0186,

        λ18 : = 0.0367,


        λ19 : = 0.0072,

        λ20 : = 0.0858,

        λ21 : = 0.0484,



        λ22 : = 0.3038,

        λ23 : = 0.1884,

        λ24 : = 0.3081,

        λ25 : = 0.1996,


        λ26 : = 1.0000,

        λ27 : = 0.3896,



        λ28 : = 0.6104,

        λ29 : = 0.2378,



        λ30 : = 0.0575,

        λ31 : = 0.3409,

        λ32 : = 0.6016,

        λ33: = 0.1211,


        λ34 : = 0.7987,

        λ35 : = 0.1608,



        λ36 : = 0.1085,

        λ37 : = 0.2144,

        λ38 : = 0.2011,

        λ39 : = 0.4760,


        λ40 : = 0.1713,

        λ41 : = 0.8287,



        λ42 : = 0.4150,

        λ43 : = 0.5850,



        λ44 : = 1.0000,

        λ45 : = 0.3243,



        λ46 : = 1.0000,

        λ47 : = 0.3243,



        λ48 : = 1.0000,

        λ49 : = 0.2928,



        λ50 : = 0.6757,

        λ51 : = 0.2191,



        λ52 : = 0.7072,

        λ53 : = 0.2071,



        λ54 : = 1.0000,

        λ55 : = 0.0510,

        λ56 : = 0.0036,

        λ57: = 0.0712,


        λ58 : = 0.0036,

        λ59 : = 0.0003,



        λ60 : = 0.9288,

        λ61 : = 0.1968,



        λ62 : = 0.4211,

        λ63 : = 0.5279,

        λ64 : = 0.1119,

        λ65 : = 0.0215,


        λ66 : = 0.0015,

        λ67 : = 0.0300,

        λ68 : = 0.0376,

        λ69 : = 0.0080,


        λ70 : = 0.0015,

        λ71 : = 0.0001,



        λ72 : = 0.7881,

        λ73 : = 0.1670.


        Note that for http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq208_HTML.gif holds.

        The desired new weights for the considered set of productions http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq205_HTML.gif are then computed by multiplying the old weights of productions with source S' by s, and by multiplying the old weights of productions Aα, AS' (and http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq209_HTML.gif ), by c |α|-1.

        Formally, for the reweighted set of productions http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq205_HTML.gif , we get the following weights:
        The resulting integer weights can be found in Table 6.
        Table 6

        Integer weights http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq210_HTML.gif , for the production rules of the grammar http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq202_HTML.gif

        Nonterminal Nt

        Integer weights of Rules with Premise Nt


        μ 1 : = 10000,

        μ 2 : = 212,


        μ 3 : = 3,



        μ 4 : = 9788,

        μ 5 : = 134,


        μ 6 : = 944,

        μ 7 : = 13,


        μ 8 : = 8559,

        μ 9 : = 1304,


        μ 10 : = 126,

        μ 11 : = 181,


        μ 12 : = 2,



        μ 13 : = 9036,

        μ 14 : = 871,


        μ 15 : = 76300000,

        μ 16 : = 4020000,


        μ 17 : = 1860000,

        μ 18 : = 3670000,


        μ 19 : = 720000,

        μ 20 : = 8580000,


        μ 21 : = 4840000,



        μ 22 : = 3038,

        μ 23 : = 1884,


        μ 24 : = 3081,

        μ 25 : = 1996,


        μ 26 : = 10000,

        μ 27 : = 3896,


        μ 28 : = 6104,

        μ 29 : = 2378,


        μ 30 : = 5750000,

        μ 31 : = 340900000000,


        μ 32 : = 6016000000000000,

        μ 33 : = 1211000000000000,


        μ 34 : = 7987,

        μ 35 : = 1608,


        μ 36 : = 10850000,

        μ 37 : = 214400000000,


        μ 38 : = 201100000000,

        μ 39 : = 4760000000000000,


        μ 40 : = 1713000000000000,

        μ 41 : = 828700000000,


        μ 42 : = 415000000000,

        μ 43 : = 585000000000,


        μ 44 : = 10000,

        μ 45 : = 3243,


        μ 46 : = 10000,

        μ 47 : = 3243,


        μ 48 : = 10000,

        μ 49 : = 2928,


        μ 50 : = 6757,

        μ 51 : = 2191,


        μ 52 : = 7072,

        μ 53 : = 2071,


        μ 54 : = 10000,

        μ 55 : = 510,


        μ 56 : = 36,

        μ 57 : = 712,


        μ 58 : = 36,

        μ 59 : = 3,


        μ 60 : = 9288,

        μ 61 : = 1968,


        μ 62 : = 4211,

        μ 63 : = 5279,


        μ 64 : = 1119,

        μ 65 : = 215,


        μ 66 : = 15,

        μ 67 : = 300,


        μ 68 : = 376,

        μ 69 : = 80,


        μ 70 : = 15,

        μ 71 : = 1,


        μ 72 : = 7881,

        μ 73 : = 1670.

        Note that for http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq211_HTML.gif holds.

        Transforming Reweighted Grammar into Admissible Specification

        Given the reweighted grammar http://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_140_IEq202_HTML.gif , we immediately obtain the following admissible specification of the corresponding combinatorial classes (note that this specification has already been simplified by removing classes that are only duplicates of others):
        Now, this(simplified) specification can easily be transformed into the following recursive form for the function size :

        From those recurrences, the desired algorithm can easily be constructed. As the complete presentation of this algorithm would be too comprehensive, we decided to omit it and instead refer to Algorithms 1 to 4 and 6 given in [20], since for the construction of our unranking algorithm, we had to use exactly these Algorithms as subroutines.



        AS thanks Carl-Zeiss-Stiftung for supporting her research. All authors wish to thank an anonymous reviewer for careful reading and helpful remarks and suggestions made for a previous version of this article.

        Authors’ Affiliations

        Department of Computer Science, University of Kaiserslautern


        1. Flajolet P, Fusy E, Pivoteau C: Boltzmann Sampling of Unlabelled Structures. In Proceedings of ANALCO'07 (Analytic Combinatorics and Algorithms) Conference. SIAM Press; 2007:201–211.
        2. Fitch WM: Random sequences. Journal of Molecular Biology 1983, 163:171–176.PubMedView Article
        3. Altschul SF, Erickson BW: Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol Biol Evol 1985,2(6):256–538.
        4. Denise A, Ponty Y, Termier M: Random Generation of structured genomic sequences. Proceedings of RECOMB 2003 2003, 3. (poster)
        5. Waterman MS: Secondary Structure of Single-Stranded Nucleic Acids. Advances in Mathematics Supplementary Studies 1978, 1:167–212.
        6. Ding Y, Lawrence CE: A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Research 2003,31(24):7280–7301.PubMedView Article
        7. Ponty Y: Efficient sampling of RNA secondary structures from the Boltzmann ensemble of low-energy: the boustrophedon method. Journal of Mathematical Biology 2008, 56:107–127.PubMedView Article
        8. Allali J, d'Aubenton Carafa Y, Chauve C, Denise A, Drevet C, Ferraro P, Gautheret D, Herrbach C, Leclerc F, de Monte A, Ouangraoua A, Sagot MF, Saule C, Termier M, Thermes C, Touzet H: Benchmarking RNA secondary structures comparison algorithms. Actes des Journées Ouvertes de Biologie, Informatique et Mathématiques - JOBIM'08 2008, 67–68.
        9. Wuchty S, Fontana W, Hofacker I, Schuster P: Complete Suboptimal Folding of RNA and the Stability of Secondary Structures. Biopolymers 1999, 49:145–165.PubMedView Article
        10. Zuker M: On Finding All Suboptimal Foldings of an RNA Molecule. Science 1989, 244:48–52.PubMedView Article
        11. Zuker M: Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 2003,31(13):3406–3415.PubMedView Article
        12. Hofacker IL: The Vienna RNA secondary structure server. Nucleic Acids Research 2003,31(13):3429–3431.PubMedView Article
        13. Dowell RD, Eddy SR: Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics 2004, 5:71.PubMedView Article
        14. Knudsen B, Hein J: RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 1999,15(6):446–454.PubMedView Article
        15. Knudsen B, Hein J: Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Research 2003,31(13):3423–3428.PubMedView Article
        16. Pedersen J, Meyer I, Forsberg R, Simmonds P, Hein J: A comparative method for finding and folding RNA secondary structures in protein-coding regions. Nucleic Acids Reserach 2004, 32:4925–4936.View Article
        17. Pedersen JS, Forsberg R, Meyer IM, Hein J: An Evolutionary Model for Protein-Coding Regions with Conserved RNA Structure. Molecular Biology and Evolution 2004, 21:1913–1922.PubMedView Article
        18. Wiebe NJP, Meyer IM: ¡sc¿Transat¡/sc¿A Method for Detecting the Conserved Helices of Functional RNA Structures, Including Transient, Pseudo-Knotted and Alternative Structures. PLoS Comput Biol 2010,6(6):e1000823.PubMedView Article
        19. Gesell T, von Haeseler A: In silico sequence evolution with site-specific interactions along phylogenetic trees. Bioinformatics 2006, 22:716–722.PubMedView Article
        20. Weinberg F, Nebel ME: Non Uniform Generation of Combinatorial Objects. Tech. rep., Technische Universität Kaiserslautern; 2010.
        21. Nebel ME, Scheid A: Analysis of the Free Energy in a Stochastic RNA Secondary Structure Model. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2010.
        22. Xia T, SantaLucia J Jr, Burkard ME, Kierzek R, Schroeder SJ, Jiao X, Cox C, Turner DH: Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry 1998, 37:14719–14735.PubMedView Article
        23. Mathews DH, Sabina J, Zuker M, Turner DH: Expanded Sequence Dependence of Thermodynamic Parameters Improves Prediction of RNA Secondary Structure. J Mol Biol 1999, 288:911–940.PubMedView Article
        24. Nijenhuis A, Wilf HS: Combinatorial Algorithms. 2nd edition. 1978. Academic Press
        25. Flajolet P, Zimmermann P, Van Cutsem B: A Calculus for the Random Generation of Combinatorial Structures. Theoretical Computer Science 1994,132(2):1–35.View Article
        26. Duchon P, Flajolet P, Louchard G, Schaeffer G: Boltzmann Samplers for the Random Generation of Combinatorial Structures. Combinatorics, Probability, and Computing, Volume 13 2004, 577–625. [Special issue on Analysis of Algorithms]
        27. Flajolet P, Sedgewick R: Analytic Combinatorics. Cambridge University Press; 2009.View Article
        28. Harrison MA: Introduction to Formal Language Theory. Addison-Wesley; 1978.
        29. Stein PR, Waterman MS: On some new sequences generalizing the Catalan and Motzkin numbers. Discrete Mathematics 1978, 26:216–272.
        30. Viennot G, Chaumont MVD: Enumeration of RNA Secondary Structures by Complexity. Mathematics in medicine and biology, Lecture Notes in Biomathematics 1985, 57:360–365.View Article
        31. Nebel ME: Combinatorial Properties of RNA Secondary Structures. Journal of Computational Biology 2002,9(3):541–574.PubMedView Article
        32. Hofacker IL, Schuster P, Stadler PF: Combinatorics of RNA secondary structures. Discrete Applied Mathematics 1998, 88:207–237.View Article
        33. Nebel ME: Investigation of the Bernoulli-Model of RNA Secondary Structures. Bulletin of Mathematical Biology 2004, 66:925–964.PubMedView Article
        34. Zuker M, Sankoff D: RNA Secondary Structures and their Prediction. Bull Mathematical Biology 1984, 46:591–621.
        35. Nebel ME: On a statistical filter for RNA secondary structures. Tech. rep., Frankfurter Informatik-Berichte; 2002.
        36. Nebel ME: Identifying Good Predictions of RNA Secondary Structure. Proceedings of the Pacific Symposium on Biocomputing 2004, 423–434.
        37. Molinero X: Ordered Generation of Classes of Combinatorial Structures. PhD thesis. Universitat Politècnica de Catalunya; 2005.
        38. Fu KS, Huang T: Stochastic Grammars and Languages. International Journal of Computer and Information Sciences 1972,1(2):135–170.View Article
        39. Huang T, Fu KS: On Stochastic Context-Free Languages. Information Sciences 1971, 3:201–224.View Article
        40. Sakakibara Y, Brown M, Hughey R, Mian IS, Sjölander K, Underwood RC, Haussler D: Stochastic context-free grammars for tRNA modeling. Nucleic Acids Research 1994, 22:5112–5120.PubMedView Article
        41. Liebehenschel J: Ranking and unranking of lexicographically ordered words: an average-case analysis. J Autom Lang Comb 1998,2(4):227–268.
        42. Weinberg F, Nebel ME: Extending Stochastic Context-Free Grammars for an Application in Bioin-formatics. 4th International Conference on Language and Automata Theory and Applications (LATA2010) 2010.
        43. Nawrocki EP, Eddy SR: Query-Dependent Banding (QDB) for Faster RNA Similarity Searches. PLoS Comput Biol 2007, 3:e56.PubMedView Article
        44. Martínez C, Molinero X: A generic approach for the unranking of labeled combinatorial classes. Random Struct. Algorithms 2001,19(3–4):472–497.
        45. Wuyts J, Rijk PD, de Peer YV, Winkelmans T, Wachter RD: The European Large Subunit Ribosomal RNA Database. Nucleic Acids Research 2001, 29:175–177.PubMedView Article
        46. Wuyts J, de Peer YV, Winkelmans T, Wachter RD: The European Database on Small Subunit Ribosomal RNA. Nucleic Acids Research 2002, 30:183–185.PubMedView Article
        47. Salomaa A, Soittola M: Automata-theoretic aspects of formal power series. Springer; 1978.
        48. Mann H, Whitney D: On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics 1947, 18:50–60.View Article
        49. Wilcoxon F: Individual Comparisons by Ranking Methods. Biometrics Bulletin 1945, 1:80–83.View Article

        This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://​creativecommons.​org/​licenses/​by/​2.​0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.