Open Access

Random generation of RNA secondary structures according to native distributions

Algorithms for Molecular Biology20116:24

DOI: 10.1186/1748-7188-6-24

Received: 20 April 2011

Accepted: 12 October 2011

Published: 12 October 2011

Abstract

Background

Random biological sequences are a topic of great interest in genome analysis since, according to a powerful paradigm, they represent the background noise from which the actual biological information must differentiate. Accordingly, the generation of random sequences has been investigated for a long time. Similarly, random object of a more complicated structure like RNA molecules or proteins are of interest.

Results

In this article, we present a new general framework for deriving algorithms for the non-uniform random generation of combinatorial objects according to the encoding and probability distribution implied by a stochastic context-free grammar. Briefly, the framework extends on the well-known recursive method for (uniform) random generation and uses the popular framework of admissible specifications of combinatorial classes, introducing weighted combinatorial classes to allow for the non-uniform generation by means of unranking. This framework is used to derive an algorithm for the generation of RNA secondary structures of a given fixed size. We address the random generation of these structures according to a realistic distribution obtained from real-life data by using a very detailed context-free grammar (that models the class of RNA secondary structures by distinguishing between all known motifs in RNA structure). Compared to well-known sampling approaches used in several structure prediction tools (such as SFold) ours has two major advantages: Firstly, after a preprocessing step in time O ( n 2 ) for the computation of all weighted class sizes needed, with our approach a set of m random secondary structures of a given structure size n can be computed in worst-case time complexity O m n log ( n ) while other algorithms typically have a runtime in O ( m n 2 ) . Secondly, our approach works with integer arithmetic only which is faster and saves us from all the discomforting details of using floating point arithmetic with logarithmized probabilities.

Conclusion

A number of experimental results shows that our random generation method produces realistic output, at least with respect to the appearance of the different structural motifs. The algorithm is available as a webservice at http://wwwagak.cs.uni-kl.de/NonUniRandGen and can be used for generating random secondary structures of any specified RNA type. A link to download an implementation of our method (in Wolfram Mathematica) can be found there, too.

Keywords

Random generation stochastic context-free grammars RNA secondary structures

Background and Introduction

The topic of random generation algorithms (also called samplers) has been widely studied by computer scientists. As stated in [1], it has been examined under different perspectives, including combinatorics, algorithmics (design and/or engineering), as well as probability theory, where two of the main motivations for random sampling are the testing of combinatorial properties of structures (e.g. conjectured structural properties, quantitative aspects), as well as the testing of properties of the corresponding algorithms (with respect to correctness and/or efficiency).

As considers software engineering, the so-called random testing approach is commonly used to test implementations of particular algorithms, as it is usually not feasible to consider all possible inputs and unknown which of these inputs are among the most interesting ones. In fact, this approach requires for the generation of random instances of program inputs that obey various sorts of syntactic and semantic constraints (where the random instances usually ought to be of a preliminarily fixed input size in order to be comparable to each other).

In the Bioinformatics area, algorithms for generating random biological sequences have been investigated for a long time (see e.g. [2, 3]). As stated in [4], random sequences are a topic of great interest in genome analysis, since according to a powerful paradigm, they represent the background noise from which the actual biological information must differentiate. Thus, random generation of combinatorial objects can be used in this context for simulations studies in order to isolate signal (unexpected events) from noise (statistically unavoidable regularities). In fact, according to [4], random biological sequences are for instance widely used for the detection of over-represented and under-represented motifs, as well as for determining whether scores of pairwise alignments are relevant or not: although there exist analytic approaches for these kinds of problems, for the most complex cases, it is often still necessary to be able to alternatively use a corresponding experimental approach (based on randomly generated sequences obtained from a computer programm). For this purpose, random sequences must obviously obey to a certain model that takes into account some relevant properties of actual real-life sequences, where such models are usually based on statistical parameters only. However, it is known that these classical models can be enriched by adding structural parameters (see [4]). Over the past years, several methods have been proposed for the random generation of more complex structures, where special attention has been paid to RNA secondary structures. RNA is a single-stranded nucleotide polymer and a major component of cellular processes (like DNA and proteins). An RNA strand is formed by linking together certain nucleotide units. The specific sequence of nucleotides along this chain is called the primary structure of the molecule. By pairing of nucleotides that are not linked in this chain (i.e. by the so-called effects of base pairing), the linear primary structure is folded into a three-dimensional conformation, called the tertiary structure, which in many cases determines the function of the molecule. Most of the 3D structure is determined by the intramolecular base-pairing interactions in the plane, which together form the secondary structure of the molecule. For this reason, pseudoknots (induced by crossing base pairs) are considered as tertiary interactions and are usually not permitted in the definition of secondary structure. As unknotted structures contain only nested base pairs and are thus essentially two-dimensional, they can be modeled as planar graphs. This rather descriptive and commonly used planar graph model for RNA secondary structures was first formalized in [5]. An example is shown in Figure 1.
https://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_Article_140_Fig1_HTML.jpg
Figure 1

An RNA secondary structure. Unpaired and paired bases are represented by white and gray points, respectively.

Most of the existing random generation algorithms for RNA secondary structures are used for predicting the structure of a given RNA sequence (see e.g. [6, 7]), while others can be employed for instance for evaluating structure comparison softwares [8]. Note that secondary structure prediction methods based on random sampling represent a non-deterministic counterpart to the up-to-date most successful and popular physics-based prediction methods that make use of the energy minimization paradigm and are realized by dynamic programming algorithms (see e.g. [912]). Random sampling also differs from the stochastical RNA structure prediction approach that is based on context-free modeling of structural motifs and adding some statistical parameters observed in real-life data by assigning probabilities to the corresponding motifs (see e.g. [1315]). Nevertheless, it should be mentioned that statistical sampling methods like [6, 7] used for RNA structure prediction are based on thermodynamics and thus inevitably inherit the problems and imprecisions related to energy minimizing methods, which are caused by the still incomplete commonly used free energy models for RNAs. In order to overcome these pitfalls, one could take the competing point of view and consider only typical structural information observed in a set of sample data as the basis for a new random generation method. If that information draws a realistic picture for all the different motifs of a molecule's folding, the corresponding sampling method is likely to produce realistic results. Accordingly, several authors made use of stochastic context free grammars and employed machine-learning techniques to train parameter values from a set of known secondary structures. Such grammars have widely been used in a predictive mode (see, e.g., [14]) but there are also successful examples of applications where the random sampling of derivation trees has been the core of the method (see, e.g., [1618] but also [19] for examples). In the present paper, we follow that line of ideas and rely on the approach of the technical report [20] to develop a new algorithm for the (non-uniform) random generation of RNA secondary structures (without pseudoknots) according to a distribution induced by a set of sample RNA data (note that the algorithm actually generates secondary structures for a preliminary fixed size, not for a given RNA sequence of this size, which means we take the combinatorial point of view and completely abstract from sequence).

The main contribution of this manuscript is the derivation of a new and efficient algorithm for the random generation of RNA secondary structures according to an elaborate and thus very realistic model. For this purpose we use and generalize the approach from [20]. Particularly, our random generation method is based on a sophisticated context-free grammar for unknotted structures which, in order to model the class of all considered RNA secondary structures as realistic as possible, distinguishes between all known structural motifs that may occur in unknotted RNA secondary structure. This means that any structural feature is modeled by one or more specific grammar rules with corresponding probabilities observed from real-life data. Note that this grammar is actually a special variant of the comprehensive grammar used in [21] for deriving a realistic RNA structure model and for performing the first ever analytical analysis of the expected free energy of a random secondary structure (of a specified RNA type). Actually, that grammar has been designed as a mirror of the famous Turner energy model [22, 23] which serves as the foundation for most of the existing physics-based RNA structure prediction methods: all structural motifs for which there are different thermodynamic rules and parameters are created by distinct production rules (with corresponding probabilities).

According to [20], our sampling method involves a weighted unranking algorithm for obtaining the final structures. Briefly, considering an arbitrary structure class of size (cardinality) c, a corresponding unranking method uses a well-defined ordering of all class elements (according to a particular numbering scheme, the so-called ranking method) and for a given input number r {1,..., c} outputs the structure with rank r in the considered ordering. That way, the random sampling based on a stochastic grammar - building heavily on the use of small floating point numbers - is translated into an unranking algorithm using integer values only. Notably, a complete structure of size n is generated by recursively unranking the distinct structural components from the corresponding subclasses (of substructures with sizes less than n). In our case, the weighted unranking algorithm requires a precomputation step in worst-case time O ( n 2 ) for computing all weighted class sizes up to input size n. The worst-case complexity for generating a secondary structure of size n at random is then given by O ( n log n ) since we are ranking structures according to the boustrophedon order (see e.g. [7]).

By the end of this paper, we analyze the quality of randomly generated structures by considering some experimental results. First, we will consider statistical indicators of many important parameters related to particular structural motifs and compare the ones observed in the used sample set of real world RNA data to those observed in a corresponding set of random structures. Their comparison measures indicate that our method actually generates realistic RNA structures. Obviously, an algorithm which, for a given structure size n, produces random RNA secondary structures that are - related to expected shapes of such structures -in most cases realistic is a major improvement over existing approaches which, for example, are only capable of generating secondary structures uniformly for size n. Furthermore, we will consider the two different free energy models defined in [21] for RNA secondary structure (with unknown RNA sequence) to get further evidence of the good quality of our random generation method (with respect to free energies and thus rather likely also with respect to appearance of the different structural motifs of RNA).

Prior Results and Basic Definitions

Uniform Random Generation

In the past, the problem of uniform random generation of combinatorial structures, that is the problem of randomly generating objects (of a preliminary fixed input size) of a specified class that have the same or similar properties, has been extensively studied. Special attention has been paid on the wide class of decomposable structures which are basically defined as combinatorial structures that can be constructed recursively in an unambiguous way.

In principle, two general (systematic) approaches have been developed for the uniform generation of these structures: First, the recursive method originated in [24] (to generate various data structures) and later systematized and extended in [25] (to decomposable data structures), where general combinatorial decompositions are used to generate objects at random based on counting possibilities. Second and more recently, the so-called Boltzmann method [1, 26], where random objects (under the corresponding Boltzmann model) have a fluctuating size, but objects with the same size invariably occur with the same probability. Note that according to [26], Boltzmann samplers may be employed for approximate-size (objects with a randomly varying size are drawn) as well as fixed-size (objects of a strictly fixed size are drawn) random generation and are an alternative to standard combinatorial generators based on the recursive method. However, fixed-size generation is considered the standard paradigm for the random generation of combinatorial structures.

(Admissible) Constructions and Specifications

According to [25], a decomposable structure is a structure that admits an equivalent combinatorial specification:

Definition 0.1 ( [25]). Let A = ( A 1 , ... , A r ) be an r-tuple of classes of combinatorial structures. A specification for A is a collection or r equations with the i th equation being of the form A i = ϕ i A 1 , . . . , A r , where ϕ i denotes a term built of the A j using the constructions of disjoint union, cartesian product, sequence, set and cycle, as well as the initial (neutral and atomic) classes.

The needed formalities that will also be used in the sequel are given as follows:

Definition 0.2 ( [27]). If A is a combinatorial class, then A n denotes the class of objects in A that have size (defined as number of atoms) n. Furthermore:

  • Objects of size 0 are called neutral objects or tags and a class consisting of a single neutral object ϵ is called a neutral class, which will be denoted by ε (ε1, ε2,... to distinguish multiple neutral classes containing the objects ϵ1, ϵ2, ..., respectively).

  • Objects of size 1 are called atomic objects or atoms and a class consisting of a single atomic object is called an atomic class, which will be denoted by Ƶ(Ƶ a , Ƶ b ,... to distinguish the classes containing the atoms a,b,..., respectively).

  • If A 1 , . . . , A k are combinatorial classes and ϵ1, ..., ϵ k are neutral objects, the combinatorial sum or disjoint union is defined as A 1 + . . . + A k : = ε 1 × A 1 . . . ε k × A k where denotes set theoretic union.

  • If A and B are combinatorial classes, the cartesian product is defined as A × B : = α , β | α A and β B , where size(α, β) = size(α) + size(β).

Note that the constructions of disjoint union, cartesian product, sequence, set and cycle are all admissible:

Definition 0.3 ( [27]). Let ϕ be an m-ary construction that associates to a any collection of classes B 1 , . . . , B m a new class A : = ϕ B 1 , . . . , B m . The construction ϕ is admissible iff the counting sequence (a n ) of A only depends on the counting sequences (b1,n),..., (bm,n) of B 1 , . . . , B m , where the counting sequence of a combinatorial class A is the sequence of integers (a n )n≥0 for a n = card A n .

The framework of (admissible) specifications obviously resembles that of context-free grammars (CFGs) known from formal language theory (note that we assume the reader has basic knowledge of the notions concerning context-free languages and grammars. An introduction can be found for instance in [28]). In order to translate a CFG into the framework of admissible constructions, it is sufficient to make each terminal symbol an atom and to assume each non-terminal A to represent a class A (the set of all words which can be derived from non-terminal A). However, for representing CFGs, only the admissible constructions disjoint union, cartesian product and sequence are needed: Words are constructed as cartesian products of atoms, sentential forms as cartesian products of atoms and the classes assigned to the corresponding non-terminal symbols. For instance, a production rule AaB translates into the symbolic equation A = a × B . Different production rules with the same left-hand side give rise to the union of the corresponding cartesian products. Nevertheless, it should be noted that [25] also shows how to reduce specifications to standard form, where the corresponding standard specifications constitute the basis of the recursive method for uniform random generation and extends the usual Chomsky normal form (CNF) for CFGs. Briefly, in standard specifications, all sums and products are binary and the constructions of sequences, sets and cycles are actually replaced with other constructions (for details see [25]).

The prime advantage of standard specifications is that they translate directly into procedures for computing the sizes of all combinatorial subclasses of the considered class C of combinatorial objects. This means they can be used to count the number of structures of a given size that are generated from a given non-terminal symbol. Moreover, standard specifications immediately translate into procedures for generating one such structure uniformly at random. The corresponding procedures (for class size calculations and structure generations) are actually required for (uniform) random generation of words of a given CFG by means of unranking.

Simply speaking, the unranking of decomposable structures (like for instance RNA secondary structures which can be uniquely decomposed into distinct structural components) works as follows: Each structure s in the combinatorial class S n of all feasible structures having size n is given a number (rank) i 0 , . . . , card S n - 1 , defined by a particular ranking method. Based on this ordering of the considered structure class S n , the corresponding unranking algorithm for a given input number i 0 , . . . , card S n - 1 computes the single structure s S n having number i in the ranking scheme defined for class S n .

Note that in this context of unranking particular elements from a considered structure class, the corresponding algorithms make heavy use of their decomposability, as the distinct structural components are unranked from the corresponding subclasses. In fact, the class sizes can be derived according to the following recursion:
size( C , n ): =  1 C is neutral and n = 0 , 0 C is neutral and n 0 , 1 C is atomic and n = 1 , 0 C is atomic and n 1 , i = 1 k size A i , n C = A 1 + . . . + A k , j = 0 n s i z e A , j size B , n - j C = A × B .

Note that when computing the sums for cartesian products, we can either consider the values for j in the sequential (also called lexicographic) order (1, 2, 3,..., n) or in the so-called boustrophedon order 1 , n , 2 , n - 1 , . . . , n 2 . In either case, given a fix number of considered combinatorial (sub)classes (or corresponding non-terminal symbols), the precomputation of all class size tables up to size n requires O ( n 2 ) operations on coefficients. One random generation step then needs O ( n 2 ) arithmetic operations when using the sequential method and O n log ( n ) operations when using the boustrophedon method (for details we refer to [25]). Obviously, using uniform unranking procedures to construct the i th structure of size n for a randomly drawn number i, any structure of size n is equiprobably generated. Consequently, in order to make sure that, for given size n and a sample set of random numbers i, the corresponding structures are in accordance with an appropriate probability distribution (as for instance observed from real-life RNA data), it is mandatory to use a corresponding non-uniform unranking method or an alternative non-uniform random generation approach.

Non-Uniform Random Generation

Coming back to the random testing problem from software engineering, we observe that generating objects of a given class of input data according to a uniform distribution is sufficient for testing the correctness of particular algorithms. However, if one intends to gather information about the "real-life behaviour" of the algorithm (e.g. with respect to runtime or space requirements), we need to perform simulations with input data that are as closely as possible related to corresponding application. This means to obtain suitable test data, we need to specify a distribution on the considered class that is similar to the one observed in real life and draw objects at random according to this (non-uniform) distribution. Deriving such a "realistic" distribution on a given class of objects can easily be done by modeling the class by an appropriate stochastic context-free grammar (SCFG). Details will follow in the next section.

As regards RNA, it has been proven that both the combinatorial model (that is based on a uniform distribution such that all structures of a given size are equiprobable and that completely abstracts from the primary structure, see e.g. [2931]) and the Bernoulli-model (which is capable of incorporating information on the possible RNA sequences for a given secondary structure, see e.g. [3234]) for RNA secondary structures are rather unrealistic. However, modeling these structures by an appropriate SCFG yields a more realistic RNA model, where the probability distribution on all structures is determined from a database of real world RNA data (see e.g. [35, 36]).

Based on this observation, the problem of non-uniform random generation of combinatorial structures has been recently addressed in [20]. There, it is described how to get algorithms for the random generation of objects of a previously fixed size according to an arbitrary (non-uniform) distribution implied by a given SCFG. In principle, the construction scheme introduced in [20] extends on the recursive method for the (uniform) random generation [25] and adapted it to the problem of unranking of [37]: the basic principle is that any (complex) combinatorial class can be decomposed into (or can be constructed from) simpler classes by using admissible constructions.

Essentially, in [20], a new admissible construction called weighting has been introduced in order to make non-uniform random generation possible. By weighting, we understand the generation of distinguishable copies of objects. Formally:

Definition 0.4. If A is a combinatorial class and λ is an integer, the weighting of A by λ is defined as https://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_Article_140_IEq25_HTML.gif . We will call two objects from a combinatorial class copies of the same object iff they only differ in the tags added by weighting operations.

For example, if we weight the class A = { a } by two, we assume the result to be the set {a, a}; weighting B = { b } by three generates {b,b,b}. Thus, 2 A + 3 B = { a , a , b , b , b } and within this class, a has relative frequency 2 5 , while b has relative frequency 3 5 . Hence, this way it becomes possible to regard non-uniformly distributed classes.

As weighting a class can be replaced by a disjoint union, size λ A , n = λ size A , n and the complexity results from [37] also hold for weighted classes. Hence, the corresponding class size computations up to n need O ( n 2 ) time.

Stochastic Context-Free Grammars

As already mentioned, stochastic context-free grammars (SCFGs) are a powerful tool for modeling combinatorial classes and the essence of the non-uniform random sampling approach that will be worked out in this article. Therefore, we will now give the needed background information.

Basic Concepts

Briefly, SCFGs are an extension of traditional CFGs: usual CFGs are only capable of modeling the class of all generated structures and thus inevitably induce a uniform distribution on the objects, while SCFGs additionally produce a (non-uniform) probability distribution on the considered class of objects. In fact, an SCFG is derived by equipping the productions of a corresponding CFG with probabilities such that the induced distribution on the generated language models as closely as possible the distribution of the sample data.

The needed formalities are given as follows:

Definition 0.5 ( [38]). A weighted context-free grammar (WCFG) is a 5-tuple G = I , T , R , S , W , where I (resp. T) is an alphabet (finite set) of intermediate (resp. terminal) symbols (I and T are disjoint), S I is a distinguished intermediate symbol called axiom, R I × (I T)* is a finite set of production rules and W : R+ is a mapping such that each rule f R is equipped with a weight w f : = W(f). If G is a WCFG, then G is a stochastic context-free grammar (SCFG) iff the following additional restrictions hold:
  1. 1.

    For all f R, we have W(f) (0,1], which means the weights are probabilities.

     
  2. 2.

    The probabilities are chosen in such a way that for all A I, we have f R , Q ( f ) = A w f = 1 , where Q(f) denotes the premise of the production f, i.e. the first component A of a production rule (A, α) R. In the sequel, we will write w f : Aα instead of f = (A, α) R, w f = w(f).

     

However, at this point, we decided to not recall the basic concepts regarding SCFGs, as they are not really necessary for the understanding of this article. The interested reader is referred to the corresponding section in [21]. For a more fundamental introduction on stochastic context-free languages, see for example [39]. In fact, the only information needed in the sequel is that if structures are modeled by a consistent SCFG, then the probability distribution on the production rules of the SCFG implies a probability distribution on the words of the generated language and thus on the modeled structures. To ensure that a SCFG gets consistent, one can for example assign relative frequencies to the productions, which are computed by counting the production rules used in the leftmost derivations of a finite sample of words from the generated language. For unambiguous SCFGs, the relative frequencies can actually be counted efficiently, as for every word, there is only one leftmost derivation to consider.

Modeling RNA Secondary Structure via SCFGs

Besides the popular planar graph representation of unknotted secondary structures, many other ways of formalizing RNA folding have been described in literature. One well-established example is the so called bar-bracket representation, where a secondary structure is modeled as a string over the alphabet Σ: = {(,), |}, with a bar | and a pair of corresponding brackets ( ) representing an unpaired nucleotide and two paired bases in the molecule, respectively (see, e.g. [30]). Obviously, both models abstract from primary structure, as they only consider the number of base pairs and unpaired bases and their positions. Moreover, there exists a one-to-one correspondence between both representations, as illustrated by the following example:

Example 0.1. The secondary structure shown in Figure 1 has the following equivalent bar-bracket representation that can be decomposed into subwords corresponding to the basic structural motifs that are distinguished in state-of-the-art thermodynamic models:
https://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_Article_140_Equb_HTML.gif

Note that the reading order of secondary structures is from left to right, which is due to the chemical structure of the molecule.

Consequently, secondary structures without pseudoknots can be encoded as words of a context-free language and the class of all feasible structures can thus effectively be modeled via a corresponding CFG. Basically, that CFG can be constructed to describe a number of classical constraints (e.g. the presence of particular motifs in structures) and it can also express long-range interactions (e.g. base pairings). By extending it to a corresponding SCFG, we can also model the fact that specific motifs of RNA secondary structures are more likely to be folded at certain stages than others (and not all possible motifs are equiprobable at any folding stage).

In fact, it is known for a long time that SCFGs can be used to model RNA secondary structures (see e.g. [40]). Additionally, SCFGs have already been used successfully for the prediction of RNA secondary structure [14, 15]. Moreoever, they can be employed for identifying structural motifs as well as for deriving stochastic RNA models that are - with respect to the expected shapes - more realistic than other models [36]. Furthermore, note that an SCFG mirror of the famous Turner energy model has been used in [21] to perform the first analytical analysis of the free energy of RNA secondary structures; this SCFG marks a cornerstone between stochastic and pyhsics-based approaches towards RNA structure prediction.

Random Generation With SCFGs

SCFGs can easily be used for the random generation of combinatorial objects according to the probability distribution induced by a sample set, where the only problem is that they do not allow the user to fix the length of generated structures. In particular, given an SCFG G and the corresponding language (combinatorial class) L ( G ) , a random word w L ( G ) can be generated in the following way:

  • Start with the sentential form S (where S denotes the axiom of the grammar G ).

  • While there are non-terminal symbols (in the currently considered sentential form), do the following:1) Let A denote the leftmost non-terminal symbol.
    1. 2)

      Draw a random number r from the interval (0,1].

       
    2. 3)

      Substitute symbol A by the right-hand side α of the production Aα determined by the random number r.

       

This means consider all m ≥ 1 rules p1 : Aα1,..., p m : Aα m having left-hand side A, where according to the definition of SCFGs, i = 1 m p i = 1 must hold. Then, find k ≥ 1 with i = 1 k - 1 p i < r i = 1 k p i , i.e. determine k ≥ 1 with r i = 1 k - 1 p i , i = 1 k p i . The production corresponding to the randomly drawn number r (0,1] is then given by Aα k and hence, in the currently considered sentential form, the non-terminal symbol A is substituted by α k .

  • If there are no more non-terminal symbols, then the currently considered sentential form is equal to a word w L ( G ) ; w has been randomly generated.

Note that the choice of the production made in 3) according to the previously drawn random number is appropriate, since it is conform to the probability distribution on the grammar rules.

Example 0.2. Consider the language generated by the SCFG with productions ¾: Sϵ and ¼: S → (S). Thus, we start with the sentential form S, then consider the leftmost non-terminal symbol, which is given by S, and draw a random number r (0,1]. If 0 <r ≤ ¾, the production determined by r is Sϵ and thus, we get the empty word and are finished. Otherwise, ¾ <r ≤ ¾ + ¼ = 1, which means we have to consider A → (S) for the substitution in step 3) and thus obtain the sentential form (S). Afterwards, we must repeat the process, as there is still one non-terminal symbol left.

Unfortunately, there is one major problem that comes with this approach for the (non-uniform) random generation of combinatorial objects: The underlying (consistent) SCFG G implies a probability distribution on the whole language L ( G ) , such that we generate a word of arbitrary size. In order to fix the size, we can proceed along the following lines:
  1. 1)

    We translate the grammar G into a new framework which allows to consider fixed sizes for the random generation, such that

     
  2. 2)

    the distribution implied on L ( G ) conditioned on any fixed size n is kept within the new framework.

     

A well-known approach which allows for 1) is connected to the concept of admissible constructions used to describe a decomposable combinatorial class (see above). As the operations (like cartesian products, unions, and so on) used to construct the combinatorial objects are also used to define an order on them, it becomes possible to identify the i th object of a given size and the problem of generating objects uniformly at random reduces to the problem of unranking, that is the problem of constructing the object of order (rank) i, for i a random number (see e.g. [41]).

Remark. Some might think that with an appropriate SCFG (modeling a given class of objects) at hand, it is not really necessary to use an unranking method that implies cumbersome formalities such as admissible constructions and decomposable classes if we want to generate random objects of a fixed size n. As a matter of principle, they are right - we could also use a conditional sampling method: If we need to generate a word of size n from non-terminal symbol A, where there are m ≥ 1 rules f i = Aα i , 1 ≤ im, having left-hand side A, then we just need to choose the next production f i according to
Prob A α i * x | size ( x ) = n Prob A * x | size ( x ) = n ,

which is the posterior probability that we used production rule f i under the condition that a word of size n is generated.

Similarly, if the production rule is of the type ABC (assuming the grammar is in Chomsky normal form (CNF), which does not pose a problem, as an unambiguous SCFG can be efficiently transformed into CNF [39]), we can choose a way to split size n into sizes j and n - j for the lengths generated from non-terminal symbols B and C. This requires precomputing n length-dependent probabilities (i.e. all probabilities for generating a word of any length up to n) for each non-terminal symbol, which might seem to be similar (with respect to complexity) to precomputing all class sizes up to n for all considered combinatorial (sub)classes as needs to be done for unranking. However, there is a striking difference between the two approaches: While conditional sampling makes heavy use of rather small floating point values - with all the well-known problems and discomforting details like underflows or using logarithms associated with it - our unranking approach builds on integer values only which we assume a major advantage. There is another striking difference: length-dependent probabilities (which by the way yield a so-called length-dependent SCFG (LSCFG), see [42], and already have been used in [43]), require a very rich training set. In fact, if the RNA data set used for determining the distribution induced by the grammar is not rich enough, then the corresponding stochastic RNA model is underestimated and its quality decreases. This is especially a problem when considering comprehensive CFGs that distinguish between many different structural motifs in order to get a realistic picture of the molecules' behaviour; such a grammar should however be preferred over simple lightweight grammars as basis for a non-uniform random generation method. Nevertheless, this problem does not surface when sticking to conventional probabilities and the corresponding traditional SCFG model. Actually, since we consider a huge CFG where all possible structural motifs are created by distinct productions, we generally obtain realistic probability distributions and RNA models (see [21]).

Finally note that of course we could make use of random sampling strategies originally designed to sample structures connected to a given sequence in order to generate a random secondary structure only. However, such algorithms typically use a linear time to sample a single base pair (see, e.g., [6]) such that the time to sample a complete structure is quadratic in its length. This causes no problems for the original application of such algorithms since the sequence-dependent preprocessing which is part of their overall procedure is at least quadratic in time and thus the dominating part. Here our approach is of advantage (replacing a factor n by log(n)) and since our preprocessing only depends on the size of the structure to be generated it is performed once and stored to disk for later reuse. Last but not least we are not sure, if the different existing approaches just mentioned could easily be made as fast as ours by simple changes only.

Bottom line is that hooking up to unranking of combinatorial classes offers three significant benefit compared to conditional sampling, namely a fast sampling strategy, the usage of integers instead of floating point values and a greater independence of the richness of the training data (compared to length-dependent models). For this reason, we assume our unranking algorithm a valuable contribution, even though it requires a more cumbersome framework.

Unranking of Combinatorial Objects

The problem of unranking can easily be solved along the composition of the objects at hand, i.e. the operations used for its construction, once we know the number of possible choices for each substructure. Assume for example we want to unrank objects from a class C = A + B . We will assume all elements of A to be of smaller order than those of B (this way we use the construction of the class to imply an ordering). Finding the i th element of C , i.e. unranking class C , now becomes possible by deciding whether i < card ( A ) . In this case, we recursively call the unranking procedure for A . Otherwise (i.e. if i card ( A ) ), we consider B , searching for its ( i - card ( A ) ))th element.

Formally, we first need to specify an order on all objects of the considered combinatorial class that have the same size. This can be done in a recursive way according to the admissible specification of the class:

Definition 0.6 ( [37]). Neutral and atomic classes contain only one element, such that there is only one possible ordering. Furthermore, let < C n denote the ordering within the combinatorial class C n , then

  • If C = A 1 + . . . + A k and γ , γ C n , then γ < C n γ iff
    γ ( A i ) n and γ ( A j ) n and i < j or γ , γ ( A i ) n and γ < ( A i ) n γ .
  • If C = A × B and γ = ( α , β ) , γ = ( α , β ) C n , then γ < C n γ iff
    size ( α ) < size ( α ) or j = size ( α ) = size ( α ) and α < ( A ) j α or α = α and β < ( B ) n - j β

when considering the lexicographic order (1, 2, 3,..., n), which is induced by the specification C n = A 0 × B n + A 1 × B n - 1 + A 2 × B n - 2 + . . . + A n × B 0 .

  • If C = A × B and γ = ( α , β ) , γ = ( α , β ) C n , then γ γ < C n γ iff
    min ( size ( α ) , size ( β ) ) < min ( size ( α ) , size ( β ) ) or min ( size ( α ) , size ( β ) ) = min ( size ( α ) , size ( B ) ) and size ( α ) < size ( α ) or j = size ( α ) = size ( α ) and α < ( A ) j α or α = α and β < ( B ) n - j β

when considering the boustrophedon order 1 , n , 2 , n - 1 , . . . , n 2 , induced by the specification C n = A 0 × B n + A n × B 0 + A 1 × B n - 1 + A n - 1 × B 1 + . . .

Considering < C n , the actual unranking algorithms are quite straightforward. Therefore, they will not be presented here and we refer to [20, 44] for details.

Recall that in [20], the basic approach towards non-uniform random generation is weighting of combinatorial classes, as this makes it possible that the classes are non-uniformly distributed. If those combinatorial classes are to correspond to a considered SCFG, we have to face the problem that the maximum likelihood (ML) training introduces rational weights for the production rules while weighting as an admissible construction needs integer arguments.

When translating rational probabilities into integral weights, we have to assure that the relative weight of each (unambiguously) generated word remains unchanged. This can be reached by scaling all productions by the same factor (common denominator of all probabilities), while ensuring that derivations are of equal length for words of the same size (ensured by using grammars in CNF). However, a much more elegant way is to scale each production according to its contribution to the length of the word generated, that is, productions lengthening the word by k will be scaled by c k . Since we consider CFGs, the lengthening of a production of the form Aα is given by |α| - 1. However, this rule leads to productions with a conclusion of length 1 not being reweighted, hence we have to assure that all those productions already have integral weights. Furthermore, ϵ-productions need a special treatment. We don't want to discuss full details here and conclude by noticing that the reweighting normal form (RNF) keeps track of all possible issues:

Definition 0.7 ( [20]). If G = I , T , R , S , W is a WCFG, G is said to be in reweighting normal form (RNF) iff
  1. 1.

    G is loop-free and ϵ-free.

     
  2. 2.

    For all Aα R with A = S, we have |α| ≤ 1.

     
  3. 3.

    For all Aα R with AS, we have |α| > 1 or W(Aα) .

     
  4. 4.

    For all A I there exists α (I T)* such that Aα R.

     

Note that the last condition (that any intermediate symbol occurs as premise of at least one production) is not required for reweighting, but necessary for the translation of a grammar into an admissible specification.

Definition 0.8 ( [20]). A WCFG G is called loop-free iff there exists no nonempty derivation A + A for A I. It is called ϵ-free iff there exists no (A, ϵ) R with A = S and there exists no (A, α12) R, where ϵ denotes the empty word.

If G and G are WCFGs, then G and G are said to be word-equivalent iff L ( G ) = L ( G ) and for each word w L ( G ) , we have W(w) = W'(w).

In [20], it is shown how to transform an arbitrary WCFG to a word-equivalent, loop-free and ϵ-free grammar, that grammar to one in RNF and the latter to the corresponding admissible specification. Formally:

Theorem 0.1 ( [39]). If G is a SCFG, there exists a SCFG G in Chomsky normal form (CNF) that is word-equivalent to G , and G can be effectively constructed from G .

The construction given in [39] assumes that G is ϵ-free. It can however be extended to non-ϵ-free grammars by adding an additional step after the intermediate grammar G has been created (see e.g. [20]). Furthermore, it should be noted that an unambiguous grammar is inevitably loop-free.

Theorem 0.2 ( [20]). If G is a loop-free, ϵ-free WCFG, there exists a WCFG G in RNF that is word-equivalent to G and G can be effectively constructed from G .

Altogether, starting with an arbitrary unambiguous SCFG G 0 that models the class of objects to be randomly generated, we have to proceed along the following lines:

  • Transform G 0 to a corresponding ϵ-free and loop-free SCFG G 1 .

  • Transform G 1 into G 2 in RNF (where all production weights are rational).

  • Reweight the production rules of G 2 (such that all production weights are integral), yielding reweighted WCFG G 3 .

  • Transform G 3 (with integral weights) into the corresponding admissible specification.

  • This specification (with weighted classes) can be translated directly- into a recursion for the function size of all involved combinatorial (sub)classes (where class sizes are weighted) and
    • into generating algorithms for the specified (weighted) classes,

yielding the desired weighted unranking algorithm for generating random elements of L ( G 0 ) .

A small example that shows how to proceed from SCFG to reweighted normal form and the corresponding weighted combinatorial classes which allow for non-uniform generation by means of unranking is discussed in the Appendix.

Generating Random RNA Secondary Structures

We will now consider the previously discussed approach to construct a weighted unranking algorithm that generates random RNA secondary structures of a given size according to a realistic probability distribution. As for this paper, the corresponding probability distribution will be induced by a set of sample (SSU and LSU r)RNA secondary structures from the databases [45, 46], which will be referred to as biological database in the sequel. However, the presented algorithm can easily be used for any other distribution, which can be defined by a database of known RNA structures of a particular RNA type; our webservice implementation accessible at http://wwwagak.cs.uni-kl.de/NonUniRandGen is actually able to sample random secondary structures of any specified RNA type. A link to download an implementation of our algorithm (in Wolfram Mathematica) can be found there, too.

Considered Combinatorial Class

According to the common definition of RNA secondary structure, we decided to consider the combinatorial class of all RNA secondary structures without pseudoknots that meet the stereochemical constraint of hairpin loops consisting of at least 3 unpaired nucleotides, formally:

Definition 0.9 ( [21]). The language L containing exactly all RNA secondary structures is given by (note that according to this definition, completely unpaired structures are prohibited) L : = L u L l u + , where l u : = ( l ) u u : = { | } * is the language of all bar-bracket representations of single-stranded regions and L l is the language of all bar-bracket representations of other possible substructures, i.e. is the smallest language satisfying the following conditions:
  1. 1.

    { | } + \ { | , | | } L l (bar-bracket representations of hairpin loops).

     
  2. 2.

    If w L l , then ( w ) l (bar-bracket representation of a stacked pair).

     
  3. 3.

    If w L l , then { | } + ( w ) L l and ( w ) { | } + l (bar-bracket representations of bulge loops).

     
  4. 4.

    If w L l , then { | } + ( w ) { | } + L l (bar-bracket representations of interior loops).

     
  5. 5.

    If w 1 , . . . , w n L l and n ≥ 2, then L u ( w 1 ) L u ( w 2 ) L u ( w n ) L u L l (bar-bracket representations of multibranched loops).

     

The desired weighted unranking algorithm thus generates, for a given size n and a given number i { 0 , . . . , card ( L n ) - 1 } , the i th secondary structure s L n , where card ( L n ) = size ( L , n ) is the number of elements in the weighted class L n .

Considered SCFG Model

First, we have to find a suitable SCFG that generates L and models the distribution of the sample data as closely as possible. To reach this goal, it is important to appropriately specify the set of production rules in order to guarantee that all substructures that have to be distinguished are generated by different rules. This is due to the fact that by using only one production rule f to generate different substructures (e.g. any unpaired nucleotides independent of the type of loop they belong to), there is only one weight (the probability p f of this production f) with which any of these substructures is generated, whereas the use of different rules f1,..., f k to distinguish between these substructures implies that they may be generated with different probabilities p f 1 , . . . p f k , where p f 1 + . . . + p f k = p f . This way, we ensure that more common substructures are generated with higher probabilities than less common ones.

Example 0.3. A (rather simple) unambiguous SCFG G s generating the language L is given by:
w 1 : S s C A , w 2 : A ( B ) C , w 3 : A ( B ) C A , w 4 : B | | | C , w 5 : B C A , w 6 : C ϵ , w 7 : C | C .

This grammar unambiguously generates L for the following reasons:

  • Every sentential form C ( B ) C ( B ) ... ( B ) C obviously is generated in a unique way; this resembles L = L u L l u + and l u : = ( l ) u of L s definition. The number of outermost pairs of brackets in the entire string uniquely determines the corresponding sentential form to be used.

  • Now B either generates a hairpin-loop |≥3, which unambiguously is possible by rules B||| C, C| C and Cϵ, or

  • B itself has to generate at least one additional pair of brackets. In this case, BCA must be applied (only A can generate brackets) and then A( B ) C resp. A( B ) CA are used; the number of outermost brackets to be generated (from B under consideration) again uniquely determines that part of the derivation.

When changing the production w5: BCA used to generate any possible k-loop for k ≥ 2 (any loop that is not a hairpin loop) with probability w5 into the two rules
w 5 . 1 : B C ( B ) C , w 5 . 2 : B C ( B ) C A ,
where w5.1 + w5.2 = w5, it becomes possible to generate any possible 2-loop (i.e. a stacked pair, a bulge (on the left or on the right), or an interior loop) and all kinds of multiloops (i.e. any k-loop with k ≥ 3) with different probabilities, which could increase the accuracy of the SCFG model. By additionally replacing the first of these two new rules, w5.1 : BC ( B ) C, by the four productions
w 5 . 1 . 1 : B ( B ) , w 5 . 1 . 2 : B C ( B ) , w 5 . 1 . 3 : B ( B ) C , w 5 . 1 . 4 : B C ( B ) C ,

where (w5.1.1 + ... + w5.1.4) + w5.2 = w5.1 + w5.2 = w5, we can distinguish between the different types of 2-loops more accurately, yielding a more realistic secondary structure model. In fact, in the case of significant differences of the new probabilities (w5.1.1, ..., w5.1.4 and w5.2), we can expect a huge improvement in the model's accuracy. Note that it is not hard to see that changes to a grammar like the ones just discussed do not change the language generated. However, this is not at all obvious with respect to ambiguity of the grammar.

According to the previously mentioned facts (and the corresponding illustrations by Example 0.3), we decided that the basis for our weighted unranking algorithm should be the following e-free, loop-free and unambiguous (note that these are exactly the preliminary required conditions for the basis SCFG according to [20]) SCFG, which has been derived from the sophisticated SCFG presented in [21] that distinguishes between all known structural motifs that can be found in RNA secondary structure:

Definition 0.10. The unambiguous ϵ-free SCFG G ^ sto generating exactly the language L is given by Ĝ sto = I Ĝ sto , Ĝ sto , R Ĝ sto , S , where
I G ^ sto = { S , E , S , T , C , A , L , G , D , B , F , H , P , Q , R , V , W , O , J , K , M , X , Y , Z , N , U } ,
G ^ sto = { ( , ) , | } and R Ĝ sto contains exactly the following rules:
p ^ 1 : S E , p ^ 2 : E S , p ^ 3 : E S C , p ^ 4 : S A , p ^ 5 : S T A , p ^ 6 : T E , p ^ 7 : T C , shape of exterior loop p ^ 8 : C | , p ^ 9 : C C | , strands in exterior loop p ^ 10 : A ( L ) , initiate helix p ^ 11 : L A , p ^ 12 : L M , initiate stacked pair or multiple loop p ^ 13 : L P , p ^ 14 : L Q , p ^ 15 : L R , initiate interior loop p ^ 16 : L F , p ^ 17 : L G , initiate hairpin loop or bulge loop p ^ 18 : G A | , p ^ 19 : G A D , p ^ 20 : G | A , p ^ 21 : G D A , shape of bulge loop p ^ 22 : D B | , p ^ 23 : B | , p ^ 24 : B B | , strands in bulge loop p ^ 25 : F | | | , p ^ 26 : F | | | | , p ^ 27 : F | | | | H , p ^ 28 : H | , p ^ 29 : H H | , hairpin loop p ^ 30 : P | A | , p ^ 31 : P | A | | , p ^ 32 : P | | A | , p ^ 33 : P | | A | | , small interior loops p ^ 34 : Q | | O | | , p ^ 35 : Q | | V | , p ^ 36 : R | O | | , p ^ 37 : R | | W | , p ^ 38 : V J O , p ^ 39 : W J A , p ^ 40 : O A K , other interior loops p ^ 41 : J | , p ^ 42 : J J | , p ^ 43 : K | , p ^ 44 : K K | , strands in interior loop p ^ 46 : M X Y , p ^ 46 : X A , p ^ 47 : X U A , p ^ 48 : Y Z , p ^ 49 : Z X , p ^ 50 : Z X N , p ^ 51 : N Z , p ^ 52 : N U , multiple loop p ^ 53 : U | , p ^ 54 : U U | , strands in multiple loop

Figures 2 and 3 illustrate by examples how (parts of) secondary structures are generated by this SCFG, where we used https://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_Article_140_IEq89_HTML.gif to denote the full parse tree for I * x (i.e. for consecutive applications of an arbitrary number of production rules that generate the subword x from the intermediate symbol I) in oder to obtain a more compact tree representation. In fact, it is easy to see that the overall structure is always produced by starting with the axiom S', while any particular substructure or structural motif that belongs to the combinatorial (sub)class I is created from the corresponding intermediate symbol I.

For our application it is crucial that Ĝ sto - as claimed its definition - is unambiguous. To prove this, we first note that Ĝ sto has been constructed starting from a simple grammar which generates L by iteratively replacing one production by several ones (like we did in the previous example) in order to distinguish more and more structural motifs but without changing the language generated. Furthermore, a standard construction to make the grammar ϵ-free has been applied. That way, we can be sure that Ĝ sto generates L (formally this fact easily follows by obvious bi-simulation proofs for each substitution and by the proven correctness of the used construction to ensure ϵ-freeness). To prove unambiguity, we translate Ĝ sto into a system of equations for its structure generating function (see [47] for details) S [ z ] = w L d ( w ) z | w | where d(w) denotes the number of derivation trees Ĝ sto offers for w. Eliminating all but the variable associated with the axiom and simplifying (for this step we made use of Mathematica) yields the single equation
https://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_Article_140_Fig2_HTML.jpg
Figure 2

Unique parse tree for the bar-bracket word considered in Example 0.1 that corresponds to the planar secondary structure from Figure 1.

https://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_Article_140_Fig3_HTML.jpg
Figure 3

Particular subtrees of the tree presented in Figure 2.

z 5 + S [ z ] ( 1 + z ) ( 1 + z ( 2 S [ z ] ( 1 + z ) z + z 4 ) ) = 0

This equation is exactly the one of our grammar G s from Example 0.3 which proves that for all n both grammars have the same number of derivation trees for words of size n. Knowing that both grammars generate L and that G s is unambiguous, the same can be concluded for Ĝ sto .

Note that Ĝ sto contains more production rules (and more different non-terminal symbols) than the SCFG considered in [21], but this new grammar is ϵ-free and additionally, the right-hand side of every single production contains at most two non-terminal symbols, such that the resulting unranking algorithm has to consider less cases (i.e. less "else if ( )" cases). For details, see [20] and the Appendix.

Furthermore, it should be mentioned that we decided to assign relative frequencies to the production rules of Ĝ sto , since such probabilities can be computed efficiently for unambiguous SCFGs. Moreover, by estimating the probabilities p ^ i , 1 i 54 , by their relative frequencies, the resulting grammar Ĝ sto has the consistency property, which means Ĝ sto provides a probability distribution on the language L ( Ĝ sto ) = L . In particular, it is well-known that relative frequencies in our context yield a maximum likelihood (ML) estimator for the rule probabilities and thus a consistent estimator for the parameter set. We have trained the probabilities (relative frequencies) of Ĝ sto from the structures s L Ĝ sto given in our biological database. The resulting probabilities are given in Table 1, their floating point approximations, rounded to the third decimal place in Table 2.
Table 1

The probabilities (relative frequencies) for the production rules of the SCFG Ĝ sto obtained by training it using our biological database

Nonterminal Nt

Probabilities of Rules with Premise Nt

S'

p ^ 1 : = 1

E

p ^ 2 : = 137 6476 , p ^ 3 : = 6339 6476 ,

S

p ^ 4 : = 177 12952 , p ^ 5 : = 12775 12952 ,

T

p ^ 6 : = 11086 12775 , p ^ 7 : = 1689 12775 ,

C

p ^ 8 : = 14367 148978 , p ^ 9 : = 134611 148978 ,

A

p ^ 10 : = 1 ,

L

p ^ 11 : = 605069 792975 , p ^ 12 : = 31912 792975 , p ^ 13 : = 4912 264325 , p ^ 14 : = 5821 158595 ,

 

p ^ 15 : = 1893 264325 , p ^ 16 : = 2723 31719 , p ^ 17 : = 38399 792975 ,

G

p ^ 18 : = 11667 38399 , p ^ 19 : = 7235 38399 , p ^ 20 : = 11831 38399 , p ^ 21 : = 7666 38399 ,

D

p ^ 22 : = 1 ,

B

p ^ 23 : = 4967 12748 , p ^ 24 : = 7781 12748 ,

F

p ^ 25 : = 3912 68075 , p ^ 26 : = 23208 68075 , p ^ 27 : = 8191 13615 ,

H

p ^ 28 : = 8191 40700 , p ^ 29 : = 32509 40700 ,

P

p ^ 30 : = 533 4912 , p ^ 31 : = 1053 4912 , p ^ 32 : = 2963 14736 , p ^ 33 : = 7015 14736 ,

Q

p ^ 34 : = 4986 29105 , p ^ 35 : = 24119 29105 ,

R

p ^ 36 : = 2357 5679 , p ^ 37 : = 3322 5679 ,

V

p ^ 38 : = 1 ,

W

p ^ 39 : = 1 ,

O

p ^ 40 : = 1 ,

J

p ^ 41 : = 27441 84620 , p ^ 42 : = 57179 84620 ,

K

p ^ 43 : = 15731 53725 , p ^ 44 : = 37994 53725 ,

M

p ^ 45 : = 1 ,

X

p ^ 46 : = 6196 87035 , p ^ 47 : = 80839 87035 ,

Y

p ^ 48 : = 1 ,

Z

p ^ 49 : = 2812 55123 , p ^ 50 : = 52311 55123 ,

N

p ^ 51 : = 7737 17437 , p ^ 52 : = 9700 17437 ,

U

p ^ 53 : = 109939 518817 , p ^ 54 : = 408878 518817 ,

Table 2

Floating point approximations of the probabilities (relative frequencies) for the production rules of the SCFG Ĝ sto (rounded to three decimal places)

Nonterminal Nt

Probabilities of Rules with Premise Nt

S'

p ^ 1 : = 1 . 000 ,

E

p ^ 2 : = 0 . 021 , p ^ 3 : = 0 . 979 ,

S

p ^ 4 : = 0 . 014 , p ^ 5 : = 0 . 986 ,

T

p ^ 6 : = 0 . 868 , p ^ 7 : = 0 . 132 ,

C

p ^ 8 : = 0 . 096 , p ^ 9 : = 0 . 904 ,

A

p ^ 10 : = 1 . 000

L

p ^ 11 : = 0 . 763 , p ^ 12 : = 0 . 040 , p ^ 13 : = 0 . 019 , p ^ 14 : = 0 . 037 ,

 

p ^ 15 : = 0 . 007 , p ^ 16 : = 0 . 086 , p ^ 17 : = 0 . 048 ,

G

p ^ 18 : = 0 . 304 , p ^ 19 : = 0 . 188 , p ^ 20 : = 0 . 308 , p ^ 21 : = 0 . 200 ,

D

p ^ 22 : = 1 . 000

B

p ^ 23 : = 0 . 390 , p ^ 24 : = 0 . 610 ,

F

p ^ 25 : = 0 . 057 , p ^ 26 : = 0 . 341 , p ^ 27 : = 0 . 602 ,

H

p ^ 28 : = 0 . 201 , p ^ 29 : = 0 . 799 ,

P

p ^ 30 : = 0 . 109 , p ^ 31 : = 0 . 214 , p ^ 32 : = 0 . 201 , p ^ 33 : = 0 . 476 ,

Q

p ^ 34 : = 0 . 171 , p ^ 35 : = 0 . 829 ,

R

p ^ 36 : = 0 . 415 , p ^ 37 : = 0 . 585 ,

V

p ^ 38 : = 1 . 000

W

p ^ 39 : = 1 . 000

O

p ^ 40 : = 1 . 000

J

p ^ 41 : = 0 . 324 , p ^ 42 : = 0 . 676 ,

K

p ^ 43 : = 0 . 293 , p ^ 44 : = 0 . 707 ,

M

p ^ 45 : = 1 . 0000

X

p ^ 46 : = 0 . 071 , p ^ 47 : = 0 . 929 ,

Y

p ^ 48 : = 1 . 0000

Z

p ^ 49 : = 0 . 051 , p ^ 50 : = 0 . 949 ,

N

p ^ 51 : = 0 . 444 , p ^ 52 : = 0 . 556 ,

U

p ^ 53 : = 0 . 212 , p ^ 54 : = 0 . 788 .

In oder to see if over-fitting is an issue for our sophisticated grammar and its rich parameter set, i.e. to see if our training set is large enough to derive reliable values for the rule probabilities, we performed the following experiments: We selected a random 90% (resp. 50%) portion of the original training set and re-estimated the probabilities of all the grammar rules. This process was iterated 40 times, resulting in a sample of 40 parameter sets. Finally, for each parameter we determined its variance along this sample of size 40. The corresponding values lay between 0 (resulting for intermediate symbols without alternatives; for whose productions a probability of 1 is predetermined) and 2.87652 × 10-6 (resp. 2.86242 × 10-5). We can conclude that over-fitting is no issue in connection with our sophisticated grammar and the training set used.

Derivation of the Algorithm

The elaborate SCFG Ĝ sto is appropriate for being used as the basis for the desired weighed unranking method: after having determined the RNF of this SCFG and the corresponding weighted combinatorial classes, we easily find a recursion for the size function (in the same ways as discussed in Example App-.4). Then, we can use the resulting weighted class sizes for the straightforward construction of the desired unranking algorithm.

In fact, for the construction of the complete algorithm, we simply have to use Algorithms 1 to 4 (Unranking of neutral classes, atomic classes, disjoint unions and cartesian products, respectively) and Algorithm 6 (Unranking of weighted classes) given in [20] as subroutines. However, to improve the worst-case complexity of the resulting unranking procedure from O ( n 2 ) to O n log ( n ) by using the boustrophedonic order instead of the sequential order, a simple change in Algorithm 4 (Unranking of cartesian products) is neccessary (see e.g. [7]).

A random RNA secondary structure of size n can easily be computed by drawing a random number i { 0 , ,  size ( L , n ) - 1 } and then unranking the i th structure of size n. The worst-case runtime complexity of this procedure is equal to that of unranking and is thus given by O n log ( n ) when using the boustrophedonic order. By repeating this procedure m times, a set of m (not necessarily distinct) random RNA secondary structures of size n can be generated in time O m n log ( n ) , where a preprocessing time of O ( n 2 ) is required for the computation of all (weighted) class sizes up to input length n.

A complete and detailed description of the derivation of our weighted unranking algorithm for (SSU and LSU r)RNA secondary structures can be found in the Appendix, since it is too comprehensive to be presented here and the different steps for its generation correspond to those described in [20].

Availability of Software

It may be of interest to the reader that this non-uniform random generation algorithm for RNA secondary structures has been implemented as a webservice which is accessible to the scientific community under http://wwwagak.cs.uni-kl.de/NonUniRandGen. Since it is relevant for researchers to have methods available for generating random structures that are realistic for a particular investigation, this webservice is also capable of allowing the user to specify the distribution from which the corresponding structures should be sampled (in the form of a set of secondary structure samples from which the parameters for our grammars are inferred). Furthermore, our Mathematica source code used to implement the webservice can be downloaded from our website and used under GNU public licence.

Discussion

The purpose of this section is to analyze the quality of randomly generated structures by considering some experimental results.

Parameters for Structural Motifs

As a first step, we decided to consider several important parameters related to particular structural motifs of RNA secondary structure and compare the observed statistical values derived from a native sample (here our biological database, i.e. the set of real-life RNA data that we used for deriving the distribution and thus the weights for the unranking algorithm) to those derived from a corresponding random sample (i.e. a set of random structures generated by our algorithm). In order to obtain an appropriate random sample, we have generated exactly one random structure of size n for each native RNA structure of size n given in our database, such that for each occurring size n, the random sample and the native sample contain the same number of structures having this size.

The determined results are presented in Table 3. Comparing the specific values of all different parameters, we can guess that our algorithm produces random RNA secondary structures that are, related to the different structural motifs and thus related to the expected shape of such structures, in most cases realistic. Obviously, this is a major improvement over existing approaches for the random generation of secondary structures of a given input size n (where the corresponding specific RNA sequence is not known, but only its length n), as those (sequence-independent) methods are only capable of generating structures uniformly at random for input size n. Furthermore, with the SCFG model used here, we have an new model for RNA secondary structures at hand which realistically reflects the structure of an RNA molecule and its basic structural motifs.
Table 3

Expectation and variance of important parameters related to particular structural motifs of RNA secondary structure

Parameter

Expected Value

Variance

 

Random

Native

Random

Native

numunp

848.179

839.956

98964.7

103426.

numbps

420.848

424.96

27785.3

31310.9

numurs

179.73

181.822

4959.96

5117.47

nume

1.

1.

0.

0.

numh

36.6983

36.4818

196.935

185.596

nums

321.18

324.26

16538.8

19343.4

numb

20.6061

20.5782

87.1894

50.3103

numi

26.1442

26.538

125.66

194.769

numm

16.2197

17.1018

57.8874

41.0261

numhel

99.6683

100.7

1549.24

1492.84

unpe

106.014

79.8382

4039.69

3897.61

unph

6.93534

6.93188

18.4264

77.464

unps

--

--

--

--

unpb

1.9948

1.99596

3.10283

6.87868

unpi

7.14617

7.08869

16.5725

31.1197

unpm

16.0122

16.2577

87.4906

195.497

unphel

--

--

--

--

bpse

9.41479

6.94105

29.1956

6.30949

bpsh

--

--

--

--

bpss

1.

1.

0.

0.

bpsb

1.

1.

0.

0.

bpsi

1.

1.

0.

0.

bpsm

2.68212

2.72734

1.12921

1.21643

bpshel

4.22249

4.22006

13.6266

5.52299

Values are derived from a native sample (our biological database) and from a random sample, respectively. num x denotes the number of occurrences of motif x in one secondary structure and unp x (bps x ) denotes the number of accessible unpaired bases (base pairs) in one substructure of type x. unp, bps, urs denote unpaired bases, base pairs and unpaired regions, whereas e, h, s, b, i, m, hel denote exterior loop, hairpin loop, stacked pair, bulge loop, interior loop, multiloop and helix, respectively.

Related Free Energies

For further investigation on the accuracy of our random generator, we take on a completely different point of view and consider thermodynamics. The reason behind this idea is that if an RNA secondary structure model induced by a SCFG shows a realistic behaviour (expectation and variance) with respect to minimum free energy, then it is rather likely that our grammar also shows a realistic picture for all the different structural motifs of a molecule's folding (as the free energy of a molecule's structure is defined as the sum of the energy contributions of all its substructures).

Since we do not know the corresponding RNA sequences for the randomly generated structures, we can not use one of the common sequence-dependent thermodynamic models for RNAs. Therefore, we decided to consider both the static and dynamic free energy models (in the static model, averaged free energy contributions for the distinguished structural motifs are considered which can easily be derived from the training data (by sequence counting). These averaged values actually represent the free energy contributions that have to be added for the respective whole substructures. For the dynamic model, corresponding average values for length-dependent free energy contributions (that depend on the number of unpaired or paired bases within particular substructures) are added for each component (unpaired base or base pair) in the respective motifs, such that in contrast to the static model, substructures of different lengths are assigned different free energy values) defined in [21] for RNA secondary structures with unknown sequence. These models are based on the well-known Turner energy model [22, 23] and model parameters have been derived from the same biological database (of SSU and LSU rRNAs) that we consider in this article. In fact, both models have turned out to show a realistic behaviour and can therefore be used to judge the quality of random structures generated by our algorithm.

Unquantified Results

Similar to [21], we denote the free energy of a given secondary structure s L according to the static and dynamic model by g stat (s) and g dyn (s), respectively. Moreover, the expected free energy and corresponding variance that have been analytically derived in that paper for any n > 0 are denoted by μ e n e r g y , n : = E e n e r g y ( s )  size ( s ) = n and σ e n e r g y , n 2 : = V e n e r g y ( s )  size ( s ) = n , respectively, where energy {g stat , g dyn }. The corresponding confidence interval for n > 0 and k > 1, which contains at least 100 - 100 k 2 percent of the energies in { e n e r g y ( s ) s L n } is denoted by I energy,n (k): = (μ energy,n - k σ energy ,n, μ energy,n + k σ energy,n ). As these analytical energy results from [21] and our unranking algorithm have been derived from the same database of real-life RNA data and by modeling the same class L of structures via very similar SCFGs, it seems adequate to use them for comparisons with the energies of our randomly generated structures.

Before we start with our comparisons, note that for any sample set S of secondary structures, we can calculate the corresponding energy points E P ( S , e n e r g y ) : = { ( size ( s ) , e n e r g y ( s ) ) s S } , where energy {g stat , g dyn }. Obviously, we can also compute the corresponding "average energy points" A v E P ( S , e n e r g y ) : = ( n , μ n : = 1 card ( S n ) s S n e n e r g y ( s ) ) S n and the corresponding "energy variance points" V a r E P ( S , e n e r g y ) : = { ( n , σ n 2 : = 1 card ( S n ) s S n ( μ n e n e r g y ( s ) ) 2 S n } , respectively. In the sequel, we will denote a random sample generated by our algorithm by and a native sample (biological database) by N .

In order to obtain an appropriate random sample for our energy comparisons, we derived a large set of random structures by generating 1000 RNA secondary structures for each of the sizes n {500,1000,1500,..., 5000, 5500} with our weighted unranking algorithm. To compare the energies of our randomly generated structures to the corresponding confidence interval(s), we decided to consider any k { 2 , 2 , 10 , 20 } , meaning the probability that the free energy of a random RNA secondary structure of size n lies within the corresponding interval is greater than 0.5, 0.75, 0.9, and 0.95, respectively.

Figure 4 shows a plot of the corresponding four confidence intervals (analytically derived, related to our biological data) along with the energy points for our random sample and for our native database, respectively, under the assumption of the static energy model. The corresponding plots for the dynamic energy model are shown in Figure 5. Looking at both figures, we immediately see that the energies for our set of randomly generated RNA secondary structures seem to fit to the ones for the considered RNA database and also to the corresponding analytically obtained energy results from [21]. This observation becomes even more clear by considering Figures 6 and 7. There, we compare the previously introduced "average energy points" and "energy variance points" to the analytically determined expected free energy and corresponding variance from [21], respectively.
https://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_Article_140_Fig4_HTML.jpg
Figure 4

Plots of the confidence intervals I g s t a t , n ( k ) . Intervals are shown for the static energy model (blue), for k { 2 , 2 , 10 , 20 } (top left to bottom right), together with the corresponding energy points E P ( , g s t a t ) for the random sample (cyan) and E P ( N , g s t a t ) for the native sample (green).

https://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_Article_140_Fig5_HTML.jpg
Figure 5

Plots of the confidence intervals I g d y n , n ( k ) . Intervals are shown for the dynamic energy model (purple), for k { 2 , 2 , 10 , 20 } (top left to bottom right), together with the corresponding energy points E P ( , g d y n ) for the random sample (magenta) and E P ( N , g d y n ) for the native sample (yellow).

https://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_Article_140_Fig6_HTML.jpg
Figure 6

Plot of expectations of the free energy. Plots show μ g s t a t , n (blue) and μ g d y n , n (purple) of a random RNA secondary structure of size n, together with the "average energy points" A v E P ( , g s t a t ) (cyan) and A v E P ( , g d y n ) (magenta) for the random sample.

https://static-content.springer.com/image/art%3A10.1186%2F1748-7188-6-24/MediaObjects/13015_2011_Article_140_Fig7_HTML.jpg
Figure 7

Plot of variances of the free energy. Plots show σ g s t a t , n 2 (blue) and σ g d y n , n 2 (purple) of a random RNA secondary structure of size n, together with the "energy variance points" V a r E P ( , g s t a t ) (cyan) and V a r E P ( , g d y n ) (magenta) for the random sample.

Quantified Results

The previously considered energy comparisons have been presented only by unquantified plots. This may not be very satisfying, since it is obvious that the free energy would decrease with structure size and aside from this, it could have been expected that for large randomly generated sets of structures of a given size, the average energy and corresponding variance fit the analytically obtained energy results derived under the assumption of a basically equivalent SCFG model for secondary structures. Therefore, there is a need to consider some sort of quantification and additionally present corresponding quantified comparison results. What really matters is the degree to which the energy ranges of the random structures agree, in distribution, with our biological database. This means we have to find out if the energies related to a random sample (generated by our unranking method) and those related to a native sample (given by the structures in our biological database) come from a common distribution. Consequently, we have to consider the energies of a random sample and those of a native one as two independent sets of values and determine the extend to which their distributions coincide, or in other words to test for significant differences between these two sets. For this reason, we decided to apply one of the most common (non-parametric) significance tests known from statistics, the so-called Mann-Whitney U-test [48], which is widely used as statistical hypothesis test for assessing whether two independent samples of observations (with arbitrary sample sizes) come from the same distribution. It is also known as the Wilcoxon rank-sum test [49] which however can only be applied for equal sample sizes.

Formally, this test is used to check whether the null hypothesis N0 - which states that the two independent samples X and Y are identically distributed (i.e. F(X) = F(Y)) - can be accepted or else, has to be rejected. More specifically, the result of such a test, the so-called p-value, is a probability answering the following question: If the two samples really have the same distribution, what is the probability that the observed difference is due to chance alone? In other words, were the deviations (differences between the two samples) the result of chance, or were they due to other factors and how much deviation can occur before one must conclude that something other than chance causes the differences? The p-value is called statistically significant if it is unlikely that the differences occurred by chance alone, according to a preliminary chosen threshold probability, the significance level α (common choices are e.g. α {0.10,0.05,0.01}). If pα, the deviation is small enough that chance alone accounts for it; this is within the range of acceptable deviation. If p < α, we must conclude that some factor other than chance causes the deviation to be so great, this will lead us to decide that the two sets come from different distributions.

For our analysis, we again decided to generate the same numbers of random structures for any size as are given for this size in our biological database, such that random and native sample contain the same numbers of structures for any occurring size (and hence the sample sizes are equal). Moreover, note that the unquantified results presented in Figures 4 and 5 might yield the assumption that for any structure size, some energy values of randomly generated structures are scattered too widely around the corresponding expected value, such that those randomly drawn secondary structures can not be considered realistic (neither with respect to thermodynamics nor with respect to structural composition and expected shape). In an attempt to disprove that assumption, we decided to perform a series of Wilcoxon tests by considering a number of different random samples. These samples are created by obeying a specified energy-based rejection scheme: Do not add a randomly generated structure of a given size to the sample if its free energy (according to the static or dynamic model or according to both models) lies outside the corresponding confidence interval(s). Formally, for any preliminary chosen value k > 1, a generated structure s L n is added to the random sample iff
[ g s t a t I g s t a t , n ( k )  (variant “static”) ]  or  [ g d y n I g d y n , n ( k )  (variant “dynamic”) ]  or  [ g s t a t I g s t a t , n ( k )  and  g d y n I g d y n , n ( k )  (variant “both”) ] ;

otherwise it is rejected. This means we accept only a specified deviation of the energy energy (s) of the random structure s from the corresponding expected free energy μ energy,n and reject structures whose energy differs too much from the expected value. Note that for k = ∞ (confidence interval I energy,n (k) contains 100 percent of the energies energy(s) of all s L n ), no structures are rejected. Hence, in this case, the corresponding random sample corresponds to the usual (unrestricted) output of our algorithm.

The Wilcoxon test results for our native sample together with any of a number of random sample sets generated in the previously described restricted manner, respectively, can be found in Table 4. As we can see, the best results are achieved for the unrestricted sample sets, where all free energies of randomly generated structures were allowed during the sample creation process. Moreover, these two results (for the unrestricted case k = ∞) are not statistically significant when considering the common significance level α = 0.05, that is in both cases, we can assume that the energies of the random structures and those of the biological data follow a common distribution. These observations indicate that our weighted unranking algorithm produces random RNA secondary structures that are - related to the free energy of such structures (in expectation and variation) - in expectation realistic.
Table 4

Significance results for statistical hypothesis testing, computed by the Wilcoxon rank-sum method

Chosen Value of k

Percent Within Corr. Interval

Models Used for Rejection

Model for Native Energies

Model for Random Energies

Resulting Wilcoxon p-Value (approx.)

  

Dynamic

Dynamic

Dynamic

0.0008438

10 3 11 1 . 00504

1

Static

Static

Static

1.872·10-9

  

Both

Dynamic

Dynamic

0.000507

  

Both

Static

Static

1.851·10-10

  

Dynamic

Dynamic

Dynamic

0.001567

2 5 19 1 . 02598

5

Static

Static

Static

1.454·10-10

  

Both

Dynamic

Dynamic

0.0002654

  

Both

Static

Static

1.009·10-9

  

Dynamic

Dynamic

Dynamic

0.001374

10 3 1 . 05409

10

Static

Static

Static

3.526·10-9

  

Both

Dynamic

Dynamic

0.0004116

  

Both

Static

Static

9.018·10-10

  

Dynamic

Dynamic

Dynamic

0.003618

2 3 1 . 15470

25

Static

Static

Static

2.530·10-7

  

Both

Dynamic

Dynamic

0.001228

  

Both

Static

Static

1.162·10-7

  

Dynamic

Dynamic

Dynamic

0.02394

2 1 . 41421

50

Static

Static

Static

1.278·10-6

  

Both

Dynamic

Dynamic

0.001389

  

Both

Static

Static

1.515 10-7

  

Dynamic

Dynamic

Dynamic

0.1184

2

75

Static

Static

Static

0.001034

  

Both

Dynamic

Dynamic

0.0495

  

Both

Static

Static

0.0009445

100

--

Dynamic

Dynamic

0.4007

  

--

Static

Static

0.08961

Besides that, it is obvious that the computed p-values are much better for the dynamic energy model than for the static one. This underlines the suggestion made in [21] that, although both energy models have been proven to be realistic, due to the more realistic variation of free energies connected to varying loop length, the dynamic model should be used for possible applications. Since at least for the dynamic model, the random data fit very nicely with the native data, we can conclude that structures generated by our non-uniform random generation algorithm behave realistic with respect to free energies and - as the energy of the overall structure is assumed to be equal to the sum of the substructure energies - rather likely also with respect to appearance of the different structural motifs of RNA molecules.

Conclusion

Altogether, we can finally conclude that the non-uniform random generation method proposed in this article produces appropriate output and may thus be used (for research issues as well as for practical applications) to generate random RNA secondary structures. In fact, for any arbitrary type of (pseudoknot-free) RNA, a corresponding random sampler can be derived in the presented way. Actually, our webservice can be used for generating random secondary structures of any specified type of RNA. It just requires a database of known structures for the respective RNA type as input.

Note that in this work, we abstract from sequence and consider only the structure size as input for our algorithm. Thus, an interesting problem for future research would be to find a way to extend the presented realistic SCFG model to additionally deal with RNA sequence. In fact, this work and especially the considered elaborate SCFG could mark some sort of stepping stone towards new stochastic RNA secondary structure prediction methods realized by statistical random sampling.

Appendix

How to Construct a Weighted Unranking Algorithm from a Given SCFG

The purpose of this section is to give a rather small example for applying the construction scheme described in detail in [20] to proceed from an arbitrary SCFG to reweighted normal form (RNF) and then to the corresponding weighted combinatorial classes which allow for non-uniform generation by means of unranking.

Example App-.4. Let us consider the SCFG G d , which contains the following rules:
w 1 : S B , w 2 : B ( B ) , w 3 : B | C , w 4 : C ϵ , w 5 : C | C .
To apply the approach presented in [20] to transform a given SCFG to RNF, the grammar needs to be ϵ-free and loop-free. Thus, we first have to transform grammar G d into the following one:
w ^ 1 : S B , w ^ 2 : B ( B ) , w ^ 3 : B C , w ^ 4 : C | , w ^ 5 : C | C .
The transformation of G d into RNF now works as follows: First, we have to gather all possible chains AA1A2 → ... → α, where AS and |α| = 1. These chains are BC, BC| and C → |; the rules BC and C| are then removed. Second, we have to replace each of these chains by a specific new rule. In fact, we have to add BC,ϵC, B|,C→ | and C|,ϵ| to the new set of productions. Consequently, our new rule set is now given by
w ^ 1 : S B , w ^ 2 : B ( B ) , w ^ 5 : C | C , 1 : B C , ϵ C , 1 : B | , C | , 1 : C | , ϵ | .
Third, for each occurrence of a non-terminal symbol A in the conclusion of a production and each previously added new rule A α , A 1 A 2 α corresponding to a chain AA1A2 → ... → α, add a specific new rule. This way, we obtain the following production set:
w ^ 1 : S B , w ^ 1 w ^ 3 : S B C , ϵ , w ^ 1 w ^ 3 w ^ 4 : S B | , C , w ^ 2 : B ( B ) , w ^ 2 w ^ 3 : B ( B C , ϵ ) , w ^ 2 w ^ 3 w ^ 4 : B ( B | , C ) , w ^ 5 : C | C , w ^ 5 w ^ 4 : C | C | , ϵ , 1 : B C , ϵ C , 1 : B | , C | , 1 : C | , ϵ | .

Fourth, each intermediate symbol that no longer occurs as premise in any of the productions has to be removed and fifth, each production of the form Sα, where S is the axiom and |α| > 1 has to be changed in a specific way. However, since in our case, there is obviously nothing left to do, the transformation of G d into RNF is finished.

For G d (in RNF), where all production weights are rational, we can determine the common denominator s of the weights of productions with premise S, as well as the common denominator c of the weights of the remaining productions (i.e., of the productions with premise B or C). Then, the reweighting of the production rules of (the RNF of) G d is done by multiplying the weights of productions with source S by s, and the weights of the other productions Aα, where AS, by the factor c|α|-1. After that, we obtain the following reweighted grammar G d :
w 1 : S B , w 2 : S B C , ϵ , w 3 : S B | , C , w 4 : B ( B ) , w 5 : B ( B C , ϵ ) , w 6 : B ( B | , C ) , w 7 : C | C , w 8 : C | C | , ϵ , 1 : B C , ϵ C , 1 : B | , C | , 1 : C | , ϵ | ,

where each W i , 1 i 8 , is integral.

The (now weighted) grammar can easily be translated into a corresponding admissible specification, which includes the weighting of all involved combinatorial (sub)classes, as described earlier. For the reweighted grammar G d , this specification is given by the following equations:
S 1 = B , S 2 = B C , ε , S 3 = B | , C , B 1 = Z ( × B × Z ) , B 2 = Z ( × B C , ε × Z ) , B 3 = Z ( × B | , C × Z ) , C 1 = Z | × C , C 2 = Z | × C | , ε , B C , ε = C , B | , C = Z | , C | , ε = Z | , S = w 1 S 1 + w 2 S 2 + w 3 S 3 , B = w 4 B 1 + w 5 B 2 + w 6 B 3 , C = w 7 C 1 + w 8 C 2 ,
which can be simplified in the following way:
1 = Z ( × × Z ) , 2 = Z ( × C × Z ) , 3 = Z ( × Z | × Z ) , C 1 = Z | × C , C 2 = Z | × Z | , S = w 1 + w 2 C + w 3 Z | , = w 4 1 + w 5 2 + w 6 3 , C = w 7 C 1 + w 8 C 2 .
As described earlier, this specification (with weighted classes) derived from reweighted grammar G d transforms immediately into a recursion for the function size of all needed combinatorial classes. For G d , the recursion for the function size has the following form:
size ( I , n ) : = size ( B , n - 2 ) I = B 1 , size ( C , n - 2 ) I = B 2 , 1 I = B 3  and  n = 3 , size ( C , n - 1 ) I = C 1 , 1 I = C 2  and  n = 2 , w 1 size ( B , n ) + w 2 size ( C , n ) + w 3 1 I = S  and  n = 1 , w 1 size ( B , n ) + w 2 size ( C , n ) + w 3 0 I = S  and  n 1 , w 4 size ( B 1 , n ) + w 5 size ( B 2 , n ) + w 6 size ( B 3 , n ) I = B , w 7 size ( C 1 , n ) + w 8 size ( C 2 , n ) I = C , 0 else .

This recursive size function (with weighted class sizes) can now be used for the straightforward construction of a corresponding algorithm for the non-uniform generation of elements of L ( G d ) by means of unranking, as proposed in [20].

Derivation of the Algorithm

In this section, we give a complete and detailled description of the derivation of our weighted unranking algorithm for RNA secondary structures. The different steps are made according to the approach described in [20] to get an unranking algorithm that generates random RNA secondary structures of a given size n according to the distribution on all these structures.

Considered (unambiguous, ϵ-free and loop-free) SCFG

First, note that in [21], to obtain the stochastic model for RNA secondary structures derived from real-world RNA data, the following unambiguous SCFG which unambiguously generates exactly the language L given in Definition 0.9 has been used:

Definition App-.11. The unambiguous SCFG G sto generating exactly the language L is given by G sto = ( I G sto , G sto , R G sto , S ) , where
I G s t o = { S , T , C , A , L , G , B , F , H , P , Q , R , J , K , M , N , U } ,
Σ G sto = { ( , ) , | } and R G sto contains exactly the following rules:
p 1 : S T A C , p 2 : T T A C , p 3 : T C , p 4 : C C | , p 5 : C ϵ , p 6 : A ( L ) , p 7 : L ( L ) , p 8 : L M , p 9 : L P , p 10 : L Q , p 11 : L R , p 12 : L F , p 13 : L G , p 14 : G ( L ) | , p 15 : G ( L ) B | | , p 16 : G | ( L ) , p 17 : G | | B ( L ) , p 18 : B B | , p 19 : B ϵ , p 20 : F | | | , p 21 : F | | | | , p 22 : F | | | | | H , p 23 : H H | , p 24 : H ϵ , p 25 : P | ( L ) | , p 26 : P | ( L ) | | , p 27 : P | | ( L ) | , p 28 : P | | ( L ) | | , p 29 : Q | | ( L ) K | | | , p 30 : Q | | | J ( L ) K | | , p 31 : R | ( L ) K | | | , p 32 : R | | | J ( L ) | , p 33 : J J | , p 34 : J ϵ , p 35 : K K | , p 36 : K ϵ , p 37 : M U ( L ) U ( L ) N , p 38 : N U ( L ) N , p 39 : N U , p 40 : U U | , p 41 : U ϵ .

In this grammar, different intermediate symbols have been used to distinguish between different substructures. In fact, the reason why this grammar has so many production rules is that the grammar must be able to distinguish between all the different classes of substructures for which there are different free energy rules according to Turner's thermodynamic model considered in [21].

However, as ϵ-freeness and loop-freeness are required preliminarily, we have to consider another unambiguous SCFG generating the same language L , where we have to guarantee that the same substructures are distinguished as are distinguished in G sto .

Using the usual way of transforming a non-ϵ-free grammar into an ϵ-free one, the following definition can immediately be obtained from the previous one:

Definition App-.12. The unambigous and ϵ-free SCFG G sto generating exactly the language L is given by G s t o = ( I G s t o , Σ G s t o , R G s t o , S ) , where
I G s t o = { S , S , T , C , A , L , G , B , F , H , P , Q , R , J , K , M , N , U } ,
Σ G sto = { ( , ) , | } and R G s t o contains exactly the following rules:
p 0 : S S , p 1 : S A , p 2 : S A C , p 3 : S T A , p 4 : S T A C , p 5 : T A , p 6 : T A C , p 7 : T T A , p 8 : T T A C , p 9 : T C , p 10 : C | , p 11 : C C | , p 12 : A ( L ) , p 13 : L ( L ) , p 14 : L M , p 15 : L P , p 16 : L Q , p 17 : L R , p 18 : L F , p 19 : L G , p 20 : G ( L ) | , p 21 : G ( L ) | | , p 22 : G ( L ) B | | , p 23 : G | ( L ) , p 24 : G | | ( L ) , p 25 : G | | B ( L ) , p 26 : B | , p 27 : B B | , p 28 : F | | | , p 29 : F | | | | , p 30 : F | | | | | , p 31 : F | | | | | H , p 32 : H | , p 33 : H H | , p 34 : P | ( L ) | , p 35 : P | ( L ) | | , p 36 : P | | ( L ) | , p 37 : P | | ( L ) | | , p 38 : Q | | ( L ) | | | , p 39 : Q | | ( L ) K | | | , p 40 : Q | | | ( L ) | | , p 41 : Q | | | J ( L ) | | , p 42 : Q | | | ( L ) K | | , p 43 : Q | | | J ( L ) K | | , p 44 : R | ( L ) | | | , p 45 : R | ( L ) K | | | , p 46 : R | | | ( L ) | , p 47 : R | | | J ( L ) | , p 48 : J | , p 49 : J J | , p 50 : K | , p 51 : K K | , p 52 : M ( L ) ( L ) , p 53 : M U ( L ) ( L ) , p 54 : M ( L ) U ( L ) , p 55 : M ( L ) ( L ) N , p 56 : M U ( L ) U ( L ) , p 57 : M U ( L ) ( L ) N , p 58 : M ( L ) U ( L ) N , p 59 : M U ( L ) U ( L ) N , p 60 : N ( L ) , p 61 : N U ( L ) , p 62 : N ( L ) N , p 63 : N U ( L ) N , p 64 : N U , p 65 : U | , p 66 : U U | .

Unfortunately, the set of productions of G sto contains productions with up to 5 non-terminal symbols in the conclusion. This is not acceptable for our purpose, for the following reason: the desired unranking algorithm makes use of the size of combinatorial classes whose representations somehow are derived from CFGs with particular integer weights on their productions. If we constructed this WCFG by starting with the grammar G sto , then this would yield a huge number of production rules. Consequently, the translation would imply a huge specification of the combinatorial classes and the corresponding function to compute their sizes and thus the corresponding unranking algorithm would have to distinguish between an unnecessarily and most importantly unacceptably large number of cases.

Nevertheless, the size of the production set of the weighted grammar underlying the desired unranking algorithm can be significantly reduced by starting with a modification of grammar G sto which has only production rules with minimum possible numbers of non-terminal symbols in the conclusion. In fact, by transforming G sto appropriately considering this observation, we obtained the SCFG Ĝ sto :

Definition App-.13. The unambiguous ϵ-free SCFG G ^ s t o generating exactly the language L is given by G ^ s t o = ( I G ^ s t o , Σ G ^ s t o , R G ^ s t o , S ) , where
I G ^ sto = { S , E , S , T , C , A , L , G , D , B , F , H , P , Q , R , V , W , O , J , K , M , X , Y , Z , N , U } ,
G ^ sto = { ( , ) , | ) and R G ^ sto contains exactly the following rules:
p ^ 1 : S E , p ^ 2 : E S , p ^ 3 : E S C , p ^ 4 : S A , p ^ 5 : S T A , p ^ 6 : T E , p ^ 7 : T C , p ^ 8 : C | , p ^ 9 : C C | , p ^ 10 : A ( L ) , p ^ 11 : L A , p ^ 12 : L M , p ^ 13 : L P , p ^ 14 : L Q , p ^ 15 : L R , p ^ 16 : L F , p ^ 17 : L G , p ^ 18 : G A | , p ^ 19 : G A D , p ^ 20 : G | A , p ^