Asymptotic structural properties of quasi-random saturated structures of RNA

Background RNA folding depends on the distribution of kinetic traps in the landscape of all secondary structures. Kinetic traps in the Nussinov energy model are precisely those secondary structures that are saturated, meaning that no base pair can be added without introducing either a pseudoknot or base triple. In previous work, we investigated asymptotic combinatorics of both random saturated structures and of quasi-random saturated structures, where the latter are constructed by a natural stochastic process. Results We prove that for quasi-random saturated structures with the uniform distribution, the asymptotic expected number of external loops is O(logn) and the asymptotic expected maximum stem length is O(logn), while under the Zipf distribution, the asymptotic expected number of external loops is O(log2n) and the asymptotic expected maximum stem length is O(logn/log logn). Conclusions Quasi-random saturated structures are generated by a stochastic greedy method, which is simple to implement. Structural features of random saturated structures appear to resemble those of quasi-random saturated structures, and the latter appear to constitute a class for which both the generation of sampled structures as well as a combinatorial investigation of structural features may be simpler to undertake.


Background
RNA is an important biomolecule, now known to play both an information carrying role, as in retroviruses, such as HIV, whose genome consists of RNA, as well as a catalytic role, as in the the peptidyl transferase catalysis by RNA, which concatenates an amino acid to a growing peptide chain in the formation of a protein on the ribosome [1]. It has recently emerged that RNA plays a wide range of previously unsuspected roles in many biological processes, including retranslation of the genetic code (selenocysteine insertion [2], ribosomal frameshift [3]), transcriptional and translational gene regulation [4,5], temperature sensitive conformational switches [6,7], chemical modification of specific nucleotides in the ribosome [8], regulation of alternative splicing [9], etc. The diverse and biologically important functions performed by RNA molecules depend for the most part on RNA tertiary structure, which is known to be constrained by secondary structure, the latter acting as a scaffold for tertiary contact formation [10]. For this reason, much work has focused on RNA secondary structure prediction [11][12][13][14] and on the kinetics of RNA folding [15][16][17]. In [18], Stein and Waterman pioneered work on asymptotic combinatorics of RNA secondary structures, where they developed recurrence relations to count the number of secondary structures. These recurrence relations were later modified by Nussinov and Jacobson [19] and especially by Zuker [20] to compute the minimum free energy secondary structure.
Formally, a secondary structure for a given RNA nucleotide sequence a 1 , . . . , a n is a set S of base pairs (i, j), such that (i) if (i, j) ∈ S then a i , a j form either a Watson-Crick (AU,UA,CG,GC) or wobble (GU) base pair, (ii) if (i, j) ∈ S then j − i > θ = 3 (a steric constraint requiring that there be at least θ = 3 unpaired bases between any two paired bases), (iii) if (i, j) ∈ S then for all j = j http://www.almob.org/content/8/1 /24 and i = i, (i , j) ∈ S and (i, j ) ∈ S (nonexistence of base triples), (iv) if (i, j) ∈ S and (k, ) ∈ S, then it is not the case that i < k < j < (nonexistence of pseudoknots). For the purposes of this paper, following Stein and Waterman [18], we consider the homopolymer model of RNA, in which condition (i) is dropped, thus entailing that any base can pair with any other base, and we modify condition (ii) so that θ = 1. With inessential additional complications in the combinatorics, we could handle the situation where θ is any fixed positive constant.
For a given RNA sequence, a saturated secondary structure is one such that no base pair can be added without introducing either a pseudoknot or base triple; in other words, saturated structures have a maximal number of base pairs, while the Nussinov minimum energy structure has a maximum number of base pairs. Since the kinetics of RNA structure formation depend on secondary structure energy landscape, and more particularly on the distribution of kinetic traps (saturated structures), in previous work we have designed an algorithm to compute the number of saturated structures [21], determine the asymptotic number of saturated secondary structures [22] and the expected number of base pairs in saturated and quasi-random saturated structures [23].
Secondary structures are conveniently displayed in Vienna dot bracket notation, consisting of a balanced parenthesis expression with dots, where an unpaired nucleotide at position i is depicted by a dot at that position, while a base pair (i, j) is depicted by the presence of matching left and right parentheses located respectively at positions i and j. The minimum free energy secondary structure of the selenocysteine insertion (SECIS) sequence fruA, given by CCUCGAGGGGAACCCGAAAGGGACCCGAGAGG ((((..(((...(((....))).)))..)))) is a saturated structure. In contrast, the following structure for the Gag/pro ribosomal frameshift site of mouse mammary tumor virus [24] is not only not saturated, but includes a pseudoknot, as shown by the square bracket notation necessary to show the crossing base pairs. Having defined saturated structure, we now define a stochastic greedy process to generate random saturated structures, technically denoted quasi-random saturated structures. This notion was defined in [23], where we showed that the expected number of base pairs in quasirandom saturated structures is 0.340633 · n, just slightly more than the expected number 0.337361 · n of base pairs in all saturated structures.
Consider the following stochastic process to generate a saturated structure. Suppose that n bases are arranged in sequential order on a line. Select the base pair (1, u) by choosing u, where θ + 2 ≤ u ≤ n, at random with probability 1/(n − θ − 1). The base pair joining 1 and u partitions the line into two parts. The left region has k bases strictly between 1 and u, where k ≥ θ, and the right region contains the remaining n − k − 2 bases properly contained within endpoints k +2 and n (see Figure 1). Proceed recursively on each of the two parts. Observe that the secondary structures produced by our stochastic process will always base pair with the leftmost available base, and that the resulting structure is always saturated. Note that the probability p i,j that (i, j) is a base pair in a saturated structure is not the same as the probability q i,j that (i, j) is a base pair in a quasi-random saturated structure (this was shown in [23], using a program we wrote to generate saturated structures).

Results and discussion
With these definitions, we are now in a position to state some results concerning structural features of (quasi) random saturated structures. Under the uniform distribution, we show that the asymptotic expected number of external loops is O(log n), and the expected maximum stem length is O(log n). In contrast, under the Zipf distribution, the asymptotic expected number of external loops is O(log 2 n), and the expected maximum stem length is O(log n/ log log n) a .
In the literature on RNA combinatorics ( [18] and subsequent papers), combinatorial results have been proved for the homopolymer as well as for the Bernouilli model, in which latter one assumes a stickiness parameter p = 2(p A p U + p G p U + p G p C ) that any two positions can basepair. To the best of our knowledge, the current paper appears to be one of the first combinatorial analyses of RNA secondary structures, which involves the Zipf distribution for base pairs. http://www.almob.org/content/8/1/24 Base 1 is base-paired by selecting a random base u such there are at least θ unpaired bases enclosed between 1 and u. By iterating this procedure, we obtain a greedy stochastic algorithm to sample quasi-random secondary structures.

Conclusions
Saturated secondary structures form natural kinetic traps in the energy landscape with respect to the Nussinov energy model [19], in that it is energetically unfavorable to move from a saturated structure to any neighboring structure that differs by one base pair. However, there is currently no program to sample saturated secondary structures with respect to the Nussinov energy (given either a homopolymer or an RNA sequence), although the programs we developed in [21,22] could be extended to do so for both homopolymers and RNA sequences. (Note that the program RNAsat, described in [25], can sample saturated structures in the Turner energy landscape, and the program RNAlocopt, described in [26], can sample locally optimal structures in the Turner energy landscape). In contrast, it is extremely simple to implement a program to sample quasi-random saturated structures, thus permitting one to easily obtain an idea of various structural features in the ensemble of quasirandom structures. We expect many structural features to be approximately shared between the random saturated structures and quasi-random saturated structures -for instance, as earlier mentioned, the expected number of base pairs in quasi-random saturated structures is 0.340633 · n, while the expected number of base pairs in saturated structures is 0.337361 · n, almost the same value [23].
Generally, it requires substantial effort involving the application of deep results from complex analysis, such as the Flajolet-Odlyzko theorem [27] or the Drmota-Lalley-Woods theorem [28][29][30] (see also the text by Flajolet and Sedgewick [31]) to prove asymptotic results, such as the fact that the asymptotic number of saturated structures is 1.07427 · n −3/2 · 2.35467 n and the asymptotic expected number of base pairs is 0.337361 · n, and the asymptotic expected number of hairpins is 0.323954 · 1.69562 n [23]. In contrast, the argument given in this paper is elementary, not requiring complex analysis. Taken together we believe that the stochastic greedy method, described in Figure 1, performs reasonably well in sampling saturated structures, that appear to be representative of the ensemble of all saturated structures, and supports a combinatorial analysis that may be simpler than that required for all saturated structures.

Structural properties of quasi-random saturated secondary structures
Given secondary structure S, an external base pair is a base pair (i, j) ∈ S, which is not interior to any other base pair of S; i.e. there is no ( are said to close the corresponding external loops; see Figure 2. The number of external loops of a given secondary structure S is defined to be the total number of external base pairs in S. We define a stem of length k to be a sequence of nested base pairs (see Figure 3) The stem length of a given secondary structure S is defined here to be the maximum length of all stems in S; i.e. the maximum number of nested base pairs in S.
Our study of structural properties of random saturated secondary structures is facilitated by defining a graph that resembles the graph on page 333 of [32]; however, note that the formal definition is slightly different than that of [32]. Given a secondary structure S on the nucleotide sequence [1, n], define the associated graph G(S) = (V , E), whose vertex set V consists of base pairs v = (i, j) in S, and whose undirected edge set E consists Figure 4 depicts the graph G(S) associated with the saturated secondary structure S.
In general G(S) is a forest; i.e., a set of trees. In the sequel we determine the size of several structural parameters of random saturated secondary structures, in particular, expected stem length and expected number of external loops. These parameters are studied both for the uniform and Zipf distributions. Before proceeding any further, we first define the probability distributions to be considered.

Probability distributions
Zipf 's law is the observation first made by the deceased Harvard linguist, George Kingsley Zipf, that the frequency http://www.almob.org/content/8/1/24 1 n Figure 2 A sequence of external base pairs. p i of English words, when graphed against their rank i (in the list of English words sorted in decreasing order with respect to frequency), obeys the power law p i ≈ i −α . More generally, Zipf 's law is the statement of a power law, when plotting frequency against rank (Zipf 's first law) or when plotting frequency against reverse rank (Zipf 's second law). In bioinformatics, Zipf 's law has been observed in the frequency/rank plot of differentially expressed gene in microarray data [33], as well as in the frequency/rank plot for protein structures [34], where there are a few very frequent structures, and very many rare structures.
In the remainder of the paper, we consider probability distributions related to Zipf 's law. A node, say 1 ≤ u ≤ n, is chosen at random with the α-Zipf distribution, if the probability that a given base pair is defined to be the α-harmonic number of n − 1. The expected number of base pairs for arbitrary threshold θ is denoted by E θ n , for random saturated secondary structures on n bases, generated by the α-Zipf stochastic process. E 0 n satisfies the following recursive formula for all n ≥ 2.
Observe that when α = 0 the α-Zipf distribution is the same as the uniform distribution, while if α = 1, we have the (classical) Zipf distribution [35]. Moreover, observe that as α increases, "shorter" base pairs are being selected with higher probability by the stochastic process described in equation (1).
The stochastic process of generating random saturated secondary structures, according to equation (1), is of the "divide-and-conquer" type, very common in computer science, where well-known algorithms such as QUICK-SORT choose a division point according to the uniform distribution. Stochastic algorithms of this kind have been intensively studied for the uniform distribution. Known results suggest that the probability distribution for the number of base pairs in random saturated structures, generated by the earlier described stochastic process (uniform choice of base pairs) is asymptotically Gaussian (see [36] and [37]). We also note that structural features of trees have been well studied including the expected depth and the exact distribution of the depth; see, for instance, [36,38,39]. In the sequel, we consider a random binary search tree with n nodes obtained by inserting n i.i.d. random variables X 1 , . . . , X n . Careful analysis of [36] and [39] implies our results in the section on the uniform distribution. However we will use a different and simpler technique that enables the analysis not only for the uniform distribution in the following section concerning the Uniform Distribution, but also for the Zipf distribution in the section following this section.
An important observation concerns the threshold θ considered above. All the results proved in this section are "upper bounds" and therefore it is easily seen that they are valid for any threshold θ ≥ 0. Therefore to simplify proofs in the sequel we consider the case of threshold θ = 0.

Uniform distribution
The main theorem of this section concerns stem length and number of external loops of random saturated structures S, generated by a natural stochastic process associated with the tree graph G(S). Throughout the remainder 1 n of the paper, we state results in terms of random saturated structures, although we intend to mean only those structures generated by the stochastic process associated with the graph G(S); we will distinguish between the uniform and α-Zipf variant of the stochastic process. Without this convention, statements of lemmas and theorems would be too cumbersome.

Theorem 1. With high probability, the number of external loops and the maximum stem length of random saturated structures generated by the uniform distribution variant is O(log n).
Proof. Before we give the proof of the main theorem it will be necessary to give the proof of two lemmas. In the first lemma we consider the expected number of external loops.

Lemma 1. With high probability, the number of external loops is O(log n).
Proof. We define a sequence of random variables X 1 , X 2 , . . . , X t by induction as follows. Let X 1 be the random variable selecting a base k chosen among 2, 3, . . . , n randomly and independently with the uniform distribution in order to form a base pair (1, k). By induction, assume that X 1 , . . . , X t have been defined. Let X t+1 be the random variable selecting a base k chosen among X t + 2, X t + 3, . . . , n randomly and independently with the uniform distribution in order to form a base pair (X t + 1, k). Next we estimate bounds on E[ X t ], for all t. Indeed, observe that P[ Next we compute the conditional probability where the last inequality is valid for k + 3 ≤ n. Finally, we can estimate We are interested in determining the behavior of the random variable, whose value is the number of external loops in random saturated structures. http://www.almob.org/content/8/1/24 T n = min{t : X t+1 ≥ (n + 1)/2}.
Next we prove the following lemma.

Lemma 2. With high probability, the maximum stem length is O(log n).
Proof. According to the recursive construction, at each stage after a base pair is chosen at random in the subsequent stages, base pairs are nested within this base pair. Therefore, the maximum stem length equals the maximum number of nested base pairs. This latter number can also be obtained as follows. We define the following sequence Y 1 , Y 2 , . . . , Y t of random variables. A base is chosen among 2, 3, . . . , n randomly and independently with the uniform distribution. Let Y 1 be the resulting random variable. By induction, assume that Y 1 , . . . , Y t have been defined. To define the random variable Y t+1 , a base is chosen among t + 2, . . . , Y t − 1 randomly and independently with the uniform distribution. Clearly, this procedure halts when Y t ≤ t + 2 and it follows that the maximum number of nested base pairs is also the number t of iterations before halting. Therefore we are interested in knowing the behavior of the random variable (notice the dependence of the random variable T on n). Observe that since by definition Y i+1 is chosen among i + 2, i + 3, . . . , Y i − 1 randomly and independently with the uniform distribution, for any integer k ≥ i + 2, Using well-known identities on conditional probabilities, we can derive the following equalities.
In particular, since E[ We are not yet completely done with the proof of Lemma 2. The proof shows that with high probability, the leftmost sequence of base pairs given by the recursive construction has length at most O(log n). We would like to prove the same for any sequence of nested base pairs. To this effect, define random intervals I s , where s is a finite sequence of 0s and 1s, by induction on the length of s. Consider the interval I ∅ = [ 1, n]. Assuming that I s = [ a s , b s ] has already been defined, we consider a random process that splits it at random into two subintervals, i.e., choose an integer r ∈ I s randomly and independently with the uniform distribution and let I s0 = [ a s , r] and it follows that the expected length of I s is at most 2 −|s| . Now consider the random variable T which is defined as follows T = min{k : ∃s(|s| = k & I s = ∅)}(notice the dependence of the random variable T on n) and observe that T > k if and only if ∀s(|s| = k ⇒ I s = ∅). Therefore , (for all sequences s such that |s| = k) As a consequence we conclude that P[ T > (1 + ) log n)] ≤ n − . This completes the proof of Lemma 2.
Finally, we can complete the proof of the main result of Theorem 1 since this is now immediate from Lemmas 1 and 2.

Zipf distribution
It is possible to consider other probability distributions like Zipf and generalized a-Zipf. The Zipf distribution (first considered in [35]) is perhaps the most interesting because it favors base pairs at a shorter distance. A base pair (1, u), is chosen at random with the Zipf distribution. I.e., the probability that the base pair (1, u) is selected is http://www.almob.org/content/8/1/24 1 k is defined to be the (n − 1)st harmonic number. As before, the chord joining 1 and u partitions the ring into two parts. One part has k bases between 1 and u, where k ≤ n − 2, and the other part has the remaining n − k − 2 bases (see Figure 1).
Define Z n to be the expected number of base pairs of a random saturated secondary structure with n bases, where n ≥ 2. A base pair (1, u) is added as follows. Select u ≥ 2 at random among 2, 3, . . . , n with probability 1 (u−1)H(n−1) . This gives rise to the following formula for all n ≥ 2. The main theorem of this section concerns the overall structure of random secondary structures. Proof. Before we give the proof, it will be necessary to give the proof of two lemmas. In the first lemma we look at the number of external loops.

Lemma 3. With high probability, the number of external loops is O(log 2 n).
Proof. We define a sequence of random variables X 1 , X 2 , . . . , X t by induction as follows. Let X 1 be the random variable resulting when the base pair (1, k) is formed by a selecting a base k among 2, 3, . . . , n randomly and independently with the Zipf distribution. By induction, assume that X 1 , . . . , X t have been defined. Let X t+1 be the random variable resulting when the base pair (X t + 1, k) is formed by selecting a base k is chosen among X t + 1, X t + 2, . . . , n randomly and independently with the Zipf distribution. Next we compute E[ X t ], for all t. Indeed, observe that P[X 1 = k] = 1 (k−1)H(n−1) and Next we compute the conditional probability Finally, we can calculate Elementary calculations using this last inequality show that We are interested in determining the behavior of the random variable, whose value is the number of external loops; i.e. the size of the largest sequence of external base pairs. Define the random variable T n = min{t : X t+1 ≥ n − 1}.
The next result concerns the maximum stem length. We can prove the following result.

Lemma 4. With high probability, the maximum stem length is O(log n/ log log n).
Proof. According to the recursive construction, at each stage after a base pair is chosen at random in the subsequent stages base pairs are nested within this base pair. Therefore, the maximum stem length is equal to the maximum number of nested base pairs. This latter number can also be obtained, by investigating a sequence of random variables Y 1 , Y 2 , . . . , Y t , defined as follows. Choose a base among 2, 3, . . . , n − 1 randomly and independently with the Zipf distribution. Let Y 1 be the resulting random variable. By induction, assume that Y 1 , . . . , Y t have been defined. To define the random variable Y t+1 , a base is chosen among t + 2, t + 3, . . . , Y t − 1 randomly and independently with the Zipf distribution. Clearly, this procedure halts when Y t = 1 and it follows that the maximum number of nested base pairs is also the number t of iterations before halting. Therefore we are interested to know the behavior of the random variable (notice the dependence of the random variable T on n). Observe that since by definition Y i+1 is chosen among i + 2, i + 3, . . . , Y i − 1 randomly and independently with the Zipf distribution, for any integer k ≥ i + 2, .
Consider the random variable E[Y i+1 |Y i ] whose value at k is equal to E[Y i+1 |Y i = k]. Using well-known identities on conditional probabilities we can derive the following inequalities.
where we used the fact that the fraction n/H(n) is monotone increasing in n. In particular, since E[Y 1 ] = n−2 H(n−2) , we conclude that E[Y t ] ≤ n−2 H(t+1)·H(t)···H (2) . Finally, we can derive In particular, log n ln ln n ≤ n − .
The proof shows that the leftmost sequence of base pairs given by the recursive construction of the random secondary structure has length at most O(log n/ log log n) with high probability. We would like to prove the same for any sequence of nested base pairs. It is easily seen that a proof similar to the one presented above works. This completes the proof of Lemma 4.
If we now combine Lemmas 3 and 4 we derive the proof of Theorem 2. Endnote a Throughout this paper all logarithms are in base 2.