The complete folding event is governed by the chemical master equation [8]. In order to introduce the concept behind the reduction of the time-dependent RNA folding problem to that of stochastic chemical kinetics describing the time evolution of a well-stirred chemically reacting system, the Appendix follows closely references [22, 23] in summarizing the formulation leading to Gillespie's Stochastic Simulation Algorithm (SSA).

Our goal is to model the problem of RNA secondary structure folding in such a way that it can be reduced to the algorithm with the pseudocode given in the Appendix. Therefore, we will describe a reduction into the Stochastic Simulation Algorithm (SSA). The rationale behind this way of formulating the problem is that after the reduction it becomes easier to devise an efficient version of the SSA for RNA folding kinetics, with multiple runs performed in parallel (see SSA version II and discussion thereafter).

Using the Vienna's notations as can be found in [4, 24], the RNA sequence in time *t* will be represented as two strings. Both are of the size of the RNA sequence. One is over the character set {A,U,G,C} also providing what the order of the nucleotides is. It will be called from here on the 'sequence string'. The other is a string of balanced parentheses over the character set {., (,)}, known as "dot-bracket" notation, describing the secondary structure of the RNA sequence (dot means no base-pairing, and each open and close parentheses represent a base pairing). It will be named here the 'structure string'. We shall notice that while the former does not change over time, the latter does.

In attempting to simulate over time the secondary-structure changes of a certain RNA sequence, let us denote ${S}_{R}\left(t\right)$ as the random variable that contains what is the structure string of the RNA structure at time t, when the sequence string is known to be R. In the settings of this simulation, ${S}_{R}\left(0\right)$is set to be the string $"\dots ,\cdots ,\dots "$, which is the initial folding open state without any base pairings. Our goal is to predict what ${S}_{R}\left(t\right)$ is for some parameter *t*. In particular, we would like to predict how much time it will take for an RNA sequence to fold into its 'optimal' state, defined as the structure whose Gibbs free energy is minimal. To formalize that, we will denote the optimal structure for the sequence R as Op(R). Thus, our simulation goal is to find the smallest *t* for which${S}_{R}\left(t\right)=Op\left(R\right)$.

Having defined our goal, we will introduce some more notations to explain the reduction. ${F}_{R}=\left\{{s}_{R}^{1},{s}_{R}^{2},\dots ,{s}_{R}^{M\left(R\right)}\right\}$ will be the finite set (whose size is denoted by M(R)) of the feasible structure strings for the sequence string R, feasible meaning taking into account biological constraints. Now, for some 1 ≤ i ≤ M(R), a single step move of ${s}_{R}^{i}$ is a structure string ${s}_{R}^{j}\in {F}_{R}$ such that ${s}_{R}^{j}$ and ${s}_{R}^{i}$ differ only by omitting a pair of parentheses, adding a pair, or flipping a pair in the way that is well described in [4]. We will define the neighbourhood of ${s}_{R}^{i}$, denoted ${N}_{R}^{i}=\left\{{s}_{R}^{{i}_{1}},{s}_{R}^{{i}_{2}},\dots ,{s}_{R}^{{i}_{L}}\right\}$, as the subset of ${F}_{R}$ which can be reached from ${s}_{R}^{i}$within a single step, unless ${s}_{R}^{i}=OP\left(R\right)$, in which case we will define ${N}_{R}^{i}$ to be an empty set.

Gillespie's SSA deals with simulations of reactions of a system [

22,

23]. A possible reaction from molecule of structure

${s}_{a}$ into molecule of structure

${s}_{b}$ is denoted

${s}_{a}\to {s}_{b}$. We will define a shortened notation for a set of possible reactions:

${s}_{a}\to {s}_{B}=\left\{{{s}_{b}}_{1},{{s}_{b}}_{2},\dots ,{{s}_{b}}_{p}\right\}={s}_{a}\to {s}_{b1},{s}_{a}\to {s}_{b2},\dots ,{s}_{a}\to {s}_{\mathit{bp}}\text{.}$

(1)

The reduction of the input is done by treating single-step moves of the RNA structure as 'possible reactions'. Using the given notations, we will obtain that the total possible moves are:

${s}_{R}^{1}\to {N}_{R}^{1}{s}_{R}^{2}\to {N}_{R}^{2},\dots ,{s}_{R}^{M\left(R\right)}\to {N}_{R}^{M\left(R\right)}$. But, since the simulation is at a specific time at state

${s}_{R}^{i}$for some 1 ≤ i ≤ M(R), we are only left with the feasible moves of

${s}_{R}^{i}\to {N}_{R}^{i}$. These moves will constitute the reaction set of Gillespie's algorithm. Now, the probability factor of each move to occur in a time of Δt is calculated according to the Gibbs free energy considerations using the program RNAeval available in the Vienna RNA package, along with a stochastic Monte Carlo feature. The equation we used in our implementation is the one based on the simulated annealing approach [

25] and known as the Metropolis [

26] step, which is:

${a}_{\mathit{ij}}=\{\begin{array}{c}\hfill exp\left(-\Delta G/RT\right),if\Delta G\ge 0,\hfill \\ \hfill 1,otherwise\text{.}\hfill \end{array}$

(2)

$\text{A}=\left[{a}_{\mathit{ij}}\right]$ is the transition probability matrix, with $\Delta G={G}_{j}^{0}-{G}_{i}^{0}$ when considering the rate of a transition to *j*, being at *i*. ${G}_{i}^{0}$is the Gibbs free energy of *i* for each secondary structure *i* for the SSA algorithm version 1 below; for the SSA algorithm version 2 that follows, it is the sum of the free energies over an n-tuple of secondary structures. We shall note that in our implementation, we used Vienna's own program called RNAeval to get the Gibbs free energy values for the RNA structures. In addition, similar to Vienna's Kinfold, a Kawasaki step can be used instead of the Metropolis step in the equation above. We can observe that we now have a proper input problem that fits Gillespie's SSA algorithm. Thus, we can use the following algorithm:

### SSA for RNA folding, version I – simulating one RNA-structure fold

- 1.
The current structure is ${s}_{R}^{i}$ for some 1 ≤ i ≤ M(R). While stopping condition ${s}_{R}^{i}$ equals to Op(R) is not met:

- 2.

- 3.
Evaluate for each member of ${N}_{R}^{i}$ its probability factor, and the total sum of the factors. We define ${a}_{R}^{k}$ as the probability factor of ${s}_{R}^{i}\to {s}_{R}^{k}$ if ${s}_{R}^{k}\in {N}_{R}^{i}$ and 0 otherwise. We will also denote ${a}_{\mathit{sum}}$ as the total sum of all the factors.

- 4.
Draw two independent uniform (0,1) random numbers: ξ_{1} and ξ_{2}.

- 5.
Set j to be the smallest integer satisfying $\sum _{k=1}^{j}{a}_{R}^{k}}>{\xi}_{1}{a}_{\mathit{sum}}\text{.$

- 6.
Set $\tau =\frac{ln\left(1/{\xi}_{2}\right)}{{a}_{\mathit{sum}}}\text{.}$

- 7.
Set the current structure to be ${s}_{R}^{j}$, and the time to be (t + τ). Return to step 1.

At this point, a beneficial observation is that we can actually expand the model to run this way many simulations simultaneously. If we have different RNA sequences with the sequence strings of ${R}_{1},{R}_{2},{R}_{3},\dots ,{R}_{n}$ as in all possible single point mutations, we will have the possible moves of ${s}_{R1}^{1}\to {N}_{R1}^{1},{s}_{R1}^{2}\to {N}_{R1}^{2},\dots ,{s}_{R1}^{M\left(R1\right)}\to {N}_{R1}^{M\left(R1\right)}\text{,}$${s}_{R2}^{1}\to {N}_{R2}^{1},{s}_{R2}^{2}\to {N}_{R2}^{2},\dots ,{s}_{R2}^{M\left(R2\right)}\to {N}_{R2}^{M\left(R2\right)},\dots ,{s}_{\mathit{Rn}}^{1}\to {N}_{\mathit{Rn}}^{1},{s}_{\mathit{Rn}}^{2}\to {A}_{\mathit{Rn}}^{2},\dots \text{,}$. Let ${s}_{R1}^{i1},{s}_{R2}^{i2},\dots ,{s}_{\mathit{Rn}}^{\mathit{in}}$ be the current states of the n RNA sequences, then we are left with the feasible moves of ${s}_{R1}^{i1}\to {N}_{R1}^{i1},{s}_{R2}^{i2}\to {N}_{R2}^{i2},\dots ,{s}_{\mathit{Rn}}^{\mathit{in}}\to {N}_{\mathit{Rn}}^{\mathit{in}}$.

Using these formulations, we suggest a somewhat optimized and generalized variation of the aforementioned algorithm. In the following version, the indices *i* and *j* will not anymore correspond directly to an explicit $\left[{a}_{\mathit{ij}}\right]$ matrix instance of the Metropolis step's equation. Instead, although we still are doing a Metropolis step, it will correspond to a much larger matrix which is defined only implicitly.

### SSA for RNA folding, version II – simulating numerous RNA-structure folds

- 1.
The initial structure array is ${S}_{\mathit{arr}}=<{s}_{{R}_{1}}^{{i}_{1}},{s}_{{R}_{2}}^{{i}_{2}},{s}_{{R}_{3}}^{{i}_{3}},\dots ,{s}_{{R}_{n}}^{{i}_{n}}>$.

- 2.
Calculate${N}_{\mathit{arr}}=<{N}_{{R}_{1}}^{{i}_{1}},{N}_{{R}_{2}}^{{i}_{2}},{N}_{{R}_{3}}^{{i}_{3}},\dots ,{N}_{{R}_{n}}^{{i}_{n}}>$.

- 3.
Evaluate for each member of ${N}_{{R}_{j}}^{{i}_{j}}\in {N}_{\mathit{arr}}$ its probability factor, and the total sum of the factors. We denote by ${a}_{{R}_{i}}^{{k}_{i}}$ as the factor related to the k'th member of ${N}_{{R}_{j}}^{{i}_{j}}$, and ${a}_{\mathit{sum}}$ the total sum of all the factors related to ${N}_{{R}_{1}}^{{i}_{1}},{N}_{{R}_{2}}^{{i}_{2}},{N}_{{R}_{3}}^{{i}_{3}},\dots ,{N}_{{R}_{n}}^{{i}_{n}}\text{.}$ Having all the factors ordered by some total order, we will obtain a series $\left({a}_{k}\right)$. Each member of $\left({a}_{k}\right)$ corresponds to a specific member of ${N}_{\mathit{arr}}$, and we will denote this mapping to the corresponding ${N}_{\mathit{arr}}$ indices by $\left({m}_{k}\right)$.

- 4.
While ${S}_{\mathit{arr}}$ is not equal to$<Op\left({R}_{1}\right),Op\left({R}_{2}\right),Op\left({R}_{3}\right),\dots ,Op\left({R}_{n}\right)>$:

- 5.
Draw two independent uniform (0,1) random numbers: ξ_{1} and ξ_{2}.

- 6.
Set j to be the smallest integer satisfying $\sum _{k=1}^{j}{a}_{k}}>{\xi}_{1}{a}_{\mathit{sum}$. We shall denote j* and k* as the indices for which ${a}_{j}={a}_{k*}^{j*}\text{.}$

- 7.
Set $\tau =\frac{ln\left(1/{\xi}_{2}\right)}{{a}_{\mathit{sum}}}\text{.}$

- 8.
Set the ${m}_{j}$'th component of ${S}_{\mathit{arr}}$ to be ${s}_{R*}^{j*}$, the time to be (t + τ).

- 9.
Recalculate the ${m}_{j}$'th neighborhood (with its corresponding factors, ${a}_{{R}_{i}}$), and update ${N}_{\mathit{arr}}$ as well as $\left({a}_{k}\right)$ and ${a}_{\mathit{sum}}$. Return to step 4.

Repetitions of the same experiment, i.e. setting ${R}_{1}={R}_{2}={R}_{3}=\dots ={R}_{n}$, are useful to estimate the smallest *t* for which ${S}_{R}\left(t\right)=Op\left(R\right)$. This information is, as told at first, what we sought for, and what we actually implemented. In this special case, memoization might be useful in calculating the expensive step 9 above. Because folding patterns tend to be very repetitive, even a relatively small-sized memoization might save a significant amount of computation time. We implemented a simple memoization and were able to run it on short sequences (up to 40 nt). We measured running times with a memoization that memorizes the neighborhoods of 10,000 different structures. The allocated RAM space was large enough not to require swapping. We obtained a significant speedup: 43 seconds with memoization vs. 684 seconds without memoization for a sequence of size 35, with 100 simultaneous simulations, and approximately 6 seconds with memoization against 60 seconds without memoization for a sequence of size 20, with 1000 simultaneous simulations. When implementing memoization, keeping in memory all neighborhood sets possible is not feasible since their number grows too fast in respect to the size of the sequence. But because most of the transitions occur in the basins of the local minima, in terms of the energy values of the structures, keeping a fixed number of sets may suffice to decrease the computation time. We suggest the use of a cache-like LRU (least recently used) algorithm as mentioned above for deciding which information is likely to be re-usable among all neighboring sets and probabilities ever computed during a run. In addition, interesting ideas can be developed to make the memoization more efficient along the lines of calculating useful measures in order to assess the "folding progress" of all molecules. Since folding times may well be distributed over several orders of magnitude, one may want to let those molecules that have "fallen behind" given some time to catch up, such that all molecules fold at approximately the same speed without large time deviations that are problematic because of having to wait for the slowest molecules to terminate. A simple and practical candidate for such a measure is the base pair distance between the starting and stopping structures. In addition, we claim that the strategy outlined in the SSA for RNA folding, version II above is better tailored for biological problems in which it is not necessary to wait for all molecules to reach their target structure. For such type of problems, there are several advantages to our approach. First, when a particular molecule gets folded, it frees its memoization resources and also reduces the size of the probability space in the sense that it makes the transitions of the rest of the molecules more probable to be the next to occur.

Aside of memoization, an advantage of the approach of the SSA for RNA folding version II presented above over a repetition of single structure simulations is that we can stop the simulation after an elapsed simulation time τ, and extract the folding time of all the experiments that were already folded in time which is the most τ. Moreover, taking this approach, no single long-lasting simulation can delay the intermediate results of the overall run. It is well on our interest that the molecules that are last to be folded will not constitute a bottleneck for the whole computation. In our settings, if we could use an anytime computation approach in which the probability is revealed gradually, what we might give up by not waiting for the last molecules to terminate is just the extent of a long probability tail. If the whole computation is stopped before all molecules have terminated their folding, it may again be useful to calculate their base pair distances to the stopping structures in order to predict the amount of error by not letting all molecules terminate their folding. For some problems the error might be small enough or the computation can be resumed some more until the approximation is satisfactory for the particular problem's needs. It should also be noted that if the sequence strings ${R}_{1},{R}_{2},{R}_{3},\dots ,{R}_{n}$ are different mutants of the same wildtype then the ideas discussed above can still be used to considerably reduce the computation time.