- Research
- Open Access
- Published:

# Hierarchical folding of multiple sequence alignments for the prediction of structures and RNA-RNA interactions

*Algorithms for Molecular Biology*
**volume 5**, Article number: 22 (2010)

## Abstract

### Background

Many regulatory non-coding RNAs (ncRNAs) function through complementary binding with mRNAs or other ncRNAs, *e.g*., microRNAs, snoRNAs and bacterial sRNAs. Predicting these RNA interactions is essential for functional studies of putative ncRNAs or for the design of artificial RNAs. Many ncRNAs show clear signs of undergoing compensating base changes over evolutionary time. Here, we postulate that a non-negligible part of the existing RNA-RNA interactions contain preserved but covarying patterns of interactions.

### Methods

We present a novel method that takes compensating base changes across the binding sites into account. The algorithm works in two steps on two pre-generated multiple alignments. In the first step, individual base pairs with high reliability are found using the PETfold algorithm, which includes evolutionary and thermodynamic properties. In step two (where high reliability base pairs from step one are constrained as unpaired), the principle of cofolding is combined with hierarchical folding. The final prediction of *intra*- and *inter*-molecular base pairs consists of the reliabilities computed from the constrained expected accuracy scoring, which is an extended version of that used for individual multiple alignments.

### Results

We derived a rather extensive algorithm. One of the advantages of our approach (in contrast to other RNA-RNA interaction prediction methods) is the application of covariance detection and prediction of pseudoknots between *intra*- and *inter*-molecular base pairs. As a proof of concept, we show an example and discuss the strengths and weaknesses of the approach.

## Background

Predicting RNA-RNA interactions is a rapidly growing area within RNA bioinformatics and is essential for the process of assigning function to known as well as *de novo* predicted non-coding RNAs (ncRNAs) such as those identified in *in silico* screens for RNA structures [1–7]. This candidate information along with the data generated from deep sequencing analyses emphasise the need to predict RNA-RNA interactions. In part, this is because there currently is no high-throughput method available for the reliable analysis of RNA-RNA interactions; however, computational prediction of RNA-RNA interactions is also essential for the identification of putative targets of known and *de novo* predicted ncRNAs. With the main exception of microRNA target prediction, the current approaches essentially evaluate the stabilities of the common complexes between ncRNAs and target RNAs by computing the overall free energy using two major strategies (see, *e.g*., [8] for a recent review).

The first strategy, represented through the implementations of RNAup[9] and IntaRNA[10], uses pre-calculated values for all possible regions of interaction to determine the energy required to make that site accessible (called the ED-value for the energy difference). The ED-value is then used to calculate a combined energy of the energy given by the duplex formed by the two interaction regions and the ED-values of both interaction regions. RNAup has a complexity of *O*(*n*^{3} + *nw*^{5}), whereas IntaRNA has a complexity of *O*(*n*^{2}), which makes it fast enough to be used in genome-wide screens. Both methods are able to predict complex interactions, like kissing hairpins, *as long as* the interaction is restricted to one region. However, there are well-known examples where several interaction sites were found, especially for longer ncRNAs. A prominent example is the interaction between OxyS and *fhlA* shown in [11].

The second strategy for RNA-RNA interaction predictions is usually handled with a class of approaches that simultaneously predict a common structure for both RNAs including their interaction. Some of the first approaches, *e.g*., pairfold[12], RNAcofold[13] and the method presented by Dirks *et al*. as part of the NUpack package [14], concatenate the two sequences using a special linker character. Then, a modified version of the usual RNA folding methods (like Mfold[15] and RNAfold[16]) is applied to cope with the linker symbol to predict the correct energies. Otherwise, a loop containing the linker symbol would be treated like a hairpin or internal loop, leading to incorrect energy values.

The main disadvantage of the concatenation approach is that the set of candidate joint structures becomes restricted. For this reason, double kissing hairpin interactions (like in OxyS-*fhlA*) cannot be considered. However, alternative (but also most resource demanding) methods have been introduced and extend the class of allowed joint structures. The IRIS tool [17] allowed several kissing hairpins using a maximum number of base pair energy model. Then, Alkan *et al*. [18] presented a more realistic energy model and showed the NP-completeness of an unrestricted model. Both approaches predict structures with minimum free energy.

A more stable approach is to consider the partition function because it allows the calculation of interaction probabilities and melting temperatures. This problem was solved independently by Chitsaz *et al*. [19] and Huang *et al*. [20]. In [21], hybrid probabilities were calculated. These approaches have high time complexities of *O*(*n*^{6}), which makes them infeasible for genome-wide applications. Methods to reduce the complexity range from approximation approaches [22, 23] to sparsification of the dynamical programming matrix [24].

Here, we present an algorithm for the prediction of RNA-RNA interactions in existing multiple alignments of RNA sequences. Its rationale is based on the assumption that a non-negligible amount of the RNA-RNA interactions contain compensatory base changes across the binding sites. The algorithm presented herein is an extension of the PETfold algorithm [25] and makes further use of the principles from RNAcofold[13] and computational strategies for hierarchical folding, *e.g*. [26, 27]. The latter approach was chosen due to the high computational costs of pseudoknot searches.

## Algorithm

The main idea of the introduced method is to use a hierarchical approach to predict an interaction by predicting reliable base pairs within a ncRNA and a mRNA (or another ncRNA), which is followed by prediction of reliable base pairs in the combined sequence. Via this approach, we are able to predict combined pseudoknotted structures, like kissing hairpins, that would be missed otherwise. In both steps, we apply a combined scoring method that predicts consensus base pairs from an alignment using evolutionary conservation and thermodynamic stability information.

The scoring for the first step is according to the standard PETfold approach, where we use thresholds for reliable base pairs that have been identified according to training on more than 30 verified interactions in bacteria, which is described later. For the second step, we define a constrained version of the PETfold scoring scheme.

Throughout this paper, we consider the concatenation of the two alignments and subsequently (in the base pair prediction process) the concatenation of the corresponding structures. *σ* will denote a set of base pairs, where the substructures in each part (*e.g*., ncRNA, mRNA and the base pairs that participate in the interaction) in respective alignments are concatenated or nested (in the dot bracket notation, these substructures have alignment lengths of the ncRNA and mRNA respectively). We use (*i*, *j*) to denote a Watson-Crick or G-U wobble base pair between columns *i* and *j*. This base pair could be an *intra*-molecular pair in each of the RNA molecules (ncRNA or mRNA) or an *inter*-molecular pair that is involved in an interaction between molecules.

Depending on the context, *σ* will either be interpreted as a specific structure that implicitly defines the single-stranded positions or as a partial structure that describes an ensemble of structures. In the first case, we define the set of single-stranded positions of a sequence *s* as

In the second case, we use ℰ(*σ*) = {*σ'*|*σ'* ⊇ *σ*} to denote the ensemble of all specific structures *σ'* extending *σ*. (*s*) denotes the set of nested secondary structures that are defined for the sequence *s*. We use the same notation for the consensus structures of a given multiple alignment with *n* sequences *s*^{1} ... *s*^{n}. In this case, a position 1 ≤ *i* ≤ || refers to a column in the alignment. Furthermore, we use *s* ∈ to indicate a sequence *s*^{1} ... *s*^{n}from the alignment.

The algorithm, like PETfold is a maximum expected scoring approach that combines the evolutionary probabilities Pr^{ev}[*σ*|] of a consensus structure, *σ*, given an alignment, with the thermodynamic probabilities of the associated structures in each sequence. Pr^{ev}[*σ*|] is generated using the stochastic context-free grammar (SCFG) from the Pfold model [28]. The Pfold model allows the computation of the probability Pr[*σ*|, *T*, *M*] of a consensus structure *σ* given an alignment , a phylogenetic tree *T* for that alignment, and a general background model *M* for secondary structures. Because the tree *T* is calculated from the alignment , and *M* is constant, we use Pr^{ev}[*σ*|] as short for Pr[*σ*|, *T*, *M*].

The (secondary structure) model itself is based on a SCFG that provides a distribution of secondary structures for a given alignment. The combined probability of an alignment and a consensus structure *σ* is

where Pr[*σ*|*T*, *M*] is the prior distribution of secondary structures and Pr[|*T*, *σ* ] is the probability of the alignment, given a known consensus structure. This is then transformed into Pr[, *σ*|*T*, *M*] by applying the Bayesian rule, and further into the posterior distribution Pr[*σ*|, *T*, *M*] of consensus structures *σ* by dividing by Pr[|*T*, *M*], which is the sum of all parse trees for an alignment given *T* and *M*. Note that the comma sign here is just a shortcut for ∧, *i.e*. Pr[*A*, *B*] = Pr[*A* ∧ *B*]. We will still use ∧ where it is appropriate.

The probability distributions themselves are formed as follows. For Pr[|*T*, *σ* ], there is an independent evaluation of all base pairs and single-stranded positions:

where is the i*th* column of , and for the constrained folding, where ( resp.) is the constrained structure on the first (second resp.) of the two concatenated alignments. For the prior model, the probability Pr[*σ*|*T*, *M*] provides an overall distribution of the secondary structures, which is estimated from rRNA and tRNA sequences. *M* is given by the following simple SCFG:

The evolutionary model and the prior model for RNA structures used in the Pfold model are combined into a single SCFG that provides a distribution over Pr[*A*, *σ*|*T*, *M*] (see additional file 1 for details).

To model the thermodynamic probabilities, we define *σ* (*s*^{k}, ) as the structure for the *k*-th sequence *s*^{k}of an alignment associated with the consensus structure *σ* of . Pr^{th}[*σ* (*s*^{k}, )|*s*^{k}] is the corresponding thermodynamic probability as defined by McCaskill's partition function approach [29].

Using the maximum expected scoring approach, these probabilities are transformed into reliabilities in a two-step approach. Throughout the paper, (*i*) is used to denote the reliability of a single-stranded region at alignment position *i* and (*i*, *j*) the reliability of a consensus base pair (*i*, *j*), where ℓ = 1, 2 refers to Step 1 or Step 2 of the combined approach.

### Refresher: PETfold scoring

Here, we briefly recall the scoring of PETfold, which is a maximum expected accuracy scoring method. For simplicity, we will exclude a description of the scoring of single-stranded positions. However, they are scored the same way as in the original PETfold approach; for more details, see [25].

The PETfold score is the sum of the evolutionary accuracy values plus the average sum of the thermodynamic accuracy values. For the evolutionary part, we compute the expected accuracy (or overlap) EA^{ev}(*σ*) of a specific consensus structure *σ* with all possible consensus structures, which are weighted according to their probabilities:

Recall that Pr^{ev}[*σ'*|] denotes the evolutionary probability of a structure *σ'* according to the Pfold SCFG as described above. |*σ* ∩ *σ'*| is the number of base pairs that are common between *σ* and *σ'* and thus denotes the overlap between these two structures.

For the thermodynamic part, the expected accuracy EA^{th}(*σ*) of *σ* with all structures for all sequences according to the thermodynamic ensembles is defined by

The combined expected accuracy consists of both parts, generally weighted with 1 for the conservation portion and *β* for the thermodynamic accuracy:

where *n* is the previously described number of sequences in the alignment. As shown previously [25], this final score can be calculated using the base pair reliabilities, where the combined reliability ℛ_{bp}(*i*, *j*) for a base pair (*i*, *j*) is given by

where (*i*, *j*, *s*) is the base pair probability of the pair (*k*, *l*) associated with columns (*i*, *j*) in sequence *s*. These reliabilities are calculated with an inside/outside algorithm and are central to the hierarchical approach presented in the following sections. The expected accuracy can then be calculated from the base pair reliabilities by

The consensus structure with the maximal reliability is then calculated using a Nussinov-style algorithm [30], where the base pairs are evaluated with reliabilities.

### Step 1: Intra-molecular partial structures

We use two alignments and of sequences and , where is a ncRNA and is its target sequence. For convenience, we adopt the convention of RNAcofold and assume that the positions in are numbered 1 ≤ *i* ≤ || and the positions in are numbered || + 1 ≤ *i* ≤ || + ||.

#### Selection of the initial structure

In the first step of the pipeline, we obtain the base pair reliabilities from Equation (4), which we denote (*i*, *j*). Using these reliabilities, the partial (constrained) structures and are determined independently for and . In the following steps, let be either or and *σ*^{p} be the partial structure calculated for . This is done by selecting only base pairs (*i*, *j*) with

where *δ* is a cut-off that must be ≥ 0.5 to avoid crossing structures. This is similar to the method by which consensus structures are predicted for single sequences [31] and has been shown to be more reliable for the prediction of consensus structures from alignments.

Here, however, we also have to estimate the contribution of each of the partial structures to the complete solution. Because the set of base pairs from a predicted consensus structure do not necessarily form a reasonable structure, we account for this by introducing a second threshold *γ*. High values for this threshold guarantee that each sequence used to create the consensus structure has a high likelihood and that the approximation, which we apply in the second step (as will be described by Equation (14)), is accurate.

To find the optimal value of the reliability threshold *δ*, its value is increased until the resulting ensemble of structures ℰ(*σ*^{p}) that are compatible with the partial structure *σ*^{p} is probable enough in the evolutionary model, in the thermodynamic model, or in both models, which is when

Here, Pr^{ev}[ℰ(*σ*^{p})|] (= Pr^{ev}[ℰ(*σ*^{p})|, *T*, *M*]) is the probability of the partial structure *σ*^{p} given the alignment in the evolutionary model *M* and tree *T*. This can be calculated from Pfold with the SCFG that combines the prior structural model with evolutionary information from the alignment (see additional file 1) as follows:

The term Pr[|*T*, *M*] has already been calculated (personal communication with Bjarne Knudsen) in Pfold as the sum of all possible parse trees for an alignment , given *T*, *M*:

Here, we add the calculation of

to Pfold by summing over all possible parse trees that are compatible with *σ*.

Pr^{th}[ℰ(*σ*^{p}(*s*, ))|*s*] is the probability of the partial structure *σ*^{p} given a sequence *s* in the thermodynamic model. This probability can be calculated using constrained partition folding as follows:

where is the free energy of the whole ensemble (as determined by RNAfold with parameters -p -d2) and is the free energy of the ensemble of structures ℰ(*σ*^{p}(*s*, )) with the base pairs in *σ*^{p}(*s*, ) as constraints, which can be calculated by RNAfold with parameters -C -p -d2.

#### Extension of constrained stems

Reliable *intra*-molecular base pairs are constrained as single-stranded in Step 2 of the algorithm because we are interested in pseudoknots of the concatenated sequence and the interactions in these induced loop regions. The drawback of this *ansatz* is that *intra*-molecular stems get instable because of intermediate unbased constraints. Thus, we may get incomplete stems. To deal with this problem, we extend the constrained stems. Inner and outer base pairs are added as long as the average reliability of the inner or outer extended stem, respectively, is larger than the threshold *δ*, and the probability of the partial structure is greater than or equal to *γ* either in the evolutionary or the thermodynamic model. That is, the average reliability of the total, extended stem has to be larger than a threshold.

Step 1 is summarised as pseudocode in Figure 1.

### Step 2: Constrained expected accuracy scoring

In the following, *s*_{1}& *s*_{2} denote the concatenated sequences of the two sequences *s*_{1}, *s*_{2} using the additional linker symbol & as done in RNAcofold. For Step 2 of the scoring, we calculate the expected accuracy of the ensemble of structures *σ* of *s*_{1}& *s*_{2}, which constitutes an interaction under the constraint that *σ* contains the partial reliable structures and of *s*_{1} and *s*_{2}, respectively. Because we use the numbering convention of RNAcofold, the union of the two partial structures and is the partial structure of *s*_{1}& *s*_{2}.

Now we have two problems to solve. On the one hand, we want to calculate the constrained accuracy given the partial structures and , which is defined as

On the other hand, we have to find a combined score for the partial structures and , and the interaction *σ*_{int} to evaluate the quality of an predicted interaction. The score must be maximal according to Equation (8).

We will demonstrate the problem and our solution for the thermodynamic folding. However, the same analysis applies to the evolutionary part, which is described later.

#### The thermodynamic part

The simplest formal solution to this problem would be to investigate directly the expected accuracy of joint structures *σ*:

where is the expected accuracy of a structure in one sequence pair *s*_{1}& *s*_{2} ∈ .

However, this would require that we compute the distribution Pr^{th}[*σ*|*s*_{1}& *s*_{2}], which can be done by a partition function approach for interacting structures. This is NP-complete in the full model [18] and even *O*(*n*^{6}) in a restricted model [19, 20], which is why the two-step approach is necessary. In the following, we ignore the index "th" for simplicity.

The relationship between and EA(*σ*) is now quantified. In the following, for a structure *σ*, we use *σ*_{1} ∪ *σ*_{2} ∪ *σ*_{int} to denote the partition of the base pairs of the first sequence, *σ*_{1}, the base pairs of the second sequence, *σ*_{2}, and the interacting base pairs, *σ*_{int}. Furthermore, for the partial structure *σ*, we use ℰ_{1}(*σ*) to denote the set of structures that extends *σ* using base pairs within the first sequence, *i.e*.,

The ensembles ℰ_{1,int}(*σ*), ℰ_{2,int}(*σ*) and ℰ_{1,2}(*σ*) are defined analogously.

Our approach uses one simplification, namely the assumption that the reliabilities for *intra*-molecular base pairs are dominated by the *intra*-molecular folding. This is equivalent to the assumption that the two structures fold independently. We formulate this as follows:

Because *σ*_{1} and *σ*_{2} are *partial joint* structures, this can be written using the ensemble function

The implication of this assumption is that the probabilities of the two structures *σ*_{1} and *σ*_{2} are merged independently into the joint probability Pr[ℰ_{int}(*σ*_{1} ∪ *σ*_{2})|*s*_{1}& *s*_{2}], see Equation (11) below. First, note that for two partial structures

by definition. Hence,

Intuitively, Pr[ℰ_{1,int}(*σ*_{2})|*s*_{1}& *s*_{2}] should be the same as Pr[*σ*_{2}|*s*_{2}]. This can be derived using the total probability formula:

Combining these equations we obtain the independence property:

Now we will use this property to relate to EA(*σ*). The independence property, as described in Equation (9), and the additivity of the expectation is the implication of the expected accuracy of a joint structure, which is the sum of the expected accuracy of the *intra*-molecular structures and the expected accuracy of the *inter*-molecular portion. To illustrate this, note that for any *σ*, *σ'*

by definition. Hence, by the additivity of the expectation we get

Now we can rewrite the first term using the independence property as follows:

which is the expected accuracy of *σ*_{1} in the sequence *s*_{1}. Analogously, we can do this for the second term . Thus, is the sum of the expected accuracies in the first and the second sequences and the expected accuracy of the interaction:

For the expected accuracy of the interaction

we still need to define Pr[*σ*|*s*_{1}& *s*_{2}]. For every *σ* = *σ*_{1} ∪ *σ*_{2} ∪ *σ*_{int},

Thus, in principle, to calculate the expected accuracy EA^{th, int} (*σ*) for the interaction, we must sum over all structures in *σ*_{1} and *σ*_{2}:

Because this is not feasible, we restrict ourselves to an ensemble of structures. Thus, instead of summing over all possible *σ*_{1} and *σ*_{2}, we use the partial structures and that were determined in the first step and approximate EA^{th,int}(*σ*) by

The second sum can now be simplified as follows:

where Equation (11') indicates the variation of the independence assumption of Equation (11) for the structure ensembles (see additional file 1). Thus, we finally have

Now is the constrained folding, where the positions covered by and are fixed. However, we have the problem that these structures might contain pseudoknots. Recall that the positions in and are fixed for folding and that we are considering all structures *σ* that contain and are nested on . Technically, we solve the problem using the fact that the set of structures that is nested on *σ*_{int}*and* compatible with is selected by considering all structures where the positions of are constrained as single-stranded. This implies that we use constrained cofolding via RNAcofold (parameters -C -p -d2), and the constraint ... *x*_{1}*x*_{1} ... & ... *x*_{2}*x*_{2} ..., where *x*_{1} (resp. *x*_{2}) denotes a position from (resp. ) that is constrained as single-stranded. The main difference is that the energy contributions could be slightly different, and therefore, we obtain only an approximation of the real distribution. For example, an extension of a helix in would be evaluated as an internal loop or hairpin. Note that this is not a major problem because we are mainly interested in the *inter*-molecular base pairs between *s*_{1} and *s*_{2} in this step. However, the recursion scheme of RNAcofold could easily be adapted to use new symbols for base pair constraints and a scoring scheme that is common to hierarchical approaches of pseudoknot structure prediction, which would avoid these problems.

Finally, we can rewrite the thermodynamic accuracy as the sum of probabilities as indicated in Equation (5). As shown in Equation (12), for a base pair (*i*, *j*) ∈ (ℓ = 1, 2), we want to use the probability of the associated sequence. To avoid competition with the probabilities for the *intra*-molecular base pairs calculated from RNAcofold, we set all of these base pairs to the same probability as described in Equation (7). For the *inter*-molecular base pairs, we use the base pair probabilities as provided by RNAcofold with constraints, which model from the constrained cofolding. However, these raw base pair probabilities (in the following denoted by ) are calculated under the constraint of and have therefore (to obtain the final base pair probabilities) to be multiplied by as indicated by Equation (14). Thus, we can score each base pair as follows:

where the 1 reflects the fixed reliability. However, we deviate from this scoring to weaken the independence assumption for the *intra*-molecular base pairs, which allows us to determine new *intra*-molecular base pairs from the constrained application of RNA-cofold. Thus, we score only the base pairs from the partial structures and with the probability in the associated sequence. In addition, to avoid competition with the probabilities for these base pairs calculated from RNAcofold, we simply set all of these base pairs to the same probability as described in Equation (7). To summarise, given the partial consensus structures and for an alignment as calculated in Step 1, the probability for a base pair (*i*, *j*) in sequence *s*_{1} & *s*_{2} ∈ in the second step is:

#### Single-stranded probabilities

Single-stranded probabilities are integrated in a similar way as the base pair probabilities, but with different weighting. The single-stranded probabilities are as follows:

Given the structure *σ* on an alignment with *m* columns, the set of all single-stranded positions in the consensus structure is denoted as ss(*σ*) = {*i* ∉ *σ*|1 ≤ *i* ≤ *m*}. Taking this into consideration, the complete version of Equation (2) is

and the evolutionary accuracy is determined similarly. The combined score is the sum of the base pair reliabilities and single-stranded reliabilities (weighted with the parameter *α*). For details, see [25].

#### The evolutionary part

The calculation for the presented thermodynamic accuracy is purely based on constrained folding. To obtain the complete constrained folding, we use the same approach for the evolutionary accuracy by applying a version of Pfold[28] that incorporates the constraints. For that purpose, the raw structural reliabilities (*i*, *j*) and (*i*) are calculated by the constrained folding with Pfold using the phylogenetic tree deduced from the concatenated alignment. As a linker, three prior-free columns are inserted between both alignments. The evolutionary reliabilities (*i*, *j*) for a base pair (*i*, *j*) and (*i*) for a single-stranded position *i* are calculated in the same manner as in Equation (16):

as well as in Equation (17):

The probabilities of the partial structures and are calculated as described in Equation (6). Step 2 is summarised as pseudocode in 1.

#### The final scoring

To summarise the reliabilities, a combined structure will be determined using the Nussinov algorithm on the following reliabilities:

where and are defined as above, as in Equation (16) and as in Equation (17).

Note that the base pairs in have a weight of 0 during folding of the constrained structure to allow for pseudoknot formation. Finally, we add the base pairs in to the constrained structure of Step 2. The flow of the structure reliabilities in the pipeline is summarised in Figure 3.

## Results and discussion

The algorithm presented herein was implemented in PETcofold (Seemann *et al*., submitted). As a proof of concept, we present an example of a bacterial sRNA-mRNA interaction. The in-depth analysis is described elsewhere (Seemann *et al*., submitted).

### Joint structure prediction of bacterial sRNA OxyS and its target mRNA fhlA

The small RNA OxyS represses the translation of the mRNA *fhlA*, which is mediated through base pairing at the ribosome binding site [11]. However, the OxyS-*fhlA* interaction involves a second binding site within the coding region of *fhlA*. Both interaction sites reside in stem loops such that OxyS and *fhlA* form a double kissing hairpin interaction.

Figure 4 shows the alignment and joint secondary structure prediction of the OxyS-*fhlA* complex, *i.e*., the secondary structures of OxyS and *fhlA* and the interaction between them, as predicted by our algorithm. The result of the prediction without extending the constrained stems is shown in Figure 4a, and the result with the extension of the constrained stems is shown in Figure 4b.

For OxyS-*fhlA*, our algorithm was able to consistently predict one of the two interaction sites. The second interaction site, which is situated in the *fhlA* coding region, was only predicted when the constrained stems were not extended in Step 1 of our algorithm. Otherwise the stem of *fhlA* that resides the second interaction site was extended both by inner and outer base pairs. Consequently, the unpaired region of the hairpin containing the second interaction site became shorter such that no interaction was predicted at this site.

### Algorithmic restrictions and potentials

The algorithm supports pseudoknots between the i*ntra*-molecular and *inter*-molecular base pairs, while the time complexity of *O*(*N* × *I* × *L*^{3}) is much lower than that of other approaches with the same ability. The time complexity is in the magnitude of PET-fold for the added sequence length *L* of both alignments, and it is linear with respect to the number of sequences *N* in the alignments and the number of iterations *I* in the adaptation of *δ* to find probable partial structures (*I* < *M*/2, where *M* is the sequence length of the longer alignment).

Pfold combines a SCFG with evolutionary information from an alignment of RNA sequences through an explicit evolutionary model. It is not clear whether the model learned from tRNA and rRNA secondary structures is appropriate for RNA cofolding. To avoid the bias of a wrong prior probability distribution Pr[*σ*|*T*, *M*] during the evolutionary scoring of the cofolding step, we omitted all SCFG rule probabilities as well as base pair substitution rates. In these cases, all base pair probabilities were calculated independently, and the different substitution rates of base pairs were ignored; thus, we observed that the entire performance decreased. A possible optimisation would be an adapted prior distribution for the cofolding step, which could be generated when sufficient verified RNA-RNA interaction data becomes available. However, the prediction of RNA secondary structure using evolutionary history is robust for different evolutionary speeds and substitution rate variations [32]. Hence, it is reasonable to assume that the deviation is fairly low using the prior probability distribution of *intra*-molecular structures.

The presented method assumes that both alignments have the same evolutionary history during the cofolding step. A more accurate method would consider independent phylogenetic trees for both RNAs, such as in Step 1, and a common tree for the interaction site. However, we do not know the interacting region in advance; thus, we would need an expectationm maximization algorithm, which would increase the running time of the algorithm to an unreasonable level. Furthermore, the energy contribution of the cofolding step might be slightly biased due to the constraint of the partial structures as single-stranded. We partly solve the resulting *intra*-molecular false predictions by extending the reliable stems in the partial structures, and, as already mentioned above, the RNAcofold algorithm and scoring scheme could be adapted to handle base pair constraints as single-stranded.

Furthermore, there is a limitation of the presented method with regard to interaction sites that are located outside conserved RNA structures. These regions are hard to align if they are, in addition, sequentially unconserved. Thus, our method will most likely miss binding sites located in unstructured and otherwise unrelated regions, *e.g*., miRNA target sites in UTR regions. However, once a correct alignment is found for these regions, then the presented approach still works if the interaction region is conserved or shows enough covariation.

Our algorithm is able to predict pseudoknots between the *intra*-molecular and (inter-)molecular base pairs. In addition, we are interested in more pseudoknots that can be predicted in a similar way using a pipeline of constrained structures. During an iteration of Step 2, additional reliable partial *inter*-molecular structures are constrained as long as new reliable base pairs appear. The final consensus structure is the union of all cofolding base pairs and the partial structures. The main unsolved problem is the weighted combination of the decreasing partial structure probabilities in one scoring scheme when the amount of constraints increases with each iteration.

## Conclusions

In summary, we introduced an extension of the PET-fold algorithm for the identification of interactions between two sets of multiple aligned RNA sequences, which exploits compensating base changes while taking *intra*-molecular partial structures and interaction sites into account. The implementation of the algorithm in PETcofold and its application are described in Seemann *et al*. (submitted).

## References

- 1.
Washietl S, Hofacker IL, Lukasser M, Hüttenhofer A, Stadler PF: Genome-wide mapping of conserved RNA secondary structure structures predicts thousands of functional non-coding RNAs in human. Nature Biotechnology. 2005, 23: 1383-1390. 10.1038/nbt1144

- 2.
Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D: Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol. 2006, 2: e33- 10.1371/journal.pcbi.0020033

- 3.
Torarinsson E, Sawera M, Havgaard JH, Fredholm M, Gorodkin J: Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. Genome Research. 2006, 16: 885-889. [(Erratum in: Genome Res. 2006 16:1439)]. 10.1101/gr.5226606

- 4.
Uzilov AV, Keegan JM, Mathews DH: Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics. 2006, 7: 173- 10.1186/1471-2105-7-173

- 5.
Washietl S, Pedersen JS, Korbel JO, Stocsits C, Gruber AR, Hackermüller J, Hertel J, Lindemeyer M, Reiche K, Tanzer A, Ucla C, Wyss C, Antonarakis SE, Denoeud F, Lagarde J, Drenkow J, Kapranov P, Gingeras TR, Guigó R, Snyder M, Gerstein MB, Reymond A, Hofacker IL, Stadler PF: Structured RNAs in the ENCODE selected regions of the human genome. Genome Research. 2007, 17: 852-864. 10.1101/gr.5650707

- 6.
Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R: Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Computational Biology. 2007, 3: e65- 10.1371/journal.pcbi.0030065

- 7.
Torarinsson E, Yao Z, Wiklund ED, Bramsen JB, Hansen C, Kjems J, Tommerup N, Ruzzo WL, Gorodkin J: Comparative genomics beyond sequence based alignments: RNA structures in the ENCODE regions. Genome Research. 2008, 18: 242-251. 10.1101/gr.6887408

- 8.
Backofen R, Hess WR: Computational prediction of sRNAs and their targets in bacteria. RNA Biol. 2010, 7:

- 9.
Mückstein U, Tafer H, Hackermüller J, Bernhart SH, Stadler PF, Hofacker IL: Thermodynamics of RNA-RNA binding. Bioinformatics. 2006, 22 (10): 1177-82. 10.1093/bioinformatics/btl024

- 10.
Busch A, Richter AS, Backofen R: IntaRNA: efficient prediction of bacterial sRNA targets incorporating target site accessibility and seed regions. Bioinformatics. 2008, 24 (24): 2849-56. 10.1093/bioinformatics/btn544

- 11.
Argaman L, Altuvia S:

*fhlA*repression by OxyS RNA: kissing complex formation at two sites results in a stable antisense-target RNA complex. Journal of Molecular Biology. 2000, 300 (5): 1101-12. 10.1006/jmbi.2000.3942 - 12.
Andronescu M, Zhang ZC, Condon A: Secondary structure prediction of interacting RNA molecules. Journal of Molecular Biology. 2005, 345 (5): 987-1001. 10.1016/j.jmb.2004.10.082

- 13.
Bernhart SH, Tafer H, Mückstein U, Flamm C, Stadler PF, Hofacker IL: Partition function and base pairing probabilities of RNA heterodimers. Algorithms Mol Biol. 2006, 1: 3- 10.1186/1748-7188-1-3

- 14.
Dirks RM, Bois JS, Schaeffer JM, Winfree E, Pierce NA: Thermodynamic Analysis of Interacting Nucleic Acid Strands. SIAM Review. 2007, 49: 65-88. 10.1137/060651100

- 15.
Zuker M: Prediction of RNA secondary structure by energy minimization. Methods in Molecular Biology. 1994, 25: 267-94.

- 16.
Hofacker IL, Fontana W, Stadler PF, Bonhoeffer S, Tacker M, Schuster P: Fast Folding and Comparison of RNA Secondary Structures. Monatshefte Chemie. 1994, 125: 167-188. 10.1007/BF00818163

- 17.
Pervouchine DD: IRIS: intermolecular RNA interaction search. Genome Inform. 2004, 15 (2): 92-101.

- 18.
Alkan C, Karakoc E, Nadeau JH, Sahinalp SC, Zhang K: RNA-RNA interaction prediction and antisense RNA target search. Journal of Computational Biology. 2006, 13 (2): 267-82. 10.1089/cmb.2006.13.267

- 19.
Chitsaz H, Salari R, Sahinalp SC, Backofen R: A partition function algorithm for interacting nucleic acid strands. Bioinformatics. 2009, 25 (12): i365-73. 10.1093/bioinformatics/btp212

- 20.
Huang FWD, Qin J, Reidys CM, Stadler PF: Partition function and base pairing probabilities for RNA-RNA interaction prediction. Bioinformatics. 2009, 25 (20): 2646-54. 10.1093/bioinformatics/btp481

- 21.
Huang FWD, Qin J, Reidys CM, Stadler PF: Target prediction and a statistical sampling algorithm for RNA-RNA interaction. Bioinformatics. 2010, 26 (2): 175-81. 10.1093/bioinformatics/btp635

- 22.
Chitsaz H, Backofen R, Sahinalp SC: biRNA: Fast RNA-RNA Binding Sites Prediction. Proc. of the 9th Workshop on Algorithms in Bioinformatics (WABI), Volume 5724 of Lecture Notes in Computer Science. Edited by: Salzberg S, Warnow T. 2009, 25-36.

- 23.
Salari R, Backofen R, Sahinalp SC: Fast prediction of RNA-RNA Interaction. Proc. of the 9th Workshop on Algorithms in Bioinformatics (WABI), Volume 5724 of Lecture Notes in Computer Science. Edited by: Salzberg S, Warnow T. 2009, 261-272.

- 24.
Salari R, Möhl M, Will S, Sahinalp SC, Backofen R: Time and space efficient RNA-RNA interaction prediction via sparse folding. Proc of RECOMB 2010. 2010

- 25.
Seemann SE, Gorodkin J, Backofen R: Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments. Nucleic Acids Research. 2008, 36: 6355-6362. 10.1093/nar/gkn544

- 26.
Gaspin C, Westhof E: An interactive framework for RNA secondary structure prediction with a dynamical treatment of constraints. J Mol Biol. 1995, 254: 163-174. 10.1006/jmbi.1995.0608

- 27.
Jabbari H, Condon A, Pop A, Pop C, Zhao Y: HFold: RNA Pseudoknotted Secondary Structure Prediction Using Hierarchical Folding. In Algorithms in Bioinformatics, 7th International Workshop, WABI Philadelphia, PA, USA, September 8-9, 2007, Proceedings. Edited by: Giancarlo R, Hannenhalli S. 2007, 323-334.

- 28.
Knudsen B, Hein JJ: RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics. 1999, 15: 446-454. 10.1093/bioinformatics/15.6.446

- 29.
McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990, 29 (6-7): 1105-19. 10.1002/bip.360290621

- 30.
Nussinov R, Pieczenik G, Griggs JR, Kleitman DJ: Algorithms for Loop Matchings. SIAM Journal on Applied Mathematics. 1978, 35: 68-82. 10.1137/0135006

- 31.
Ding Y, Chan CY, Lawrence CE: RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA. 2005, 11 (8): 1157-66. 10.1261/rna.2500605

- 32.
Knudsen B, Andersen ES, Damgaard C, Kjems J, Gorodkin J: Evolutionary rate variation and RNA secondary structure prediction. Comput Biol Chem. 2004, 28 (3): 219-226. 10.1016/j.compbiolchem.2004.04.001

- 33.
Waterhouse AM, Procter JB, Martin DMA, Clamp M, Barton GJ: Jalview Version 2-a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009, 25 (9): 1189-91. 10.1093/bioinformatics/btp033

## Acknowledgements

We thank Bjarne Knudsen for inspiring discussions about extensions of the Pfold method.

This work was supported by the Lundbeck Foundation (grant 374/06 to J.G.), the Danish Research Council for technology and production (grant 274-09-0282 to J.G.), the Danish Center for Scientific Computation (J.G.), the German Federal Ministry of Education and Research (BMBF grant 0313921 FRISYS to R.B.) and the German Research Foundation (DFG grant BA 2168/2-1 SPP 1258 to A.S.R. and R.B.).

## Author information

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authors' contributions

SES, RB and JG designed the algorithm. SES implemented the algorithm. ASR designed and performed the analysis of the algorithm. RB drafted the manuscript. All authors contributed to the manuscript, read and approved the final manuscript.

Stefan E Seemann, Andreas S Richter contributed equally to this work.

## Electronic supplementary material

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

## About this article

### Cite this article

Seemann, S.E., Richter, A.S., Gorodkin, J. *et al.* Hierarchical folding of multiple sequence alignments for the prediction of structures and RNA-RNA interactions.
*Algorithms Mol Biol* **5, **22 (2010) doi:10.1186/1748-7188-5-22

#### Received

#### Accepted

#### Published

#### DOI

### Keywords

- Base Pair
- Partial Structure
- Parse Tree
- Joint Structure
- Consensus Structure