RNA-RNA interaction prediction using genetic algorithm

Background RNA-RNA interaction plays an important role in the regulation of gene expression and cell development. In this process, an RNA molecule prohibits the translation of another RNA molecule by establishing stable interactions with it. In the RNA-RNA interaction prediction problem, two RNA sequences are given as inputs and the goal is to find the optimal secondary structure of two RNAs and between them. Some different algorithms have been proposed to predict RNA-RNA interaction structure. However, most of them suffer from high computational time. Results In this paper, we introduce a novel genetic algorithm called GRNAs to predict the RNA-RNA interaction. The proposed algorithm is performed on some standard datasets with appropriate accuracy and lower time complexity in comparison to the other state-of-the-art algorithms. In the proposed algorithm, each individual is a secondary structure of two interacting RNAs. The minimum free energy is considered as a fitness function for each individual. In each generation, the algorithm is converged to find the optimal secondary structure (minimum free energy structure) of two interacting RNAs by using crossover and mutation operations. Conclusions This algorithm is properly employed for joint secondary structure prediction. The results achieved on a set of known interacting RNA pairs are compared with the other related algorithms and the effectiveness and validity of the proposed algorithm have been demonstrated. It has been shown that time complexity of the algorithm in each iteration is as efficient as the other approaches.


Background
Major successes have been achieved in the treatment of some cancers, including colon, breast and pancreatic by suppressing the gene expression involved in the development of these diseases using RNA-RNA interaction. The interaction between two RNAs is known as the newest and the most efficient method for gene silencing. It has been shown that the small interfering RNAs (siRNAs) can be used for silencing their target mRNAs [1]. Furthermore, small RNAs (sRNAs) play an important role in the regulation of gene expression. They usually bind to their target mRNAs to prevent their translation [2].
In RNA-RNA Interaction Prediction (RRIP) problem, two RNA sequences are given as inputs and the goal is to find the minimum free energy Secondary Structure of Interacting RNAs (SSIR). To tackle this problem, some algorithms have been proposed by research groups. Andronescu et al. [3] proposed a method based on dynamic programming in which two RNA sequences are concatenated as a single sequence and its secondary structure is calculated [3]. Another approach calculates the partition function of a secondary structure complex of multiple interacting RNAs [4]. This method rigorously extends those models of secondary structure to the multistranded case. The tools such as RNAhybrid [5], UNAFold [6] and RNAduplex from ViennaRNA package [7] reduce computational time complexity by ignoring all the internal base pairings in both RNAs. RNAup [8,9] extends the standard partition function approach to RNA secondary structures and employs the single (unpaired) regions on each RNA to find the interaction between them. RNAplex [10,11] finds the possible hybridization sites for RNA in the large RNA databases based on a slight simplification of the energy model. In this model, the loop energy is assumed to be a function of the loop size.
Recently, a novel algorithm based on the multiple context free grammars was introduced in [12]. Accordingly, two real values called transition and emission probabilities are specified for each rule of the grammar. Then, a derivation tree is constructed for the grammar based on the rules with high probability.
In heuristic based approaches, inRNAs [13] firstly predicts the loop regions in the native structure of each sequence, and then finds the optimal non-conflicting interaction between two RNAs. IntaRNA [14] combines the accessibility of target sites as well as the existence of a user-definable seed to find RNA-RNA interaction. Minimizing the joint free energy between two RNA molecules under a number of energy models with growing complexity was introduced in [2]. Another interesting heuristic approach for this problem was presented in [15]. This algorithm employs some dot matrices representation of all possible base pairs for finding the secondary structure of each RNA and between the two RNAs.
An approximation algorithm was presented in [1], where an RNA-RNA interaction graph is created in which every edge represents a possible bond in or between two RNAs. A set of edges is found to maximize the number of bonds. A statistical sampling algorithm was introduced in [16] based on some modifications to the grammars. It calculates the interaction probabilities for any given single region on RNA. RactIP [17] predicts RNA-RNA interaction using integer programming. Accordingly, it uses the approximate information of the internal and external base pairing probabilities of joint structures as an objective function of integer programming. PETcofold [18] employs covariance information in the internal and external base pairs to predict SSIR of two multiple alignments of RNA sequences. InteRNA [19] reduces the time and space complexity of RRIP problem described by Alkan et al. [2] using dynamic programming sparsification.
One of the pitfalls of the most existing algorithms is their high computational time to predict RNA-RNA interaction, while a number of them have not been performed on some RNA pairs to predict binding sites between two single regions of RNAs. Alkan et al. [2] proved that RNA-RNA interaction prediction is an NP-complete problem.
In this paper, we propose a new genetic algorithm called GRNAs as an appropriate solution for the RRIP problem. This algorithm can be performed on some standard RNA pairs with high accuracy. In this method, at first, all possible stems in each RNA as well as all possible hybrid regions between two RNAs are extracted from a dot matrix. The initial population consists of some individuals, where each of them is an SSIR obtained from some randomly extracted stems and hybrid regions of the dot matrix. The minimum free energy is computed for each individual as a fitness value. For each generation, some individuals are selected to mate based on their fitness values and form a new population. Then, mutation operation is done on a few individuals. The population generation terminates when the free energy of an individual is minimum enough. Finally, one of the best individuals is selected as an optimal SSIR. The algorithm is conducted on some real datasets and compared with some other algorithms to investigate efficiency and validity of the proposed method. The time and space complexity of the proposed method in each iteration is 0(l 2 + |P|), where l and |P| indicate the sum of the length of two RNAs and the length of an individual, respectively. The results show that the accuracy of the algorithm is as efficient as the other related methods.
The rest of this paper is organized as follows. In Section 2, some definitions and notations are described. In Section 3, a genetic algorithm called GRNAs is presented to predict RNA-RNA interaction. The results and conclusion are discussed in Sections 4 and 5, respectively.

Definitions and notations
An RNA molecule is composed of a long, usually singlestranded chain of nucleotide units; adenine (A), cytosine (C), guanine (G) and uracil (U). Thus, R = r 1 r 2 … r n in 5 ' − 3 ' direction is an RNA sequence, where |R| = n and r i ∈ {A, C, G, U} (1 ≤ i ≤ n). The RNA structure is formed by the creation of hydrogen bonds between Watson-Crick complementary bases (A − U and C − G) and a Wobble base pair (G − U).
In an RNA secondary structure, each base interacts with at most one other base, and no base pairs cross each other. Two bases r i and r j (1 ≤ i < j ≤ n) of the base pair (r i , r j ) are represented by ' (' and ') ', respectively and each unpaired base is declared by '. '. A stem consists of subsequent base pairs and a loop is one sequence of consecutive unpaired bases.
A secondary structure of two interacting RNAs, R 1 and R 2 , contains the set of stems in each RNA and the hybrid regions between two RNAs as well as loops. Each hybrid region consists of subsequent hybrid base pairs between two RNAs. Two bases r i ∈ R 1 and r j ∈ R 2 of the hybrid base pair (r i , r j ) are represented by ' [' and '] ', respectively.
Example. Let R 1 = CGGUUUGAGGUCCG and R 2 = ACUACCGAAAAGUU be two RNA sequences. The SSIR of the two RNAs is shown as follows; In this example, each RNA has one stem. In the left hand RNA, one stem is found by the production of bonding between CGG and the reverse CCG (GCC). There are two hybrid regions between the sequences R 1 and R 2 . The first one is produced by binding between UUU and the reverse AAA (AAA). The second one is generated by binding between GGU and the reverse ACC (CCA).

A new genetic algorithm for RNA-RNA interaction prediction
Genetic algorithm is an optimization method based on evolutionary biology that is widely used to solve search and optimization problems [20][21][22]. In this section, a new genetic algorithm, GRNAs, is presented to predict RNA-RNA interaction. In the following, initial population, fitness function, crossover and mutation operations are introduced.

Initial population
In the proposed algorithm, two RNA sequences R ' = r ' 1 r ' 2 … r ' n (|R ' | = n) and R " = r " 1 r " 2 … r " m (|R " | = m) are given as inputs. The two RNAs R ' and R " are converted to the sequence R = r 1 r 2 … r l as follows; where N is an arbitrary character to distinguish between two sequences and l = m + n + 1. A dot matrix M R lÂl is made, where the axes in the dot matrix correspond to the two sequences R and reverse R,as follows; where r i and r l-j+1 (1 ≤ i, j ≤ l) are the i-th and l-j+1-th nucleotides in the sequence R = r 1 r 2 … r l , respectively. Each right-skewed consecutive value of 1's which is parallel to the main diagonal in the dot matrix is selected as a sub-diagonal. Each sub-diagonal shows a possible stem in each RNA or hybrid region between two RNAs. Set D R shows all sub-diagonals in the dot matrix as follows; where i and j indicate the start position of the row and the column of a sub-diagonal with t+1 consecutive 1's, respectively. Hence, each <i,j,t> is a set of consecutive base pairs as follows; According to the prior knowledge, we know that Watson-Crick base pairs occur more than Wobble in RNA structures. In this regard, we compute the percent of G-U pairs on our dataset approximated 14%. So, G-U pairs are removed from the sub-diagonals including more than 14% G-U pairs. For each For each individual P, a |D R | -tuple is randomly made as; In other words, individual P contains those subdiagonals that their related x in I is equal to 1. Here, x k = 1 means sub-diagonal d k ∈ P, while x k = 0 points to d k ∉ P. Then, the individual P is constructed as follows; where the set C k (modified k-th sub-diagonal) is obtained as follows; Here, C k ∈ P is a set of base pairs in d k without any common base pairs in the previous sub-diagonals, d t (1 ≤ t < k), of individual P. Finally, C k is modified by removing the lonely base pairs from it as follows; Notice that a set of the produced individuals creates an initial population.

Fitness function
For each individual P, let S and H represent two RNAs secondary structures and binding sites between the two RNAs, respectively. Therefore, the fitness function is defined as follows; where for C ∈ {S, H}, MFE(C) denotes the minimum free energy of structure C. We apply RNAeval.exe [7] to compute minimum free energies of secondary structures and binding sites separately.

Crossover
Crossover operation is performed between the individuals with the rate of 0.9. The good and mediocre individuals are transferred to the next population. The remaining individuals are consecutively selected for crossover operation.
Let P 1 and P 2 are selected as parents. For each individual P i , 1 ≤ i ≤ 2, a |D R | -tuple is defined as follows: In this procedure, a random position k, 1 ≤ k ≤ |D R |, is selected and I 1 and I 2 are crossed. Then, I ' 1 and I ' 2 are formed as follows: In the following, two new individuals P ' 1 and P ' 2 are generated from I ' 1 and I ' 2 similar to the described method in the initial population (refers to the Equation 1), respectively.

Mutation
Mutation operation is done with the rate of 0.1 on a few randomly selected individuals in each generation. Assume that P is an individual selected for mutation. For the individual P, a |D R | -tuple is obtained by: An item x j , where x j = 1 is randomly selected and replaced to 0. Then, another x k (x k = 0) is replaced to 1. The new individual, P, is obtained from the changed I based on the proposed method in initial population (refers to the Equation 1). Finally, if C k ∉ P (all the base pairs of d k have overlapping with the existence subdiagonals in P), the other x i (x i = 0) is selected to replace with 1. This process continues until C k ∈ P or the defined number of generations is reached. When mutation is performed on a number of individuals, they will be increasingly sorted based on their fitness values.

Termination of the GRNAs algorithm
The GRNAs algorithm terminates when the best individual in definite generations will not be changed or the defined number of generations be reached. After the termination of the algorithm, one of the best individuals is selected as the best folding of two RNAs and the best interaction between the two RNAs. On the other hand, for storing the h individuals of length |P| we need O(h. |P|) space complexity. Furthermore, the population in the algorithm uses both dot matrix and an array of sub-diagonals. Hence, the storage complexity of these two types is O(l 2 ), where l denotes sum of the length of the two RNAs. Thus the total space complexity of GRNAs is O(h|P| + l 2 ) which is simplified with O(l 2 + |P|).

Results and discussion
The GRNAs has been performed on a machine with two-Core Intel(R) Duo processor T6670 2.20 GHz and 4 GB RAM to predict the interaction structure between two RNAs. The proposed genetic algorithm is performed on two well-known datasets of RNA-RNA interactions. The first set contains: R1inv-R2inv, Tar-Tar*, DIS-DIS, CopA-CopT and IncRNA 54 -RepZ in the Escherichia coli bacteria [12]. The joint secondary structures of this dataset include kissing hairpins. We evaluate the performance of joint secondary structure prediction of this dataset.
To evaluate the prediction accuracy of the GRNAs, F-measure (F) and Matthews Correlation Coefficient (MCC) [18] are calculated using sensitivity (Sn) and positive predictive value (PPV). Assume that the number of correctly predicted base pairs, the number of false predicted base pairs and the number of unpredicted base pairs are indicated by TP, FP and FN, respectively. So, Sn, PPV, F, and MCC are defined as follows: Table 1 shows the results of joint secondary structure prediction of our algorithm, GRNAs, in Matthews Correlation Coefficient [18]. It is also compared to the state-of-the-art methods such as PETcofold [18], the sparsified version of inteRNA [19], Pairfold [3], RactIP [17] and RNAcofold [7]. The MCC evaluates the joint structure, i.e. both the binding sites between the two RNAs and the secondary structure of each single RNA. In two pairs MicA-ompA and OxyS-fhlA, PETcofold has the best MCC value and in other two pairs RyhB-uoffur and RyhB-sodB, GRNAs has the highest MCC value.
We also compared GRNAs with four state-of-the-art methods: inRNAs, IntaRNA, RNAup and RactIP. Table 2 shows the results of prediction in binding sites in sensitivity and positive predictive values on the datasets Table 1 The results of joint secondary structure prediction of GRNAs in MCC in comparison to the PETcofold and other joint structure prediction methods such as RNAcofold, inteRNA, Pairfold and RactIP  Table 2 The results of binding sites prediction of GRNAs in sensitivity and positive predictive value on the datasets [12,14] in comparison to inRNAs, IntaRNA, RNAup and RactIP  [12,14] using the proposed approach and mentioned methods. Here, only external base pairs are considered to measure accuracy. According to the second half of Table 2, the average positive predictive value on datasets [12,14] is 96.03%. This table shows that our method is comparable with the existing methods. Table 3 indicates the accuracy of GRNAs in F-measure with considering binding sites. In GRNAs, the average F-measure is 89%. The results shown in Tables 2 and 3 indicate that GRNAs works as efficient as the other methods in average sensitivity, positive predictive value and F-measure. Our genetic algorithm randomly selects the subdiagonals to make individuals. Therefore different individuals with variety sub-diagonals are constructed. Due to the nature of proposed genetic approach, some of the RNA-RNA interactions can be predicted more accurate than the other algorithms. For example the accuracy rate of Tar-Tar* is obtained 100%, while maximum accuracy of the other approaches is 90.9%.
We compare the computational time complexity of GRNAs and state-of-the-are methods. The time and space complexity of several algorithms (TIRNA [15], App (approximation algorithm to predict SSIR) [14], ripalign [23], and other methods in the Tables 1, 2 and 3) are given in Table 4. As it is shown, the time complexity of GRNAs in each iteration is O(l 2 + |P|), where l and |P| indicate sum of the length of the two RNAs and the length of an individual, respectively. Also, Space complexity of the proposed method is O(l 2 + |P|).

Conclusion
In this paper, a new genetic algorithm was introduced for solving RNA-RNA interaction prediction problem. In this algorithm, all possible stems in each RNA and hybrid regions between two RNAs are extracted from a dot matrix showing all possible base pairs. Initial population is formed based on some stems and hybrid regions of the dot matrix. Minimum free energy is considered as a fitness function. Crossover operation is done between some consecutive individuals in the population. Mutation is taken on a few randomly selected individuals. Population generation continues until the minimum free energy of the best individual becomes minimal enough. Finally, one of the best individuals is selected to form RNA-RNA interaction structure. The proposed algorithm was tested on several RNA-RNA interaction datasets. The experimental results indicate a high accuracy of GRNAs. Furthermore, time and space complexity of GRNAs is as efficient as the other related studies.

Availability
The program of GRNAs is available at http://mostafa.ut. ac.ir/grnas. Table 3 The results of binding sites prediction of GRNAs in F-measure on the datasets [12,14]    Here, l and |P| indicate the sum of the length of two RNAs and the length of an individual, respectively.