 Research
 Open Access
 Published:
StreAM\(T_g\): algorithms for analyzing coarse grained RNA dynamics based on Markov models of connectivitygraphs
Algorithms for Molecular Biology volume 12, Article number: 15 (2017)
Abstract
Background
In this work, we present a new coarse grained representation of RNA dynamics. It is based on adjacency matrices and their interactions patterns obtained from molecular dynamics simulations. RNA molecules are wellsuited for this representation due to their composition which is mainly modular and assessable by the secondary structure alone. These interactions can be represented as adjacency matrices of k nucleotides. Based on those, we define transitions between states as changes in the adjacency matrices which form Markovian dynamics. The intense computational demand for deriving the transition probability matrices prompted us to develop StreAM\(T_g\), a streambased algorithm for generating such Markov models of kvertex adjacency matrices representing the RNA.
Results
We benchmark StreAM\(T_g\) (a) for random and RNA unit sphere dynamic graphs (b) for the robustness of our method against different parameters. Moreover, we address a riboswitch design problem by applying StreAM\(T_g\) on six long term molecular dynamics simulation of a synthetic tetracycline dependent riboswitch (500 ns) in combination with five different antibiotics.
Conclusions
The proposed algorithm performs well on large simulated as well as real world dynamic graphs. Additionally, StreAM\(T_g\) provides insights into nucleotide based RNA dynamics in comparison to conventional metrics like the rootmean square fluctuation. In the light of experimental data our results show important design opportunities for the riboswitch.
Background
The computational design of switchable and catalytic ribonucleic acids (RNA) becomes a major challenge for synthetic biology [1]. So far, available models and simulation tools to design and analyze functionally complex RNA based devices are very limited [2]. Although several tools are available to assess secondary as well as tertiary RNA structure [3], current capabilities to simulate dynamics are still underdeveloped [4] and rely heavily on atomistic molecular dynamics (MD) techniques [5]. RNA structure is largely modular and composed of repetitive motifs [4] that form structural elements such as hairpins and stems based on hydrogenbonding patterns [6]. Such structural modules play an important role for nano design [1, 7].
In order to understand RNA dynamics [8, 14] we develop a new method to quantify all possible structural transitions, based on a coarse grained, transferable representation of different module sizes. The computation of Markov State Models (MSM) have recently become practical to reproduce longtime conformational dynamics of biomolecules using data from MD simulations [15].
To this end, we convert MD trajectories into dynamic graphs and derive the Markovian dynamics in the space of adjacency matrices. Aggregated matrices for each nucleotide represent RNA coarse grained dynamics. However, a full investigation of all transitions is computationally expensive.
To address this challenge we extend StreaM—a streambased algorithm for counting 4vertex motifs in dynamic graphs with an outstanding performance for analyzing (bio)molecular trajectories [16]. The extension StreAM computes one transition matrix for a single set of vertices or a full set for combinatorial many matrices. To gain insight into global folding and stability of an RNA molecule, we propose StreAM\(T_g\): It combines all adjacencybased Markov models for a nucleotide into one global weighted stochastic transition matrix \(T_g(a)\). However, deriving Markovian dynamics from MD simulations of RNA is an emerging method to describe folding pathways [13] or to elucidate the kinetics of stacking interactions [11]. Especially MSM of atomistic aptamer simulations like the theophylline [12] and thrombin aptamer could help to understand structurefunction relationships as well as the folding process [18]. Nonetheless, all the methods mentioned above rely on Root Mean Square Deviation (RMSD) computations in combination with clustering in order to identify relevant transition states. For StreAM\(T_g\), the transition states are given by small adjacency matrices representing structural motifs.
The remainder of this paper is structured as follows: In “Our approach for coarse grained analysis”, we introduce the concept of StreAM\(T_g\) as well as our biological test setup. We describe details of the algorithm in “Algorithm”. We present runtime evaluations as well as application scenario of our algorithm in “Evaluation” for a synthetic tetracycline (TC) dependent riboswitch (TCAptamer). Furthermore, we investigate the influence upon ligand binding of four different TC derivates and compare them with a conventional method. Finally, we summarize our work in “Summary, conclusion, and future work”.
Our approach for coarse grained analysis
Structural representation of RNA
Predicting the function of complex RNA molecules depends critically on understanding both, their structure as well as their conformational dynamics [17, 19]. To achieve the latter we propose a new coarse grained RNA representation. For our approach, we start with an MD simulation to obtain a trajectory of the RNA. We reduce these simulated trajectories to nucleotides represented by their (\(C3'\)) atoms. From there, we represent RNA structure as an undirected graph [20] using each \(C3'\) as a vertex and distance dependent interactions as edges [3]. It is well known that nucleotidebased molecular interactions take place between more than one partner [21]. For this reason interactions exist for several edges observable in the adjacency matrix (obtained via a Euclidean distance cutoff) of \(C3'\) coordinates at a given timestep. The resulting edges represent, e.g., strong local interactions such as WatsonCrick pairing, Hoogsteen, or \(\pi{}\pi\)stacking.
Our algorithm estimates adjacency matrix transition rates of a given set of vertices (nucleotides) and builds a Markov model. Moreover, by deriving all Markov models of all possible combinations of vertices, we can reduce them afterwards into a global weighted transition matrix for each vertex representing the ensemble that the nucleotide modeled as a vertex is immersed in.
Dynamic graphs, their analysis, and Markovian dynamics
A graph \(G = (V,E)\) is an ordered pair of vertices \(V = \{v_1, v_2, \dots v_{V}\}\) and edges E. We refer to a single vertex of V as a. Here, we only consider undirected graphs without selfloops, i.e., \(E \subseteq \{\{v,w\}: v,w \in V, v \ne w\}\). We define a selfloop as an edge that connects a vertex to itself. For a subset \(V'\) of the vertex set V, we refer to \(G[V'] = (V', E'), \; E' := \{\{v,w\} \in E : v, w \in V'\}\) as the \(V'\) induced subgraph of G. We refer to the powerset of V as \(\mathbb {P}(V)\). The adjacency matrix \(A(G) = A_{i,j}\) (Eq. 1) of a graph G is a \(V \times V\) matrix, defined as follows:
Here, the symbol \(\Diamond\) denotes for an undefined matrix entry. We denote the set of all adjacency matrices of size k as \(\mathcal{A}_k\), with \(\mathcal{A}_k = 2^{\frac{k \cdot (k1)}{2}}\). In our current implementation k can takes values in \(\{2,3,4,5,6,7,8,9,10\}\). With concat(A), we denote the rowbyrow concatenation of all defined values of an adjacency matrix A. We define the adjacency id of a matrix A as the numerical value of the binary interpretation of its concatenation, i.e., \(id(A) = concat(A)_2 \in \mathbb {N}\). We refer to \(id(V') := id(A(G[V']))\) as the adjacency id of the \(V'\)induced subgraph of G. For example, the concatenation of the adjacency matrix of graph \(G_1[V']\) (shown in Fig. 1) is \(concat(A(G_1[V'])) = \text {011011}\) and its adjacency id is \(id(V') = 011011_2 = 27_{10}\).
As a dynamic graph \(G_t = (V, E_t)\), we consider a graph whose edge set changes over time. For each point in time \(t \in [1,\tau ]\), we consider \(G_t\) as the snapshot or state of the dynamic graph at that time. The transition of a dynamic graph \(G_{t1}\) to the next state \(G_{t}\) is described by a pair of edge sets which contain the edges added to and removed from \(G_{t1}\), i.e., \((E^+_{t}, E^_{t})\). We refer to these changes as a batch, defined as follows: \(E^+_{t} := E_{t} \backslash E_{t1}\) and \(E^_{t} := E_{t1} \backslash E_{t}\). The batch size is referred as \(\delta _t=E^+_t+E^_t\) and the average batch size is refered as \(\delta _{avg}\) and is defined as \(\frac{\sum _t \delta _t}{\tau }\).
The analysis of dynamic graphs is commonly performed using stream or batchbased algorithms. Both output the desired result for each snapshot \(G_t\). Streambased algorithms take a single update to the graph as input, i.e., the addition or removal of an edge e. Batchbased algorithms take a pair \((E^+_{t+1},E^_{t+1})\) as input. They can always be implemented by executing a streambased algorithm for each edge addition \(e \in E^+_{t+1}\) and removal \(e \in E^_{t+1}\). We refer to \(id_t(V')\) as the adjacency id of the \(V'\)induced subgraph of each snapshot of \(G_t\). The result of analyzing the adjacency id of \(V'\) for a dynamic graph \(G_t\) is a list \((id_t(V'): t \in [1,\tau ])\). We consider each pair \((id_t(V'), id_{t+1}(V'))\) as an adjacency transition of \(V'\) and denote the set of all transitions as \(\mathcal{T}(V')\). Then, we define the local transition matrix \(T(V')\) of \(V'\) as a \(\mathcal{A}_k \times \mathcal{A}_k\) matrix, which contains the number of transitions between any two adjacency ids over time, i.e., \(T_{i,j}(V') := (i+1,j+1) \in \mathcal{T}(V')\) for an adjacency size k. From \(T(V')\), we can derive a Markov model to describe these transitions.
By combining all possible \(T(V')\) where \(V' \in \mathbb {P}(V): V' = k\) and \(a \in V'\), we derive a transition tensor \(C_{a}(V)\). Thus \(C_{a}(V)\) has the dimensions of \(\mathcal{A}_k \times \mathcal{A}_k \times (k1)! \left( {\begin{array}{c}V\\ k1\end{array}}\right)\).
We define the weighting matrix \(W(V')\) with the dimensions of \(\mathcal{A}_k \times (k1)! \left( {\begin{array}{c}V\\ k1\end{array}}\right)\). \(W(V')\) contains the weighting for every subset \(V' \in C_{a}(V)\). It is defined as \(W(V'):= \frac{S(V')}{\sum _{V' \in C_a(V)} S(V')}\). Here, \(S(V')\) is a matrix containing the sum of every transition between adjacency \(id(V')\) and every other \(id(V')\) of the same matrix \(T(V')\) for all \(V' \in C_a(V)\). Hence \(S(V')\) has the dimensions \(\mathcal{A}_k \times (k1)! \left( {\begin{array}{c}V\\ k1\end{array}}\right)\). Thus \(W(V')\) is considered as the local distribution weighted by its global distribution of transitions matrices of \(V'\). Finaly, we define a global transition matrix, a vertex a is immeresd in, as \(T_g(a)=\sum _{V' \in C_{a}(V)} W(V') \times T(V')\) with the dimensions \(\mathcal{A}_k \times \mathcal{A}_k\).
For a local or global transition matrix the respective dominant eigenvector^{Footnote 1} is called \(\pi\) and represents the stationary distribution attained for infinite (or very long) times. The corresponding conformational entropy of the ensemble of motifs is \(H:=\sum _i \pi _i \cdot \log \pi _i\). The change in conformational entropy upon, e.g., binding a ligand is then given as \(\Delta H = H_{wt}H_{complex}\).
MD simulation setup
We use a structure of a synthetic tetracycline binding riboswitch (PDB: 3EGZ, chain B, resolution: 2.2 Å, Fig. 2) [23] and perform six simulations: the TCAptamer with five different tetracycline types in complex and one without tetracycline. As tetracycline binding alters the structural entropy of the molecule [24] our proposed method should be able to detect changes in (local) dynamics due the presence of tetracycline. All simulations were performed using the GROMACS software package (version 2016). For water molecules, we used the TIP3P model, the RNA interact through the CHARMM force field, while the tetracycline analogs interact through a modified CHARMM force field from Aleksandrov and Simonson [25, 26]. The systems were first energy minimized and equilibrated for 1 ns in the NVTensemble at a temperature of 300 K and for 5 ns in the NpTensemble at a temperature of 300 K and a pressure of 1 bar. During the equilibration, temperature was controlled using the velocityrescale thermostat [27] (\(\tau _{\text {T}} = {0.1}~{\mathrm{ps}}\)) and pressure was controlled using the Berendsen barostat [28] (\(\tau _{\text {P}}={0.5}~{\mathrm{ps}}\)). Isothermal compressibility was set to \({4.5}\times 10^{5}\,\mathrm{bar}^{1}\), which is the corresponding value for water. Production runs were performed for 500 ns. The temperature was controlled using the NoséHoover thermostat [29, 30] (\(\tau _{\text {T}} = {1}~{\mathrm{ps}}\)) and pressure was controlled using the ParrinelloRahman barostat [31] (\(\tau _{\text {P}}={1}~{\mathrm{ps}}\)) during the production runs. Bond lengths were constrained using the LINCS [32] algorithm. The LennardJones nonbonded interactions were evaluated using a cutoff distance of 1.2 nm. The electrostatic interactions were evaluated using the particle mesh Ewald method with a real space cutoff 1.2 nm and a gridspacing 0.12 nm. Longrange corrections to energy and pressure due to the truncation of LennardJones potential were accounted for. The equations of motion were integrated using a 2 fs time step.
Tetracycline derivates
For the comparison of TC derivates we use tetracycline (tc), doxycycline (dc), anhydrotetracycline (atc) and 6deoxy6demythyltetracycline (ddtc) in our MD simulation. These four analogs share the characteristic 4ringstructure and functional groups of all tetracyclines. Still, the possibility and the mode of interaction with the RNA is an open question. The first ring of tetracycline carries a dimethylamino group, while the third ring carries a hydroxy and a methyl group facing towards the same direction away from the 4ringsystem. The detailed chemical structures are shown in Fig. 3. In comparison to these two rings the fourth, aromatic ring has an especially small steric volume on this side of the molecule. From tc over dc and atc to ddtc this steric volume is further reduced by shifting the aforementioned hydroxy and methyl group away from the fourth ring or eliminating some of them entirely. Note, that our graphbased approach is capable to easily distinguish between different modes of interaction upon changes in the, e.g., the sidechains of the rings. The molecular data of tc, dc, atc and ddtc was created using the Avogadro software [33]. Structures were manually constructed and moved into the extended conformation described to be 3 kcal/mol more stable than its twisted alternative by Alexandrov et al. [24]. The molecules were then fitted to the position of 7chlorotetracycline (7cltc) bound in the TCAptamer structure used for simulation. Note, that the geometry of 7cltc was already present in the crystal structure of the TCAptamer. All considered antibiotics show different properties upon ligand binding. They range from high activity (tc, 7cltc) to weak activity (dc, ddtc, atc) based on in vivo experiments [34].
Workflow
RNA trajectory and contact probability
An RNA trajectory X is represented as a list of T frames \(X = (\vec {x}_{t_0},\vec {x}_{t_1}, \ldots )\). Each frame \(\vec {x}_t\in \mathbb {R}^{3n}\) contains the threedimensional coordinates of the simulated system of the n atoms at the respective point in time t. We define a binary contact matrix B(t) with dimensions \(V \times V\). Its entries scan range between \(\{0,1\}\). A single contact \(B_{i,j}(t)\) between one pair of atom coordinates \(\vec {r}_{i}(t)\) and \(\vec {r}_{j}(t)\) is generated if their Euclidean distance [L2norm, \(L2(\ldots )\)] is shorter than d. Thus B(t) entries are defined as follows:
The contact probability of one pair of atom coordinates \(\vec {r}_{i}\) and \(\vec {r}_{j}\) is defined as:
Graph transformation
All considered MD simulations have a total length of 500 ns using an integration stepsize of 2 fs. We created snapshots every 250 ps resulting in 100,000 frames. We generated dynamic graphs \(G_t = (V, E_t)\) containing \(V=65\) vertices (Table 1), each modelling a nucleic \(3C'\) (Fig. 2). This resolution is sufficient to represent both small secondary structure elements as well as large quaternary RNA complexes [35, 36]. We create undirected edges between two vertices in case their Euclidean cutoff (d) is shorter than \(\{ d \in N  10 \le d \le 15 \}\) Å (cmp. Table 1).
Markov state models (MSM) of local adjacency and global transition matrix
StreAM counts adjacency transitions (e.g. as a set \(\mathcal{T}(V')\)) of an induced subgraph for a given adjacency size. Now the transition matrix \(T(V')\) can be derived from \(\mathcal{T}(V')\) but not all possible states are necessarily visited in a given, finite simulation, although a “missing state” potentially might occur in longer simulations. In order to allow for this, we introduce a minimal pseudocount [37] of \(P_k=\frac{1}{\mathcal{A}_k}\). All models that fullfill \(\{V' \in \mathbb {P}(V) : V' = k, a \in V'\}\) have the same matrix dimension and thus can be envisioned to be combined in a tensor \(C_a(V)\). Now, \(C_{a~i,j,l}(V)\) is one entry of the tensor of transitions between adjacency id i and j in the l th transition matrix \(T(V')\) with \(l=\left( {\begin{array}{c}V\\ k1\end{array}}\right) \times k1\). Thus \(C_a(V)\) contains all \(T(V')\) a specific vertex is immersed in and due to this it contains all possible information of local markovian dynamics. To derive \(T_g(a)\) every entry \(C_{a~i,j,l}(V)\) is normalized by the count of all transitions of i in all matrices \(S(V)_{j,l} = \sum _{i} C_{a~i,j,l}(V)\). For a given set of l transition matrices \(T(V')\) we can combine them into a global model with respect to their probability:
Stationary distribution and entropy
As \(T_g(a)\) (Eq. 4) is a row stochastic matrix we can compute its dominant eigenvector from a spectral decomposition. It represents a basic quantity of interest: the stationary probability \(\vec {\pi }:=\left( \pi _1, \ldots , \pi _{i},\ldots \right)\) of microstates i [37]. To this end we used the markovchain library in R [38, 39]. For measuring the changes in conformational entropy \(H := \sum _{i=1}^{\mathcal{A}_k}{\pi _i \cdot \log \pi _i}\) upon binding a ligand, we define \(\Delta H = H_{wt}H_{complex}\), form a stationary distribution.
Conventional analysis: root mean square fluctuation (RMSF)
The flexibility of an atom can be quantitatively assessed by its Rootmeansquare fluctuation (RMSF). This measure is the time average L2norm \(L2(\ldots )\) of one particular atom’s position \(\vec {r}_{i}(t)\) to its timeaveraged position \(\bar{\vec {r}_{i}}\). The RMSF of an nucleotide i (represented by its respective \(C3'\) atom) is defined as:
Algorithm
Overview
In this section, we introduce the required algorithms to compute \(T_g(a)\). First, we describe StreAM, a streambased algorithm for computing the adjacency \(id(V')\) for a given \(V'\). Afterwards we describe, the batchbased computation using StreAM \(_B\) to derive \(id_t(V')\). By computing the adjacency id of a dynamic graph \(G_t[V']\) we derive a list \((id_t(V'): t \in [1,\tau ])\) where each pair \([id_t(V'), id_{t+1}(V')]\) represents an adjacency transition. The respective transitions are than stored in \(\mathcal{T}(V')\). Now, a single \(T(V')\) can be derived by counting the transitions in \(\mathcal{T}(V')\). At last we introduce StreAM\(T_g\), an algorithm for the computation of a global transition matrix \(T_g(a)\) for a given vertex a from a dynamic graph \(G_t[V]\). To this end, StreAM\(T_g\) computes the tensor \(C_a(V)\) which includes every single matrix \(T(V')\) where \(V' \in \mathbb {P}(V)\) and \(V' = k\) with vertex \(a \in V'\). Finally, StreAM\(T_g\) computes \(T_g(a)\) from \(C_a(V)\).
StreAM and StreAM\(_B\).
We compute the adjacency id \(id(V')\) for vertices \(V' \subseteq V\) in the dynamic graph \(G_t\) using the streambased algorithm StreAM, as described in Algorithm 1. Here, \(id(V') \in [0,\mathcal {A}_{V'})\) is the unique identifier of the adjacency matrix of the subgraph \(G[V']\). Each change to \(G_t\) consists of the edge \(\{a,b\}\) and a type to mark it as addition or removal (abbreviated to add,rem). In addition to edge and type, StreAM takes as input the ordered list of vertices \(V'\) and their current adjacency id.
An edge \(\{a,b\}\) is only processed by StreAM in case both a and b are contained in \(V'\). Otherwise, its addition or removal has clearly no impact on \(id(V')\).
Assume \(pos(V',a), pos(V',b) \in [1,k]\) to be the positions of vertices a and b in \(V'\). Then, \(i = min(pos(V',a), pos(V',b))\) and \(j = max(pos(V',a), pos(V',b))\) are the row and column of adjacency matrix \(A(G[V'])\) that represent the edge \(\{a,b\}\). In the bit representation of its adjacency id \(id(V')\), this edge is represented by the bit \((i1) \cdot k + j  i \cdot (i+1)/2\). When interpreting this bit representation as a number, an addition or removal of the respective edge corresponds to the addition or subtraction of \(2^{k \cdot (k1) / 2  ((i1) \cdot k + j  i \cdot (i+1)/2)}\). This operation is performed to update \(id(V')\) for each edge removal or addition. In the following, we refer to this position as \(e(a,b,V') := \frac{V' \cdot (V'1)}{2}  [(i1) \cdot V' + j  \frac{i \cdot (i+1)}{2}]\).
Furthermore, in Algorithm 2 we show StreAM \(_B\) for the batchbased computation of the adjacency id for vertices \(V'\)
StreAM\(T_g\)
For the design or redesign of aptamers it is crucial to provide experimental researchers informations about e.g. dynamics at the nulceotide level. To this end, StreAM\(T_g\) combines every adajcencybased transition matrix, one nucleotide participates in, into a global model \(T_g(a)\). This model can be derived for every nucleotide of the regarded RNA structure and contains all the structural transition of a nuclotide between the complete ensemble of remaining nucleotides. In order to do this, we present StreAM\(T_g\), an algorithm for the computation of global transition matrices, one particular vertex is participating in, given in Algorithm 3. A full computation with StreAM\(T_g\) can be divided into the following steps. The first step is the computation of all possible Markov models that fulfill \(V' \in \mathbb {P}(V) : V' = k\) with StreAM for a given k with \(k \in [2,10]\). This results in \(\left( {\begin{array}{c}V\\ k\end{array}}\right) \cdot k!=\frac{V!}{\left( Vk\right) !}\) combinations. Afterwards, StreAM\(T_g\) sorts the matrices by vertex id into different sets, each with the size of \(\left( {\begin{array}{c}V\\ k1\end{array}}\right) \cdot (k1)!\). For each vertex a, StreAM\(T_g\) combines the obtained \(T(V')\) that fulfill \(a \in V'\) in a transition tensor \(C_a(V)\), which is normalized by \(W(V')\) the global distribution of transition states a vertex is immersing in, taking the whole ensemble into account. \(W(V')\) can be directly computed from \(C_a(V)\) (e.g. “Dynamic graphs, their analysis, and Markovian dynamics”)
StreAM\(T_g\) optimization using precomputed contact probability
The large computational demands for a full computation of the \(\left( {\begin{array}{c}V\\ k\end{array}}\right) \cdot k!=\frac{V!}{\left( Vk\right) !}\) transition matrices to derive a set of \(T_g(a)\), motivated us to implement an optimization: The number of Markov models can be reduced by considering only adjacencies including possible contacts between at least two vertices of \(G_t = (V, E_t)\). This can be precomputed before the full computation by considering the contact probability \(P(X,\vec {r}_{i},\vec {r}_{j})\) between vertices. To this end we only compute transition matrices forming a contact within the dynamic graph with \(P(X,\vec {r}_{i},\vec {r}_{j}) > 0\).
Evaluation
Objectives
As StreAM\(T_g\) is intended to analyze large MD trajectories we first measure the speed of StreAM for computing a single \(\mathcal{T}(V')\) to estimate overall computational resources. With this in mind, we benchmark different \(G_t\) with increasing adjacency size k (Table 1). Furthermore, we need to quantify the dependence of computational speed with respect to \(\delta _{t}\). Note, \(\delta _{t}\) represents changes in conformations within \(G_t\). For the full computation of \(T_g(a)\), we want to measure computing time in order to benchmark StreAM\(T_g\) by increasing network size V and k for a given system due to exponentially increasing matrix dimensions \(\mathcal{A}_k = 2^{\frac{k \cdot (k1)}{2}}\) (\(k=3\) 8, \(k=4\) 64, \(k=5\) 1,024, \(k=6\) 32,768, \(k=7\) 2,097,152 size of matrix dimensions). We expect due to combinatorial complexity of matrix computation a linear relation between V and speed and an exponential relation between increasing k and speed. To access robustness of influence of d robustness regarding the computation of \(T_g(a)\) stationary distribution \(\vec {\pi }\). We expect a strong linear correlation between derived stationary distributions. Details are shown in “Robustness against threshold”. We compare Markovian dynamics between the native TCAptamer and the structure in complex with 7cltc with experimental data. We discuss the details in “Workflow” and “Application to molecular synthetic biology”. Furthermore, we want to illustrate the biological relevance by applying it to a riboswitch design problem; this is shown in detail in “Application to molecular synthetic biology”. For the last part, we investigate the ligand binding of four different TC derivates using StreAM\(T_g\) and compare them with a classical metric (e.g. RMSF) in “Comparison of tetracycline derivates”.
Evaluation setup
All benchmarks were performed on a machine with four Intel(R) Xeon(R) CPU E52687W v2 processors with 3.4GHz running a Debian operating system. We implemented StreAM in Java; all sources are available in a GitHub repository.^{Footnote 2} The final implementation StreAM\(T_g\) is integrated in a Julia repository.^{Footnote 3} We created plots using the AssayToolbox library for R [39, 40]. We generate all random graphs using a generator for dynamic graphs^{Footnote 4} derived for vertex combination.
Runtime dependencies of StreAM on adjacency size
For every dynamic graph \(G_t(V,E_t)\), we selected a total number of 100,000 snapshots to measure StreAM runtime performance. In order to perform benchmarks with increasing k, we chose randomly nodes \(k \in [3, 10]\) and repeated this 500 times for different numbers of snapshots (every 10,000 steps). We determined the slope (speed \(\frac{frames}{ms}\)) of compute time vs. k for random and MD graphs with different parameters (Table 1).
Runtime dependence of StreAM on batch size
We measured runtime performance of StreAM for the computation of a set of all transitions \(\mathcal{T}(V')\) with different adjacency sizes k as well as dynamic networks with increasing batch sizes. To test StreAM batch size dependencies, 35 random graphs were drawn with increasing batch size and constant numbers of vertex and edges. All graphs contained 100,000 snapshots and k is calculated from 500 random combinations of vertices.
Runtime dependencies of StreAM\(T_g\) on network size
We benchmarked the full computation of \(T_g(a)\) with different \(k \in [3, 5]\) for increasing network sizes V. Therefore we performed a full computation with StreAM. StreAM\(T_g\) sorts the obtained transition list, converts them into transition matrices and combines them into a global Markov model for each vertex.
Runtime evaluation
Figure 4b shows computational speeds for each dynamic graph. Speed decreases linearly with a small slope (Fig. 4a). While this is encouraging the computation of transition matrices for \(k > 5\) is still prohibitively expensive due to the exponential increase of the matrix dimensions with \(2^{\frac{k \cdot (k1)}{2}}\). For \(G_t\) obtained from MD simulations, we observe fast speeds due to small batch sizes (Table 1).
Figure 4b reveals that \(T_{cpu}\) increases linearly with increasing V and with k exponentially. We restrict the \(T_g(a)\) full computation to \(k<5\). In Fig. 4c, speed decreases linearly with \(\delta _{t}\). As \(\delta _{t}\) represents the changes between snapshots our observation has implications for the choice of MD integration step lengths as well as trajectory granularity.
Performance enhancing by precomputed contact probability
The exponential increase of transition matrix dimensions with \(2^{\frac{k \cdot (k1)}{2}}\) is an obvious disadvantage of the proposed method. However, there exist several \(T(V')\) where every vertex is never in contact with another vertex from the set. These adjacencies remain only in one state during the whole simulation. To avoid the computation of the respective Markov models we precomputed \(P(X,\vec {r}_{i},\vec {r}_{j})\) of all vertices. Thus only combinations are considered with \(P(X,\vec {r}_{i},\vec {r}_{j}) > 0\). This procedure leads to a large reduction of \(T_{cpu}\) due to fewer number of matrices to be computed to derive \(T_g(a)\). To illustrate this reduction, we compute the number of adjacencies left after a precomputation of \(P(X,\vec {r}_{i},\vec {r}_{j})\) as a function of d for the TCAptamer simulation without TC. The remaining number of transition matrices for adjacency sizes \(k=3,4,5\) are shown in Fig. 5b. For further illustration we show the graph of the RNA molecule obtained for a cutoff of \(d=15\) Å in Fig. 5a.
We can observe that using a precomputation of \(P(X,\vec {r}_{i},\vec {r}_{j})\) to a full computation of \(T_g(a)\) hardly depends on the Euclidean cutoff (d) for all considered adjacencies. The reduced computational costs in case of a full computation can be expressed by a significant smaller number of transition matrices left to compute for all considered adjacency sizes \(k=3,4,5\). For example if we use \(k=4\) and \(d=13\) Å we have to compute 16,248,960 transition matrices, if we use a precomputation of \(P(X,\vec {r}_{i},\vec {r}_{j})\) we can reduce this value to 2,063,100, this roughly eightfold. Furthermore, in case of new contact formation due to an increased d the number of transition matrices can increase.
Robustness against threshold
Here, we investigate the influence of threshold d for the full computation of \(T_g(a)\). To this end, we created dynamic graphs with different \(d \in [11, 15]\) Å of the TCAptamer simulation without TC. Here, we focus on a simple model with an adjacency size of \(k=3\), thus with eight states. In particular, we focus on the local adjacency matrix of combination 52, 54 and 51 because these nucleotides are important for TC binding and stabilization of intermediates.
To access the overall robustness of a full computation of \(T_g(a)\) we compute the stationary distribution for every \(T_g(a)\) and afterwards we compare them with each other. For the comparison we use the Pearson product moment correlation (Pearson’s r). Figure 6 illustrates the comparison of stationary distributions obtained from 65 \(T_g(a)\) for unit sphere dynamic graphs with different d.
The obtained Pearson correlations r are also shown in Fig. 6 (a, upper triangle). We observed a high robustness expressed by an overall high correlation (\(r= 0.938\) to \(r = 0.98\)) of the dynamic graphs created with different d. However transient states disappear with increasing threshold d (Fig. 6b). This observation stems from the fact that the obtained graph becomes more and more densely connected. One consequence of a high threshold d is that the adjacency remain in the same state.
Accuracy of StreAM
In this section we discuss the accuracy of StreAM for the computation of a set of all transitions \(\mathcal{T}(V')\) on finite data samples. Our approach estimates the transition probabilities from a trajectory as frequencies of occurrences. It could be shown that uncertainties derived from a transition matrix (e.g derived from a molecular dynamics simulation) decreases with increasing simulation time [22]. Thus the error and bias in our estimator are driven by the available data set size to derive \(\mathcal{T}(V')\). Additionally, there is an implicit influence of k on the accuracy since the number of k determines the transition matrix dimensions. Consequently, the available trajectory (system) data must be at least larger than the number of entries in the transition matrix to be estimated in order to use StreAM.
Application to molecular synthetic biology
This section is devoted to investigate possible changes in Markovian dynamics of the TCAptamer upon binding of 7cltc. This particular antibiotic is part of the crystal structure of the TCAptamer thus structure of 7cltc has the correct geometry and orientation of functional groups.
For both simulations of “Workflow”, we computed 16,248,960 transition matrices and combined them into 65 global models (one for each vertex of the riboswitch). To account for both the pairinteractions and potential stacking effects we focus on \(k=4\)vertex adjacencies and use dynamic RNA graphs with \(d=13\) Å. One global transition matrix contains all the transitions a single nucleotide participates in. The stationary distribution and the implied entropy (changes) help to understand the effects of ligand binding and potential improvements on this (the design problem at hand). The \(\Delta H\) obtained are shown in Fig. 7.
A positive value of \(\Delta H\) in Fig. 7 indicates a loss of conformational entropy upon ligand binding. Interestingly, the binding loop as well as complexing nucleotides gain entropy. This is due to the fact of rearrangements between the nucleotides in spatial proximity to the ligand because 70% of the accessible surface area of TC is buried within the binding pocket L3 [23]. Experiments confirmed that local rearrangement of the binding pocket are necessary to prevent a possible release of the ligand [41]. Furthermore crystallographic studies have revealed that the largest changes occur in L3 upon TC binding [23]. Furthermore, we observe the highest entropy difference for nucleotide G51. Experimental data reveals that G51 crosslinks to tetracycline when the complex is subjected to UV irradiation [42]. These findings suggest a strong interaction with TC and thus a dramatic, positive change in \(\Delta H\). Nucleotides A52 and U54 show a positive entropy difference inside L3. Interestingly, molecular probing experiments show that G51, A52, and U54 of L3 are—in the absence of the antibiotic—the most modified nucleotides [23, 34]. Clearly, they change their conformational flexibility upon ligand binding due they direct interaction with the solvent. U54 further interacts with A51,A52,A53 and A55 building the core of the riboswitch [23]. Taken together, these observations reveal that U54 is necessary for the stabilization of L3. A more flexible dynamics (\(\Delta H\)) will change the configuration of the binding pocket and promotes TC release.
Comparison of tetracycline derivates
In this section, we want to investigate possible changes in configuration entropy by binding of different TC derivates. Moreover, we want to contrast StreAM\(T_g\) to conventional metrics like RMSF (Eq. 5) using the entropy of the stationary distributions obtained from \(T_g(a)\). Therefore, we simulated a set consisting of four different antibiotics (atc, dc, ddtc, tc) in complex with the riboswitch of “Workflow”. The structures of all derivates, each with different functional groups and different chemical properties, are shown in Fig. 3. For this approach we use a precomputation of \(P(X,\vec {r_{i}},\vec {r_{j}})\) to reduce the number of transition matrices for a full computation of \(T_g(a)\). Hence for all four simulations of TC derivates, we computed 1,763,208 (for tc), 1,534,488 (for atc), 2,685,816 (for dc) and 2,699,280 (for ddtc) transition matrices and combined them into 65 global models \(T_g(a)\) each. Similar to “Application to molecular synthetic biology”, we compute \(\Delta H = H_{wt}H_{complex}\) from the stationary distribution as well as \(\Delta RMSF = RMSF_{wt}RMSF_{complex}\) from individual RMSF computations. The results are shown in Fig. 8.
The \(\Delta RMSF\) in Fig. 8b and in \(\Delta H\) Fig. 8a shows a similar picture in terms of nucleotide dynamics. If we focus on atc we can observe a loss of conformational entropy upon ligand binding for almost every nucleotide. Considering this example the RMSF only detects a significant loss of nucleotidebased dynamics ranging from nucleotide 37–46. However, for dc, we observe the same effects like for dc. Contrary to this observation we detect, for ddtc, an increase in dynamic upon ligand binding as well as negative \(\Delta RMSF\) values. For tc, we observe a similar picture as for 7cltc (“Comparison of tetracycline derivates”). In a next step, we want to compare the obtained differences in stationary distribution with experimental values. To this end,we use an experimental metric: xfold values. A xfold value describes the efficiency of regulation in vivo and is given as the ratio of fluorescence without and with antibiotic in the experimental setup [43]. Unfortunately, atc reveals no experimental dynamics due to growth inhibition caused by the toxicity of the respective tc derivative [43]. In contrast to atc, dc and ddtc show only a weak performance (xfold = 1.1) in comparison to tc (xfold = 5.8) and 7cltc (xfold = 3.8) [43]. On the one hand, atc and dc appear overall too rigid and on the other hand ddtc too flexible to obtain a stable bound structure, implying insufficient riboswitch performance. For our design criterion of high xfold, we conclude that only certain nucleotides are allowed to be affected upon ligand binding. In particular, we need flexible nucleotides for the process of induced ligand binding (like nucleotide G51 Fig. 7) and stabilization of the complex intermediates (“Application to molecular synthetic biology”). Additionally, the switch needs rigidity for nucleotides building the stem region of the TCAptamer upon ligand binding (like nucleotides A51, A52 and A53 Fig. 7).
Summary, conclusion, and future work
Simulation tools to design and analyze functionally RNA based devices are nowadays very limited. In this study, we developed a new method StreAM\(T_g\) to analyze structural transitions, based on a coarse grained representation of RNA MD simulations, in order to gain insights into RNA dynamics. We demonstrate that StreAM\(T_g\) fulfills our demands for a method to extract the coarsegrained Markovian dynamics of motifs of a complex RNA molecule. Moreover StreAM\(T_g\) provides valuable insights into nucleotide based RNA dynamics in comparison to conventional metrics like the RMSF.
The effects observed in a designable riboswitch can be related to known experimental facts, such as conformational altering caused by ligand binding. Hence StreAM\(T_g\) derived Markov models in an abstract space of motif creation and destruction. This allows for the efficient analysis of large MD trajectories.
Thus we hope to elucidate molecular relaxation timescales, spectral analysis in relation to singlemolecule studies, as well as transition path theory in the future. At present, we use it for the design of switchable synthetic RNA based circuits in living cells [2, 44].
To broaden the application areas of StreAM\(T_g\) we will extend it to proteins as well as evolutionary graphs mimicking the dynamics of molecular evolution in sequence space [45].
Notes
 1.
Guaranteed to exist due to the PerronFrobenius theorem with an eigenvalue of \(\lambda = 1\).
 2.
 3.
 4.
Abbreviations
 MD:

molecular dynamics
 RMSF:

rootmeansquare fluctuation
 TC:

tetracycline
 dc:

doxycycline
 atc:

anhydrotetracycline
 ddtc:

6deoxy6demythyltetracycline
 7cltc:

7chlorotetracycline
References
 1.
Schlick T. Mathematical and biological scientists assess the state of the art in Rna science at an Ima workshop, Rna in biology, bioengineering, and biotechnology. Int J Multiscale Comput Eng. 2010;8(4):369–78.
 2.
Carothers JM, Goler JA, Juminaga D, Keasling JD. Modeldriven engineering of RNA devices to quantitatively program gene expression. Science. 2011;334(6063):1716–9.
 3.
Andronescu M, Condon A, Hoos HH, Mathews DH, Murphy KP. Computational approaches for RNA energy parameter estimation. RNA. 2010;16(12):2304–18.
 4.
Laing C, Schlick T. Computational approaches to RNA structure prediction, analysis, and design. Curr Opin Struct Biol. 2011;21(3):306–18.
 5.
III TEC. Simulation and modeling of nucleic acid structure, dynamics and interactions. Curr Opin Struct Biol. 2004;14(3):360–7.
 6.
Zhang M, Perelson AS, Tung CS. RNA Structural Motifs. eLS. 2011;1–4. doi:10.1002/9780470015902.a0003132.pub2.
 7.
Parisien M, Major F. The MCfold and MCSym pipeline infers RNA structure from sequence data. Nature. 2008;452(7183):51–5.
 8.
Alder BJ, Wainwright TE. Studies in molecular dynamics. J Chem Phys. 1959;30:459–66.
 9.
Huang X, Yao Y, Bowman GR, Sun J, Guibas LJ, Carlsson G, Pande VS. Constructing multiresolution markov state models (msms) to elucidate rna hairpin folding mechanisms. In: Biocomputing 2010. World Scientific. 2012. p. 228–39. doi: 10.1142/97898142952910025.
 10.
Gu C, Chang HW, Maibaum L, Pande VS, Carlsson GE, Guibas LJ. Building Markov state models with solvent dynamics. BMC Bioinform. 2013;14(2):8. doi:10.1186/1471210514S2S8.
 11.
Pinamonti G, Zhao J, Condon DE, Paul F, Noé F, Turner DH, Bussi G. Predicting the kinetics of RNA oligonucleotides using Markov state models. J Chem Theory Comput. 2017;13(2):926–34. doi:10.1021/acs.jctc.6b00982.
 12.
Warfield BM, Anderson PC. Molecular simulations and Markov state modeling reveal the structural diversity and dynamics of a theophyllinebinding RNA aptamer in its unbound state. PLoS ONE. 2017;12:e0176229. doi:10.1371/journal.pone.0176229.
 13.
Bottaro S, GilLey A, Bussi G. RNA folding pathways in stop motion. Nucleic Acids Res. 2016;44(12):5883–91. doi:10.1093/nar/gkw239.
 14.
Shapiro BA, Yingling YG, Kasprzak W, Bindewald E. Bridging the gap in RNA structure prediction. Curr Opin Struct Biol. 2007;17(2):157–65.
 15.
Chodera JD, Noé F. Markov state models of biomolecular conformational dynamics. Curr Opin Struct Biol. 2014;25:135–44.
 16.
Schiller B, Jager S, Hamacher K, Strufe T. Stream—a streambased algorithm for counting motifs in dynamic graphs. Berlin: Springer LNCS; 2015. p. 53–67.
 17.
Jonikas MA, Radmer RJ, Laederach A, Das R, Pearlman S, Herschlag D, Altman RB. Coarsegrained modeling of large RNA molecules with knowledgebased potentials and structural filters. RNA. 2009;15(2):189–99.
 18.
Zeng X, Zhang L, Xiao X, Jiang Y, Guo Y, Yu X, Pu X, Li M. Unfolding mechanism of thrombinbinding aptamer revealed by molecular dynamics simulation and Markov State Model. Sci Rep. 2016;6:24065. doi:10.1038/srep24065.
 19.
Manzourolajdad A, Arnold J. Secondary structural entropy in RNA switch (Riboswitch) identification. BMC Bioinform. 2015;16(1):133.
 20.
Gan HH, Pasquali S, Schlick T. Exploring the repertoire of RNA secondary motifs using graph theory; implications for RNA design. Nuc Acids Res. 2003;31(11):2926–43.
 21.
Stombaugh J, Zirbel CL, Westhof E, Leontis NB. Frequency and isostericity of RNA base pairs. Nucleic Acids Res. 2009;37(7):2294–312.
 22.
Metzner P, Noé F, Schütte C. Estimating the sampling error: distribution of transition matrices and functions of transition matrices for given trajectory data. Phys Rev E Stat Nonlin Soft Matter Phys. 2009;80(2):1–33. doi:10.1103/PhysRevE.80.021106.
 23.
Xiao H, Edwards TE, FerréD’Amaré AR. Structural basis for specific, highaffinity tetracycline binding by an in vitro evolved aptamer and artificial riboswitch. Chem Biol. 2008;15(10):1125–37.
 24.
Wunnicke D, Strohbach D, Weigand JE, Appel B, Feresin E, Suess B, Muller S, Steinhoff HJ. Ligandinduced conformational capture of a synthetic tetracycline riboswitch revealed by pulse EPR. RNA. 2011;17(1):182–8.
 25.
Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, Karplus M. Charmm: a program for macromolecular energy, minimization, and dynamics calculations. J Comp Chem. 1983;4(2):187–217.
 26.
Aleksandrov A, Simonson T. Molecular mechanics models for tetracycline analogs. J Comp Chem. 2009;30(2):243–55. doi:10.1002/jcc.21040.
 27.
Bussi G, Donadio D, Parrinello M. Canonical sampling through velocity rescaling. J Chem Phys. 2007;126(1):014101. doi:10.1063/1.2408420.
 28.
Berendsen HJC, Postma JPM, van Gunsteren WF, DiNola A, Haak JR. Molecular dynamics with coupling to an external bath. J Chem Phys. 1984;81(8):3684. doi:10.1063/1.448118.
 29.
Nosé S. A molecular dynamics method for simulations in the canonical ensemble. Mol Phys. 1984;52(2):255–68. doi:10.1080/00268978400101201.
 30.
Hoover WG. Canonical dynamics: equilibrium phasespace distributions. Phys Rev A. 1985;31(3):1695–7. doi:10.1103/PhysRevA.31.1695.
 31.
Parrinello M, Rahman A. Polymorphic transitions in single crystals: a new molecular dynamics method. J Appl Phys. 1981;52(12):7182. doi:10.1063/1.328693.
 32.
Hess B, Bekker H, Berendsen HJC, Fraaije JGEM. LINCS: a linear constraint solver for molecular simulations. J Comput Chem. 1997;18(12):1463–72.
 33.
Hanwell MD, Curtis DE, Lonie DC, Vandermeersch T, Zurek E, Hutchison GR. Avogadro: an advanced semantic chemical editor, visualization, and analysis platform. J Cheminform. 2012;4(1):17. doi:10.1186/17582946417.
 34.
Hanson S, Bauer G, Fink B, Suess B. Molecular analysis of a synthetic tetracyclinebinding riboswitch. RNA. 2005;11:503–11.
 35.
Deigan KE, Li TW, Mathews DH, Weeks KM. Accurate SHAPEdirected RNA structure determination. PNAS. 2009;106(1):97–102.
 36.
Hamacher K, Trylska J, McCammon JA. Dependency map of proteins in the small ribosomal subunit. PLoS Comput Biol. 2006;2(2):1–8.
 37.
Senne M, Trendelkampschroer B, Mey ASJS, Schütte C, Noe F. EMMA: a software package for Markov model building and analysis. Theory Comput J Chem. 2012;8(7):2223–38. doi:10.1021/ct300274u.
 38.
Spedicato GA. Markovchain: discrete time Markov chains made easy. (2015). R package version 0.4.3
 39.
Team RDC. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2008.
 40.
Buß O, Jager S, Dold SM, Zimmermann S, Hamacher K, Schmitz K, Rudat J. Statistical evaluation of HTS assays for enzymatic hydrolysis of \(\beta\)Keto esters. PloS ONE. 2016;11(1):e146104.
 41.
Reuss A, Vogel M, Weigand J, Suess B, Wachtveitl J. Tetracycline determines the conformation of its aptamer at physiological magnesium concentrations. Biophys J. 2014;107(12):2962–71.
 42.
Berens C, Thain A, Schroeder R. A tetracyclinebinding rna aptamer. Bioorg Med Chem. 2001;9(10):2549–56.
 43.
Müller M, Weigand JE, Weichenrieder O, Suess B. Thermodynamic characterization of an engineered tetracyclinebinding riboswitch. Nucleic Acids Res. 2006;34(9):2607. doi:10.1093/nar/gkl347.
 44.
Cameron DE, Bashor CJ, Collins JJ. A brief history of synthetic biology. Nature reviews. Microbiology. 2014;12(5):381–90.
 45.
Lenz O, Keul F, Bremm S, Hamacher K, von Landesberger T. Visual analysis of patterns in multiple amino acid mutation graphs. In: IEEE conference on visual analytics science and technology (VAST). 2014. p. 93–102.
 46.
Jager S, Schiller B, Strufe T, Hamacher K. StreamT_{g} : algorithms for analyzing coarse grained RNA dynamics based on markov models of connectivitygraphs. Berlin: Springer; 2016.
 47.
Schrödinger L. The PyMOL molecular graphics system, Version 1.8. 2015.
Authors' contributions
Conceptualization of research work by SJ and KH. Data preparation by SJ, PB and MB. Implementation by SJ and BS. Benchmarks and Analysis of algorithm by SJ. Graphics by SJ, BS and PB. Writing of the manuscript by SJ, BS, KH, TS, MB and PB. Valuable suggestions to improve the manuscript by SJ, TS and KH. This Paper is an extended version of the research article: StreAM\(T_g\) : Algorithms for Analyzing Coarse Grained RNA Dynamics Based on Markov Models of ConnectivityGraphs [46]. All authors read and approved the final manuscript.
Acknowledgements
Not applicable.
Competing interests
The authors declares that they have no competing interests.
Availability of data and materials
Generator for dynamic graphs: https://github.com/BenjaminSchiller/DNA.datasets Implementation of StreAM\(T_g\): http://www.cbs.tudarmstadt.de/streAMTg.tar.gz Implementation of StreAM and StreAM \(_B\): https://github.com/BenjaminSchiller/Stream
Funding
The Authors gratefully acknowledge financial support by the LOEWE project CompuGene of the Hessen State Ministry of Higher Education, Research and the Arts. Parts of this work have also been supported by the DFG, through the Cluster of Excellence cfaed as well as the CRC HAEC.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Jager, S., Schiller, B., Babel, P. et al. StreAM\(T_g\): algorithms for analyzing coarse grained RNA dynamics based on Markov models of connectivitygraphs. Algorithms Mol Biol 12, 15 (2017). https://doi.org/10.1186/s1301501701050
Received:
Accepted:
Published:
Keywords
 RNA
 Markovian dynamics
 Dynamic graphs
 Molecular dynamics
 Coarse graining
 Synthetic biology