Protein Structure Idealization: How accurately is it possible to model protein structures with dihedral angles?

Cui, Xuefeng; Li, Shuai Cheng; Bu, Dongbo; Alipanahi, Babak; Li, Ming

doi:10.1186/1748-7188-8-5

Research
Open access
Published: 25 February 2013

Protein Structure Idealization: How accurately is it possible to model protein structures with dihedral angles?

Xuefeng Cui¹,
Shuai Cheng Li²,
Dongbo Bu³,
Babak Alipanahi¹ &
…
Ming Li¹

Algorithms for Molecular Biology volume 8, Article number: 5 (2013) Cite this article

9939 Accesses
2 Citations
Metrics details

Abstract

Background

Previous studies show that the same type of bond lengths and angles fit Gaussian distributions well with small standard deviations on high resolution protein structure data. The mean values of these Gaussian distributions have been widely used as ideal bond lengths and angles in bioinformatics. However, we are not aware of any research done to evaluate how accurately we can model protein structures with dihedral angles and ideal bond lengths and angles.

Here, we introduce the protein structure idealization problem. We focus on the protein backbone structure idealization. We describe a fast O(n m/ε) dynamic programming algorithm to find an idealized protein backbone structure that is approximately optimal according to our scoring function. The scoring function evaluates not only the free energy, but also the similarity with the target structure. Thus, the idealized protein structures found by our algorithm are guaranteed to be protein-like and close to the target protein structure.

We have implemented our protein structure idealization algorithm and idealized the high resolution protein structures with low sequence identities of the CULLPDB_PC30_RES1.6_R0.25 data set. We demonstrate that idealized backbone structures always exist with small changes and significantly better free energy. We also applied our algorithm to refine protein pseudo-structures determined in NMR experiments.

Background

When studying the functions of a protein, it is crucial to know the three-dimensional structure consisting of the Cartesian coordinates of all the atoms of the protein. These atoms are bonded together by inter-atomic forces called chemical bonds. It has been observed that the bond lengths and angles of the same type assume a Gaussian distribution with a small standard deviation (STDEV) in high resolution protein structure data. Typically, the bond lengths on protein backbones have STDEVs between 0.019Å and 0.033Å and the bond angles on protein backbones have STDEVs between 1.5° and 2.7° [1, 2]. These results suggest the possibility of modeling protein structures with the mean values of bond lengths and angles, which are often referred to as ideal values.

Ideal bond lengths and angles have been widely used in nuclear magnetic resonance (NMR) protein structure determination [3, 4] and in protein structure prediction [5–9]. Moreover, stereochemical restraints are also used in X-ray protein structure determination [10, 11]. In protein structure prediction, the main advantage of using ideal bond lengths and angles is a reduction in the search space for the target protein structure [12, 13]. Specifically, if the target protein has n amino acids, the number of n, C_α and C atoms on the backbone is 3n, and thus the Cartesian search space for the idealized backbone structure has a degree of freedom of 9n[12, 13]. However, if all bond lengths and angles have ideal values, the protein backbone structure can be represented by a series of bond torsion angles in the feasible bond torsion angle space. In this case, the degree of freedom is reduced to approximately one tenth of that in the Cartesian space [12, 13].

Although ideal bond lengths and angles have been widely used and accepted, we are not aware of any research done to evaluate how accurately it is possible to model protein structures with dihedral angles. This motivates us to solve what we call the protein structure idealization problem: Given the coordinates of the target protein structure, find the coordinates of the optimal idealized protein structure. Here, an idealized protein structure is a protein structure with bond lengths and angles that are ideal with respect to a given scoring function; the function depends on the resultant structure’s free energy, as well as its similarity with the target structure. Thus, the idealized protein structure is taken to be a protein-like structure that is close to the target protein structure.

First, we solve the protein structure idealization problem by idealizing the backbone structure and then idealizing the side-chain structure. This approach is widely accepted because previous research suggests that the backbone conformation is archived before the side-chain conformations are archived [14]. In our work, Ω dihedral angles are rounded to be either 0° or 180°. Some discussions on the properness of idealizing Ω dihedral angles can be found in [15, 16].

We introduce a novel dynamic programming algorithm with a run-time complexity of O(n/ε⁸), where ε is a small constant, to find the optimal idealized protein backbone structure according to our scoring function. In practice, we observed that it is unnecessary to remember the entire dynamic programming table. Thus, with a filtering technique, the run-time complexity is further reduced to O(n m/ε), where m is a constant integer.

In our initial study on the protein structure idealization problem, side-chain structures are determined using an exhaustive search which assumes that side-chain structures of different residues are independent from each other. The scoring function is similar to the one we used for backbone structure idealization. In practice, we observe that it is fast to regenerate idealized structures that are similar to a given idealized structure. We also refine the idealized backbone and side-chain structures according to our scoring functions iteratively.

We use our algorithm to evaluate how accurately it is possible to model protein structures with dihedral angles. We idealize all the X-ray protein structures from PDB [17] which satisfy the high resolution and the low sequence identity constraints downloaded on June 6, 2008 [18, 19]. The results show that such idealized structures always exist and that they are very similar to the target structures in terms of the root mean square deviation (RMSD) of C_α or all atoms. Moreover, the idealized backbone structures tend to have dDFIRE free energy scores [20, 21], which are significantly better than the target structures. The results support our conclusion that it is possible to model protein structures accurately with dihedral angles on all high resolution protein backbone structures.

One application of the protein structure idealization algorithm is to refine protein pseudo-structures either determined in experiments or predicted by computers. We have demonstrated one such case to improve poor (Φ,Ψ) dihedral angles of protein structures determined by NMR. The experiment result is also consistent with the previous experiment showing that the idealized structure has a small RMSD and better backbone free energy. We discuss several potential applications for our protein structure idealization algorithm in the conclusion.

Protein backbone structure idealization

Given the target protein backbone structure, we would like to find the optimal idealized backbone structure. For an idealized protein backbone structure, the coordinates of O, H and C_β backbone atoms can be calculated from the coordinates of n, C_α and C backbone atoms. Thus, we specifically describe how to generate coordinates of n, C_α and C atoms in this section. For simplicity, a structure is always referred to as a protein backbone structure unless strictly specified.

Idealized backbone structure generation

Given the target structure, we would like to generate idealized structures fulfilling two generation goals. First, the idealized structures should be similar to the target structure. Second, each pair of idealized structures should be at least some distance away to avoid redundant computation. Furthermore, we are interested in generating as many of these idealized structures as possible.

Before describing how we fulfill the generation goals, we describe a simple distance metric to measure the distance between two sets of coordinates representing the target protein. Let P_i be a set of coordinates representing the target protein, and $P_{i}^{j} \in P_{i}$ be the coordinate of the j-th atom of the target protein. Thus, there is $P_{i} = {P_{i}^{1}, P_{i}^{2}, ..., P_{i}^{3 n}}$ , where n is the number of amino acids of the target protein. For simplicity, let P₀ always represent the target structure, and P_i represent a generated idealized structure for i>0. Let $D (P_{i}^{k}, P_{j}^{k})$ be the Euclidean distance between $P_{i}^{k}$ and $P_{j}^{k}$ . We describe the distance between P_i and P_j as the bottleneck distance:

D (P_{i}, P_{j}) = max_{k} D (P_{i}^{k}, P_{j}^{k}) .

(1)

Using this distance metric, we fulfill both generation goals by satisfying the following generation constraints:

\{\begin{array}{l} D (P_{0}, P_{i}) \leq r & \forall i > 0 \\ D (P_{i}, P_{j}) \geq ε & \forall i, j > 0 \end{array} .

(2)

The first generation constraint assumes that the accuracy of the coordinates of the target structure is reasonably good, and no-worse than r. If this constraint is satisfied, the distance between the target coordinate and any generated coordinate representing the same atom is upper bounded by r. Thus, it is reasonable for any generated idealized structure P_i to be considered similar to target structure P₀. If the second generation constraint is satisfied, for each pair of generated idealized structures, there exists a pair of coordinates, one from each structure representing the same atom, such that they are at least ε distance away from each other. Therefore, both generation goals are achieved.

These generation constraints suggest limiting the search space inside a sphere with radius r, and discreting the search space with grids of size ε. When ε=0.001Å, the accuracy of X-ray crystallography [22] and PDB (protein database) format [23] is reached. Thus, this method is capable of generating all possible idealized structures at the accuracy of X-ray crystallography and PDB format.

Given the limited and discrete search space of each atom, one can generate idealized structure coordinates from the first atom to the last atom. For the first atom, an idealized coordinate lies within a sphere. Thus, the number of generated coordinates is bounded by O(1/ε³). For each generated coordinate $P_{i}^{1}$ of the first atom, an idealized coordinate of the second atom lies on a ball surface with a constant distance to $P_{i}^{1}$ . Thus, the number of generated coordinates is bounded by O(1/ε²). For each generated coordinate pair $(P_{i}^{1}, P_{i}^{2})$ of the first two atoms, an idealized coordinate of the third atom lies on a circle with a constant distances to $P_{i}^{1}$ and $P_{i}^{2}$ . Thus, the number of generated coordinates is bounded by O(1/ε). Similarly, the number of generated coordinates for any of the following atoms is also bounded by O(1/ε). Moreover, since we round Ω dihedral angles to either 0° or 180°, the coordinate of any C_α atom is unique and can be calculated from the coordinates of the previous three atoms.

Therefore, the total number of coordinates generated for all atoms is bounded by O(1/ε²ⁿ⁺⁴) by induction. Here, it is acceptable to assume that r is a constant because it is only related to the first atom. For subsequent atoms, we did not limit the search space to be inside the sphere with radius r as described above, and thus the actual number of generated coordinates should be much smaller in practice.

Idealized backbone structure scoring function

Given the generated idealized structures {P_i}, we need a scoring function S_{B
B}(P_i) to find the optimal idealized structure. The scoring function should evaluate not only the similarity between generated idealized structure P_i and target structure P₀, but should also evaluate the free energy of P_i, to ensure that P_i is protein-like. Thus, we define our scoring function as follows:

\begin{array}{l} S_{BB} (P_{i}) = & S_{f} (P_{i}) - w_{1} D_{α} (P_{i}, P_{0}) - w_{2} D_{β} (P_{i}, P_{0}) \\ - w_{3} D_{H} (P_{i}, P_{0}) - w_{4} D_{Φ, Ψ} (P_{i}, P_{0}), \end{array}

(3)

where w_a are the weighting parameters, S_f(P_i) is the free energy score, D_α(P_i,P₀) is the root mean square divergence (RMSD) of C_α atoms, D_β(P_i,P₀) is the RMSD of C_β atoms, D_H(P_i,P₀) is the RMSD of the hydrogen and oxygen atoms participating in hydrogen bonds, and D_Φ,Ψ(P_i,P₀) is the RMSD of (Φ,Ψ) dihedral angles.

In our scoring function, the free energy is evaluated by a (Φ,Ψ) dihedral angle log-odd score as the free energy score S_f(P_i). Specifically, we discrete the Ramachandran plot into grids of 360 by 360, and draw one plot for each type of amino acid. Then, we calculate the log-odd score $S_{f} (P_{i}^{1, t})$ of idealized structure $P_{i}^{1, t}$ of the first t atoms:

S_{f} (P_{i}^{1, t}) = \sum_{5 \leq i \leq t, A_{i} = C_{α}} log \frac{P_{A A_{i - 3}} (Φ_{i - 3}, Ψ_{i - 3})}{P_{null} (Φ_{i - 3}, Ψ_{i - 3})},

(4)

where one log-odd score is calculated at each C_α atom (by checking that atom type A_i is C_α) for the previous amino acid (represented by the previous C_α atom at i−3), $P_{A A_{i - 3}} (Φ_{i - 3}, Ψ_{i - 3})$ is the probability of the grid containing (Φ_i−3,Ψ_i−3) on the Ramachandran plot of amino acid type A A_i−3, and P_{n
u
l
l}(Φ_i−3,Ψ_i−3) is the probability of the null model with a uniform distribution such that $P_{null} (Φ_{i - 3}, Ψ_{i - 3}) = \frac{1}{360} \frac{1}{360}$ .

Structure similarity is evaluated by other distance matrices in our scoring function. We use D_α(P_i,P₀) and D_Φ,Ψ(P_i,P₀) to serve as distance metrics to conserve the backbone structures, and D_β(P_i,P₀) to serve as a distance metric to conserve the side-chain structure compatibilities; we also use D_H(P_i,P₀) to serve as a distance metric to conserve the hydrogen bonds. Thus, some global dependencies are addressed implicitly by distance matrices $D_{β} (P_{i}^{1, t}, P_{0}^{1, t})$ and D_H(P_i,P₀).

Dynamic programming algorithm

Theoretically, one can calculate scores for all generated idealized structures and find the optimal one with the maximum score. This method works well as long as similar structures always have similar scores. More formally, the method requires the assumption that D(P_i,P_j)≤ε ⇒ |S_{B
B}(P_i)−S_{B
B}(P_j)|≤ε_s, which is reasonable for small ε. Note that, since the total number of generated idealized structures is bounded by O(1/ε²ⁿ⁺⁴), this method is computationally expensive. Thus, we introduce a dynamic programming algorithm with a filtering technique to find the optimal idealized structure efficiently.

The dynamic programming algorithm has two assumptions. One assumption is that given two generated idealized structures $P_{i}^{1, t - 1}$ and $P_{j}^{1, t - 1}$ of the first t−1 atoms, such that $D (P_{i}^{t - k, t - 1}, P_{j}^{t - k, t - 1}) \leq ε$ , for any generated coordinate $P_{i}^{t}$ of the t’th atom, there always exists a generated coordinate $P_{j}^{t}$ , such that $D (P_{i}^{t}, P_{j}^{t}) \leq ε$ . The other assumption is that the scoring function satisfies the additive property, such that $S_{BB} (P_{i}^{1, t}) = S_{BB} (P_{i}^{1, t - k}) \oplus S_{BB} (P_{i}^{t - k + 1, t})$ , under some addition operator ⊕.

We observed that counter examples of the first assumption when k≥5 are rare, though counter examples do exist theoretically. The second assumption holds for our scoring function. Distance matrices $D_{α} (P_{i}^{1, t}, P_{0}^{1, t})$ , $D_{β} (P_{i}^{1, t}, P_{0}^{1, t})$ , $D_{H} (P_{i}^{1, t}, P_{0}^{1, t})$ and $D_{Φ, Ψ} (P_{i}^{1, t}, P_{0}^{1, t})$ satisfy the additive property because RMSD $D_{RMS} (P_{i}^{1, t}, P_{0}^{1, t})$ satisfies the additive property:

\begin{array}{l} D_{RMS} (P_{i}^{1, t}, P_{0}^{1, t}) \\ = D_{RMS} (P_{i}^{1, t - k}, P_{0}^{1, t - k}) \oplus D_{RMS} (P_{i}^{t - k + 1, t}, P_{0}^{t - k + 1, t}) \\ = \sqrt{\frac{D_{RMS}^{2} (P_{i}^{1, t - k}, P_{0}^{1, t - k}) (t - k) + D_{RMS}^{2} (P_{i}^{t - k + 1, t}, P_{0}^{t - k + 1, t}) k}{t}} . \end{array}

(5)

Moreover, the free energy score $S_{f} (P_{i}^{1, t})$ satisfies the additive property as follows:

\begin{array}{l} S_{f} (P_{i}^{1, t}) & = S_{f} (P_{i}^{1, t - k}) \oplus S_{f} (P_{i}^{t - k + 1, t}) \\ = S_{f} (P_{i}^{1, t - k}) + S_{f} (P_{i}^{t - k + 1, t}) . \end{array}

(6)

The second assumption is fundamental to our dynamic programming algorithm. By induction, the first assumption implies that if $D (P_{i}^{t - k, t - 1}, P_{j}^{t - k, t - 1}) \leq ε$ , for any generated idealized structure $P_{i}^{t, n}$ , there always exists a generated idealized structure $P_{j}^{t, n}$ such that $D (P_{i}^{t, n}, P_{j}^{t, n}) \leq ε$ . Recall that the scoring function assumes that $D (P_{i}^{t, n}, P_{j}^{t, n}) \leq ε \Rightarrow | S_{BB} (P_{i}^{t, n}) - S_{BB} (P_{j}^{t, n}) | \leq ε_{s}$ , and thus there is $S_{BB} (P_{i}^{t, n}) \approx S_{BB} (P_{j}^{t, n})$ . If $S_{BB} (P_{i}^{1, t - 1}) \geq S_{BB} (P_{j}^{1, t - 1})$ , there is approximately $S_{BB} (P_{i}) = S_{BB} (P_{i}^{1, t - 1}) \oplus S_{BB} (P_{i}^{t, n}) \geq S_{BB} (P_{j}^{1, t - 1}) \oplus S_{BB} (P_{j}^{t, n}) = S_{BB} (P_{j})$ . Therefore, if $D (P_{i}^{t - k, t - 1}, P_{j}^{t - k, t - 1}) \leq ε$ and $S_{BB} (P_{i}^{1, t - 1}) \geq S_{BB} (P_{j}^{1, t - 1})$ , there is no need to generate $P_{j}^{t, n}$ to find an approximately optimal solution.

Based on this observation, we developed a novel dynamic programming algorithm. Idealized structures are still generated as previously described, but the generation process is stopped for some idealized structures if we know it cannot lead us to the optimal one. First, the search space for each atom of the target protein is discretized to grids of size ε. When generating coordinates for atom t, if $P_{i}^{t - k + 1, t}$ and $P_{j}^{t - k + 1, t}$ are located in the same grid set $G_{g}^{t - k + 1, t}$ , we know that there is no need to continue the generation process on the lower scoring one of $P_{i}^{1, t}$ and $P_{j}^{1, t}$ . Thus, we define the dynamic programming table $T_{BB} (t, G_{g}^{t - k + 1, t})$ to be the optimal idealized structure for each observed tail grid set $G_{g}^{t - k + 1, t}$ as follows:

\begin{array}{l} \{\begin{array}{l} T_{BB} (t, G_{g}^{t - k + 1, t}) = max_{i, j} T_{BB} (t - 1, G_{i}^{t - k, t - 1}) \\ \oplus S_{BB} (P_{j}^{t}) \\ T_{BB} (k, G_{g}^{1, k}) = max_{i} S_{BB} (P_{i}^{1, k}) \end{array}, \end{array}

(7)

where $G_{g}^{t - k + 1, t - 1} = G_{i}^{t - k + 1, t - 1}$ , $P_{j}^{t - k + 1, t} \in G_{g}^{t - k + 1, t}$ and $S_{BB} (P_{j}^{1, t - 1}) \oplus S_{BB} (P_{j}^{t}) = S_{BB} (P_{j}^{1, t})$ . Thus, the dynamic programming table can be calculated from the first atom to the last atom. Finally, the optimal idealized structure is the one with the highest score $max_{g} G_{g}^{3 n - k + 1, 3 n}$ .

The run-time complexity of our dynamic programming algorithm depends on the value of k. To keep all possible (Φ,Ψ) dihedral angles of the previous residue when generating C_α atoms, we have to choose k≥5. For speed, we choose k=5 in our implementation. In this case, the number of score calculations required to calculate $T_{BB} (t, G_{g}^{t - 4, t})$ is no more than the maximum number of coordinates sampled for six consecutive backbone atoms. Recall that there are exactly two C_α atoms in six consecutive backbone atoms, and the Ω dihedral angle is rounded. Thus, the coordinate of one C_α atom can be calculated from the coordinates of the other C_α atom and the two atoms between them. For this reason, the maximum number of sampled coordinates is bounded by O(1/ε⁸). Moreover, the number of score calculations required to calculate $T_{BB} (k, G_{g}^{1, k})$ is no more than the maximum number of possible coordinates sampled for five consecutive backbone atoms, which is also O(1/ε⁸). Therefore, the run-time complexity of our dynamic programming algorithm is O(n/ε⁸).

To increase the speed for the dynamic programming algorithm, we applied an additional filtering technique to remember only the highly scored idealized structures. Specifically, the algorithm only remembers the optimal idealized structure for the top m scored tail configurations instead of all possible conformations. Thus, the run-time complexity is reduced to O(n m/ε). This approach works well in practice because an optimal idealized structure with a long poorly scored fragment is rare. Thus, we assumed that the local quality of the idealized structure should be reasonably high (in the top m score list).

Protein side-chain structure idealization

After the backbone structure of the target protein has been idealized, we begin to idealize the side-chain structures. When doing this, the idealized backbone structure is considered to be rigid. This approach is widely accepted because previous research suggests that the backbone conformation is archived before the side-chain conformations are archived [14]. After the side-chain idealization, we should have a complete idealized protein structure with all of the backbone and the side-chain structures idealized.

Protein side-chains suffer from low quality problems when determining protein structures. This is mainly because side-chains are not as stable as backbones, and they are more likely to have disorder problems than are backbones in crystals [22]. Thus, the target side-chain structure might be a poor reference for defining the search space and for evaluating the structure similarity score for generated idealized side-chain structures. To address this, we perform an exhaustive search on the entire feasible torsion angle space, instead of the limited torsion angle space, around the target side-chain structure.

Our side-chain idealization method assumes that the side-chain conformations of different residues are independent of each other. Otherwise, all residues with dependencies have to be generated together and the run-time complexity increases exponentially to the number of atoms involved. Moreover, the N_{η 1}−C_ζ−N_ε−C_δ and the N_{η 2}−C_ζ−N_ε−C_δ torsion angles of arginine residues are rounded to be either 0° or 180°. Then, the degree of freedom of the search space for each residue is at most four and it is now practical to perform an exhaustive search for each residue independently.

To find the optimal idealized side-chain structure, we design a new scoring function involving the similarity among the generated idealized side-chain structures and the target side-chain structures, and the free energy of the generated idealized side-chain structures. Let P₀ be the target side-chain structure of some residue, and P_i for all i>0 be a generated idealized side-chain structure of the same residue. Then, the scoring function S_{S
C}(P_i) is defined:

S_{SC} (P_{i}) = S_{f} (P_{i}) - w_{1} D_{H^{'}} (P_{i}, P_{0}) - w_{2} D_{χ} (P_{i}, P_{0}),

where w_k are the weighting parameters, S_f(P_i) is the free energy score, $D_{H^{'}} (P_{i}, P_{0})$ is the root mean square divergence (RMSD) of all non-hydrogen atoms, and D_χ(P_i,P₀) is the RMSD of χ torsion angles.

In our scoring function, the free energy score S_f(P_i) is defined as a simple χ torsion angle log-odd score, which is similar to the free energy score of our backbone scoring function. Moreover, the log-odd score is based on the popular backbone dependent rotamer library downloaded from Dunbrack’s lab [24]. Certainly, other local free energy scores can be adopted here. Similar to the backbone scoring function, $D_{H^{'}} (P_{i}, P_{0})$ and D_χ(P_i,P₀) serve as distance metrics to conserve the side-chain structure.

Result

To study the protein structure idealization problem and its applications, we implemente our protein structure idealization algorithm. In our implementation, we use the mean bond lengths and angles that had been reported in [2] as the ideal bond lengths and angles, respectively. When idealizing the protein backbone structure, we set the search space radius of an atom as r=1.6Å and the discrete grid size as ε=r/5. We find that m=50,000 had a reasonable balance between speed and accuracy. When idealizing the protein side-chain structure, we set the search space of a rotamer dihedral angle to be within 3σ distance from the mean value, where σ is the STDEV of the rotamer dihedral angle, and we set the discrete grid size to be 10°. We also refine the idealized structure by iteratively reducing the search space and the discrete grid size by a constant factor of 0.5. Since finding the best scoring function for the protein structure idealization is out of the scope of this paper, we set all weights w_a=1.0 for all a in our scoring function.

PDB protein structure idealization

In this experiment, we addressed how accurately it is possible to model protein structures with dihedral angles. We idealized high resolution protein structures with low sequence identities of the CULLPDB_PC30_RES1.6_R0.25 data set [18, 19]. In fact, the CULLPDB_PC30_RES1.6_R0.25 data set is the complete set of X-ray protein structures in PDB [17] with a sequence identity cutoff of 30%, a resolution cutoff of 1.6Å, and an r factor cutoff of 0.25. In summary, the data set contains 1898 proteins with an average length of 227 residues, as downloaded on June 6, 2008.

To show that the idealized and the target backbone structures are very similar, we calculated the C_α-RMSD as shown in Figure 1. The C_α-RMSD is a popular distance metric to evaluate the backbone distance between two protein backbone structures. The result shows that most distances between the idealized and the target backbone structures are small with mean 0.53Å and STDEV 0.08Å. Specifically, the smallest C_α-RMSD reaches 0.16Å, and 90% of the C_α-RMSDs are smaller than 0.63Å. Moreover, the C_α-RMSD is upper bounded by 1.00Å, although the search space radius for each atom is set to be 1.6Å. This result is consistent with the result of checking (Φ,Ψ) dihedral angles, where the average difference between the idealized and the target (Φ,Ψ) dihedral angles is as small as 0.08°. Therefore, it is possible to model protein backbone structures in CULLPDB_PC30_RES1.6_R0.25 accurately using only Φ and Ψ dihedral angles.

We studied the C_α-RMSD further in different regions of the target protein structures. In Figures 2 and 3, we see that the C_α-RMSD of the α-helix and the β-sheet regions are smaller than that of the complete protein by 0.28Å and 0.12Å, respectively. Indeed, these regions are more restricted because of using D_H(P_i,P₀) to conserve hydrogen bonds of α-helices and β-sheets in our scoring function. We also observe that the C_α-RMSD of residues that are closer to the geometric center of a target protein structure is 0.13Å smaller on average than the C_α-RMSD of the other residues that are farther, as shown in Figure 4. Thus, the inner residues tend to be closer to the idealization state than are the outer residues. We did not observe any significant differences on the C_α-RMSD between the buried and the exposed regions.

We also calculated the all-atom RMSD to show that the idealized and the target structures are very similar. In Figure 5, we see that most distances between the idealized and the target structures are small, with mean 0.79Å and STDEV 0.13Å. Moreover, the smallest all-atom RMSD reaches 0.45Å, and 90% of the all-atom RMSDs are smaller than 0.94Å. Note that both the C_α-RMSD and the all-atom RMSD between the idealized and the target structures tend to be stable when the target protein is long. Therefore, it is possible to model protein structures accurately with only Φ, Ψ, and χ dihedral angles.

The idealized backbone structures are also favored in terms of free energy. This is shown by checking the free energy differences between the idealized and the target protein backbone structures in Figure 6. Here, we calculate the free energy using dDFIRE [20, 21], and observe that the dDFIRE free energy of most idealized backbone structures are significantly better than are those of the target backbone structures. For the rest without significant improvements, the difference is close to zero. This may be the result of some tight thereochemical restraints used in existing X-ray structure refinement programs [15, 16]. It is also interesting that the observed free energy improvements are clearly not independent from the protein length. The figure suggests that the free energy difference has a square dependence on the protein length.

After idealizing the side-chain structures, the free energy is either improved by a relatively biger amount or worsened by a relatively smaller amount as shown in Figure 7. Unfortunately, in most cases, the free energy is worsened slightly but is still in a stable state with negative values. Again, here we used dDFIRE [20, 21] to calculate the free energy. We observed that the dDFIRE free energy is improved for 90 or 4.74% of the idealized protein structures and is worsened slightly by 44 on average. Moreover, the dDFIRE free energy is improved by 1585 in the best case, and worsened by 293 in the worst case. The figure also suggests that the free energy difference has a linear dependence on the protein length.

Several side-chain prediction tools have been proven to predict accurate side-chain structures from native backbone structures [8, 9, 25, 26]. However, these tools does not perform well when predicting side-chain structures from predicted backbone structures. To address this, we compared the predicted side-chain structures given the native backbone structures and those given the predicted backbone structures in terms of free energy. Here, we treat the idealized backbone structures of the CULLPDB_PC30_RES1.6_R0.25 data set as those which are best possibly predicted. Moreover, we used SCWRL4 [9] to predict side-chain structures and dDFIRE [20, 21] to calculate free energies. The result shows that the free energy is worsened slightly by 43 if the predicted backbone structures are used. We do not think this difference is significant to side-chain prediction. Certainly, more experiments will show if this is conclusive.

Finally, we study the effects of idealization on hydrogen bonds. As shown in Table 1, we compare the number of hydrogen bonds detected by the DSSP program [27, 28]. Here, only differences of the most popular types of hydrogen bonds are included. We note that the total number of hydrogen bonds is increased by 1.59% or 0.012 per residue after idealization. Specifically, the effects of idealization on hydrogen bonds of β bridges is minor, and the loss of the hydrogen bonds on α helices is reasonably controlled under 1.48%. Interestingly, the idealized backbone structures have significantly more 2₇ ribbons. The reason behind this observation remains open.

Table 1 DSSP hydrogen bond differences before and after idealization

Full size table

In summary, we demonstrate that using dihedral angles with ideal bond lengths and angles is capable of modeling protein structures that are highly similar to the ones in CULLPDB_PC30_RES1.6_R0.25 [18, 19]. Since CULLPDB_PC30_RES1.6_R0.25 is the complete set of PDB protein structures satisfying the high resolution and the low sequence identity constraints, it is reasonable to extend the conclusion to all protein backbone structures. A positive side effect is that idealization improves backbone free energy, while most hydrogen bonds are conserved.

NMR protein structure refinement

In this experiment, we demonstrate an application of the protein structure idealization problem in NMR by idealizing 32 NMR protein structures. The NMR protein structures were randomly chosen from PDB [17] with a sequence identity cutoff of 30% and a gapless fragment length cutoff of 80 residues. In cases of multiple chains or models of some NMR protein structures, only the first chain from the first model is used in this experiment. This addition to the conclusion of the previous experiment shows that poor (Φ,Ψ) dihedral angles of the NMR protein structures are improved by idealizing them.

To demonstrate this, we compared the percentage of favored (Φ,Ψ) dihedral angles calculated by PROCHECK [29] in Table 2. After idealization, we see that 19 out of 32 NMR protein structures have more favored (Φ,Ψ) dihedral angles. Overall, the percentage is increased by 4.34% on average and 27.30% in the best case, which is closer to the minimum percentage of 90% expected in a good quality model [29].

Table 2 The percentages of the favored (Φ,Ψ) dihedral angles of 32 NMR protein structures before and after idealization

Full size table

Note that for those NMR protein structures that already have more than approximately 75% of favored (Φ,Ψ) dihedral angles, idealization harms the percentage by −0.85% on average. There are at least two reasons for this. First, our free energy score S_f(P_i) is calculated from a data set that is different from the one used by PROCHECK. In fact, we used 1898 protein structures of the CULLPDB_PC30_RES1.6_R0.25 data set [18, 19], while PROCHECK used 118 protein structures, with a resolution cutoff of 2.0Å and an r factor cutoff of 0.20 [29]. Although the percentages of favored (Φ,Ψ) dihedral angles are decreased in Table 2, our free energy scores of proteins 1WI5, 1NMW, and 2LBV are increased by 0.22, 1.35, and 0.31, respectively, after idealization. Second, our implementation is trying to optimize our scoring function S_{B
B}(P_i), instead of optimizing only the free energy score. Thus, it is possible to see decreased free energy scores after idealization, especially when the target protein structure has a high percentage of favored (Φ,Ψ) dihedral angles.

Our conclusion is further supported by the case study of the NMR structure with PDB ID 1WPI [30]. From the Ramachandran plots drawn by PROCHECK [29] in Figures 8 and 9, we find that (Φ,Ψ) dihedral angles tend to move towards favored regions. Specifically, the native structure contains only 64.4% of (Φ,Ψ) dihedral angles in favored regions, while the idealized structure contains a significantly improved percentage of 81.4% of (Φ,Ψ) dihedral angles in favored regions. Moreover, the native structure contains three (Φ,Ψ) dihedral angles that are not in any feasible areas of the Ramachandran plot. However, there is only one such case found in the idealized structure. Thus, two infeasible (Φ,Ψ) dihedral angles are fixed by the (Φ,Ψ) dihedral angle log-odd score. Here, we did not, but certainly can, implement a hard constraint to disallow any infeasible (Φ,Ψ) dihedral angles.

In summary, we have demonstrated that protein structure idealization can be used to improve poor (Φ,Ψ) dihedral angles of protein pseudo-structures. These protein pseudo-structures can either be predicted or be experimentally determined. More applications of the protein structure idealization problem will be studied.

Conclusion

We have introduced the protein structure idealization problem and performed our first attempt to solve it. The experiment results show that idealized structures always exist with small changes on the coordinates. Furthermore, the idealized backbone structures have significantly better free energy and (Φ,Ψ) dihedral angle distributions. Therefore, protein structures can be modeled accurately with dihedral angles and ideal bond lengths and angles, and it is feasible to predict protein backbone and side-chain structures by searching the dihedral angle space.

Our protein structure idealization algorithm may be improved in several ways. Since our scoring functions are very simple with all weights w_a=1.0 in the current implementation, there is space for improvements. We are also looking forward to adding protein-ligand interaction energy to our scoring function and to studying the effect of idealization on protein-ligand interactions. Moreover, since some atoms are more flexible than others, we can also set different search spaces for different atoms in our algorithm. For example, when idealizing X-ray protein structures, the search space of each atom could be selected according to its B-factor. We can also adopt a divide-and-conquer algorithm in our algorithm to find the global, rather than local, optimal idealized structure. Specifically, we can divide the protein structure into small fragments, idealize each fragment separately, and merge idealized fragments. The key is to divide the protein structure by a tree decomposition of the interaction graph and to remember the optimal idealized fragment for each possible configuration of atoms with interactions to external atoms. Similar ideas have already been used successfully to improve the speed and the accuracy of backbone and side-chain structure predictions [8, 9, 25, 26, 31, 32].

Our protein structure idealization algorithm can also correct modelling errors of protein structures in PDB [17]. In fact, previous research indicates that many bond conformations and side-chain rotamers are likely incorrect in PDB, and it is useful to have an automated mechanism to fix these problems [33, 34]. Thus, we can address these problems by idealizing all protein structures in PDB with our protein structure idealization algorithm and using our specially tuned scoring functions.

The idealized version of the PDB [17] provides new protein structure references to study protein structures and functions. For example, we can rebuild fragment and rotamer libraries based on the idealized PDB. It would then be more intuitive to use the idealized fragment or rotamer libraries in the protein backbone or side-chain structure prediction algorithms searching the dihedral angle space. Thus, we expect to see some improvements of the accuracy of these algorithms with the idealized fragment and rotamer libraries. Therefore, we also provide a new approach for discovering unusual atoms and bonds by comparing the idealized and the original PDB structures. Although most of these unusual atoms and bonds are due to errors, we expect to discover some biochemical insights that assist in understanding protein functions.

References

Engh RA, Huber R: Accurate bond and angle parameters for X-ray protein structure refinement. Acta Crystallogr Sect A. 1991, 47: 392-400. 10.1107/S0108767391001071.
Article Google Scholar
Engh RA, Huber R: Structure quality and target parameters. Vol F Int Tables Crystallogr. 2006, 382-392. chap. 18.3.
Chapter Google Scholar
Improved efficiency of protein structure calculations from NMR data using the program DIANA with redundant dihedral angle constraints. J Biomol NMR. 1991, 1 (4): 447-456.
Article PubMed Google Scholar
Güntert P, Mumenthaler C, Wüthrich K: Torsion angle dynamics for NMR structure calculation with the new program DYANA. J of Mol Biol. 1997, 273: 283-98. 10.1006/jmbi.1997.1284. [http://www.ncbi.nlm.nih.gov/pubmed/9367762], []
Article Google Scholar
Simons KT, Kooperberg C, Huang E, Baker D: Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol. 1997, 268: 209-225.
Article CAS PubMed Google Scholar
Simons KT, Strauss C, Baker D: Prospects for ab initio protein structural genomics. J Mol Biol. 2001, 306: 1191-1199.
Article CAS PubMed Google Scholar
Li SC, Bu D, Xu J, Li M: Fragment-HMM: a new approach to protein structure prediction. Protein Sci. 2008, 17 (11): 1925-1934.
Article PubMed Central CAS PubMed Google Scholar
Canutescu AA, Shelenkov AA, Dunbrack RL: A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci. 2003, 12 (9): 2001-2014. [http://dx.doi.org/10.1110/ps.03154503], []
Article PubMed Central CAS PubMed Google Scholar
Krivov GG, Shapovalov MV, Dunbrack RL: Improved prediction of protein side-chain conformations with SCWRL4. Proteins: Struct Funct Bioinformatics. 2009, 77 (4): 778-795. 10.1002/prot.22488. [http://dx.doi.org/10.1002/prot.22488], []
Article CAS Google Scholar
Kuszewski J, Gronenborn AM, Clore GM: Improving the quality of NMR and crystallographic protein structures by means of a conformational database potential derived from structure databases. Protein Sci. 1996, 5 (6): 1067-1080.
Article PubMed Central CAS PubMed Google Scholar
Kuszewski J, Gronenborn AM, MClore G: Improvements and extensions in the conformational database potential for the refinement of NMR and X-ray structures of proteins and nucleic acids. J Magn Reson. 1997, 125: 171-177.
Article CAS PubMed Google Scholar
Rice LM, Brünger AT: Torsion angle dynamics: Reduced variable conformational sampling enhances crystallographic structure refinement. Proteins: Struct Funct Genet. 1994, 19 (4): 277-290. 10.1002/prot.340190403. [http://dx.doi.org/10.1002/prot.340190403], []
Article CAS Google Scholar
Stein EG, Rice LM, Brünger AT: Torsion-angle molecular dynamics as a new efficient tool for NMR structure calculation. J Magn Reson. 1997, 124: 154-164.
Article CAS PubMed Google Scholar
Dunbrack RL, Cohen FE: Bayesian statistical analysis of protein side-chain rotamer preferences. Protein Sci. 1997, 6 (8): 1661-1681. [http://www.proteinscience.org/cgi/content/abstract/6/8/1661], []
Article PubMed Central CAS PubMed Google Scholar
Evans PR: An introduction to stereochemical restraints. Acta Crystallogr D. 2007, 63: 58-61.
Article PubMed Central CAS PubMed Google Scholar
Jaskolski M, Gilski M, Dauter Z, Wlodawer A: Stereochemical restraints revisited: how accurate are refinement targets and how much should protein structures be allowed to deviate from them?. Acta Crystallogr D. 2007, 63 (5): 611-620.
Article CAS PubMed Google Scholar
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28: 235-242.
Article PubMed Central CAS PubMed Google Scholar
Wang G, Dunbrack RL: PISCES: a protein sequence culling server. Bioinformatics. 2003, 19 (12): 1589-1591.
Article CAS PubMed Google Scholar
Wang G, Dunbrack RL: PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res. 2005, 33 (Web-Server-Issue): 94-98.
Article Google Scholar
Yang Y, Zhou Y: Ab initio folding of terminal segments with secondary structures reveals the fine difference between two closely related all-atom statistical energy functions. Protein Sci. 2008, 17 (7): 1212-1219.
Article PubMed Central CAS PubMed Google Scholar
Yang Y, Zhou Y: Specific interactions for ab initio folding of protein terminal regions with secondary structures. Proteins. 2008, 72 (2): 793-803.
Article CAS PubMed Google Scholar
Lattman E, Loll PJ, Loll P: Protein Crystallography: A Concise Guide. Johns Hopkins University Press 2008.
Google Scholar
Rutgers UCSD: Protein data bank contents guide. 2011,http://www.wwpdb.org/documentation/format33/v3.3.html,
Google Scholar
Shapovalov MV, Dunbrack RL: A smoothed backbone-dependent rotamer library for proteins derived from adaptive kernel density estimates and regressions. Structure London England 1993. 2011, 19 (6): 844—858.[http://www.ncbi.nlm.nih.gov/pubmed/21645855], []
Google Scholar
Xu J: Rapid protein side-chain packing via tree decomposition. Springer 2005.
Book Google Scholar
Xu J, Jiao F, Berger B: A tree-decomposition approach to protein structure prediction. Fourth Int IEEE Comput Soc Comput Syst Bioinformatics Conf. 2005, 247-256. [http://view.ncbi.nlm.nih.gov/pubmed/16447982], []
Google Scholar
Kabsch W, Sander C: Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983, 22 (12): 2577-2637.
Article CAS PubMed Google Scholar
Joosten RP, te Beek TAH, Krieger E, Hekkelman ML, Hooft RWW, Schneider R, Sander C, Vriend G: A series of PDB related databases for everyday needs. Nucleic Acids Res. 2011, 39 (suppl 1): D411-D419.
Article PubMed Central CAS PubMed Google Scholar
Laskowski RA, MacArthur MW, Moss DS, Thornton JM: PROCHECK: a program to check the stereochemical quality of protein structures. J of Appl Crystallogr. 1993, 26 (2): 283-291. 10.1107/S0021889892009944. [http://dx.doi.org/10.1107/S0021889892009944], []
Article CAS Google Scholar
Jung JW, Yee A, Wu B, Arrowsmith CH, Lee W: Solution structure of YKR049C, a putative redox protein from Saccharomyces cerevisiae. J Biochem Mol Biol. 2005, 38 (5): 500-504.
Article Google Scholar
Xu Y, Xu D: Protein threading using PROSPECT: Design and evaluation. Proteins: Struct Funct Genet. 2000, 40 (3): 343-354. 10.1002/1097-0134(20000815)40:3<343::AID-PROT10>3.0.CO;2-S. [http://dx.doi.org/10.1002/1097-0134(20000815)40:3], []
Article CAS Google Scholar
Kim D, Xu D, Guo JT, Ellrott K, Xu Y: PROSPECT II: protein structure prediction program for genome-scale applications. Protein Eng. 2003, 16 (9): 641-650. [http://view.ncbi.nlm.nih.gov/pubmed/14560049], []
Article CAS PubMed Google Scholar
Hooft RWW, Vriend G, Sander C, Abola EE: Errors in protein structures. NATURE. 1996, 381 (6580): 272.
Article CAS PubMed Google Scholar
Joosten RP, Joosten K, Cohen SX, Vriend G, Perrakis A: Automatic rebuilding and optimization of crystallographic structures in the Protein Data Bank. Bioinformatics. 2011, 27 (24): 3392-3398. [http://dx.doi.org/10.1093/bioinformatics/btr590], []
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

This work was supported by a Startup Grant at City University of Hong Kong [7002731], National Basic Research Program of China [2012CB316500], an NSERC Grant [OGP0046506], Canada Research Chair program, an NSERC Collaborative Grant, OCRiT, Premier’s Discovery Award, Killam Prize and SHARCNET.

Author information

Authors and Affiliations

University of Waterloo, Ontario, Canada
Xuefeng Cui, Babak Alipanahi & Ming Li
City University of Hong Kong, Hong Kong, China
Shuai Cheng Li
Chinese Academy of Sciences, Beijing, China
Dongbo Bu

Authors

Xuefeng Cui
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Cheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Dongbo Bu
View author publications
You can also search for this author in PubMed Google Scholar
Babak Alipanahi
View author publications
You can also search for this author in PubMed Google Scholar
Ming Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming Li.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

The algorithm implementation and all experiments were done by XC; the protein structure idealization algorithm was designed by XC and SCL; the CULLPDB_PC30_RES1.6_R0.25 experiment was designed by XC, SCL and DB; the NMR experiment was designed by XC and BA; the project was directed by ML; All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Cui, X., Li, S.C., Bu, D. et al. Protein Structure Idealization: How accurately is it possible to model protein structures with dihedral angles?. Algorithms Mol Biol 8, 5 (2013). https://doi.org/10.1186/1748-7188-8-5

Download citation

Received: 20 December 2012
Accepted: 05 February 2013
Published: 25 February 2013
DOI: https://doi.org/10.1186/1748-7188-8-5

Protein Structure Idealization: How accurately is it possible to model protein structures with dihedral angles?