- Open Access
On the protein folding problem in 2D-triangular lattices
© Islam and Rahman; licensee BioMed Central Ltd. 2013
- Received: 6 March 2013
- Accepted: 20 November 2013
- Published: 26 November 2013
In this paper, we present a novel approximation algorithm to solve the protein folding problem in HP model. Our algorithm is polynomial in terms of the length of the given HP string. The expected approximation ratio of our algorithm is for n ≥ 6, where n2 is the total number of H’s in a given HP string. The expected approximation ratio tends to reach 1 for large values of n. Hence our algorithm is expected to perform very well for larger HP strings.
- Protein folding
- Approximation ratio
- HP model
A long standing problem in Molecular Biology and Biochemistry is to determine the three dimensional structure of a protein given only the sequence of amino acid residues that compose protein chains. This problem is known as the Holy Grail of Computational Molecular Biology, also termed as “cracking the second half of the genetic code”. There exist a variety of models attempting to simplify the problem by abstracting only the “essential physical properties” of real proteins. In these models, the three-dimensional space is often represented by a lattice. Residues which are adjacent (i.e., covalently linked) in the primary sequence must be placed at adjacent points in the lattice.
In this paper, we consider the Hydrophobic-Polar Model, HP Model for short, introduced by Dill . The HP model is based on the assumption that hydrophobicity is the dominant force in protein folding. This model simplifies a protein’s primary structure to a linear chain of beads. Each bead represents an amino acid, which can be one of two types: H (hydrophobic or nonpolar) or P (hydrophilic or polar). Conformations of proteins are embedded in either a two-dimensional or three-dimensional square/triangular/hexagonal lattice. A conformation of a protein is simply a self-avoiding walk along the lattice. The goal of the protein folding problem is to find a conformation of the protein sequence on the lattice such that the overall energy is minimized, for some reasonable definition of energy. Each amino acid in the chain is represented by occupying one lattice point, connected to its chain neighbour(s) on adjacent lattice points. An optimal embedding is one that maximizes the number of H-H contacts which are not adjacent in the amino acid chain. So, in effect, an input to the problem is a finite string over the alphabet (H, P)+. Often, in what follows, the input strings to our problem will be referred to as HP strings. For a more biological background and motivations the readers are referred to [1, 2].
A number of approximation algorithms have been developed for the HP model on the 2D square lattice, 3D cubic lattice, triangular lattice and the face-cantered-cubic (FCC) lattice [3–5]. The first approximation algorithm developed for this problem on the square lattice by Hart and Istrail has an approximation ratio of 1/4 . The approximation ratio for this problem was improved to 1/3 by Newman . The algorithm presented in  can be generalized to an approximation algorithm for the problem on the 3D cubic lattice. In , a general method for protein folding on the HP model was presented by Hart and Istrail. This method can be applied to a large class of lattice models. Hart and Istrail  provided the first approximation algorithn for the problem on the side-chain model which can be applied to 2D square, 3D cubic lattices, and FCC lattices. The approximation ratio achieved by  remains the best ratio for an approximation algorithm in any 3D HP-models to date. In , the authors also illustrate the transformation of approximation algorithm from lattice models to off-lattice models. Another approximation algorithm, based on different geometric ideas was presented in . Heun  presented a linear-time approximation algorithm for protein folding in the HP side chain model on extended cubic lattice having approximate ration 0.84. In , the authors presented an approximation algorithm with approximation ratio 0.17 that folds an arbitrary protein sequence in the 2D hexagonal lattice HP-model. Readers are referred to a survey of Istrail and Lam  for further reading.
In , the authors proposed a genetic algorithm for the protein folding problem in HP model in 2D square lattice. In [12, 13], hybrid genetic algorithm was presented for HP model in 2D triangular lattice and 3D FCC lattice. The authors in  first proposed the pull move set for rectangular lattices, which is used in the HP model under a variety of local search methods. They also showed the completeness and reversibility of the pull move set for rectangular grid lattices. In , the authors extended the idea of pull move set in local-search approach for finding an optimal embedding in 2D triangular grid and the FCC lattice in 3D.
In this paper, we present an approximation algorithm for protein folding in 2D-triangular lattice. To the best of our knowledge the best approximation ratio for this problem was obtained by Agarwalla et al. , which is . For our algorithm we do a probabilistic analysis and deduce that the expected approximation ratio of our algorithm is for n ≥ 6, where n2 is the total number of H in a given HP string. Clearly our approximation ratio depends on n, which in turn depends on the number of H in the HP string. For large values of n, this ratio tends to reach 1. So it can be expected that our algorithm would provide very close to optimal results for large values of n.
The rest of the paper is organized as follows. In Section ‘Preliminaries’, we define some notations and notions. Section ‘Our approach’ describes our approach to solve the problem. In Section ‘Expected approximation ratio ratio’ we deduce the expected approximation ratio. We briefly conclude in Section ‘Conclusion’.
In this section, we present some notions and definitions (mostly in relation to the underlying lattice) that we need to explain our algorithm. In a triangular lattice, each lattice point has six neighbouring lattice points . In the literature it is also called a hexagonal lattice. Note that, by definition, a lattice is infinite. However, in what follows, when we refer to a lattice we will refer to a finite part of it. This finite part of the lattice would essentially be a hexagon. We now define some notions related to a hexagon in the context of our approach. Note that a hexagon is said to be perfect (or regular) if it has six equal sides and six equal angles. Throughout the paper, when we refer to a hexagon we assume that the opposite sides of it are parallel having the same length. Also, when we consider a non-regular hexagon we assume that the sides of it can be grouped into two groups based on their length. In particular, two of its sides (that are parallel to each other) have a particular length, say p and the other four sides have a different length, say m. Clearly, when p = m, we have a regular hexagon. Following the above discussion, it would be useful to define the former couple of sides of the (non-regular) hexagon (i.e., that having a length of p each) as -sides and the latter four sides (i.e., that having a length of m each) as -sides.
The length of the hexagon (or lattice) is the total number of points along the -sides. Figure 1, shows a hexagon of length 6. A region in the hexagon is a group of the lattice points such that each point in it has at least two neighbours from within it. Similar to the boundary of a hexagon we also define the boundary of a region. The boundary of a region consists of those lattice points that have fewer than six neighbours within the region. A region must not contain any point such that deleting that point creates two separate regions. From a graph theoretic concept, the region cannot have a cut vertex. Also all the lattice points inside the boundary of a region are parts of the region. So, by definition, only the boundary itself cannot be considered as a region unless there are no points inside the boundary at all. In Figure 1, the black vertices comprise a region (which has only one point inside the boundary). The size of a region is the total number of lattice points inside it including the boundary points.
We also introduce a notion of a bend for a hexagon if it’s length is greater than it’s depth. A bend refers to the combined bent line along the 2 -sides to the right (a bend could be defined identically considering the two -sides to the left as well. However, for our purpose, we exclude that option from our definition). A bend is illustrated in Figure 1. Notice that if the depth of such a hexagon is x, then a bend contains 2x + 1 points. There is a total of ℓ bends in a hexagon, where ℓ is the length of the hexagon. Removing all bends from the hexagon leaves a total of x2 lattice points (see Figure 1).
We use the usual notion of a run in an HP string. In particular, a run in an HP string denotes the consecutive H’s or P’s. For example, in the HP string HHHPPHHPHHHH, we have a run of 3 H’s, followed by a run of 2 P’s and so on. Here the run-length of the first run of H (P) is 3 (2). We sometimes will use the term H-run (P-run) to indicate a run of H’s (P’s). The longest H-run (P-run) of a string denotes the run of H (P) which has the highest run-length among all the H-runs (P-runs) of the string. For the sake of conciseness, the HP strings shall often be represented as H’s and P’s with the corresponding run-lengths as their powers. So, the HP string will often be conveniently represented by . Further, we use S(i), 1 ≤vi ≤ |S| to denote the i th character of the HP string . Similarly, denotes the j th run of . For example, refers to H3, refers to P2 and so on. We will use SumH as the sum of the run-lengths of all the H-runs of a given string . We end this section with a formal definition of the problem we handle in this paper.
Problem 1. Given an HP string , the problem is to place the HP string on a triangular lattice such that the total number of bonds are maximized.
Our approach is a simple and intuitive one. Our idea is to identify the length and depth of a suitable hexagon and then try to put all the H’s of a particular H-run inside the hexagon and put the P’s of the following P-run (if any) outside that hexagon. The length and depth of the hexagon depend on SumH. The motivation here is to get the maximum number of bonds between H’s. Note carefully that if we can fully fill a hexagon with z lattice points with a single H-run and get a total of k edges, the number of total bonds will be k - z + 1. And if the number of H-run is n(H) then in this case the total number of bonds will be k - z + n(H). We illustrate the approach with an example below. In the figures throughout this paper the bonds and edges are not shown explicitly. A connection between 2 lattice points indicate the presence of 2 H’s that are adjacent in the input HP string.
In Figure 10, we have a hexagon of 37 points and after filling the hexagon fully we get a total of 90 edges. It is easy to verify that the total number of bonds are 62 which is equal to k - z + n(H). Notably, if two hexagons have the same number of lattice points and are filled up fully with H by a given HP string, the hexagon with higher number of total edges have the higher number of total bonds as the difference between the total number of the edges and that of bonds is a constant.
Now that we have discussed our main approach to fill up the hexagon, we can shift our focus to the question of whether we can accommodate all the H-runs of the input HP string within the current hexagon. Recall that our goal is to increase the number of edges (and hence the total number of bonds)as much as possible. We have the following useful lemmas.
If two hexagons have the same number of lattice points then the hexagon with the higher depth will not have lesser number of edges.
Suppose that the total number of points is z and the depth of the hexagon x and length ℓ. If we remove the top and bottom rows and put the corresponding points along the other rows of the hexagon, new bends to the either side of the hexagon will be created and each bend will have 2 × (x - 1) + 1 points. The new hexagon will have depth x - 1 and additional bends (see Figure 11).
Now deleting two rows decrease 2 × (2ℓ + ℓ - 1) or 6ℓ - 2 edges. On the other hand, increasing the bends will increase or or or edges. This is clearly less than or equal to 6ℓ - 2 as or. Hence the result follows. □
Lemma 2. Suppose we have a regular hexagon containing N points. Now we reduce the length of this hexagon to get another hexagon such that contains N points as well. Then will have lesser number of edges inside it than that of.
Now deleting a bend decreases 2 × 2x + 1 + 2x or 6x + 1 edges. Out of the 2x + 1 points of the bend, 2x - 2 points will create new sides. These points create additional 2 × (2 × (x - 1) + x - 2) or 6x - 8 edges. And the rest of the 3 points can contribute to at most 8 edges. So we get at most a total of 6x edges whereas we loose 6x + 1 edges. Hence the result follows. □
Assuming that we can fill up all the points of the hexagon, the total number of edges (as well as the total number of bonds) will be maximum if, and only if, the hexagon is a regular hexagon.
Lemma 3 follows readily from Lemmas 1 and 2. □
Notably, to fill up a regular hexagon with depth x at least one H-run having length 2x + 1 is needed. Besides, we need at least two H-runs of length 2x, three of length 2x - 2, three of length 2x - 4… three H-runs of length 1 or 2 (depending on the size of x) in the input. The string in Example 1 presented before meets this criteria assuming x = 3. To explain a bit more, note that, in the HP string of Example 1 we have one H-run with run-length 7, two H-runs with run-length 6, three H-runs with run length 4 and the rest of the H-runs have run-length 2. Another example is given below where we cannot put all H’s in a regular hexagon.
Steps of the algorithm
Now we describe the steps of our algorithm below.
- Step 1
Let SumH of the input HP string is z and the longest H-run is having run-length k. Compute. Set globalB = 0.
- Step 2
Set and construct a hexagon with length ℓ and depth x where each bend will contain 2x + 1 points. If z ≥ ℓ(2x + 1) + x2 then add new p bends such that (ℓ + p)(2x + 1) + x2 ≥ z ≥ (ℓ + p + 1)(2x + 1) + x2. If z > (ℓ + p)(2x + 1) + x2 then create a distorted hexagon having the last bend (i.e., on the boundary of the hexagon) containing z - (ℓ + p)(2x + 1) + x2 points.
- Step 3For each of the H-runs and P-runs, starting from and wrapping around the end (if applicable) execute the following steps. For an H-run, execute Step 3.a, Step 3.b and Step 3.c; for a P-run execute Step 3.d.
- Step a[for H-runs] If the run length of the H-run is less than 3 then we take lattice points on the boundary of the hexagon. Otherwise, we try to find a region from the remaining unoccupied points as follows. Here the total number of points in the region must be equal to the run-length of the current H-run and at least two of these points must be boundary points of the hexagon. We find the region executing the following steps (Steps 3.a.i to 3.a.iv). We ensure that the region property is maintained as we proceed by including the points one after another. In the following steps we will use to refer the the region we are constructing iteratively.
- Step i
Take two points on the boundary of the hexagon. These are the first two points of the region .
- Step ii
Identify the unoccupied points in the hexagon such that each of those has two neighbouring lattice points in . Find the point having the highest depth among these points (breaking ties arbitrarily) and add this point to . Thus we increase the size (by one).
- Step iii
If no such point is found then go to Step 3.b.
- Step iv
If the size of is still less than the run length of the current H-run, go to Step 3.a.ii.
- Step i
- Step b
[for H-runs]Fill up the lattice points of the identified region ( ) with the H-run.
- Step c
[for H-runs]Put the rest of the H-runs (if any) outside the hexagon in a single row.
- Step d
[for P-runs]Put the P-run outside the hexagon in two rows. The first P of the run will be a neighbour of the previous H-run’s last H, while the last P of the run will be a neighbour of the next H-run’s first H.
- Step a
- Step 4
Count the total number of bonds, B.
- Step 5
If globalB > B, return globalB.
- Step 6
Otherwise set globalB = B and x = x - 1.
- Step 7
If x = 1, return B; otherwise go to Step 2.
A brief discussion on Step 3.a.ii is in order. In Step 3.a.ii, our algorithm always chooses the points having the higher depths, with ties broken arbitralily. In some cases, some lattice points may remain unoccupied. Example 2 elaborates such cases. If some lattice points remain unoccupied we continue the algorithm. Some H-runs or part thereof may lie outside the hexagon (Step 3.c of the algorithm). We count the total number of bonds and compare it with previous/latter hexagon (increased or decreased depth) as applicable. So, if the algorithm fails to insert all the H-runs within a hexagon it does not mean it fails in total, as we put the rest of the H-runs or parts thereof (as appropriate) outside the hexagon. The algorithm is formally presented in Algorithm 1.
Algorithm 1 Finding the Folding
Given a region consisting of lattice points, a starting and an ending points such that those are boundary points of the hexagon, there always exists a path that starts at the starting point, ends at the ending point visiting each point in the region exactly once.
Our algorithm runs in polynomial time as discussed below. Firstly, the algorithm iterates over at most x times. Now we have, because, z = 3 × x × (x + 1) which is proved in Lemma 4 in the following section. In each iteration we have to find a region for the current H-run. If a H-run has run-length ℓ, then Step 3.a in the algorithm needs O(ℓ2) time as Step 3.a.ii needs at most O(ℓ) time. So total time needed to perform this operation in each iteration, is at most O(z2). As each of the other steps need constant time, the complete runtime of the algorithm is.
In this section, we are going to deduce the expected approximation ratio of our algorithm. As the total number of H-runs and run-lengths thereof may vary, in this analysis, we will find the expected number of H-runs and the expected run-length of each of those. These two values will depend only on SumH. Consider a regular hexagon with depth x. Assume that the total number of points in the hexagon is z. Then we have the following lemma.
Suppose we have a regular hexagon with depth x and z points. The total number of bonds, B, is 6x2 when all the points are filled with H’s of a single H-run.
This completes the proof. □
Now the following lemma considers non-regular hexagons as well.
Consider a hexagon (either regular or non regular) having n2 points. Then, the total number of bonds B is less than or equal 2 × n(n - 1).
Hence the result follows for regular hexagons. Clearly, by Lemma 3 the result applies for non-regular hexagons as well. □
We will now deduce the approximation ratio based on an expected value of the total number of bonds. We assume that all H-runs have equal length. This assumption is valid in the context of our analysis and does not lose generality as follows. In what follows, we will be working with the expected number of H-runs and the expected length (say k Ex ) of an H-run. Hence in our analysis, each H-run will be assumed to have length k Ex . We will now compute the expected values of the total number of edges (bonds), E Ex (B Ex ) under this assumption.
Note that according to our approach, the value of x is dependent on SumH. For this analysis, we now derive the expected run-length of H for a given HP string where SumH is n2. This problem can be mapped into the problem of Integer Partitioning as defined below.
Problem 2. Given an integer Y, the problem of Integer Partitioning aims to provide all possible ways of writing Y, as a sum of positive integers.
where γ is the famous Euler’s constant.
This is the final expected approximation ratio.
Expected approximation ratio for different values of n
z = n 2
In this paper, we have given a novel approximation algorithm to solve the protein folding problem in HP model introduced by Dill . Our algorithm is polynomial and the expected approximation ratio is for n ≥ 6 where n2 is total number of H in a given HP string. For larger HP strings it is expected that our algorithm will give better result than the algorithm provided in , which currently gives the best approximation ratio for 2D-triangular lattice. Additionally, our expected approximation ratio tends to reach one for large values of n. Hence our algorithm is expected to perform very well for larger HP strings.
The authors would like to thank Md. Mahbubul Hasan for fruitful discussions. The authors gratefully acknowledge the fruitful comments and suggestions of the anonymous reviewers which aided in improving the presentation of the paper. This research work was conducted as part of the M.Sc. Engg. thesis work of Islam under the supervision of Rahman.
- Dill KA: Theory for the folding and stability of globular-proteins. Biochemistry. 1985, 24 (6): 1501-1509.View ArticlePubMedGoogle Scholar
- Agarwala R, Batzoglou S, Dancík V, Decatur SE, Hannenhalli S, Farach M, Muthukrishnan S, Skiena S: Local rules for protein folding on a triangular lattice and generalized hydrophobicity in the HP model. J Comput Biol. 1997, 4 (3): 275-296.View ArticlePubMedGoogle Scholar
- Hart WE, Istrail S: Fast protein folding in the hydrophobic-hydrophillic model within three-eights of optimal. J Comput Biol. 1996, 3: 53-96.View ArticlePubMedGoogle Scholar
- Newman A: A new algorithm for protein folding in the HP model. In SODA ACM/SIAM. 2002, 876-884.Google Scholar
- Newman A, Ruhl M: Combinatorial problems on strings with applications to protein folding. LATIN, Volume 2976 of Lecture Notes in Computer Science. 2004, 369-378. SpringerGoogle Scholar
- Hart WE, Istrail S: Invariant patterns in crystal lattices: implications for protein folding algorithms (Extended abstract). CPM, Volume 1075 of Lecture Notes in Computer Science. Springer,1996, 288-303.Google Scholar
- Hart WE, Istrail S: Lattice and off-lattice side chain models of protein folding: linear time structure prediction better than 86% of optimal. J Comput Biol. 1997, 4 (3): 241-259.View ArticlePubMedGoogle Scholar
- Heun V: Approximate Protein Folding in the HP Side Chain Model on Extended Cubic Lattices. Lecture Notes in Computer Science 1643: Proceedings of the 7th Annual European Symposium on Algorithms. 1998, 212-223. Springer-VerlagGoogle Scholar
- Jiang M, Zhu B: Protein folding on the hexagonal lattice in the Hp model. J. Bioinformatics Computat Biol. 2005, 3: 19-34. 10.1142/S0219720005000850.View ArticleGoogle Scholar
- Istrail S, Lam F: Combinatorial algorithms for protein folding in lattice models: a survey of mathematical results. Commun Inf Syst. 2009, 9 (4): 303-346.Google Scholar
- Unger R, Moult J: Genetic algorithms for protein folding simulations. J Mol Biol. 1993, 231: 75-81.View ArticlePubMedGoogle Scholar
- Hoque T, Chetty M, Dooley LS: A hybrid genetic algorithm for 2D FCC hydrophobic-hydrophilic lattice model to predict protein folding. Australian Conference on Artificial Intelligence, Volume 4304 of Lecture Notes in Computer Science. 2006, 867-876. SpringerGoogle Scholar
- Hoque T, Chetty M, Sattar A: Protein folding prediction in 3D FCC HP lattice model using genetic algorithm. IEEE Congress on Evolutionary Computation. 2007, 4138-4145. IEEEGoogle Scholar
- Lesh N, Mitzenmacher M, Whitesides S: A Complete and Effective Move Set for Simplified Protein Folding. 7th Annual International Conference on Research in Computational Molecular Biology (RECOMB) 2003. 2003, 188-195. ACM PressGoogle Scholar
- Böckenhauer HJ, Ullah AZMD, Kapsokalivas L, Steinhöfel K: A local move set for protein folding in triangular lattice models. WABI, Volume 5251 of Lecture Notes in Computer Science. Springer,2008, 369-381.Google Scholar
- Kessler I, Livingston M: The expected number of parts in a partition of n. Monatshefte für Mathematik. 1976, 81 (3): 203-212. 10.1007/BF01303193.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.