Open Access

On the protein folding problem in 2D-triangular lattices

  • Abu Sayed Md Sohidull Islam1 and
  • Mohammad Sohel Rahman1Email author
Algorithms for Molecular Biology20138:30

https://doi.org/10.1186/1748-7188-8-30

Received: 6 March 2013

Accepted: 20 November 2013

Published: 26 November 2013

Abstract

Background

In this paper, we present a novel approximation algorithm to solve the protein folding problem in HP model. Our algorithm is polynomial in terms of the length of the given HP string. The expected approximation ratio of our algorithm is 1 - 2 log n n - 1 for n ≥ 6, where n2 is the total number of H’s in a given HP string. The expected approximation ratio tends to reach 1 for large values of n. Hence our algorithm is expected to perform very well for larger HP strings.

Keywords

Protein foldingApproximation ratioAlgorithmsHP model

Background

A long standing problem in Molecular Biology and Biochemistry is to determine the three dimensional structure of a protein given only the sequence of amino acid residues that compose protein chains. This problem is known as the Holy Grail of Computational Molecular Biology, also termed as “cracking the second half of the genetic code”. There exist a variety of models attempting to simplify the problem by abstracting only the “essential physical properties” of real proteins. In these models, the three-dimensional space is often represented by a lattice. Residues which are adjacent (i.e., covalently linked) in the primary sequence must be placed at adjacent points in the lattice.

In this paper, we consider the Hydrophobic-Polar Model, HP Model for short, introduced by Dill [1]. The HP model is based on the assumption that hydrophobicity is the dominant force in protein folding. This model simplifies a protein’s primary structure to a linear chain of beads. Each bead represents an amino acid, which can be one of two types: H (hydrophobic or nonpolar) or P (hydrophilic or polar). Conformations of proteins are embedded in either a two-dimensional or three-dimensional square/triangular/hexagonal lattice. A conformation of a protein is simply a self-avoiding walk along the lattice. The goal of the protein folding problem is to find a conformation of the protein sequence on the lattice such that the overall energy is minimized, for some reasonable definition of energy. Each amino acid in the chain is represented by occupying one lattice point, connected to its chain neighbour(s) on adjacent lattice points. An optimal embedding is one that maximizes the number of H-H contacts which are not adjacent in the amino acid chain. So, in effect, an input to the problem is a finite string over the alphabet (H, P)+. Often, in what follows, the input strings to our problem will be referred to as HP strings. For a more biological background and motivations the readers are referred to [1, 2].

A number of approximation algorithms have been developed for the HP model on the 2D square lattice, 3D cubic lattice, triangular lattice and the face-cantered-cubic (FCC) lattice [35]. The first approximation algorithm developed for this problem on the square lattice by Hart and Istrail has an approximation ratio of 1/4 [3]. The approximation ratio for this problem was improved to 1/3 by Newman [4]. The algorithm presented in [3] can be generalized to an approximation algorithm for the problem on the 3D cubic lattice. In [6], a general method for protein folding on the HP model was presented by Hart and Istrail. This method can be applied to a large class of lattice models. Hart and Istrail [7] provided the first approximation algorithn for the problem on the side-chain model which can be applied to 2D square, 3D cubic lattices, and FCC lattices. The approximation ratio achieved by [7] remains the best ratio for an approximation algorithm in any 3D HP-models to date. In [7], the authors also illustrate the transformation of approximation algorithm from lattice models to off-lattice models. Another approximation algorithm, based on different geometric ideas was presented in [5]. Heun [8] presented a linear-time approximation algorithm for protein folding in the HP side chain model on extended cubic lattice having approximate ration 0.84. In [9], the authors presented an approximation algorithm with approximation ratio 0.17 that folds an arbitrary protein sequence in the 2D hexagonal lattice HP-model. Readers are referred to a survey of Istrail and Lam [10] for further reading.

In [11], the authors proposed a genetic algorithm for the protein folding problem in HP model in 2D square lattice. In [12, 13], hybrid genetic algorithm was presented for HP model in 2D triangular lattice and 3D FCC lattice. The authors in [14] first proposed the pull move set for rectangular lattices, which is used in the HP model under a variety of local search methods. They also showed the completeness and reversibility of the pull move set for rectangular grid lattices. In [15], the authors extended the idea of pull move set in local-search approach for finding an optimal embedding in 2D triangular grid and the FCC lattice in 3D.

In this paper, we present an approximation algorithm for protein folding in 2D-triangular lattice. To the best of our knowledge the best approximation ratio for this problem was obtained by Agarwalla et al. [2], which is 6 11 . For our algorithm we do a probabilistic analysis and deduce that the expected approximation ratio of our algorithm is 1 - 2 log n n - 1 for n ≥ 6, where n2 is the total number of H in a given HP string. Clearly our approximation ratio depends on n, which in turn depends on the number of H in the HP string. For large values of n, this ratio tends to reach 1. So it can be expected that our algorithm would provide very close to optimal results for large values of n.

The rest of the paper is organized as follows. In Section ‘Preliminaries’, we define some notations and notions. Section ‘Our approach’ describes our approach to solve the problem. In Section ‘Expected approximation ratio ratio’ we deduce the expected approximation ratio. We briefly conclude in Section ‘Conclusion’.

Preliminaries

In this section, we present some notions and definitions (mostly in relation to the underlying lattice) that we need to explain our algorithm. In a triangular lattice, each lattice point has six neighbouring lattice points [2]. In the literature it is also called a hexagonal lattice. Note that, by definition, a lattice is infinite. However, in what follows, when we refer to a lattice we will refer to a finite part of it. This finite part of the lattice would essentially be a hexagon. We now define some notions related to a hexagon in the context of our approach. Note that a hexagon is said to be perfect (or regular) if it has six equal sides and six equal angles. Throughout the paper, when we refer to a hexagon we assume that the opposite sides of it are parallel having the same length. Also, when we consider a non-regular hexagon we assume that the sides of it can be grouped into two groups based on their length. In particular, two of its sides (that are parallel to each other) have a particular length, say p and the other four sides have a different length, say m. Clearly, when p = m, we have a regular hexagon. Following the above discussion, it would be useful to define the former couple of sides of the (non-regular) hexagon (i.e., that having a length of p each) as -sides and the latter four sides (i.e., that having a length of m each) as -sides.

The discussion that follows can be better understood with the help of Figure 1. As has been mentioned above the finite portion of the lattice of our interest can be seen as a hexagon, the boundary of which consists of those lattice points that have fewer than six neighbours within the hexagon. An edge is formed by two neighbouring lattice points. If the lattice points are filled by H, the two neighbouring H’s also form an edge. If two H’s are non-adjacent in an HP string and placed on neighbouring lattice points to form an edge, they form a bond. The points on the boundary are referred to as the boundary points. The depth of a point in a lattice is the minimum number of points it needs to traverse to reach any boundary point. Naturally, the depth of a boundary point is 0. The depth of a hexagon is the maximum depth of all points in the hexagon. In Figure 1, the depth of the hexagon is 2.
Figure 1

Lattice. In this figure, a lattice and some related notions are illustrated.

The length of the hexagon (or lattice) is the total number of points along the -sides. Figure 1, shows a hexagon of length 6. A region in the hexagon is a group of the lattice points such that each point in it has at least two neighbours from within it. Similar to the boundary of a hexagon we also define the boundary of a region. The boundary of a region consists of those lattice points that have fewer than six neighbours within the region. A region must not contain any point such that deleting that point creates two separate regions. From a graph theoretic concept, the region cannot have a cut vertex. Also all the lattice points inside the boundary of a region are parts of the region. So, by definition, only the boundary itself cannot be considered as a region unless there are no points inside the boundary at all. In Figure 1, the black vertices comprise a region (which has only one point inside the boundary). The size of a region is the total number of lattice points inside it including the boundary points.

We also introduce a notion of a bend for a hexagon if it’s length is greater than it’s depth. A bend refers to the combined bent line along the 2 -sides to the right (a bend could be defined identically considering the two -sides to the left as well. However, for our purpose, we exclude that option from our definition). A bend is illustrated in Figure 1. Notice that if the depth of such a hexagon is x, then a bend contains 2x + 1 points. There is a total of bends in a hexagon, where is the length of the hexagon. Removing all bends from the hexagon leaves a total of x2 lattice points (see Figure 1).

Now we define a new notion of a distorted hexagon as follows. Suppose we have a hexagon having length and depth x; so each bend contains 2x + 1 points. Now we can increase its length to get a new hexagon of length  + 1 by adding a new bend. Similarly by adding succesive bends we can continue to increase the length of a hexagon. If within such a process the last added bend has fewer than 2x + 1 points, then we refer to the hexagon as a distorted hexagon. An example of a distorted hexagon is shown in Figure 2.
Figure 2

Distorted hexagon. In this figure, a distorted hexagon is illustrated.

We use the usual notion of a run in an HP string. In particular, a run in an HP string denotes the consecutive H’s or P’s. For example, in the HP string HHHPPHHPHHHH, we have a run of 3 H’s, followed by a run of 2 P’s and so on. Here the run-length of the first run of H (P) is 3 (2). We sometimes will use the term H-run (P-run) to indicate a run of H’s (P’s). The longest H-run (P-run) of a string denotes the run of H (P) which has the highest run-length among all the H-runs (P-runs) of the string. For the sake of conciseness, the HP strings shall often be represented as H’s and P’s with the corresponding run-lengths as their powers. So, the HP string S = HHHPPHHPHHHH will often be conveniently represented by Ŝ = H 3 P 2 H 2 P 1 H 4 . Further, we use S(i), 1 ≤vi ≤ |S| to denote the i th character of the HP string . Similarly, Ŝ ( j ) denotes the j th run of . For example, Ŝ ( 1 ) refers to H3, Ŝ ( 2 ) refers to P2 and so on. We will use SumH as the sum of the run-lengths of all the H-runs of a given string . We end this section with a formal definition of the problem we handle in this paper.

Problem 1.

Problem 1. Given an HP string , the problem is to place the HP string on a triangular lattice such that the total number of bonds are maximized.

Our approach

Our approach is a simple and intuitive one. Our idea is to identify the length and depth of a suitable hexagon and then try to put all the H’s of a particular H-run inside the hexagon and put the P’s of the following P-run (if any) outside that hexagon. The length and depth of the hexagon depend on SumH. The motivation here is to get the maximum number of bonds between H’s. Note carefully that if we can fully fill a hexagon with z lattice points with a single H-run and get a total of k edges, the number of total bonds will be k - z + 1. And if the number of H-run is n(H) then in this case the total number of bonds will be k - z + n(H). We illustrate the approach with an example below. In the figures throughout this paper the bonds and edges are not shown explicitly. A connection between 2 lattice points indicate the presence of 2 H’s that are adjacent in the input HP string.

Example 1.

Example 1. Suppose we have an HP string as follows:
Ŝ = H 6 P 5 H 2 P 6 H 4 P 5 H 6 P 3 H 2 P 5 H 4 P H 7 P 6 H 2 P 2 H 4 .
Figure 3 gives us a suitable regular hexagon for on the underlying lattice. Our approach starts with the longest H-run of . In Figure 4 the longest H-run, i.e, Ŝ ( 13 ) = H 7 , is first positioned within the hexagon. Then, in Figure 5, the subsequent P-run is positioned outside the hexagon. Similarly the approach continues through Figures 6, 7 and 8 where we illustrate the positioning of H-runs and P-runs up to Ŝ ( 17 ) . Then we wrap around and start with Ŝ ( 1 ) in Figure 9. The final position of all the runs of is shown in Figure 10.
Figure 3

Lattice points, i.e., Hexagon.

Figure 4

Positioning S ̂ ( 13 ) = H 7 .

Figure 5

Positioning S ̂ ( 14 ) = P 6 .

Figure 6

Positioning S ̂ ( 15 ) = H 2 .

Figure 7

Positioning S ̂ ( 16 ) = P 2 .

Figure 8

Positioning S ̂ ( 13 ) = H 4 .

Figure 9

Positioning S ̂ ( 13 ) = H 6 .

Figure 10

Final state.

In Figure 10, we have a hexagon of 37 points and after filling the hexagon fully we get a total of 90 edges. It is easy to verify that the total number of bonds are 62 which is equal to k - z + n(H). Notably, if two hexagons have the same number of lattice points and are filled up fully with H by a given HP string, the hexagon with higher number of total edges have the higher number of total bonds as the difference between the total number of the edges and that of bonds is a constant.

Now that we have discussed our main approach to fill up the hexagon, we can shift our focus to the question of whether we can accommodate all the H-runs of the input HP string within the current hexagon. Recall that our goal is to increase the number of edges (and hence the total number of bonds)as much as possible. We have the following useful lemmas.

Lemma 1.

If two hexagons have the same number of lattice points then the hexagon with the higher depth will not have lesser number of edges.

Proof

We show this by considering a hexagon and adjusting its depth keeping the total number of points fixed. We illustrate this scenario in Figure 11. Note that, the adjustment discussed and shown here does not give us a hexagon and is only to facilitate better exposition of the calculation and arguments.
Figure 11

Figure for Lemma 1. This figure aids in understanding the proof of Lemma 1.

Suppose that the total number of points is z and the depth of the hexagon x and length . If we remove the top and bottom rows and put the corresponding points along the other rows of the hexagon, new bends to the either side of the hexagon will be created and each bend will have 2 × (x - 1) + 1 points. The new hexagon will have depth x - 1 and additional 2 2 x - 1 1 bends (see Figure 11).

Now deleting two rows decrease 2 × (2 +  - 1) or 6 - 2 edges. On the other hand, increasing the 2 2 x - 1 bends will increase 2 2 x - 1 × ( ( 2 x - 2 ) + ( 2 x - 1 ) + ( 2 x - 2 ) ) or 2 2 x - 1 × ( 6 x - 5 ) or 2 2 x - 1 × ( 3 ( 2 x - 1 ) - 2 ) or 6 - 4 2 x - 1 edges. This is clearly less than or equal to 6 - 2 as 4 2 x - 1 - 2 0 or 2 2 x - 1 1 . Hence the result follows. □

Lemma 2.

Lemma 2. Suppose we have a regular hexagon H 1 containing N points. Now we reduce the length of this hexagon to get another hexagon H 2 such that H 2 contains N points as well. Then H 2 will have lesser number of edges inside it than that of H 1 .

Proof.

Proof. Suppose that the depth of the regular hexagon H 1 is x. So the length is x + 1. To reduce the length of H 1 , while keeping the total number of points intact, we have to remove a bend of H 1 and distribute the points on that bend over the adjacent sides (See Figure 12). Hence H 2 will be a non-regular hexagon. Note that a bend in a regular hexagon contains 2x + 1 lattice points. After reducing the length, new length will become x-1 as follows. By deleting one bend we reduce the length of adjacent sides by one. So they will now have x lattices points. After placing the removed lattice points over these sides the new side will definitely have x - 1 lattice points.
Figure 12

Figure for Lemma 2. This figure aids in understanding the proof of Lemma 2.

Now deleting a bend decreases 2 × 2x + 1 + 2x or 6x + 1 edges. Out of the 2x + 1 points of the bend, 2x - 2 points will create new sides. These points create additional 2 × (2 × (x - 1) + x - 2) or 6x - 8 edges. And the rest of the 3 points can contribute to at most 8 edges. So we get at most a total of 6x edges whereas we loose 6x + 1 edges. Hence the result follows. □

Lemma 3

Assuming that we can fill up all the points of the hexagon, the total number of edges (as well as the total number of bonds) will be maximum if, and only if, the hexagon is a regular hexagon.

Proof.

Lemma 3 follows readily from Lemmas 1 and 2. □

As has been mentioned before, our algorithm proceeds in an iterative fashion in order to achieve the highest possible number of edges by iteratively changing the length and depth of the hexagon. We start with an appropriate regular hexagon. Note carefully that, by Lemma 3, if we can fill the points of a regular hexagon, we get the optimum number of edges. If we fail to fill up all the points of a regular hexagon we put the rest of the H-runs outside the hexagon in a single row (see Figure 13) and finally compute the total number of bonds. We reduce the depth of the hexagon and increase its length with the hope that the number of bonds will increase in the new hexagon. We continue the iteration (i.e., reducing the depth of the hexagon and filling it up) until we reach a case when the total number of bonds decreases than that of the previous iteration. In that case, we terminate our algorithm and return the result of the previous iteration.
Figure 13

S ̂ of Example 2 cannot properly fill the hexagon. This figure illustrates that S ̂ of Example 2 cannot properly fill the hexagon.

Notably, to fill up a regular hexagon with depth x at least one H-run having length 2x + 1 is needed. Besides, we need at least two H-runs of length 2x, three of length 2x - 2, three of length 2x - 4… three H-runs of length 1 or 2 (depending on the size of x) in the input. The string in Example 1 presented before meets this criteria assuming x = 3. To explain a bit more, note that, in the HP string of Example 1 we have one H-run with run-length 7, two H-runs with run-length 6, three H-runs with run length 4 and the rest of the H-runs have run-length 2. Another example is given below where we cannot put all H’s in a regular hexagon.

Example 2.

Consider an HP string
Ŝ = H 6 P 3 H 4 P 4 H 3 P 6 H 4 P 3 H 2 P 4 H 2 P 3 H 6 P 2 H 5 P 2 H 2 P H 3 .
For this string, the length of a suitable regular hexagon is 4 (i.e., depth is 3) as SumH is 37. But in Figure 13 we can see that we cannot properly fill the hexagon. So we put the rest of the H-runs outside the hexagon in a single row. In such a case we have to increase the length of the hexagon to 6 as well as decreasing its depth to 2. The new hexagon is shown in Figure 14. It is evident from Figure 14, that we can now fill the hexagon properly with and thus increase the total number of bonds.
Figure 14

S ̂ of Example 2 can fill the adjusted hexagon. This figure illustrates that S ̂ of Example 2 fill the adjusted hexagon.

Steps of the algorithm

Now we describe the steps of our algorithm below.

  1. Step 1

    Let SumH of the input HP string is z and the longest H-run is Ŝ ( i ) having run-length k. Compute x = 1 + 1 + 4 × ( z - 1 ) 3 2 . Set globalB = 0.

     
  2. Step 2

    Set = z - x 2 2 x + 1 and construct a hexagon with length and depth x where each bend will contain 2x + 1 points. If z ≥ (2x + 1) + x2 then add new p bends such that ( + p)(2x + 1) + x2 ≥ z ≥ ( + p + 1)(2x + 1) + x2. If z > ( + p)(2x + 1) + x2 then create a distorted hexagon having the last bend (i.e., on the boundary of the hexagon) containing z - ( + p)(2x + 1) + x2 points.

     
  3. Step 3
    For each of the H-runs and P-runs, starting from Ŝ ( i ) and wrapping around the end (if applicable) execute the following steps. For an H-run, execute Step 3.a, Step 3.b and Step 3.c; for a P-run execute Step 3.d.
    1. Step a
      [for H-runs] If the run length of the H-run is less than 3 then we take lattice points on the boundary of the hexagon. Otherwise, we try to find a region from the remaining unoccupied points as follows. Here the total number of points in the region must be equal to the run-length of the current H-run and at least two of these points must be boundary points of the hexagon. We find the region executing the following steps (Steps 3.a.i to 3.a.iv). We ensure that the region property is maintained as we proceed by including the points one after another. In the following steps we will use to refer the the region we are constructing iteratively.
      1. Step i

        Take two points on the boundary of the hexagon. These are the first two points of the region .

         
      2. Step ii

        Identify the unoccupied points in the hexagon such that each of those has two neighbouring lattice points in . Find the point having the highest depth among these points (breaking ties arbitrarily) and add this point to . Thus we increase the size (by one).

         
      3. Step iii

        If no such point is found then go to Step 3.b.

         
      4. Step iv

        If the size of is still less than the run length of the current H-run, go to Step 3.a.ii.

         
       
    2. Step b

      [for H-runs]Fill up the lattice points of the identified region ( ) with the H-run.

       
    3. Step c

      [for H-runs]Put the rest of the H-runs (if any) outside the hexagon in a single row.

       
    4. Step d

      [for P-runs]Put the P-run outside the hexagon in two rows. The first P of the run will be a neighbour of the previous H-run’s last H, while the last P of the run will be a neighbour of the next H-run’s first H.

       
     
  4. Step 4

    Count the total number of bonds, B.

     
  5. Step 5

    If globalB > B, return globalB.

     
  6. Step 6

    Otherwise set globalB = B and x = x - 1.

     
  7. Step 7

    If x = 1, return B; otherwise go to Step 2.

     

A brief discussion on Step 3.a.ii is in order. In Step 3.a.ii, our algorithm always chooses the points having the higher depths, with ties broken arbitralily. In some cases, some lattice points may remain unoccupied. Example 2 elaborates such cases. If some lattice points remain unoccupied we continue the algorithm. Some H-runs or part thereof may lie outside the hexagon (Step 3.c of the algorithm). We count the total number of bonds and compare it with previous/latter hexagon (increased or decreased depth) as applicable. So, if the algorithm fails to insert all the H-runs within a hexagon it does not mean it fails in total, as we put the rest of the H-runs or parts thereof (as appropriate) outside the hexagon. The algorithm is formally presented in Algorithm 1.

Algorithm 1 Finding the Folding

In Figure 15 a folding produced by Algorithm 1 for the HP string Ŝ 1 = H 3 P 3 H 3 P 4 H 3 P 6 H 3 P 3 H 20 P 4 H 3 P 3 H 4 P 2 H 2 P 2 H 5 P H 3 P H 3 P 2 H 6 is shown for a hexagon with depth 3. The folding is not optimal for a hexagon with depth 3 as is evident from Figure 16, which shows the optimal folding. Now Algorithm 1 will count the total number of bonds and continue to the next iteration by reducing the depth by 1. The new folding produced in the next iteration is shown in Figure 17. Since the number of bonds in this folding is less than that of Figure 15, Algorithm 1 will choose the folding of Figure 15. So, as expected, Algorithm 1 may not produce the optimal folding for a hexagon with a given depth but it compares the folding produced by different hexagon having different depths and choose the best folding among there. As will be proved later, the folding produced by Algorithm 1 is expected to be quite near to optimal for long HP string. Now we present and prove the following theorem which basically proves the correctness of our approach.
Figure 15

Folding produced by Algorithm 1 with a hexagon havig depth 3 for S ̂ 1 . This figure illustrates the folding produced by by Algorithm 1 of S ̂ 1 of Section ‘Steps of the algorithm’ for hexagon having depth 3.

Figure 16

Optimal folding for hexagon with depth 3 for S ̂ 1 . This figure illustrates the optimal filling of S ̂ 1 of Section ‘Steps of the algorithm’ for hexagon having depth 3.

Figure 17

Folding produced by Algorithm 1 with a hexgaon having depth 2 for S ̂ 1 . This figure illustrates the folding produced by by Algorithm 1 of S ̂ of Section ‘Steps of the algorithm’ for hexagon having depth 2.

Theorem 1.

Given a region consisting of lattice points, a starting and an ending points such that those are boundary points of the hexagon, there always exists a path that starts at the starting point, ends at the ending point visiting each point in the region exactly once.

Proof.

We can traverse the points row wise from left to right within the region starting from, say, Row i and then right to left in Row i + 1 and so on. If the number of rows are even, then, in this manner we can traverse all the points (see Figure 18). If it is odd then we traverse in a similar way except for the last two rows, where we simultaneously traverse those in a zigzag fashion (see Figure 19). So filling up a region appropriately can be done in linear time with respect to the run-length of the corresponding H-run. □
Figure 18

For even number of rows. This figure illustrates how to traverse all the points in a region with even number of rows.

Figure 19

For odd number of rows. This figure illustrates how to traverse all the points in a region with odd number of rows.

Our algorithm runs in polynomial time as discussed below. Firstly, the algorithm iterates over at most x times. Now we have x z , because, z = 3 × x × (x + 1) which is proved in Lemma 4 in the following section. In each iteration we have to find a region for the current H-run. If a H-run has run-length , then Step 3.a in the algorithm needs O(2) time as Step 3.a.ii needs at most O() time. So total time needed to perform this operation in each iteration, is at most O(z2). As each of the other steps need constant time, the complete runtime of the algorithm is O ( z 2 × z ) .

Expected approximation ratio

In this section, we are going to deduce the expected approximation ratio of our algorithm. As the total number of H-runs and run-lengths thereof may vary, in this analysis, we will find the expected number of H-runs and the expected run-length of each of those. These two values will depend only on SumH. Consider a regular hexagon with depth x. Assume that the total number of points in the hexagon is z. Then we have the following lemma.

Lemma 4.

Suppose we have a regular hexagon with depth x and z points. The total number of bonds, B, is 6x2 when all the points are filled with H’s of a single H-run.

Proof.

A regular hexagon with depth 1 have 1+6=7 points. We can increase its depth from x - 1 to x by adding 6x new points each having depth 0. Since we have z points in the hexagon, so we must have:
z = 1 + 6 × 1 + 6 × 2 + 6 × 3 + + 6 × x z = 1 + 6 × ( 1 + 2 + 3 + + x ) z = 1 + 6 × ( x + 1 ) x / 2 z = 1 + 3 × x ( x + 1 )
Now we will count the total number of possible edges, E. Note that each of the points except those in the perimeter can contribute to six edges. Among the points on the perimeter, the six corner points can only contribute to three edges whereas the others can contribute to four edges. Since each edge is formed by two points, to prevent double counting, we have to divide the total count by two. So we have the following:
2 × E = 6 × ( 3 × x × ( x + 1 ) + 1 ) - 6 × 3 - 6 × 2 × ( x - 1 ) 2 × E = 2 × 3 × ( 3 × x × ( x + 1 ) + 1 ) - 9 + 6 - 6 x E = 3 × ( 3 × x × ( x + 1 ) + 1 ) - 3 - 6 x .
Now we focus on calculating the total number of bonds B. Recall that according to our approach, only H’s are placed inside the hexagon. Since an H can have at most 2 H’s adjacent to it in an HP string, once placed inside the hexagon an H can only have at most 2 edges that would not be counted as bonds. So to compute B we simply need to deduct the total number of points from E. So we have:
B = E - z + 1 B = 3 × ( 3 × x ( x + 1 ) + 1 ) - 3 - 6 x - ( 3 × x ( x + 1 ) + 1 ) + 1 B = 2 × ( 3 × x ( x + 1 ) + 1 ) - 3 - 6 x + 1 B = 6 × x ( x + 1 ) + 2 - 3 - 6 x + 1 B = 6 x 2

This completes the proof. □

Now the following lemma considers non-regular hexagons as well.

Lemma 5.

Consider a hexagon (either regular or non regular) having n2 points. Then, the total number of bonds B is less than or equal 2 × n(n - 1).

Proof.

According to lemma 4 for a regular hexagon with 1 + 3 × x × (x + 1) points, the total number of bonds is 6x2. Or replacing n = x + 1 we get, for a regular hexagon with 1 + 3 × n × (n - 1) points, the total number of bonds is 6 × (n - 1)2. So, clearly we have:
B n 2 × ( 6 × ( n - 1 ) 2 ) / ( 3 × n ( n - 1 ) + 1 ) B n 2 × ( 6 × ( n - 1 ) 2 ) / ( 3 × n ( n - 1 ) ) B 2 × n × ( n - 1 )

Hence the result follows for regular hexagons. Clearly, by Lemma 3 the result applies for non-regular hexagons as well. □

We will now deduce the approximation ratio based on an expected value of the total number of bonds. We assume that all H-runs have equal length. This assumption is valid in the context of our analysis and does not lose generality as follows. In what follows, we will be working with the expected number of H-runs and the expected length (say k Ex ) of an H-run. Hence in our analysis, each H-run will be assumed to have length k Ex . We will now compute the expected values of the total number of edges (bonds), E Ex (B Ex ) under this assumption.

From Figure 1, we can see that, the length of the hexagon is and depth is x. So each bend contains 2x + 1 points and there are a total of bends. There are x2 remaining lattice points outside the bends. So if the total number of points are z (see Figure 1) then,
z = ( 2 x + 1 ) × + x 2
(1)
So for a given z and x we can get,
= ( z - x 2 ) 2 x + 1
(2)
To calculate the total number of edges, at first we have to identify how many edges can be formed by individual points. The arguments of Lemma 4 for calculating E and B also apply here. Note that, on the perimeter, aside from the corner points, total number of points are 2 × ( - 2) + 4 × (x - 1). So E Ex can be computed as follows:
2 E Ex = 6 × ( ( 2 x + 1 ) × + x 2 ) - 6 × 3 - 2 × ( 2 × ( - 2 ) + 4 × ( x - 1 ) ) 2 E Ex = 2 × 3 × ( ( 2 x + 1 ) × + x 2 ) - 9 - ( 2 - 4 + 4 x - 4 ) E Ex = 3 × ( ( 2 x + 1 ) × + x 2 ) - 1 - 2 × ( + 2 x )
And B Ex can be computed as follows:
B Ex = E Ex - z B Ex = 3 × ( ( 2 x + 1 ) × + x 2 ) - 1 - 2 × ( + 2 x ) - ( ( 2 x + 1 ) × + x 2 ) B Ex = 2 × ( ( 2 x + 1 ) × + x 2 ) - 2 × ( + 2 x ) - 1
Hence, we get the following equation.
B Ex = 2 z - 2 × ( + 2 x ) - 1
(3)

Note that according to our approach, the value of x is dependent on SumH. For this analysis, we now derive the expected run-length of H for a given HP string where SumH is n2. This problem can be mapped into the problem of Integer Partitioning as defined below.

Problem 2.

Problem 2. Given an integer Y, the problem of Integer Partitioning aims to provide all possible ways of writing Y, as a sum of positive integers.

Note that the ways that differ only in the order of their summands are considered to be the same partition. A summand in a partition is called a part. Now, if we consider SumH as the input of Problem 2 (i.e., Y) then each run-length can be viewed as parts of the partition. So at first, we have to find the expected number of partitions, i.e., the expected number of runs of H. Kessler and Livingston [16] showed that to get an integer partition of an integer Y, expected number of required parts is:
3 Y 2 π × ( log Y + 2 γ - 2 log π 6 ) ,

where γ is the famous Euler’s constant.

For our problem Y = SumH = n2. If we denote E[ P] as the expected number of H-runs then,
E [ P ] = 6 π × n × ( log n + γ - log π 6 ) .
Now, as ( log n + γ - log π 6 ) ( 2 π 3 × log n ) for n ≥ 5, we can say that
E [ P ] 2 n × log n .
Since SumH is n2, expected value of each part, i.e., expected length of each run is greater than or equal to n 2 2 n × log n = n 2 log n . Since all the H-runs are assumed to have the same length so each of them will construct a bend of 2x + 1 points in the lattice. So we must have 2 x + 1 n 2 log n . Hence we get the following equations:
x n 4 log n - 1 2
(4)
n 2 - ( n 2 16 ( log n ) 2 - n 4 log n + 1 4 ) n 2 log n
(5)
Now, let us consider a hexagon H 1 with length max = n 2 - ( n 2 16 ( log n ) 2 - n 4 log n + 1 4 ) n 2 log n and depth x min = n 4 log n - 1 2 . Now, in H 1 we also must have n2 points. So, from Lemma 1 and Equations 4 and 5, clearly the number of bonds in H 1 is less than or equal to than that in the hexagon having length and depth x. So from Equation 3 we have the following:
B Ex 2 n 2 - 2 × ( 2 n log n - n 8 log n - log n 2 n + 1 2 + n 2 log n - 1 ) - 1 B Ex 2 n 2 - 2 × ( 2 n log n + 3 n 8 log n - log n 2 n )
Now from Lemma 5, recall the upper bound for the total number of bonds, which is as follows: B ≤ 2 × n(n - 1). Hence we get the following expected approximation ratio:
B Ex B 2 n 2 - 2 × ( 2 n log n + 3 n 8 log n - log n 2 n ) 2 n × ( n - 1 ) B Ex B n - 2 log n - 3 8 log n + log n 2 n 2 n - 1
As the term log n 2 n 2 is very small we can ignore it from the final result. Hence we have:
B Ex B n - 2 log n - 3 8 log n n - 1
As, 3 8 log n 1 for n ≥ 2, so, B Ex B n - 2 log n - 1 n - 1 or
B Ex B 1 - 2 log n n - 1 for n 6 .

This is the final expected approximation ratio.

Note that the ratio increases significantly with the increase of the value of n as presented in Table 1. So we can see that for large values of n, expected approximation ratio tends to 1. So for large n it is expected that our algorithm will outperform the approximation algorithm presented in [2]. Recall that the approximation ratio of the algorithm of [2] is 6 11 , i.e., around 0.55.
Table 1

Expected approximation ratio for different values of n

log n

n

z = n 2

Ratio

3

8

64

0.142

4

16

256

0.466

5

32

1024

0.677

6

64

4096

0.809

This table illustrates the expected approximation ratio for different values of n.

Conclusion

In this paper, we have given a novel approximation algorithm to solve the protein folding problem in HP model introduced by Dill [1]. Our algorithm is polynomial and the expected approximation ratio is 1 - 2 log n n - 1 for n ≥ 6 where n2 is total number of H in a given HP string. For larger HP strings it is expected that our algorithm will give better result than the algorithm provided in [2], which currently gives the best approximation ratio for 2D-triangular lattice. Additionally, our expected approximation ratio tends to reach one for large values of n. Hence our algorithm is expected to perform very well for larger HP strings.

Declarations

Acknowledgements

The authors would like to thank Md. Mahbubul Hasan for fruitful discussions. The authors gratefully acknowledge the fruitful comments and suggestions of the anonymous reviewers which aided in improving the presentation of the paper. This research work was conducted as part of the M.Sc. Engg. thesis work of Islam under the supervision of Rahman.

Authors’ Affiliations

(1)
AℓEDA Group, Department of CSE, BUET

References

  1. Dill KA: Theory for the folding and stability of globular-proteins. Biochemistry. 1985, 24 (6): 1501-1509.View ArticlePubMedGoogle Scholar
  2. Agarwala R, Batzoglou S, Dancík V, Decatur SE, Hannenhalli S, Farach M, Muthukrishnan S, Skiena S: Local rules for protein folding on a triangular lattice and generalized hydrophobicity in the HP model. J Comput Biol. 1997, 4 (3): 275-296.View ArticlePubMedGoogle Scholar
  3. Hart WE, Istrail S: Fast protein folding in the hydrophobic-hydrophillic model within three-eights of optimal. J Comput Biol. 1996, 3: 53-96.View ArticlePubMedGoogle Scholar
  4. Newman A: A new algorithm for protein folding in the HP model. In SODA ACM/SIAM. 2002, 876-884.Google Scholar
  5. Newman A, Ruhl M: Combinatorial problems on strings with applications to protein folding. LATIN, Volume 2976 of Lecture Notes in Computer Science. 2004, 369-378. SpringerGoogle Scholar
  6. Hart WE, Istrail S: Invariant patterns in crystal lattices: implications for protein folding algorithms (Extended abstract). CPM, Volume 1075 of Lecture Notes in Computer Science. Springer,1996, 288-303.Google Scholar
  7. Hart WE, Istrail S: Lattice and off-lattice side chain models of protein folding: linear time structure prediction better than 86% of optimal. J Comput Biol. 1997, 4 (3): 241-259.View ArticlePubMedGoogle Scholar
  8. Heun V: Approximate Protein Folding in the HP Side Chain Model on Extended Cubic Lattices. Lecture Notes in Computer Science 1643: Proceedings of the 7th Annual European Symposium on Algorithms. 1998, 212-223. Springer-VerlagGoogle Scholar
  9. Jiang M, Zhu B: Protein folding on the hexagonal lattice in the Hp model. J. Bioinformatics Computat Biol. 2005, 3: 19-34. 10.1142/S0219720005000850.View ArticleGoogle Scholar
  10. Istrail S, Lam F: Combinatorial algorithms for protein folding in lattice models: a survey of mathematical results. Commun Inf Syst. 2009, 9 (4): 303-346.Google Scholar
  11. Unger R, Moult J: Genetic algorithms for protein folding simulations. J Mol Biol. 1993, 231: 75-81.View ArticlePubMedGoogle Scholar
  12. Hoque T, Chetty M, Dooley LS: A hybrid genetic algorithm for 2D FCC hydrophobic-hydrophilic lattice model to predict protein folding. Australian Conference on Artificial Intelligence, Volume 4304 of Lecture Notes in Computer Science. 2006, 867-876. SpringerGoogle Scholar
  13. Hoque T, Chetty M, Sattar A: Protein folding prediction in 3D FCC HP lattice model using genetic algorithm. IEEE Congress on Evolutionary Computation. 2007, 4138-4145. IEEEGoogle Scholar
  14. Lesh N, Mitzenmacher M, Whitesides S: A Complete and Effective Move Set for Simplified Protein Folding. 7th Annual International Conference on Research in Computational Molecular Biology (RECOMB) 2003. 2003, 188-195. ACM PressGoogle Scholar
  15. Böckenhauer HJ, Ullah AZMD, Kapsokalivas L, Steinhöfel K: A local move set for protein folding in triangular lattice models. WABI, Volume 5251 of Lecture Notes in Computer Science. Springer,2008, 369-381.Google Scholar
  16. Kessler I, Livingston M: The expected number of parts in a partition of n. Monatshefte für Mathematik. 1976, 81 (3): 203-212. 10.1007/BF01303193.View ArticleGoogle Scholar

Copyright

© Islam and Rahman; licensee BioMed Central Ltd. 2013

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.