Given the target protein backbone structure, we would like to find the optimal idealized backbone structure. For an idealized protein backbone structure, the coordinates of *O*, *H* and *C*_{
β
} backbone atoms can be calculated from the coordinates of *n*, *C*_{
α
} and *C* backbone atoms. Thus, we specifically describe how to generate coordinates of *n*, *C*_{
α
} and *C* atoms in this section. For simplicity, a structure is always referred to as a protein backbone structure unless strictly specified.

### Idealized backbone structure generation

Given the target structure, we would like to generate idealized structures fulfilling two generation goals. First, the idealized structures should be similar to the target structure. Second, each pair of idealized structures should be at least some distance away to avoid redundant computation. Furthermore, we are interested in generating as many of these idealized structures as possible.

Before describing how we fulfill the generation goals, we describe a simple distance metric to measure the distance between two sets of coordinates representing the target protein. Let

*P*_{
i
} be a set of coordinates representing the target protein, and

${P}_{i}^{j}\in {P}_{i}$ be the coordinate of the

*j*-th atom of the target protein. Thus, there is

${P}_{i}=\{{P}_{i}^{1},{P}_{i}^{2},\mathrm{...},{P}_{i}^{3n}\}$, where

*n* is the number of amino acids of the target protein. For simplicity, let

*P*_{0} always represent the target structure, and

*P*_{
i
} represent a generated idealized structure for

*i*>0. Let

$D({P}_{i}^{k},{P}_{j}^{k})$ be the Euclidean distance between

${P}_{i}^{k}$ and

${P}_{j}^{k}$. We describe the distance between

*P*_{
i
} and

*P*_{
j
} as the bottleneck distance:

$D({P}_{i},{P}_{j})=\underset{k}{max}D({P}_{i}^{k},{P}_{j}^{k}).$

(1)

Using this distance metric, we fulfill both generation goals by satisfying the following generation constraints:

$\left\{\begin{array}{ll}D({P}_{0},{P}_{i})\le r& \forall i>0\\ D({P}_{i},{P}_{j})\ge \epsilon & \forall i,j>0\end{array}\right..$

(2)

The first generation constraint assumes that the accuracy of the coordinates of the target structure is reasonably good, and no-worse than *r*. If this constraint is satisfied, the distance between the target coordinate and any generated coordinate representing the same atom is upper bounded by *r*. Thus, it is reasonable for any generated idealized structure *P*_{
i
} to be considered similar to target structure *P*_{0}. If the second generation constraint is satisfied, for each pair of generated idealized structures, there exists a pair of coordinates, one from each structure representing the same atom, such that they are at least *ε* distance away from each other. Therefore, both generation goals are achieved.

These generation constraints suggest limiting the search space inside a sphere with radius *r*, and discreting the search space with grids of size *ε*. When *ε*=0.001Å, the accuracy of X-ray crystallography [22] and PDB (protein database) format [23] is reached. Thus, this method is capable of generating all possible idealized structures at the accuracy of X-ray crystallography and PDB format.

Given the limited and discrete search space of each atom, one can generate idealized structure coordinates from the first atom to the last atom. For the first atom, an idealized coordinate lies within a sphere. Thus, the number of generated coordinates is bounded by *O*(1/*ε*^{3}). For each generated coordinate ${P}_{i}^{1}$ of the first atom, an idealized coordinate of the second atom lies on a ball surface with a constant distance to ${P}_{i}^{1}$. Thus, the number of generated coordinates is bounded by *O*(1/*ε*^{2}). For each generated coordinate pair $({P}_{i}^{1},{P}_{i}^{2})$ of the first two atoms, an idealized coordinate of the third atom lies on a circle with a constant distances to ${P}_{i}^{1}$ and ${P}_{i}^{2}$. Thus, the number of generated coordinates is bounded by *O*(1/*ε*). Similarly, the number of generated coordinates for any of the following atoms is also bounded by *O*(1/*ε*). Moreover, since we round *Ω* dihedral angles to either 0° or 180°, the coordinate of any *C*_{
α
} atom is unique and can be calculated from the coordinates of the previous three atoms.

Therefore, the total number of coordinates generated for all atoms is bounded by *O*(1/*ε*^{2n+4}) by induction. Here, it is acceptable to assume that *r* is a constant because it is only related to the first atom. For subsequent atoms, we did not limit the search space to be inside the sphere with radius *r* as described above, and thus the actual number of generated coordinates should be much smaller in practice.

### Idealized backbone structure scoring function

Given the generated idealized structures {

*P*_{
i
}}, we need a scoring function

*S*_{
B
B
}(

*P*_{
i
}) to find the optimal idealized structure. The scoring function should evaluate not only the similarity between generated idealized structure

*P*_{
i
} and target structure

*P*_{0}, but should also evaluate the free energy of

*P*_{
i
}, to ensure that

*P*_{
i
} is protein-like. Thus, we define our scoring function as follows:

$\begin{array}{ll}{S}_{\mathit{\text{BB}}}\left({P}_{i}\right)=& {S}_{f}\left({P}_{i}\right)-{w}_{1}{D}_{\alpha}({P}_{i},{P}_{0})-{w}_{2}{D}_{\beta}({P}_{i},{P}_{0})\\ -{w}_{3}{D}_{H}({P}_{i},{P}_{0})-{w}_{4}{D}_{\Phi ,\Psi}({P}_{i},{P}_{0}),\end{array}$

(3)

where *w*_{
a
} are the weighting parameters, *S*_{
f
}(*P*_{
i
}) is the free energy score, *D*_{
α
}(*P*_{
i
},*P*_{0}) is the root mean square divergence (RMSD) of *C*_{
α
} atoms, *D*_{
β
}(*P*_{
i
},*P*_{0}) is the RMSD of *C*_{
β
} atoms, *D*_{
H
}(*P*_{
i
},*P*_{0}) is the RMSD of the hydrogen and oxygen atoms participating in hydrogen bonds, and *D*_{Φ,Ψ}(*P*_{
i
},*P*_{0}) is the RMSD of (*Φ*,*Ψ*) dihedral angles.

In our scoring function, the free energy is evaluated by a (

*Φ*,

*Ψ*) dihedral angle log-odd score as the free energy score

*S*_{
f
}(

*P*_{
i
}). Specifically, we discrete the Ramachandran plot into grids of 360 by 360, and draw one plot for each type of amino acid. Then, we calculate the log-odd score

${S}_{f}\left({P}_{i}^{1,t}\right)$ of idealized structure

${P}_{i}^{1,t}$ of the first

*t* atoms:

${S}_{f}\left({P}_{i}^{1,t}\right)=\sum _{5\le i\le t,{A}_{i}={C}_{\alpha}}log\frac{{P}_{A{A}_{i-3}}({\Phi}_{i-3},{\Psi}_{i-3})}{{P}_{\mathit{\text{null}}}({\Phi}_{i-3},{\Psi}_{i-3})},$

(4)

where one log-odd score is calculated at each *C*_{
α
} atom (by checking that atom type *A*_{
i
} is *C*_{
α
}) for the previous amino acid (represented by the previous *C*_{
α
} atom at *i*−3), ${P}_{A{A}_{i-3}}({\Phi}_{i-3},{\Psi}_{i-3})$ is the probability of the grid containing (*Φ*_{i−3},*Ψ*_{i−3}) on the Ramachandran plot of amino acid type *A* *A*_{i−3}, and *P*_{
n
u
l
l
}(*Φ*_{i−3},*Ψ*_{i−3}) is the probability of the null model with a uniform distribution such that ${P}_{\mathit{\text{null}}}({\Phi}_{i-3},{\Psi}_{i-3})=\frac{1}{360}\frac{1}{360}$.

Structure similarity is evaluated by other distance matrices in our scoring function. We use *D*_{
α
}(*P*_{
i
},*P*_{0}) and *D*_{Φ,Ψ}(*P*_{
i
},*P*_{0}) to serve as distance metrics to conserve the backbone structures, and *D*_{
β
}(*P*_{
i
},*P*_{0}) to serve as a distance metric to conserve the side-chain structure compatibilities; we also use *D*_{
H
}(*P*_{
i
},*P*_{0}) to serve as a distance metric to conserve the hydrogen bonds. Thus, some global dependencies are addressed implicitly by distance matrices ${D}_{\beta}({P}_{i}^{1,t},{P}_{0}^{1,t})$ and *D*_{
H
}(*P*_{
i
},*P*_{0}).

### Dynamic programming algorithm

Theoretically, one can calculate scores for all generated idealized structures and find the optimal one with the maximum score. This method works well as long as similar structures always have similar scores. More formally, the method requires the assumption that *D*(*P*_{
i
},*P*_{
j
})≤*ε* ⇒ |*S*_{
B
B
}(*P*_{
i
})−*S*_{
B
B
}(*P*_{
j
})|≤*ε*_{
s
}, which is reasonable for small *ε*. Note that, since the total number of generated idealized structures is bounded by *O*(1/*ε*^{2n+4}), this method is computationally expensive. Thus, we introduce a dynamic programming algorithm with a filtering technique to find the optimal idealized structure efficiently.

The dynamic programming algorithm has two assumptions. One assumption is that given two generated idealized structures ${P}_{i}^{1,t-1}$ and ${P}_{j}^{1,t-1}$ of the first *t*−1 atoms, such that $D({P}_{i}^{t-k,t-1},{P}_{j}^{t-k,t-1})\le \epsilon $, for any generated coordinate ${P}_{i}^{t}$ of the *t*’th atom, there always exists a generated coordinate ${P}_{j}^{t}$, such that $D({P}_{i}^{t},{P}_{j}^{t})\le \epsilon $. The other assumption is that the scoring function satisfies the additive property, such that ${S}_{\mathit{\text{BB}}}\left({P}_{i}^{1,t}\right)={S}_{\mathit{\text{BB}}}\left({P}_{i}^{1,t-k}\right)\oplus {S}_{\mathit{\text{BB}}}\left({P}_{i}^{t-k+1,t}\right)$, under some addition operator ⊕.

We observed that counter examples of the first assumption when

*k*≥5 are rare, though counter examples do exist theoretically. The second assumption holds for our scoring function. Distance matrices

${D}_{\alpha}({P}_{i}^{1,t},{P}_{0}^{1,t})$,

${D}_{\beta}({P}_{i}^{1,t},{P}_{0}^{1,t})$,

${D}_{H}({P}_{i}^{1,t},{P}_{0}^{1,t})$ and

${D}_{\Phi ,\Psi}({P}_{i}^{1,t},{P}_{0}^{1,t})$ satisfy the additive property because RMSD

${D}_{\mathit{\text{RMS}}}({P}_{i}^{1,t},{P}_{0}^{1,t})$ satisfies the additive property:

$\begin{array}{l}\phantom{\rule{1.5em}{0ex}}{D}_{\mathit{\text{RMS}}}({P}_{i}^{1,t},{P}_{0}^{1,t})\\ ={D}_{\mathit{\text{RMS}}}({P}_{i}^{1,t-k},{P}_{0}^{1,t-k})\oplus {D}_{\mathit{\text{RMS}}}({P}_{i}^{t-k+1,t},{P}_{0}^{t-k+1,t})\\ =\sqrt{\frac{{D}_{\mathit{\text{RMS}}}^{2}({P}_{i}^{1,t-k},{P}_{0}^{1,t-k})(t-k)+{D}_{\mathit{\text{RMS}}}^{2}({P}_{i}^{t-k+1,t},{P}_{0}^{t-k+1,t})k}{t}}.\end{array}$

(5)

Moreover, the free energy score

${S}_{f}\left({P}_{i}^{1,t}\right)$ satisfies the additive property as follows:

$\begin{array}{ll}{S}_{f}\left({P}_{i}^{1,t}\right)& ={S}_{f}\left({P}_{i}^{1,t-k}\right)\oplus {S}_{f}\left({P}_{i}^{t-k+1,t}\right)\\ ={S}_{f}\left({P}_{i}^{1,t-k}\right)+{S}_{f}\left({P}_{i}^{t-k+1,t}\right).\end{array}$

(6)

The second assumption is fundamental to our dynamic programming algorithm. By induction, the first assumption implies that if $D({P}_{i}^{t-k,t-1},{P}_{j}^{t-k,t-1})\le \epsilon $, for any generated idealized structure ${P}_{i}^{t,n}$, there always exists a generated idealized structure ${P}_{j}^{t,n}$ such that $D({P}_{i}^{t,n},{P}_{j}^{t,n})\le \epsilon $. Recall that the scoring function assumes that $D({P}_{i}^{t,n},{P}_{j}^{t,n})\le \epsilon \phantom{\rule{2.77695pt}{0ex}}\Rightarrow \phantom{\rule{2.77695pt}{0ex}}\left|{S}_{\mathit{\text{BB}}}\right({P}_{i}^{t,n})-{S}_{\mathit{\text{BB}}}({P}_{j}^{t,n}\left)\right|\le {\epsilon}_{s}$, and thus there is ${S}_{\mathit{\text{BB}}}\left({P}_{i}^{t,n}\right)\approx {S}_{\mathit{\text{BB}}}\left({P}_{j}^{t,n}\right)$. If ${S}_{\mathit{\text{BB}}}\left({P}_{i}^{1,t-1}\right)\ge {S}_{\mathit{\text{BB}}}\left({P}_{j}^{1,t-1}\right)$, there is approximately ${S}_{\mathit{\text{BB}}}\left({P}_{i}\right)={S}_{\mathit{\text{BB}}}\left({P}_{i}^{1,t-1}\right)\oplus {S}_{\mathit{\text{BB}}}\left({P}_{i}^{t,n}\right)\ge {S}_{\mathit{\text{BB}}}\left({P}_{j}^{1,t-1}\right)\oplus {S}_{\mathit{\text{BB}}}\left({P}_{j}^{t,n}\right)={S}_{\mathit{\text{BB}}}\left({P}_{j}\right)$. Therefore, if $D({P}_{i}^{t-k,t-1},{P}_{j}^{t-k,t-1})\le \epsilon $ and ${S}_{\mathit{\text{BB}}}\left({P}_{i}^{1,t-1}\right)\ge {S}_{\mathit{\text{BB}}}\left({P}_{j}^{1,t-1}\right)$, there is no need to generate ${P}_{j}^{t,n}$ to find an approximately optimal solution.

Based on this observation, we developed a novel dynamic programming algorithm. Idealized structures are still generated as previously described, but the generation process is stopped for some idealized structures if we know it cannot lead us to the optimal one. First, the search space for each atom of the target protein is discretized to grids of size

*ε*. When generating coordinates for atom

*t*, if

${P}_{i}^{t-k+1,t}$ and

${P}_{j}^{t-k+1,t}$ are located in the same grid set

${G}_{g}^{t-k+1,t}$, we know that there is no need to continue the generation process on the lower scoring one of

${P}_{i}^{1,t}$ and

${P}_{j}^{1,t}$. Thus, we define the dynamic programming table

${T}_{\mathit{\text{BB}}}(t,{G}_{g}^{t-k+1,t})$ to be the optimal idealized structure for each observed tail grid set

${G}_{g}^{t-k+1,t}$ as follows:

$\begin{array}{l}\left\{\begin{array}{l}{T}_{\mathit{\text{BB}}}(t,{G}_{g}^{t-k+1,t})=\underset{i,j}{max}{T}_{\mathit{\text{BB}}}(t-1,{G}_{i}^{t-k,t-1})\\ \phantom{\rule{1em}{0ex}}\oplus {S}_{\mathit{\text{BB}}}\left({P}_{j}^{t}\right)\\ {T}_{\mathit{\text{BB}}}(k,{G}_{g}^{1,k})=\underset{i}{max}{S}_{\mathit{\text{BB}}}\left({P}_{i}^{1,k}\right)\end{array}\right.,\end{array}$

(7)

where ${G}_{g}^{t-k+1,t-1}={G}_{i}^{t-k+1,t-1}$, ${P}_{j}^{t-k+1,t}\in {G}_{g}^{t-k+1,t}$ and ${S}_{\mathit{\text{BB}}}\left({P}_{j}^{1,t-1}\right)\oplus {S}_{\mathit{\text{BB}}}\left({P}_{j}^{t}\right)={S}_{\mathit{\text{BB}}}\left({P}_{j}^{1,t}\right)$. Thus, the dynamic programming table can be calculated from the first atom to the last atom. Finally, the optimal idealized structure is the one with the highest score $\underset{g}{max}{G}_{g}^{3n-k+1,3n}$.

The run-time complexity of our dynamic programming algorithm depends on the value of *k*. To keep all possible (*Φ*,*Ψ*) dihedral angles of the previous residue when generating *C*_{
α
} atoms, we have to choose *k*≥5. For speed, we choose *k*=5 in our implementation. In this case, the number of score calculations required to calculate ${T}_{\mathit{\text{BB}}}(t,{G}_{g}^{t-4,t})$ is no more than the maximum number of coordinates sampled for six consecutive backbone atoms. Recall that there are exactly two *C*_{
α
} atoms in six consecutive backbone atoms, and the *Ω* dihedral angle is rounded. Thus, the coordinate of one *C*_{
α
} atom can be calculated from the coordinates of the other *C*_{
α
} atom and the two atoms between them. For this reason, the maximum number of sampled coordinates is bounded by *O*(1/*ε*^{8}). Moreover, the number of score calculations required to calculate ${T}_{\mathit{\text{BB}}}(k,{G}_{g}^{1,k})$ is no more than the maximum number of possible coordinates sampled for five consecutive backbone atoms, which is also *O*(1/*ε*^{8}). Therefore, the run-time complexity of our dynamic programming algorithm is *O*(*n*/*ε*^{8}).

To increase the speed for the dynamic programming algorithm, we applied an additional filtering technique to remember only the highly scored idealized structures. Specifically, the algorithm only remembers the optimal idealized structure for the top *m* scored tail configurations instead of all possible conformations. Thus, the run-time complexity is reduced to *O*(*n* *m*/*ε*). This approach works well in practice because an optimal idealized structure with a long poorly scored fragment is rare. Thus, we assumed that the local quality of the idealized structure should be reasonably high (in the top *m* score list).