We start by estimating how similar two differently threaded tube structures fulfilling typical distance inequalities of alpha carbon atoms in proteins can be when measured by Root Mean Square Deviation (RMSD) [10], Global Distance Test - Total Score (GDT-TS) [7], and TM-score [11].
A motivating example
Alpha carbon atom neighbors along the backbone are usually \(3.8\AA \) (Ångström) apart. Otherwise, alpha carbon pairs are rarely less than \(4.5\AA \) apart. Imagine a symmetrical crossing in the xy-plane where the over-crossing part of a chain has heights (measured on the z-axis) 0, 0, 1.25, 2.5, 2.5, 1.25, 0, 0Å. Similarly, the under-crossing part follows the y-axis at heights 0, 0, – 1.25, – 2.5, – 2.5, – 1.25, 0, 0Å. Consider two larger structures that are identical except that the above over- and under-crossings are interchanged. The Euclidean motion of the optimal superposition of two structures is close to the identity, which we assume for simplicity. The total motion is thus given by 4 residues traveling \(5\AA \) and 4 residues traveling \(2.5\AA \). For an n-residue structure we find a topology change (tc) for
$$\begin{aligned} RMSD_{n}^{{tc}} \; \le& \sqrt {\frac{{4(5{\AA})^{2} + 4(2.5{\AA})^{2} }}{n}} < \frac{{12{\AA}}}{{\sqrt n }} \hfill \\ & GDT - TS_{n}^{{tc}} \; \ge \frac{1}{4}\sum\limits_{{m = 1,2,4,8}} {\text{aligned\;residues\;within\;m{\AA}}} \hfill \\ &\frac{1}{4}\left( {\frac{{n - 8}}{n} + \frac{{n - 8}}{n} + \frac{{n - 4}}{n} + \frac{n}{n}} \right)\; =& \;1 - \frac{5}{n} \hfill \\ &TM_{n}^{{tc}} \ge \frac{1}{n}\sum\limits_{{i = 1}}^{n} {\frac{1}{{1 + \left( {\frac{{d_{i} }}{{d_{0} }}} \right)^{2} }}} \hfill \\ =& \frac{{n - 8}}{n} + \frac{4}{n}\frac{1}{{1 + \left( {\frac{5}{{d_{0} }}} \right)^{2} }} \hfill \\ & > 1 - \frac{4}{n},for\;n \ge 100, \hfill \\ \end{aligned} $$
where \(d_0=1.24(n-15)^{1/3}-1.8\). For \(n=100\) we find \(RMSD^\text {tc}_{100}=1.2\AA \), \(GDT-TS^{\text {tc}}_{100}=0.95\) and \(TM^\text {tc}_{100}=0.96\). These scores are typical for very similar structures despite the constructed topology change. The persistence length of native proteins may require longer deformations than used in this example; but in computational modeling when, e.g., trying to minimize the distance between a model and a native structure one may get close to this example. Figure 1 shows a similar crossing change between two native 112 residue structures aligned with global RMSD \(5.6\AA \) and TM= 0.60. Before describing a method to detect such topological changes during a linear interpolation between structures we start with an algorithm for detecting steric clashes under a linear interpolation.
Measuring protein-protein overlap under a linear interpolation
Let two protein chains be represented by the coordinates of their alpha carbon atoms \(P_0=\bigl [p_0^1 \dots p_0^n\bigl ]\) and \(P_1=\bigl [p_1^1 \dots p_1^n\bigl ]\) each with index starting at the N-terminus. Assume they are placed in optimal superposition by RMSD or any other alignment method. We study the linear interpolation where each point \(p^i(t)=(1-t)p_0^i+tp_1^i\) traverses the straight line segment connecting the i’th alpha carbon in each of the two structures and the parameter \(t\in \bigl [0,1\bigl ]\) may be thought of as a time parameter. We want to check if typical protein distance constraints are violated during the interpolation and quantify how much the protein overlaps itself. For this, we first find the minimal distance between \(p^i\) and \(p^j\) during the interpolation. Let \(p^{i,j}_0=p_0^j-p_0^i\), \(p^{i,j}_1=p_1^j-p_1^i\) and denote \(\bigl \Vert p^{i,j}_0\bigl \Vert =a\), \(\bigl \Vert p^{i,j}_1\bigl \Vert =b\), and \(p^{i,j}_0\cdot p^{i,j}_1=ab\cos {\theta }\). For all t we have
$$\begin{aligned} d^2_{i,j}(t)= & {} \bigl \Vert p^j(t)-p^i(t)\bigl \Vert _2^2=\bigl \Vert (1-t)p^{i,j}_0+tp^{i,j}_1\bigl \Vert {_2^2}\\=\; & {} (1-t)^2a^2+t^2b^2+2t( 1-t)ab\cos {\theta }. \end{aligned}$$
The global minimum of \(d^2_{i,j}\) is found for \(t^*=\frac{a^2-ab\cos {\theta }}{a^2+b^2-2ab\cos {\theta }}\) and equals
$$\begin{aligned} m= & {} \frac{\sin {\theta }a^2b^2}{a^2+b^2-2ab\cos {\theta }}=\frac{\bigl \Vert p_1^{i,j}\times p_0^{i,j}\bigl \Vert _2^2}{\bigl \Vert p_1^{i,j}- p_0^{i,j}\bigl \Vert _2^2}\\= & {} \frac{\bigl \Vert \bigl (p_1^{i,j}- p_0^{i,j}\bigl )\times p_0^{i,j}\bigl \Vert _2^2}{\bigl \Vert p_1^{i,j}- p_0^{i,j}\bigl \Vert _2^2}. \end{aligned}$$
The last equation shows that the fraction is bounded by \(\bigl \Vert p_0^{i,j}\bigl \Vert _2^2\) and by symmetry also by \(\bigl \Vert p_1^{i,j}\bigl \Vert _2^2\). Setting \({\tilde{t}}=\bigl \{0\), if \(t^*<0\); \(t^*\), if \(0\le t^*\le 1\); and 1, if \(1<t^*\bigl \}\), the minimum of \(d^2_{i,j}\) for \(t\in [0,1]\) denoted \(d^\text {2,interpolation}_{i,j}=d^2_{i,j}({\tilde{t}})\). If \(\bigl \Vert p_1^{i,j}- p_0^{i,j}\bigl \Vert _2^2\) is vanishing \(t^*\) is ill-defined; but then \(d^2_{i,j}\) is constant and this value is returned. From a set of native protein structures, we estimate the shortest possible distances between alpha carbons i and j to be \(d_{min}(|i-j|)=\) 2.8, 4.5, 3.86, 3.47, 3.52, 3.48, 3.6Å for \(|i-j|=1,\dots ,7\) and \(d_{min}(|i-j|)=3.7\AA \) for \(|i-j|>7\). The overlap, \(\text {overlap}_{i,j}\), is thus the max of \(d_{min}(|i-j|)-d^\text {interpolation}_{i,j}\) and 0(zero). The mean overlap is given by \({\text {MeanOverlap}}\big ( P1,P2\big )=\frac{1}{n}\sum _{i<j} \text {overlap}_{i,j}\). Especially in \(\alpha \)-helices the distance between neighbouring alpha carbon atoms may be shortened substantially during the interpolation. Therefore \(d_{min}(1)=2.8\) is chosen small to avoid large contributions from local deformations. If both protein structure \(P_0\) and \(P_1\) are overlap free, \(\text {overlap}_{i,j}\) is given by the max of \(d_{min}(|i-j|)-\sqrt{m}\) and 0(zero) if \(0<t^*<1\) and 0(zero) otherwise. To be able to handle models not fulfilling basic distance constraints, we do not make this assumption.
It is a priory not clear how to penalize protein-protein overlap. Here, we have chosen to penalize pairwise overlap linearly.
Topology check of a protein under a linear interpolation
We now leave the volumetric point of view and turn to a purely topological point of view. Given two superimposed protein structures, we consider the linear interpolation of the piecewise linear curve connecting the alpha carbons. Let P denote the piecewise linear curve through all points \(p_i\), \(i=1,\dots ,n\). That is \(P_a=p_i\) for \(a=1,\dots ,n\), but \(P_a\) is defined for all \(a\in [1,n]\). The linear interpolation between two piecewise linear curves \(P^0\) and \(P^1\) is given by \(P_a(t)=(1-t)P^0_a +tP^1_a\) for \(t\in [0,1]\) and \(a\in [1,n ]\).
We want to detect all self-intersections of the backbone curve during the linear interpolation. Furthermore, we want to identify all self-intersections we can avoid by changing the initial morph locally by rearranging at most a user defined number of residues. The remaining self-intersections are expected to do essential changes to the fold, e.g., as avoiding them require larger rearrangements than allowed by the user. To do this, we borrow some notation from knot theory concerning knots on closed curves in 3-space. For a knot in 3-space a planar projection with over and under crossings indicated is called a knot diagram if all crossings are isolated. Hence, at most two points on the knot are allowed to be projected to the same point. When changing the plane of projection the crossings will move, crossings may disappear, new crossings may appear, and in the planar projection existing crossings may pass through each other. The same happens to the planar projection when deforming the knot through a self-avoiding morph. A fundamental result found in any text book on knot theory states that two closed curves share knot type, i.e., the one may be morphed into the other through imbeddings by a ambient isotopy if and only if their knot diagrams are connected by a finite series of Reidemeister moves, shown in Fig. 2, plus deformations not changing crossings. Crossings of a knot may be associated with a sign using the usual right-hand rule explained in the same figure. Note, that changing the direction of traversing a knot does not change the sign of the crossing. The sign of a crossing comes from the orientation of 3-space and is for line segments \(P_i P_{i+1}\) and \(P_j P_{j+1}\) given by the sign of the determinant \(\det _{ij}=\det \bigl ( P_{i+1}-P_{i}\, P_{j+1}-P_{j}\, P_{i}-P_{j} \bigl )\).
Detecting transversal self-intersections
We consider only so-called transversal self-intersections where two line segments actually go through each other and consequently the sign of their crossing changes at the time of the self-intersection. Algorithmically, we thus first find isolated real roots \(t^*\in [0,1]\) for the 3’rd order polynomial
$$\begin{aligned}&{\text {det}}_{ij}\bigl (t\bigl ) = \\& \det \bigl ( P_{i+1}(t)-P_{i}(t)\, \quad P_{j+1}(t)-P_{j}(t)\, P_{i}(t)-P_{j}(t)\bigl ). \end{aligned}$$
For each of these roots, \(t^*\in [0,1]\), the line segments \(P_i(t^*)P_{i+1}(t^*)\) and \(P_j(t^*)P_{j+1}(t^*)\) lie in a plane and intersect only if the planar problem
$$\begin{aligned}&(1-s_i)P_{i}(t^*)+s_iP_{i+1}(t^*)\\&\quad =(1-s_j)P_{j}(t^*)+s_jP_{j+1}(t^*) \end{aligned}$$
has a solution with \(s_i,s_j\in [0,1]\). If a self-intersection is found, we store the interpolation parameter \(t^*\), the curve parameters of the self-intersection \(a=i+s_i\) and \(b=j+s_j\), together with the crossings sign change given by the sign of the derivative of \(\det _{ij}(t)\) at \(t^*\). By checking all pairs of line segments we get a list of self-intersections with data \(( a_k,b_k,sign_k,t^*_k)\), where k is an arbitrary index.
A self-intersection between line segments \(P_i P_{i+1}\) and \(P_j P_{j+1}\) requires that at least some of the spheres centered at \(P_i\), \(P_{i+1}\), \(P_j\) and \(P_{j+1}\) are overlapping during the interpolation as calculated in the previous section. The self-intersection check may be skipped if the alpha carbon neighbors are less than \(4\AA \) apart and \( {\text{overlap}}_{{i,j}} + {\text{overlap}}_{{i,j + 1}} + {\text{overlap}}_{{i + 1,j}} + {\text{overlap}}_{{i + 1,j + 1}} \; < \;2.6{\AA} \). The pre-computed overlap gives an efficient filter before the self-intersection check, but one cannot tell a nearby passing from a self-intersection by the overlap values.
Locally avoidable self-intersections
Some morph self-intersections make a significant change to a proteins fold, while others do not. In the mathematical notion of topology however all protein chains are topologically equivalent as they all may be refolded e.g. into one long beta strand. Algorithmically such a morph may be constructed by a blow-up technique. A distinction between significant and insignificant fold changes can thus only be decided by human judgement and may differ depending on the application at hand. Here, we search for insignificant self-intersections that we can resolve by local changes of the morph and quantify how much longer the resulting morph is.
Imagine the left- and right-handed loops in Fig. 2 aligned on top of each other. The linear interpolation between the two structures will have a self-intersection. Loops are the most flexible parts of proteins and such a self-intersection is a minor structural change as we now explain. Mathematically this self-intersection can be removed from the morph as this structural change may be performed by two Reidemeister moves of type one, denoted \(\Omega _1\). We have used that the projection of the loop in Fig. 2 is isolated from the rest of the projected curve. In a projection of a globular protein structure, it is likely that the projection of other parts of the backbone pass through the projected loop. Hereby a longer series of Reidemeister moves may be needed to connect the projections of the two loops. To avoid this algorithmic complexity we choose to make a 3-dimensional analogue of Reidemeister moves. We want to know if we can remove a self-intersection by just changing the morph of the arc forming the loop while leaving the rest of the curve and its morph unchanged. We therefore say that a self-intersection \(( a_k,b_k,sign_k,t^*_k)\) is locally \(\Omega _1\)-removable if the topological disk given by all the triangles connecting two consecutive points of the loop
$$ P_{{a_{k} }} \;\left( {t_{k}^{*} } \right),P_{{ciel(a_{k} )}} \left( {t_{k}^{*} } \right), \ldots ,P_{{floor(b_{k} )}} \left( {t_{k}^{*} } \right),P_{{b_{k} }} \left( {t_{k}^{*} } \right)\; = \;P_{{a_{k} }} \left( {t_{k}^{*} } \right) $$
to the center of mass of all these points is disjoint from all the other line segments of the \(t^*_k\)-curve \(P_{1}(t^*_k),\dots , P_{floor(a_k)}(t^*_k)\) and \(P_{ciel(b_k)}(t^*_k),\dots , P_{n}(t^*_k)\). In this case, one can freely deform the loop in a neighborhood of the topological disk without intersecting the rest of the curve. One possible morph that avoids the self-intersection contracts the loop almost to its center of mass where there is room to throw one arc around the loop. However, for less complicated loops one would simply deform the loop by rotating it around the line through the point of self-intersection and the center of mass of the loop. We therefore penalize a locally \(\Omega _1\)-removable self-intersection by a price \({\mathscr {P}}1\) equal to twice the sum of the distances from the points of the loop to this line. The self-intersection shown in Fig. 1 may be removed by an \(\Omega _1\) move involving a large part of the protein. We therefore let the user decide the maximal accepted backbone length, denoted \({\text {MaxLength}}\) involved in the alternate morph. Only self-intersections with \(b_k-a_k\le {\text {MaxLength}}\) are tested for being \(\Omega _1\) removable. In a planer projection into a knot diagram the curve segment from \(a_k\) to \(b_k\) need neither be simple nor isolated from the remainder of the projected curve. The constructed \(\Omega _1\)-move is operating on a 3-dimensional curve. It is therefore not a real Reidemeister move that acts on 2-dimensional knot diagrams. The \(\Omega _1\)-move need not project to a Reidemeister 1 move for two main reasons. Firstly, the 3-dimensional loop need not project to a simple curve. For larger \({\text {MaxLength}}\) the loop may even be knotted. Secondly, in the knot diagram the projection of the loop is not likely to be isolated from the remaining projection. Thus the 3-dimensional \(\Omega _1\)-move may involve many Reidemeister moves in a knot diagram and can be seen as a meta move in the context of knot diagrams.
If a closed double stranded DNA undergoes the self-intersection considered here, it will change its writhe and hereby also the linking number between the two strands by plus or minus two [34]. To feel this coupling between writhe and twisting, hold a length of torsional stiff wire in the shape of a left-handed loop such that it is in self-contact almost forming a self-intersection. By rotating both ends of the wire, you can force the wire to morph into a right-handed loop. That is, a morph avoiding the self-intersection requires two full rotations around the backbone. Proteins are in principle free to do full rotations around the backbone. One may choose not to penalize the discontinuous writhing caused by \(\Omega _1\) loops but we offer the possibility.
Most self-intersections are not \(\Omega _1\)-removable. A significant speed up of the calculation time is obtained by: First, move the origin of the coordinate system to the center of mass of the loop, which then is a vertex in all the triangles of the topological disc. Then to sort all line segments by increasing distance from the origin to the center of the line-segments. Finally, to check all line-segment triangle pairs for intersection and stop either when an intersection is found or when the triangle inequality between the disk and line segments implies that an intersection is impossible.
The local zigzag shape of general alpha carbon curves often make self-intersections occur in pairs during a morph. Such a pair corresponds to interchanging over and under sliding in Reidemeister moves of type two, \(\Omega _2\), shown in the middle of Fig. 2. The two crossings, \(( a_j,b_j,sign_j,t^*_j)\) and \(( a_k,b_k,sign_k,t^*_k)\), in an \(\Omega _2\) move have opposite signs and both change sign during transversal self-intersections. Without loss of generality, we may assume that \(t^*_j\le t^*_k\). Consider first the case where \(t^*_j= t^*_k\). At time \(t^*_k\) the two arcs connecting the two points of self-intersection form a closed curve. If a topological disc spanned by this closed curve is disjoint from the remaining parts of the curve then the two self-intersections can be removed essentially by performing two Reidemeister moves of type two close to this disc. You may construct this such that one of the moves occurs before the time \(t^*_k\) and one after. In the general case where \(t^*_j\le t^*_k\) it is sufficient that a topological disk whose boundary includes the two arcs is isolated for some \(t\in \big [t^*_j; t^*_k\big ]\), see Fig. 3. To close the curve and to have room for a morph avoiding the two self-intersections we have to add the line segments traversed by the self-intersection points in the time interval \(t^*_j\le t\le t^*_k\). In practice we check if the topological disc connecting the loop for \(t^*_{average}=(t^*_j+t^*_k)/2\) to its center of mass is disjoint from the remainder of the curves for \(t=t^*_{average}\) and if the additional line segments traversed by the self-intersection points are disjoint from the remainder of the curve in the time interval \(t^*_j\le t\le t^*_k\). See Fig. 3. The first check is speeded up as in the case of the \(\Omega _1\) move. If the backbone length of the loop is shorter than the user-defined \({\text {MaxLength}}\), i.e., if \(b_j-a_j+b_k-a_k\le {\text {MaxLength}}\), and no obstructions are found we say that \(( a_j,b_j,sign_j,t^*_j)\) and \(( a_k,b_k,sign_k,t^*_k)\) form a locally \(\Omega _2\)-removeable loop and penalize it by price \({\mathscr {P}}2\) twice the sum of the distances from the points of the loop to the line connecting the two points \(P_{a_j}(t^*_{average})\) and \(P_{a_k}(t^*_{average})\).
As in the case of \(\Omega _1\) moves, an \(\Omega _2\) move acts on space curves. When projected to a knot diagram, an \(\Omega _2\) move may require several ordinary Reidemeister moves and may thus be thought of as a meta move of knot diagrams. The sum of the crossings signs is unchanged by an \(\Omega _2\) move and there is thus no net discontinuous contribution to the writhe of the curve. E.g., the linking number of double stranded DNA undergoing an \(\Omega _2\) move is unchanged and there is therefore no torsional penalty involved.
Essential self-intersections
Let \(\{v_i\}\) denote all the self-intersections. Some are \(\Omega _1\)-removable at price \({\mathscr {P}}1(v_i)>0\). Some pairs \([v_i,v_j]\) are \(\Omega _2\)-removable at price \({\mathscr {P}}2(v_i,v_j)>0\). We want to remove as many self-intersections as possible at the lowest possible cost when being restricted to removing each \(v_i\) at most once. These two problems may be solved simultaneously by one maximal weight matching problem as we now explain. We will denote the self-intersections left after this process a set of essential self-intersections of the morph. Consider the \(v_i\)’s as vertices in a graph and let \(\Omega _1\) denote the \(\Omega _1\)-removable subset. Let the constant \(\varepsilon \) equal one over twice the sum of the prices of all \(\Omega _1\) and \(\Omega _2\) moves. Each vertex is associated a weight
$$\begin{aligned} w_v(v_i)=\left\{ \begin{array}{ll} \varepsilon {\mathscr {P}}1(v_i), &{} \text {if} \; v_i\in \Omega _1 \\ 1, &{} \hbox {else.} \end{array} \right. \end{aligned}$$
We want to remove as many vertices as possible. Hence, each vertex in \(\Omega _1\) will be removed by an \(\Omega _1\) move unless it’s more favorable to let it take part in an \(\Omega _2\) move. Hence, we only need to consider how to use the \(\Omega _2\) moves. For this, let each \(\Omega _2\) move be represented by an edge, e, connecting vertices \(v_i\) and \(v_j\) and associate this edge with the weight
$$\begin{aligned} w(e)=w_v(v_i)+w_v(v_j) - \varepsilon {\mathscr {P}}2(v_i,v_j). \end{aligned}$$
Finally, let K be the number of vertices not in \(\Omega _1\), denote the total cost of the \(\Omega _1\) moves by \({\mathscr {P}}1_{all}=\sum _{v_i\in \Omega _1} {\mathscr {P}}1(v_i)\), and recall that a matching is a subset \(E'\) of all the edges that uses each vertex at most once. Let \(N(E')\) denote the number of vertices left after performing the \(\Omega _2\) moves specified by \(E'\) followed by all the possible \(\Omega _1\) moves and let \({\mathscr {P}}(E')\) be the sum of the prices of all the \(\Omega _1\) and \(\Omega _2\) moves used. We claim that for each matching \(E'\)
$$\begin{aligned} N(E')+\varepsilon {\mathscr {P}}\bigl (E'\bigl )=K+\varepsilon {\mathscr {P}}1_{all} -\sum _{e\in E'} w(e) \end{aligned}$$
and denote this quantity by \(A\bigl (E'\bigl )\). By this claim a maximal weighted matching \(E^*\) maximizes \(\sum _{e\in E'} w(e)\) and thus minimizes the left-hand side. By construction \(0\le \varepsilon {\mathscr {P}}(E')\le \frac{1}{2}\) and hereby \(N(E^*)\) is minimized and \({\mathscr {P}}(E^*)\) is minimized subject to the constraint that \(N(E^*)\) is minimal, solving both problems at once.
We find it most instructive to prove this claim by induction on the cardinality of \(E'\) and start with the case \(E'=\emptyset \) where all \(\Omega _1\) moves are performed and the claim is trivial. Consider a matching \(E''\) where \(e_k\in E''\) and the induction claim is true for \(E'=E''-\{e_k\}\). Then
$$\begin{aligned} A\bigl (E''\bigl )&=K+\varepsilon {\mathscr {P}}1_{all} -\sum _{e\in E''} w(e)\\&= K+\varepsilon {\mathscr {P}}1_{all} -\sum _{e\in E'} w(e)-w(e_k)\\&= N(E')+\varepsilon {\mathscr {P}}(E')-w(e_k), \end{aligned}$$
where the last equality follows by induction. We use the shorthand \({\mathscr {P}}2(e_k)\) for the price \({\mathscr {P}}2(v_i,v_j)\) of the edge \(e_k\) between the two vertices \(v_i\) and \(v_j\).
If \(v_i,v_j \notin \Omega _1\) then \(w(e_k)=2-\varepsilon {\mathscr {P}}2(e_k)\) and \(A(E'')=N(E')-2 +\varepsilon \bigl ( {\mathscr {P}}(E') +{\mathscr {P}}2(e_k)\bigl )\). As two additional vertices are removed at the cost \({\mathscr {P}}2(e_k)\) we find \(A(E'')=N(E'')+\varepsilon {\mathscr {P}}(E'')\).
If \(v_i\in \Omega _1\) and \(v_j \notin \Omega _1\) then \(w(e_k)=\varepsilon {\mathscr {P}}1(v_i) +1-\varepsilon {\mathscr {P}}2(e_k)\) and \(A(E'')=N(E')-1 +\varepsilon \bigl ( {\mathscr {P}}(E') -{\mathscr {P}}1(v_i)+{\mathscr {P}}2(e_k)\bigl )\). Here one additional vertex, \(v_j\), is removed and the change of cost is \({\mathscr {P}}2(e_k)-{\mathscr {P}}1(i)\) as \(v_i\) not is removed by a \(\Omega _1\) move but by the \(\Omega _2\) move \(e_k\). Thus \(A(E'')=N(E'')+\varepsilon {\mathscr {P}}(E'')\).
Finally, if \(v_i,v_j\in \Omega _1\) then \(w(e_k)=\varepsilon {\mathscr {P}}1(v_i) +\varepsilon {\mathscr {P}}1(v_i)-\varepsilon {\mathscr {P}}2(e_k)\) and \(A(E'')=N(E')+\varepsilon \bigl ( {\mathscr {P}}(E') -{\mathscr {P}}1(v_i)-{\mathscr {P}}1(v_j)+{\mathscr {P}}2(e_k)\bigl )=N(E'')+\varepsilon {\mathscr {P}}(E'')\) as no additional vertices are removed and the cost is changed by using the \(\Omega _2\) move \(e_k\) instead of the two \(\Omega _1\) moves. This completes the proof of the claim.
We use the Matlab implementation [35] based on [36]. In practice, this calculation uses only a small fraction of the computation time and may be considered efficient. We also implemented a greedy algorithm that searches for locally contractible \(\Omega _2\) moves sorted by increasing backbone length between the two points of self-intersection and removes the ones found. Usually considerable fewer candidate \(\Omega _2\) moves have to be checked. Hence, the greedy algorithm is faster but in some cases it cannot remove as many self-intersections as the optimal method based on maximum weighted matching in graphs. There may be several sets with a minimal number of essential self-intersections, but in most cases the real valued prices of \(\Omega _1\) and \(\Omega _2\) moves will result in a unique cheapest minimal set. However, for some morphs a slight change in these prices will change the selected minimal set. For more complicated morphs we therefore stress not to emphasise which self-intersections are denoted essential but solely to report their cardinality.
End-contractions
When a self-intersection is not locally removable, we construct a morph using a standard strategy for unknotting ropes or shoelaces namely: while keeping most of the rope fixed, pull the rope until a terminus gets free of the loop restraining it. That is, a self-intersection \(( {{a_j},\;{b_j},\;{sign_j},\;{t_j^*}}) \) may be avoided by N- or C-terminus contraction. An N-terminus contraction first moves all points upstream of \(P_{a_j}(0)\) to this point for interpolation parameter \(t=0\), then uses the corresponding restriction of the original morph from \(t=0\) to \(t=1\), and finally does a reversed end-contraction on the protein structure for \(t=1\). If the end-contraction is restricted to be performed along the backbone path, \(P_1\dot{(}0)\) is moved a total distance of approximately \(a_j 3.8\AA \), \(P_2\) is moved approximately \((a_j-1)3.8\AA \) etc. The total morph length is therefore quadratic in the distance to the terminus. In general, the following algorithm illustrated on an N-terminus contraction finds a much shorter end-contraction.
The end-contractions is performed on the original protein structures, and we omit writing the constant interpolation parameter t = 0 or 1. That is \(P{a_j}\) is shorthand for either \(P_{a_j}(0)\) or \(P_{a_j}(1)\). Let the first obstruction point be \(Po=Pa_j\) and let \(m=floor(a_j)\). We have to move the entire curve \(P_1,\dots ,P_m\) to the point \(P_{a_j}\). \(P_m\) may always be moved along the straight line to \(P_{a_j}\). If the triangle with corners \(Po, P_m, P_{m-1}\) is disjoint from the curve downstream of \(P(a_j)\) the shortcut from \(P_{m-1}\) to Po may be used. Otherwise a new obstruction point \(P{{\tilde{o}}}\) inside the triangle is chosen such that the length of the path Po to \(P{{\tilde{o}}}\) to \(P_{m-1}\) is minimized and all downstream line segments are avoided. See Fig. 4. The algorithm is greedy and accepts the path from \(P{{\tilde{o}}}\) to \(P_{a_j}\) as the shortest found and all upstream points will use this as part of their path to \(P_{a_j}\). The last encountered obstruction point is moved to \(P{\tilde{o}}\). Then this shortening of the end-contraction is continued for all the upstream line segments. The cost of avoiding the self-intersection \(( a_j,b_j,sign_j,t^*_j)\) is thus the total distance traveled by the points in the end-contraction both for \(t=0\) and for \(t=1\) together with m times the distance traveled by \(P(a_j)\) in the original linear interpolation, because all the m contracted points travel this distance.
If \(a_1\le a_2\le ,\dots ,\le a_n\) and \(b_1,b_2,\dots ,b_n\) are the parameter values of the essential self-intersections the combined smallest N and C terminus end-contractions removing them all solve:
$$\begin{aligned} \min _{{1\; \le \;a\; < \;b\; \le \;L}} {\text{ N - Contraction}}(a) + {\text{C - Contraction}}(b) \\ \quad{\text{s.t.}}\,a_{i} \; \le a\;or\;b \le b_{i} \;{\text{for}}\;{\text{all}}\;i=1,\dots ,n. \end{aligned}$$
If no \({\text {N-contraction}}\) is used \(a=1\) and \(b=\min (b_1,b_2,\dots ,b_n,L)\) in order to remove all the self-intersections. If \(a=a_1\) the first self-intersection is removed from the N-terminus end and \(b=\min (b_2,\dots ,b_n,L)\), removes the remaining self-intersections. Thus in general \(a=a_j\) and \(b=\min (b_{j+1},\dots ,b_n,L)\) fulfill the constraints and we only have to check \(n+1\) cases to find this minimum. To speed this calculation up, we use an estimate of the end-contraction based on the observation that the distance traveled by residues up to 17 residues away from the self-intersection point grow roughly linearly to 25Å and is fluctuating around 25Å when further apart along the backbone. This gives a reasonable estimate of the sizes of the chosen end-contractions as shown on Fig. 4. The end-contractions are the computationally most costly part of all the calculations. As they only serve to quantify the distance between essentially different folds, we offer the possibility to omit this calculation and instead use the estimated size of the end-contractions. Finally, we offer the possibility to consider a self-intersection as \(\Omega _1\)-removable if it is sufficiently close to a protein terminus. One natural choice is to set the longest allowed distance from the point of self-intersection to an end to half of the user-defined \({\text {MaxLength}}\). In case a self-intersection also is \(\Omega _1\)-removable the price is set to the minimum of the \(\Omega _1\) price and the end-contraction price. Figure 5 illustrates the full self-avoiding path that lies in a tubular neighborhood of the original linear interpolation if essential self-intersection are absent.
Treating alignments with gaps
Optimal structural superposition of two protein chains includes gaps in the alignments of the chains to compensate for the naturally occurring insertions and deletions in homologous protein chains. A canonical example is that a loop may be longer in one structure than in the other; but such that the long loop can be contracted into the shorter loop without causing self-intersections of the entire structure. A sequence or structural alignment determines a paring of only a part of the residues of the two proteins. We choose to fill out the alignment gaps using linear interpolation as explained by the following example involving a 12 residue long Chain 0 and a 10 residue long Chain 1 using the notation of the program TM-align:
Backbone 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | – | – | – | 10 | 11 | 12 |
TM align | | | . | : | | : | : | | | | | | : | : | |
Backbone 1 | | | 1 | 2 | – | 3 | 4 | – | – | 5 | 6 | 7 | 8 | 9 | 10 |
The first aligned residue pairs between Chain 0 and Chain 1 are (3, 1), (4, 2) and (6, 3). We treat weakly aligned pairs, indicated with “.” by TM-align, as aligned pairs indicated by “:”. Residue 5 of Chain 0 is not aligned and we choose to align it with the midpoint \(P^1_{2.5}\) between Chain 1’s second and third residue. The next two aligned pairs (7, 4) and (10, 8) leave 3 intervals open on Chain 0 and 4 intervals on Chain 1. By linear interpolation, Curve 0 needs to be traversed at 3/4 of the speed Curve 1 is traversed resulting in the re-parameterizations:
New param. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Re-param. 0 | 3 | 4 | 5 | 6 | 7 | 7.75 | 8.5 | 9.25 | 10 | 11 |
IsAligned | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 |
Re-param. 1 | 1 | 2 | 2.5 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
The re-parameterized Curve 1 is the piecewise linear curve connecting the points:
\({\tilde{P}}^1_1=P^1_1\) | \({\tilde{P}}^1_2=P^1_2 \) | \({\tilde{P}}^1_3=P^1_{2.5}\) | \({\tilde{P}}^1_4=P^1_3\) | \(\dots \) | \({\tilde{P}}^1_{10}=P^1_9\) |
The re-parameterized Curve 0 connects the points:
\({\tilde{P}}^0_1=P^0_3\) | \({\tilde{P}}^0_2=P^0_4\) | \({\tilde{P}}^0_3=P^0_5\) | \({\tilde{P}}^0_4=P^0_6\) | \({\tilde{P}}^0_5=P^0_{7}\) |
\({\tilde{P}}^0_6=P^0_{7.75}\) | \({\tilde{P}}^0_{7}=P^0_{8.5}\) | \({\tilde{P}}^0_{8}=P^0_{9.25}\) | \({\tilde{P}}^0_{9}=P^0_{10}\) | \({\tilde{P}}^0_{10}=P^0_{11} \) |
The piecewise linear curve connecting the \({\tilde{P}}_i^0\)’s differs slightly form the original curve as e.g. a corner around \(P_8^0\) is cut when connecting \({\tilde{P}}^0_6=P^0_{7.75}\) to \({\tilde{P}}^0_{7}=P^0_{8.5}\). However, it is trivial to morph the one curve into the other and the thickness of the protein backbone guaranties that no self-intersections can occur when doing this. The structural alignment thus gives rise to studying the self-intersections of the linear interpolation
$$\begin{aligned} P(t,a)=(1-t)P^0_{{\text {RePar}}^0(a)}+tP^1_{{\text {RePar}}^1(a)}, \end{aligned}$$
for \(t\in [0,1]\) and \(a\in [1;m]\), where m is determined by the alignment On each line segment the function IsAligned(a) is given by linear interpolation of the values zero (for not aligned) and one (for aligned) at the endpoints of the line segments. In particular, it is one on line segments connecting aligned residues and zero when connecting non-aligned residues.
A self-intersection for parameter values \((a_k,b_k)\) gives the sum \({\text {AlignedSum}}(k)={\text {IsAligned}}(a_k)+{\text {IsAligned}}(b_k)\). We call the self-intersection aligned-aligned if \( {\text{AlignedSum}}(k)\; \ge \;1.5\), aligned-gap if \(1.5> {\text {AlignedSum}}(k) > 0.5\) and gap-gap if \( {\text{AlignedSum}}(k)\; \le \;0.5 \). The minimal allowed distance, \(d_{min}\), between points along the backbone curve used to define the overlap are given by linear interpolation of the discrete \(d_{min}\) values after the re-parametrization. E.g., \(d_{min}\) for two points 4.80 line segments apart along the backbone curve is set to \(3.51{\AA }\) found by interpolating \(d_{min}= 3.47{\AA }\) and \(3.52{\AA }\) for points being 4 and 5 line segments apart respectively.
A loop-contraction combined with other deformation between two structures likely lead to a pair of self-intersections of type \(\Omega _2\). If just one of these self-intersections involves the aligned parts and the other not, then the aligned-aligned self-intersection is probably essential when only considering aligned parts of the morph whereas it is removable if all self-intersections are considered. Beside the goal to tell if alignment gaps are topologically similar, this is a reason not to restrict the morph to the aligned parts.
Applying curve smoothening
We represent each protein structure by either the positions of its alpha carbon atoms \(C_i^\alpha \) or by a smoothened curve. In the smoothening all alpha carbons, except for the first two and the last two are replaced by the fixed convex combination \(P_i=\frac{C_{i-2}^\alpha +a C_{i-1}^\alpha +b C_{i}^\alpha +a C_{i+1}^\alpha +C_{i+2}^\alpha }{2+2a+b}\). The constants \(a=2.4\) and \(b=2.1\) are chosen to minimize the total curvature of a collection of protein structures [18]. All linear interpolations from an alpha carbon curve to its smoothened curve are self-intersection free for the protein structures used here. The smoothening straightens both alpha helices and beta-strands, see Fig. 6. The smoothened representation resembles typical cartoon pictures of protein structures and has the advantage often to make it possible to follow an interpolation visually, see Fig. 6. Furthermore, it results in fewer interpolation self-intersections. For the smoothened representation, the overlap is based on the minimal distances 1.0, 2.1, 3.0, 3.4, 3.6Å for residues with indices \(|i-j|=1,\dots ,5\) apart and 3.7Å otherwise. A self-intersection check may be avoided if the line-segments are shorter than 3.5Å and the sum of the endpoints overlap is \(<2.1\)Å.