FlexSnap: Flexible Non-sequential Protein Structure Alignment

Background Proteins have evolved subject to energetic selection pressure for stability and flexibility. Structural similarity between proteins that have gone through conformational changes can be captured effectively if flexibility is considered. Topologically unrelated proteins that preserve secondary structure packing interactions can be detected if both flexibility and Sequential permutations are considered. We propose the FlexSnap algorithm for flexible non-topological protein structural alignment. Results The effectiveness of FlexSnap is demonstrated by measuring the agreement of its alignments with manually curated non-sequential structural alignments. FlexSnap showed competitive results against state-of-the-art algorithms, like DALI, SARF2, MultiProt, FlexProt, and FATCAT. Moreover on the DynDom dataset, FlexSnap reported longer alignments with smaller rmsd. Conclusions We have introduced FlexSnap, a greedy chaining algorithm that reports both sequential and non-sequential alignments and allows twists (hinges). We assessed the quality of the FlexSnap alignments by measuring its agreements with manually curated non-sequential alignments. On the FlexProt dataset, FlexSnap was competitive to state-of-the-art flexible alignment methods. Moreover, we demonstrated the benefits of introducing hinges by showing significant improvements in the alignments reported by FlexSnap for the structure pairs for which rigid alignment methods reported alignments with either low coverage or large rmsd. Availability An implementation of the FlexSnap algorithm will be made available online at http://www.cs.rpi.edu/~zaki/software/flexsnap.


Background
The wide spectrum of functions performed by proteins are enabled by their intrinsic flexibility [1]. It is known that proteins go through conformational changes to perform their functions. Homologous proteins have evolved to adopt conformational changes in their structure. Therefore, similarity between two proteins which have similar structures with one of them having undergone a conformational change will not be captured unless flexibility is considered.
The problem of flexible protein structural alignment has not received much attention. Even though there are a plethora of methods for protein structure comparison [2][3][4][5][6][7][8], the majority of the existing methods report only sequential alignments and thus cannot capture non-sequential alignments. Non-sequential similarity can occur naturally due to circular permutations [9] or convergent evolution [10]. The case is even harder for flexible alignment since only two methods, FlexProt [11], and FATCAT [12] report flexible alignments. Nevertheless, both methods are inherently limited to sequential flexible structural alignment because both methods employ sequential chaining techniques. The complexity of protein structural alignment depends on how the similarity is assessed. Kolodny and Linial [13] showed that the problem is NP-hard if the similarity score is distance matrix based. Therefore, over the years, a number of heuristic approaches have been proposed, which can mainly be classified into two main categories: dynamic programming and clustering.
Dynamic Programming (DP) is a general paradigm to solve problems that exhibit the optimal substructure property [14]. DP-based methods, Structal [15] and SSAP [16], construct a scoring matrix S, where each entry, S ij , corresponds to the score of matching the i-th residue in protein A and the j-th residue in protein B. Given a scoring scheme between residues in the two proteins, dynamic programming finds the global alignment that maximizes the score. DP-based methods suffer from two main limitations: first, the alignment is sequential and thus non-topological similarity cannot be detected, and second, it is difficult to design a scoring function that is globally optimal [13]. In fact, structure alignment does not have the optimal substructure property, therefore DP-based methods can find only a suboptimal solution [17]. The other category of alignment methods, the Clustering-based methods, DALI [2], SARF2 [4], CE [5], SCALI [7], and FATCAT [12], seek to assemble the alignment out of smaller compatible (similar) element pairs such that the score of the alignment is as high as possible [18]. Two compatible element pairs are consistent (can be assembled together) if the substructures obtained by elements of the pairs are similar. The clustering problem is NP-hard [19], thus several heuristics have been proposed. The approaches differ in how the set of compatible element pairs is constructed and how the consistency is measured. Both SARF2 and SCALI produce non-sequential alignments.
The two main flexible alignment methods, FlexProt [11] and FATCAT [12], work by clustering (chaining) aligned fragment pairs (AFPs) and allowing flexibility while chaining, by introducing hinges (twists). FlexProt searches for the longest set of AFPs that allow different number of hinges. It then reports different alignments with different number of hinges. The FATCAT method works by chaining AFPs using dynamic programming. The score of an alignment ending with a given AFP is computed as the maximum score of connecting the AFP with any of alignments that end before the AFP. A penalty is applied to the score to compensate for gaps, root mean squared deviation (rmsd), and hinges. A third method, which can handle flexible alignments, is the HingeProt [20] method. HingeProt first partitions one of the two proteins into rigid parts using a Gaussian-Network-Model-based (GNM) approach and then aligns each rigid region with the other protein using the MultiProt [6] method. HingeProt uses the MultiProt algorithm in the sequential mode and thus does not report flexible non-sequential alignments. Therefore, the accuracy of the HingeProt approach depends on the accuracy of identifying the rigid domains which is a hard problem as the best known method, HingeMaster [21], has a sensitivity of only 50%.
To address the limitations of exisiting algorithms we propose FlexSnap, a greedy algorithm for flexible sequential and non-sequential protein structural alignment (the name of the algorithm is a non-sequential permutation of the bold letters in Flexible non-Sequential protein alignment). The algorithm assembles the alignment from the set of AFPs and allows non-sequential alignments and hinges. We demonstrate the effectiveness of FlexSnap by evaluating its alignments' agreement with manually curated non-sequential alignments. Moreover, FlexSnap shows competitive results on the FlexProt dataset when compared to the main flexible alignment methods, FlexProt and FATCAT.

Methods
The main idea of the FlexSnap approach is to assemble the alignment from short well-aligned fragment pairs, which are called AFPs. As we assemble the alignment by adding AFPs, introducing hinges when necessary. Figure 1 shows how the alignment is constructed from smaller aligned fragment pairs. When chaining a fragment pair to the alignment, we choose the fragment that has the highest score when joined with the last rigid region in the alignment. The score rewards longer alignments with small rmsd and penalizes large rmsd, gaps, and the introduction of hinges. In the next subsections, we provide a detailed discussion of the FlexSnap algorithm.

AFPs Extraction
Let A = {A 1 , A 2 ,..., A n } and B = {B 1 , B 2 ,..., B n } be two proteins with n and m residues respectively, where A i 8 ᑬ 3 × 1 (similarly B i ) represents the 3D coordinates of the C α atom of the i-th residue in protein A. The first step in FlexSnap is to generate a list of aligned fragment pairs (AFPs): Each AFP, (i, j, l), is a fragment that starts at the i-th residue in A and j-th residue in B and it has a length of l resi-  ) is not able to align the blue fragment, but a flexible alignment (bottom right) can do this easily by introducing a hinge between the rigid block (the black and green fragments) and the blue fragment. As we assemble the alignment from well-aligned pairs, we introduce hinges to get a longer alignment and smaller rmsd. dues. An AFP is formally represented as a set of l equivalenced pairs between the two proteins, and given as: where (A i , B j ) indicates that the i th residue of protein A is paired with the j th residue of protein B, and l is AFP's length. Each AFP must satisfy a user-defined similarity constraint. In FlexSnap, we employ the root mean square deviation as the similarity measure, i.e., rmsd(i, j, l) ≤ ε. Moreover, we require that the length of the AFP be at least L, i.e., 3 ≤ L ≤ l. Furthermore, we define and to be the beginning and end of the AFP k along the backbone of protein B. For example, for a triplet AFP k = (i, j, l) and protein A, = i and = i + l -1. The number of possible AFPs can be as large as O(n 3 ). The set of all AFPs can be obtained by iterating over all the triplets (i, j, l), and for each triplet checking if the rmsd(i, j, l) ≤ ε. The rmsd of a fragment of length l can be obtained in O(l) [22]. A naive implementation that iterates over all the triplets (i, j, l) to obtain the set of all the AFPs would have an O(n 4 ) time complexity. However, by observing that the rmsd of the AFP (i, j, l + 1) can be computed incrementally from the rmsd of AFP (i, j, l) in constant time, the set of aligned fragment pairs (AFPs) can be obtained in O(n 3 ) time complexity [11].
The main idea to incrementally compute the rmsd is to simplify the rmsd formula. Given two sets, A and B, of N points each, the root mean square deviation (rmsd) is calculated as [23]: where A' and B' denote the points after recentering, i.e., , and the d i 's are the singular values of C = A'B' T , which is a 3 × 3 covariance matrix given as: In rare cases when the determinant of C is negative, then d 3 = -1 * d 3 . Equation (1) can be simplified as: It is clear that all the terms used in equation (3) can be updated in constant time, and thus computing the rmsd for N + 1 points requires constant time if we have all the terms evaluated for the first N points. Therefore computing the rmsd for AFP(i, j, l) for all values of l's (for a given i and j ) requires only O(n) time. Thus, the total time complexity for the seeds extraction step is O(n 3 ) ...

Flexible Chaining
The second step in FlexSnap is to construct the alignment by selecting a subset of the AFPs. Given a set of AFPs, P, obtained in the AFPs extraction step, we are interested in finding a subset of AFPs, R : P, such that all the AFPs in R are mutually non-overlapping and the score of the selected AFPs in R is as large as possible. At one hand, we want to get as large an alignment as possible, while on the other hand, we want to minimize the number of hinges and gaps. Therefore, our goal is to optimize a score that rewards long alignments with small rmsd, and penalizes the introduction of hinges and gaps.
The set of AFPs can be thought of as runs in an n × m matrix S, where n and m are the sizes of proteins A and B, Figure 2 Flexible Structural Alignment by AFPs chaining. When extending the alignment R = {P 1 , P 2 , P 3 }, the score of extending R with each AFP is computed and we extend the alignment with the AFP that gives the best score. The score S(P 4 , P 2 , P 3 )) indicates the score of adding P 4 to the region composed of P 2 and P 3 .
respectively (see Figure 2). We define a precedence relation, ՞, between two AFPs such that P i ՞ P j if P i appears either in the upper or lower left quadrant of P j , i.e. and , or and (recall that and denote the beginning and end, respectively, of AFP P i in protein A). Generally speaking, we say that two AFPs, P i and P j , can be chained if they do not overlap, i.e., P i ՞ P j or P j ՞ P i As depicted in Figure 2, P 7 and P 8 can be chained to For sequential chaining, we define a sequential precedence relation, ՞ s , such that P i precedes P j (written as P i ՞ s P j ) if P i appears strictly in the upper left quadrant with respect to P j , i.e. and . Two AFPs P i and P j can be sequentially chained together if P i ՞ s P j or P j ՞ s P i . In Figure 2, P 7 and P 2 can be sequentially chained to P 1 .
An AFP, P i , can be chained to an alignment R, denoted as (R T P i ), if it does not overlap with any AFP in R. In Figure   2, P 7 , P 4 , and P 5 can be sequentially chained to R which consists of AFPs {P 1 , P 2 , P 3 }; and both P 6 and P 8 can be non-sequentially chained to R. Next, we shall introduce our solution for the general flexible chaining problem.

The FlexSnap Approach
The goal of chaining is to find the highest scoring subset of AFPs, i.e., R : P, such that all the AFPs in R are mutually consistent and non-overlapping. The problem of finding the highest scoring subset of AFPs is essentially the same as finding the maximum weighted clique in a graph G = (V, E, w) where the set of vertices V represent the set of AFPs, each vertex v i has a weight equal to the score of the AFP, w(v i ) = S(P i ), where the score of an AFP P i , S(P i ), could be its length or some other combination of length and rmsd.
There is an edge (v i , v j ) 8 E if the AFPs P i and P j do not overlap and are consistent (can be joined with small rmsd or have similar rotation matrices). The problem of finding the maximum weighted clique in a graph is computationally expensive; it is NP-hard [19]. Thus, we propose a greedy algorithm to find an approximate solution for the chaining problem. The main idea is to start building the alignment from an initial AFP and to add AFPs to the alignment. We start the alignment by selecting the longest AFP, then we iteratively add new AFPs to the alignment as long as the newly added AFP improves the score of the alignment. Given an alignment, R, we add to it the AFP that contributes most. We keep growing the alignment until no more AFPs can be added. The contribution of an AFP to the alignment is scored by how consistent the AFP is with the alignment and how good the AFP is. When adding an AFP to an alignment, we reward longer AFPs with smaller rmsd, and we penalize for gaps, inconsistency, and hinges. The penalty takes into consideration: 1) the number of gaps introduced; 2) the increase in rmsd when combining two or more AFPs; 3) the introduction of new hinges.
As depicted in Figure 2, the scores of extending the alignment, R, with P 4 , P 5 , P 6 , P 7 , or P 8 are computed and the AFP with the best score is added to the alignment. When measuring the score of adding an AFP to the alignment, we actually measure the score of adding the AFP to the last rigid region, and not just to the last fragment, in the alignment. In Figure 2, the score of adding P 4 to R is the score of adding P 4 to the region composed of P 2 and P 3 . Since P 2 and P 3 together form a rigid sub-alignment (as we can see there is no hinge between them). When adding P 7 to R, the score of adding P 7 to the region composed only of P 1 is computed. Figure 3 shows the pseudo-code for the greedy chaining algorithm used in FlexSnap. Since the chaining is a greedy algorithm, we run the algorithm K times starting from the K highest scoring non-overlapping AFPs and we report the alignment with the best score.

Alignment Extension Score
Next, we will discuss how we extend a partial alignment with the next best AFP. More specifically, given an alignment R, the next AFP to chain to the alignment is the one that maximizes the following scoring function: Figure 3 A greedy AFP chaining algorithm. A greedy algorithm for AFP chaining. The algorithm iteratively chooses an AFP to add to R (lines 4-7) until no more AFPs can be added, or the best score of adding an AFP to R is negative.
where R T P i indicates that P i does not overlap with R, and S(R, P i ) is the score of chaining P i to R. The score, S(R, P i ), is a combination of the weight of the AFP, W(P i ), and the penalty of extending R with P i , C(R T P i ). The score is defined as follows: where C(R T P i ) is the penalty incurred when connecting P i to R, and W(P i ) is the score of the AFP itself. The scoring function rewards longer AFPs with small rmsd and penalize gaps and hinges. If the addition of an AFP P i to the alignment results in a large rmsd, then we introduce a hinge only if W(P i ) is large enough to compensate for the penalty incurred. A similar approach for penalizing gaps and hinges was used in the FATCAT method [12]. Though their score and cost functions are different, and they do not consider rigid regions as we do in FlexSnap when connecting an AFP to the alignment. The score of connecting P i to R is defined as follows: where M g is the penalty for a gap, M r is the maximum penalty for a hinge, and is the rmsd of connecting P i to the last rigid region in R. If increases above a userdefined threshold, D c , we introduce a hinge and the penalty is maximum; if not, the penalty is proportional to how far the rmsd value is from ε (maximum rmsd for an AFP). Moreover, we allow only a maximum number of H hinges. The score for an AFP is a function of its length and rmsd. The score is the length of the AFP, L(P i ), plus a contribution of the rmsd of the AFP, rmsd(P i ), to the score, and is given as: The complexity of the chaining algorithm depends on the number of AFPs, M, that the two structures have. In the worst case, M could be close to n 3 , but in practice it is much less, i.e., M ≤ n 2 . The complexity of the algorithm is Mlog(M) + k * M * n, where k is the number of AFPs in the final solution and n is the size of the larger protein.

Sequential Flexible Chaining
The above general chaining algorithm reports both sequential and non-sequential alignments. In the results section, we demonstrate that the quality of its non-sequential alignments is competitive to state-of-the-art non-sequential alignment methods. However, for sequential flexible alignment, there are more efficient chaining algorithms, namely the approach proposed by the FATCAT algorithm. The FATCAT algorithm follows a dynamic programming approach for chaining the AFPs. In FATCAT, the score of an alignment ending with an AFP, P i , is defined in terms of the score of P j 's and the connection cost of P i with these P j 's such that P j precedes P i (P j ՞ s P i ). More specifically, FAT-CAT defines the score of the alignment that ends with P i as follows: where C(P j T P i ) is the penalty incurred when connecting P i to the alignment that ends with P j and it is similar to the penalty function used in the general chaining and W(P i ) is the score of the AFP itself. We propose an approach that is Figure 4 A greedy sequential AFP chaining algorithm. A greedy sequential algorithm for AFP chaining. When encountering the beginning of an AFP, the algorithm computes the scores of adding the AFP to the alignments in the upper left corner and the AFP is chained to the alignment with which it gives the highest score.
similar in spirit to FACTCAT, however, it is different in two important aspects. The first aspect is the optimality of the alignment reported by FATCAT. The main issue here is that the scoring function has an rmsd term since W(P i ) is a function of the length of P i and its rmsd. Therefore, S(P i ) cannot be optimal because we do not know of a scoring function that involves the rmsd value that is additive and optimal (rmsd score is not a metric since it does not satisfy the triangular inequality property). Therefore, the optimality of FATCAT alignments is not guaranteed since the sub-optimality property of the dynamic programming does not hold if the score incorporates an rmsd term. In Figure 4, let the optimal alignment be {P 1 , P 4 , P 5 , P 6 }, the sub-optimality property requires that {P 1 , P 4 , P 5 } is also optimal, and it is the best alignment that ends with P 5 . This is not necessarily true in structural alignment, because it could happen that the alignment {P 1 , P 2 , P 3 , P 5 } is better {P 1 , P 4 , P 5 }. In general, the flexible structural alignment does not exhibit the optimal substructure that would justify the use of dynamic programming. In FlexSnap, we follow a similar approach as the approach presented in [24] for chaining substrings. In the original algorithm, once we reach the end of a substring (segment), P i , we delete all the solutions that end with P j 's whose ends are lower and to the left of the endpoint of P i and S(P j ) <S(P i ). For the segments shown in Figure 4, let S(P 3 ) >S(P 4 ), once we encounter the end of P 3 , we should delete the solution that ends with P 4 . When we encounter P 5 , we know that the best solution it can be chained to ends with P 3 . This approach works fine for regular chaining problems (like strings). However for the structural alignment problem, this approach does not yield the optimal alignment since the problem does not exhibit the optimal substructure property. Therefore, in FlexSnap, once we reach the end of an AFP, P i , we do not delete all solutions that end with P j 's which are to the left and below P i ; instead we only delete the ones that have very low scores as compared to S(P i ). Though not optimal, this approach gave better results for sequential chaining than the pure greedy approach presented in the previous section.
The second aspect where FlexSnap is different from FAT-CAT is that in FATCAT C(P j T P i ) is the connection cost of P i and P j while in FlexSnap it is the connection cost of P i to the rigid region that contains P j . In FATCAT, if P j belongs to a rigid region and the connection cost of P i with P j is small, P i will be added to the same rigid region as P j even though P i might not be consistent with other AFPs in the same region. In Figure 4, if we were connecting P 5 to R 2 that ends with P 4 , FATCAT would compute the connection cost C(P 5 , P 4 ) but FlexSnap would compute C((P 1 , P 4 ) T P 5 ) since P 4 belongs to the rigid region that contains P 1 . In FATCAT, when connecting P 5 to R 2 , we might get the conclusion that there is no need to introduce a hinge and thus P 5 belongs to the same rigid region as P 1 and P 4 . This may lead to a large rmsd when we report the alignment since we did not check if P 5 is consistent with P 1 . However, when FlexSnap adds P 5 to the same rigid region as P 4 and P 1 , it will not harm the final rmsd when we report the alignment as FlexSnap ensures that all the segments in the same rigid region are consistent. In the results section, we investigate how computing the connection cost with the whole rigid region as opposed to the last segment in the rigid region affects the quality of the alignment. For some structure pairs, considering the whole rigid region in computing the connection cost resulted in significant improvements.

Results and Discussion
To assess the quality of FlexSnap alignment compared to other structural alignment methods, we evaluated the agreement of the methods' alignments with reference manuallycurated alignments. We compared FlexSnap against sequential methods (DALI [2] and CE [5]), non-sequential methods (SARF2 [4], MultiProt [6], and SCALI [7]), and flexible sequential alignment methods (FlexProt [11] and FATCAT [12]). Finally, we analyzed the flexibility on the DynDom dataset [25], which is a comprehensive and nonredundant dataset of protein domain movements.
All the experiments were run on a 1.66 GHz Intel Core Duo machine with 1 GB of main memory running Ubuntu Linux. The chaining algorithm is efficient and its running time varies from 1 second to a minute depending on the size of the proteins. We used the corresponding web server for most of the other alignment methods. The optimal values for the different parameters were found empirically such that they give the best agreement with manually curated alignments; we used L = 8, ε = 2 Å, D c = 3 Å, α = 0.5, M r = -10, M g = -1, and H = 3 (see Figure 3).

Non-Sequential Alignments
We used the reference alignments for the structure pairs which have circular permutation in the RIPC dataset [26]. The RIPC set contains 40 structurally related protein pairs which are challenging to align because they have indels, repetitions, circular permutations, and show conformational flexibility [26]. There are 10 pairs in the RIPC dataset that have circular permutation. Since the structure pairs have non-sequential alignments, to be fair, we only compare with algorithms that can handle non-sequentiality. However, we report the average agreement for some sequential methods as well. The agreement of a given alignment, S, with the reference alignment, R, is defined as the percentage of the residue pairs in the alignment which are identically aligned as in the reference alignment (I S ) relative to the reference alignment's length (L R ), i.e., A(S, R) = (I S /L R ) × 100. Table 1 http://www.almob.org/content/5/1/12 Page 7 of 13 Three values are reported for each alignment: its length, its rmsd, and A which is its agreement with the reference alignment in the RIPC dataset shows the agreements of four different methods with the reference alignments in the RIPC dataset. The results show that FlexSnap is competitive to state-of-the-art methods for non-sequential alignment. In fact, it has the highest average agreement (79%) among the methods shown. The average agreement of most of the sequential alignment methods we compared with were drastically lower: DALI [2] (40%), CE [4](36%), FATCAT [12](28%), and LGA [27](38%). FlexSnap alignments have 100 percent agreement on four structure pairs. One such pair is the alignment of NK-lysin (1nkl, 78 residues) with prophytepsin (1qdm, chain A, 77 residues). On this pair, all the sequential alignment methods (CE, DALI, FATCAT, and LGA) returned zero agreements. For the non-sequential ones: SARF returned 92%, Multi-Prot got 68%, and SCALI returned 69%. The reference alignment had 72 aligned pairs. As shown in Figure 5, the sequential alignment methods (only DALI and FATCAT shown) have their alignment paths along the diagonal and do not agree with the reference alignment (shown as circles). Table 2 shows the alignments of different methods on the FlexProt dataset [11] which is obtained from the database of macromolecular motions [28]. We have implemented two versions of FlexSnap namely FlexSnap F , and ;

Sequential Flexible Alignments
In FlexSnap F , C(P j T P i ) is the cost of connecting P i with the rigid region to which P j belongs. In the second version, , C(P j T P i ) is the connection cost of P i with only P j . It is observed that when considering the entire rigid region, as in FlexSnap F , we get much better alignments, i.e., they have lower rmsd and fewer hinges. Moreover, FlexSnap F gives comparable results to the FATCAT method. In few cases, it got slightly shorter alignments with

Flexibility in the DynDom Dataset
The DynDom dataset [25] is a comprehensive and nonredundant dataset of protein domain movements; it has been compiled by an exhaustive analysis of protein domain movements on all available protein structures using the DynDom program [29]. The protein conformations are first grouped into families based on sequence similarity, resulting in 1825 families with an average of 11.5 family members. Then a clustering procedure is applied to members of the same family to remove dynamic redundancy (same motion) and finally running the DynDom program to analyze domain movements in each family. There are currently 2035 representative pairs belonging to 1578 families in the DynDom dataset. Since these representative pairs involve domain movements, rigid alignment methods would not be able to align these pairs effectively, while flexible alignment methods will be able to introduce hinges and align the pairs more effectively. We define the coverage of the alignment as the percentage of the number of residues in the alignment to the length of the smaller protein. More for-  Two values are reported for the alignments of each method: average coverage for the method (aC in %), and average rmsd (aR in Å). For FlexSnap and FlexProt, we also report the average number of hinges introduced (aH). FlexSnap R is FlexSnap in rigid mode with the number of maximum allowed hinges set to zero.   Table 3 shows the average coverage, rmsd, and hinges reported by different methods on the DynDom dataset. For the same structure pair, FlexProt reports different solutions with different number of hinges ranging from 0 to 5 hinges. For the sake of fair comparison, we choose the FlexProt alignment with the same number of hinges as the solution reported by FlexSnap. Moreover, we also run FlexSnap in rigid mode (FlexSnap R ) with the number of allowed hinges set to 0 to investigate how it compares to rigid alignment methods. DALI has the highest coverage followed by Flex-Snap. However, the average rmsd of FlexSnap alignments is much smaller than the average rmsd for DALI alignments. On average, FlexSnap introduced 0.59 hinges in the alignments. By introducing flexibility in the alignments, FlexSnap reported alignments with significantly smaller rmsd while maintaining high alignment coverage. Also when run in the rigid mode, FlexSnap R is competitive to state-of-the-art methods like DALI, Structal, and MultiProt.   Two values are reported for the alignments of each method: average coverage (aC in %), and average rmsd (aR in Å). For FlexSnap and FlexProt, we also report the average number of hinges introduced (aH).

DynDom Pairs with Low Coverage
Rigid alignment methods try to optimize a score that is usually dependent on the length and rmsd of the alignment. Therefore, they might prefer shorter alignment with small rmsd over a longer alignment with significantly larger rmsd. In some cases, like when there is a movement in one of the proteins, they have no choice but to report a shorter alignment with an acceptable rmsd value. We analyze the alignments of structure pairs for which rigid alignment methods returned short alignments as compared to the length of the smaller protein. We run three different rigid alignment methods, DALI, Structal, and MultiProt, and get the pairs for which any of the methods returned a coverage less than or equal to 60%. The list has 30 pairs for DALI, 282 for Structal, and 164 for MultiProt. An example of a rigid alignment with low coverage is shown in Figure 6. For this DynDom pair, Structal reported an alignment of 52 res-idues with rmsd 0.40 Å; MultiProt's alignment was 54 with rmsd 0.52 Å. Table 4 shows the average coverage, rmsd, and hinges reported by FlexSnap on these structure pairs. For fair comparison, we choose the FlexProt alignment with the same number of hinges as the FlexSnap solution. FlexSnap significantly improves the coverage of the alignments of these hard pairs. Moreover, it does so while maintaining good rmsd values and introducing on average about 1.5 hinges. In FlexSnap's scoring function, hinges are penalized and we only introduce a hinge if there is a significant increase in the alignment score. That explains why the number of hinges introduced is not large. DALI optimizes a score that incorporates the length and rmsd of the alignment. Thus for these 30 pairs, the score is too low for longer alignments, and thus DALI chooses to report shorter alignments with good rmsd, and thus low coverage on these 30 pairs. The Structal method reported low coverage alignments on many more structure pairs when compared to DALI. The reason behind that is the fact that the Structal method depends on the initial alignments for its initial transformations and it might miss the true alignment if the initial alignments are not good starting points.

DynDom Pairs with Large rmsd
In some cases rigid alignment methods would seek to optimize the score that favors longer alignments with acceptable rmsd values, and thus they may have good coverage on some pairs, but the rmsd values may be too large. Flexible alignments can be employed for these cases to get similar alignments but with much better rmsd values. For each of our test methods, namely DALI, Structal, and MultiProt, we compiled a list of the structure pairs for which the method reported an alignment with rmsd ≥ 4.0 Å, and we ran FlexSnap on these pairs. An example of a rigid alignment with large rmsd is shown in Figure 7. FlexSnap reported an alignment with 100% coverage with an rmsd of 0.71 Å by introducing only one hinge in the alignment. Table 5 shows the average coverage, and rmsd as reported by the native rigid method and by FlexSnap. Under this criterion, DALI reported alignments with rmsd ≥ 4.0 Å on 295 pairs, much more than what the other methods reported. MultiProt didn't report any alignment with large rmsd. In fact, all of the MultiProt alignments had rmsd ≤ 2.3 Å; this can be explained by noting that MultiProt includes in the alignment only residue pairs which are closely aligned and thus the overall rmsd will not be large.
FlexSnap significantly improved the average rmsd of the alignments of these pairs. For the 295 pairs for which DALI reported an average rmsd of 5.89 Å, FlexSnap reported an average rmsd of 1.61 Å. For the 16 pairs reported by Structal, FlexSnap average rmsd is 1.97 Å as opposed to 5.03 Å reported by Structal.

Conclusions
We have introduced FlexSnap, a greedy chaining algorithm that reports both sequential and non-sequential alignments and allows twists (hinges). We assessed the quality of the FlexSnap alignments by measuring its agreements with manually curated non-sequential alignments (on the RIPC dataset). On the FlexProt dataset, FlexSnap was competitive to state-of-the-art flexbile alignment methods. Moreover, we demonstrated the benefits of introducing hinges by showing the significant improvement in the alignments reported by FlexSnap for the structure pairs for which rigid