Backtranslation for discovering distant protein homologies in the presence of frameshift mutations
 Marta Gîrdea^{1, 2}Email author,
 Laurent Noé^{1, 2}Email author and
 Gregory Kucherov^{1, 2, 3}Email author
https://doi.org/10.1186/1748718856
© Gîrdea et al; licensee BioMed Central Ltd. 2010
Received: 11 August 2009
Accepted: 4 January 2010
Published: 4 January 2010
Abstract
Background
Frameshift mutations in proteincoding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins' common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level.
Results
We developed a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memoryefficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. Our implementation is freely available at http://bioinfo.lifl.fr/path/.
Conclusions
Our approach allows to uncover evolutionary information that is not captured by traditional alignment methods, which is confirmed by biologically significant examples.
Keywords
Background
Context and motivation
In proteincoding DNA sequences, frameshift mutations (insertions or deletions of one or more bases) can alter the translation reading frame, affecting all the amino acids encoded from that point forward. Thus, frameshifts produce a drastic change in the resulting protein sequence, preventing any similarity to be visible at the amino acid level. For that reason, classic protein alignment methods, that rely on amino acid comparisons, fail to reveal the proteins' common origins in the case of divergence by frameshift.
Consequently, it is natural to handle frameshift mutations at the DNA level, by DNA sequence comparisons. Several papers, including [1–4] reported functional frameshifts discovered using classic alignment tools from the BLAST [5, 6] family. In all cases, the DNA sequences were relatively well conserved, which allowed the similarity to remain detectable at the DNA level.
However, the divergence may also involve additional base substitutions, that can reduce the similarity of the diverged DNA sequences. It has been shown [7–9] that, in coding DNA, there is a base compositional bias among codon positions, that no longer applies after a reading frame change. A frameshifted coding sequence can be affected by base substitutions leading to a composition that complies with this bias. If, in a long evolutionary time, a large number of codons in one or both sequences undergo such changes, they may be altered to such an extent that the common origin becomes difficult to observe by direct DNA comparison.
In this paper, we address the problem of finding distant protein homologies, in particular when the primary cause of the divergence is a frameshift. We aim at being able to detect the common origins of sequences even if they were affected by an important number of point mutations in addition to the frameshift. Also, when dealing with sequences that have little similarity, we wish to distinguish between sequences that are indeed (distantly) related, and sequences that resemble by chance. We achieve this by computing the best alignment of DNA sequences that encode the target proteins, with respect to a powerful scoring system that evaluates point mutations in their context, based on codon substitution patterns. Our approach implicitly explores all the pairs of DNA sequences that can be translated into these proteins, which allows a wider vision on the match possibilities at the DNA level.
We designed and implemented an efficient method for aligning putative coding DNA sequences, which builds expressive alignments between hypothetical nucleotide sequences obtained by backtranslating the proteins, that can provide some information about the common ancestral sequence, if such a sequence exists. We perform the analysis on memoryefficient graph representations of the complete set of putative DNA sequences of each protein. The proposed method consists of a dynamic programming alignment algorithm that computes the two putative DNA sequences that have the best scoring alignment under an appropriate scoring system.
Protein backtranslation
Backtranslation or reverse translation of a protein usually refers to obtaining one of the DNA sequences that encodes the given protein. Several methods for achieving this exist [10, 11], aiming at finding the DNA sequence that is most likely to encode that protein. Several programs use multiple protein alignments to improve the backtranslation [12, 13]. This can be considered to be the opposite to the "translation way", where translation is used to improve coding DNA alignments or assess new coding DNA [14–17].
In this paper, we are not interested in just one of the coding sequences, but aim at exploring them exhaustively and aligning them with potential frameshifts. Thus, in the context of our work, the backtranslation will refer to all the putative DNA sequences, as explained further in the Methods section.
Similar approaches
The idea of using knowledge about coding DNA when aligning amino acid sequences has been explored in several papers.
A nonstatistical approach for analyzing the homology and the "genetic semihomology" in protein sequences was presented in [18, 19]. Instead of using a statistically computed scoring matrix, amino acid similarities are scored according to the complexity of the substitution process at the DNA level, depending on the number and type (transition/transversion) of nucleotide changes that are necessary for replacing one amino acid by the other. This ensures a differentiated treatment of amino acid substitutions at different positions of the protein sequence, thus avoiding possible rough approximations resulting from scoring them equally, based on a classic scoring matrix. The main drawback of this approach is that it was not designed to cope with frameshift mutations.
Regarding frameshift mutation discovery, many studies [1–4] preferred the plain BLAST [5, 6] alignment approach: BLASTN on DNA and mRNA, or BLASTX on mRNA and proteins, applicable only when the DNA sequences are sufficiently similar. BLASTX programs, although capable of insightful results thanks to the six frame translations, have the limitation of not being able to transparently manage frameshifts that occur inside the sequence, for example by reconstructing an alignment from pieces obtained on different reading frames.
For handling frameshifts at the protein level, [20] and [21] propose the use of 5 substitution matrices for aligning amino acids encoded on different reading frames, based on nucleotide pair matches between respective codons and amino acid substitution probabilities. One of the main differences between this scoring scheme and the one we present further in this paper is that our scores target nucleotide symbols explicitly, and are computed by taking into account the changes that occur at the DNA level directly. Also, our alignment method allows more flexibility with respect to frameshift gap placement within the alignment.
On the subject of aligning coding DNA in presence of frameshift errors, some related ideas were presented in [22, 23]. The author proposed to search for protein homologies by aligning their sequence graphs (data structures similar to the ones we use in our method). The algorithm tries to align pairs of codons, possibly incomplete since gaps of size 1 or 2 can be inserted at arbitrary positions. The score for aligning two such codons is computed as the maximum substitution score of two amino acids that can be obtained by translating them. This results in a complex, time costly dynamic programming method that basically explores all the possible translations. Our algorithm addresses the same problem, by employing an approach that is more efficient, since it aligns nucleotides instead of codons, and works with simpler data structures thanks to the IUPAC ambiguity code, without any loss of information, as we will show further in the paper. Also, our alignment algorithm is more generic and is not restricted to a certain scoring function. Additionally, the scoring scheme we propose relies on codon evolution patterns, since we believe that, in frameshift mutation scenarios, the information provided by DNA sequence dynamics provides valuable information in addition to amino acid similarities.
Methods
The problem of inferring homologies between distantly related proteins, whose divergence is the result of frameshifts and point mutations, is approached in this paper by determining the best pairwise alignment between two DNA sequences that encode the proteins.
Given two proteins P_{ A }and P_{ B }, the objective is to find a pair of DNA sequences, D_{ A }and D_{ B }, such that translation(D_{ A }) = P_{ A }and translation(D_{ B }) = P_{ B }, which produce the best pairwise alignment under a given scoring system.
The alignment algorithm incorporates a gap penalty that limits the number of frameshifts allowed in an alignment, to comply with the observed frequency of frameshifts in a coding sequence's evolution. The scoring system is based on possible mutational patterns of the sequences. This leads to reducing the false positive rate and focusing on alignments that are more likely to be biologically significant.
Data structures: backtranslation graphs
For identical nucleotides that appear at the same position of different codons for the same amino acid, and are preceded by different nucleotides within their respective codon, (as it is the case for bases C and T at the second position of the codons corresponding to amino acids S and L respectively), different nodes are introduced into the graph in order to avoid the creation of paths that do not correspond to actual putative DNA sequences for the given protein. Also, as the scoring system we propose in this paper requires to differentiate identical symbols by their context, identical nucleotides appearing at the third position of different codons for amino acids L, S and R will have different corresponding nodes in the backtranslation graph. Basically, we can consider that each nucleotide symbol α from a putative coding DNA sequence, belonging to some codon c, is labeled with a word l which is its prefix in the codon c. Depending on the position of α in c, l will consist of 0, 1 or 2 letters. Here we denote such a labeled symbol by α^{ l }. Further in the paper we will drop the l for notation simplicity, and consider this differentiation implicit. Two symbols that appear at the same position of two putative DNA sequences encoding the same protein are identical (and are represented by the same node) if and only if they represent the same nucleotide and their labels are identical.
Two nodes at consecutive positions are linked by an arc if and only if they are either consecutive nucleotides of the same codon, or they are respectively the third and the first base of two consecutive codons. No other arcs exist in the graph.
are arcs between nodes corresponding to symbols that are consecutive in one of the protein's putative coding sequences.
Note that, in the implementation, the number of nodes is reduced by using the IUPAC nucleotide codes [24]. For backtranslating an amino acid, only 4 extra nucleotide symbols  R, Y, H and N, representing the sets {A, G}, {C, T}, {A, C, T} and {A, C, G, T} respectively  are necessary. In this condensed representation, the number of ramifications in the graph is substantially reduced, as illustrated by Figure 1. More precisely, the only amino acids with ramifications in their backtranslation are amino acids R, L and S, each encoded by 6 codons with different prefixes, while the backtranslations of all other amino acids are simple sequences of 3 symbols. As we will show below, there is no information loss regarding the actual pair of nonambiguous symbols aligned.
Alignment algorithm
When aligning two backtranslated protein sequences, we are interested in finding the two putative DNA sequences (one for each protein) that are most similar. To achieve this, we use a dynamic programming method, similar to the SmithWaterman algorithm [25], extended to backtranslation graphs, and equipped with gap related restrictions.
 (a)
0 (similarly to the classic SmithWaterman algorithm, only nonnegative scores are considered for local alignments).
 (b)
the substitution score of symbols (α _{ i }, β _{ j }), denoted score(α _{ i }, β _{ j }), added to the score of the best partial alignment ending in M [i  1, j  1], provided that the partially aligned paths contain α _{ i }at position i and β _{ j }on position j respectively; this condition is ensured by restricting the entries of M [i  1, j  1] to those labeled with symbols that precede α _{ i }and β _{ j }in the graphs, and is expressed in (3) by α _{i1}∈ (α _{ i }), β _{j1}∈ (β _{ j }).
 (c)
the cost singleGapPenalty of a frameshift (gap of size 1 or extension of a gap of size 1) in the first sequence, added to the score of the best partial alignment that ends in a cell M [i, j  1, (α _{ i }, β _{j1})], provided that β _{j1}precedes β _{ j }in the second graph (β _{j1}∈ (β _{ j })); this case is considered only if the number of allowed frameshifts on the current path is not exceeded, or a gap of size 1 is extended.
 (d)
the cost of a frameshift in the second sequence, added to a partial alignment score defined as above.
 (e)
the cost tripleGapPenalty of removing an entire codon from the first sequence, added to the score of the best partial alignment ending in a cell M [i, j  3, (α _{ i }, β _{j3})].
 (f)
the cost of removing an entire codon from the second sequence, added to the score of the best partial alignment ending in a cell M [i  3, j, (α _{i3}, β _{ j })]
Although the algorithm is defined for backtranslated protein alignment, it can also be used for aligning two DNA sequences or a DNA sequence to a protein. The graph corresponding to a DNA sequence has only one node at each position. Thus, the method can be used for aligning proteins to longer DNA sequences containing coding regions. However, when long sequences are aligned by dynamic programming methods, time and space complexity issues need to be addressed.
Complexity and improvements
In this section we discuss the time and space complexity of our method and show how we can improve the latter using an approach inspired by [26].
Space complexity of the backtranslation graphs
The space necessary for storing the backtranslation graph of a protein sequence P of size n depends linearly on n. Basically, as mentioned in the section dedicated to data structures, the backtranslation graph G_{ P }= (V_{ P }, E_{ P }) consists of 3·n groups of nodes { } (as each of the n aminoacids are encoded by sequences of 3 nucleotides). Every group i contains the nodes corresponding to the nucleotides that can appear at position i in at least one of the putative coding sequences (see (1)). The number of nodes in a group is limited by the number of codons that encode an amino acid, that we denote (6 in the worst case scenario for nonambiguous symbols) and thus does not depend on the protein's length.
Arcs exist only between nodes in consecutive groups (equation (2)), therefore each node can have a limited number of neighbors. Consequently, the overall memory consumption for storing the backtranslation graph of a protein sequence P of size n is (n). The worst case scenario is a protein sequence composed only of the amino acids L, S, R, which are encoded by 6 codons each, and hence have the most complex backtranslation. For each such amino acid, 10 nodes and 20 arcs are necessary, yielding a maximum memory size of 30n for the entire graph. For the ambiguous nucleotide symbol encoding though, 6 nodes and 6 arcs are necessary in the worst case for each amino acid, while most amino acids only require 3 nodes and 3 arcs for their backtranslated representation.
Complexity of the alignment algorithm
Let G_{ A }and G_{ B }be graphs obtained by backtranslating proteins P_{ A }and P_{ B }, of lengths n_{ A }and n_{ B }respectively. The dynamic programming matrix M computed by the alignment algorithm will have 3·n_{ A }+ 1 rows and 3·n_{ B }+ 1 columns. Each cell of the matrix M [i, j] has several entries corresponding to the possible pairs of nodes from each sequence. The number of entries is bounded by the square of the number of nodes that can appear on each position in the graph ( ^{2}). Consequently, the total number of entries in the matrix is at most ^{2}·(3·n_{ A }+ 1)·(3·n_{ B }+ 1), hence (n_{ A }n_{ B }).
Each entry holds the score of the partial alignment ending at the corresponding positions, as well as the number of frameshifts that occurred on the path so far (to ensure the established limit in the complete alignment) and a reference to the previous matrix entry of the alignment path, to facilitate the traceback. The storage space requirements for this supplementary information are bounded by a constant.
For computing each score in the matrix, the expressions that need to be evaluated are given by equation (3), by querying some of the entries from 5 other cells in the matrix. Since the number of entries in each cell is bounded by ^{2}, this operation is considered to be performed in constant time. Consequently, the overall time complexity of the algorithm is (n_{ A }n_{ B }). To recover the best alignment and the two actual sequences that produce it, a classic traceback algorithm is used, with an execution time depending linearly on the alignment length, which cannot be larger than (3·n_{ A }+ 3·n_{ B }+ 1).
Improving the memory usage
To overcome the memory issues caused by aligning very large sequences with our dynamic programming method, which requires quadratic space, we used an approach inspired from the linear space algorithm for the LCS problem [26]. Our aim is to decrease the space consumption, not necessarily to linear space, with a less prominent increase of the computation time, i.e. the number of recursive matrix recomputations that are necessary for retrieving the actual alignment in this reduced space.
As a compromise, we choose to split the alignment according to some preestablished cutpoints, in submatrices that are small enough to fit into memory, and that are recomputed only once for retrieving the corresponding alignment fragments. In our implementation, the cutpoints delimit submatrices that are, by default, 128 columns wide. In this setup, we use a twostep approach: first, we compute the score of the best local alignment in linear space, using a sliding window, while also identifying the intersections of the corresponding path with the established cutpoints; in the second step, we recompute separately the submatrices containing parts of the best alignment (restricted to the rows that intersect it), and then rebuild the alignment by pasting the obtained alignment fragments together.
For the first pass, we use a sliding window of 4 columns instead of the original 2, because each partial score depends on the scores that are 3 cells to the left or 3 cells above (see equation (3), items (e) and (f)). Each cell of the sliding window memorizes the matrix entry where the alignment path started (identified by the coordinates within the matrix and actual pair of aligned nodes), as well as the intersections of this path with the cutpoints. This information is propagated from the previous cell contributing to the computation of the score, and completed in each cell from the cutpoint columns by storing the line number and the node pair that help identify an actual entry in the matrix which belongs to the alignment path. The best scoring entry encountered so far is memorized and updated at each step of the alignment algorithm. When the first pass is completed, the best scoring cell will provide all the necessary information for reconstructing the alignment: the start of the alignment, the intersection with each cutpoint, and its end, which is the cell itself. According to these coordinates, subgraphs of the two backtranslation graphs are extracted and aligned globally (ensuring that the start and end node pair of each fragment are preserved). The obtained global alignments, combined, will give the best local alignment of the two large sequences.
Translationdependent scoring function
In this section, we present a new translationdependent scoring system suitable for our alignment algorithm. Our scoring scheme incorporates information about possible mutational patterns for coding sequences, based on a codon substitution model, with the aim of filtering out alignments between sequences that are unlikely to have common origins.
Mutation rates have been shown to vary within genomes, under the influence of different factors, including neighbor bases [27]. Consequently, a model where all base mismatches are equally penalized is oversimplified, and ignores possibly precious information about the context of the substitution.
With the aim of retracing the sequence's evolution and revealing which base substitutions are more likely to occur within a given codon, our scoring system targets pairs of triplets (α, p, a), were α is a nucleotide, p is its position in the codon, and α is the amino acid encoded by that codon, thus differentiating various contexts of a substitution. There are 99 valid triplets out of the total of 240 hypothetical combinations. Pairwise alignment scores are computed for all possible pairs of valid triplets
where is the frequency of the t_{ i }↔ t_{ j }substitution in related sequences, and = p(t_{ i })p(t_{ j }) is the background probability. This scoring function is used in the algorithm as shown by equation (3)(b), where we refer to it as score(α_{ A }, α_{ B }), without explicitly mentioning the context  amino acid and position in the corresponding codon  of the paired nucleotides. These details were omitted in equation (3) for generality (other scoring functions, that do not depend on the translation, can be used by the algorithm too) and for notation simplicity.
The insignificant amount of available real data that fits our hypothesis does not allow classical, statistical computation of the foreground and background probabilities. Therefore, instead of doing statistics on real data directly, we will rely on codon frequency tables and codon substitution models, either mechanistic or empirically constructed.
Codon substitution models
Mechanistic codon substitution models
for all i ≠ j. Here, the parameter ω represents the nonsynonymoussynonymous rate ratio, κ the transitiontransversion rate ratio, and π_{ j }the equilibrium frequency of codon j. As in all Markov models of sequence evolution, absolute rates are found by normalizing the relative rates to a mean rate of 1 at equilibrium, that is, by enforcing and completing the instantaneous rate matrix Q by defining to give a form in which the transition probability matrix is calculated as P (θ) = e^{ θQ }[29]. Evolutionary times θ are measured in expected number of nucleotide substitutions per codon.
Note that there exist some more advanced codon substitution models, targeting sequences with overlapping reading frames [30]. However, such models do not fit our scenario, because they are designed for overlapping reading frames, where a mutation affects both translated sequences, while in our case the sequences become at one point independent and undergo mutations independently.
Empirical codon substitution model
The mechanistic codon substitution model presented above simulates substitutions with accurate parameters, but does not take into account the selective pressure and the resulting effects on the final codon conservation.
One of these effects, most commonly known and most observable in alignments of coding sequences, is the "third base mutation": in most cases, the encoded amino acid is not changed by a transition mutation of the codon third base; this is true in some cases of transversion mutations as well.
There are several other specific conservation families for groups of amino acids, as the aliphatic conservation (amino acids L, I, V) where corresponding amino acid codons share T at their second base. The last base is, within this group, almost a free choice, while the first has a large degree of freedom. It is thus expected to frequently observe the second T conserved on such codons when aligned with the aliphatic group. A similar phenomenon (however with a weaker frequency) appears for the subset (A, S, T) of the "small" amino acids, where the codons have in common the second base C.
In other chemically related amino acid groups, the succession of nucleotide substitutions at the codon level follows more complex paths, as it is the case for positively charged amino acids (R, K), aromatic amino acids (F, Y, W), etc.
Such different and complex conservation patterns are difficult to express and model with simple rules. As most of the matrices built for proteins, an empirical estimation gives a very good global approximation. In [31], the first empirical codon substitution matrix entirely built from alignments of coding sequences from vertebrate DNA is presented. A set of 17,502 alignments of orthologous sequences from five vertebrate genomes yielded 8.3 million aligned codons from which the number of substitutions between codons were counted. From this data, 64 × 64 probability matrices and similarity score matrices ("1codon PAM") were computed. One can use these probability matrices as an alternative to the ones obtained using the mechanistic model.
Foreground probabilities
Once the codon substitution probabilities are obtained, can be deduced in several steps. Basically, we first need to identify all pairs of codons with a common subsequence, that have a perfect semiglobal alignment (for instance, codons CAT and ATG satisfy this condition, having the common subsequence AT; this example is further explained below). We then assume that the codons from each pair undergo independent evolution, according to the codon substitution model. For the resulting codons, we compute, based on all possible original codon pairs, p((α_{ i }, p_{ i }, c_{ i }), (α_{ j }, p_{ j }, c_{ j }))  the probability that nucleotide α_{ i }, located at position p_{ i }of codon c_{ i }, and nucleotide α_{ j }, situated on position p_{ j }of codon c_{ j }have a common origin (equation (7)). From these, we can immediately compute, as shown by equation (8) below, p((α_{ i }, p_{ i }, a_{ i }), (α_{ j }, p_{ j }, a_{ j })), corresponding to the foreground probabilities , where t_{ i }= (α_{ i }, p_{ i }, a_{ i }) and t_{ j }= (α_{ j }, p_{ j }, a_{ j }).
In the following, stands for the probability of the event codon c_{ i }mutates into codon c_{ j }in evolutionary time θ, and is given by a codon substitution probability matrix (θ).
The notation c_{i} [interval_{i}] ≡ c_{j} [interval_{j}] states that codon c_{ i }restricted to the positions given by interval_{ i }is a sequence identical to c_{ j }restricted to interval_{ j }. This is equivalent to having a word w obtained by "merging" the two codons. For instance, if c_{ i }= CAT and c_{ j }= ATG, with their common substring being placed in interval_{ i }= [2..3] and interval_{ j }= [1..2] respectively, w is CATG.
Background probabilities
Substitution matrix for ambiguous symbols
where is an ambiguous nucleotide symbol representing the possible nucleotides that can appear on position p_{ i }of the codons that encode the amino acid a_{ i }, and set denotes the set of nonambiguous nucleotide symbols represented by . Basically, the score of pairing two ambiguous symbols is the maximum over all substitution scores for all pairs of nucleotides from the respective sets.
By using ambiguous symbols, less triplets are formed for each amino acid when compared with the nonambiguous symbol case. 17 amino acids can be antitranslated as trimers with just one ambiguous symbol per position, while the others have two alternatives each of the three positions. Therefore, there are 69 different triplets with ambiguity codes to be paired (as opposed to 99), which means more than twice less storage space necessary for the score matrix.
Parametrization
In this section we have presented a general framework that helps to compute a translation dependent scoring function for DNA sequence pairs, parametrized by a codon substitution model and an evolutionary time measured in expected number of mutations per codon. We consider that the sequences evolve independently, and the distance is relative to the original sequence.
Score evaluation
The score significance is estimated according to the Gumbel distribution, where the parameters λ and K are computed with the method described in [32, 33]. In the future, we aim at improving our estimation by using a computation method more suited for gapped alignments, such as [34].
We use two different score evaluation parameter sets for the forward alignment (where the two backtranslated graphs that are aligned have the same translation sense) and the reverse complementary alignment (where one of the graphs is aligned with the reverse complementary of the other), because these are two independent cases with different score distributions.
In order to obtain a more refined evaluation of the alignments, we introduce (λ, K) parameters for estimating the score significance of alignment fragments inside which the reading frame difference is preserved. Therefore, there are eight (λ, K) parameters that help to evaluate the alignments (four for the forward alignment sense and four for the reverse complementary alignment sense):

(λ_{ FW }, K_{ FW }) for the forward sense and (λ_{ RC }, K_{ RC }) for the reverse complementary sense respectively, that are used for evaluating the score of the whole alignment.

(λ_{+i}, K_{+i}) for the forward sense and (λ_{i}, K_{i}) for the reverse complementary sense respectively, with i ∈ {0, 1, 2} that are used for evaluating the scores of each alignment fragment within which the reading frame difference is preserved. This second evaluation aims at providing a measure of the actual contribution of each such fragment to the score of the alignment.
The parameters (λ_{±i}, K_{±i}) are estimated on alignments restricted to the respective reading frame difference, where further frameshifts are not allowed, while (λ_{ FW }, K_{ FW }) and (λ_{ RC }, K_{ RC }) are computed in a more flexible setup, where a limited number of frameshifts is accepted.
Behavior in the nonframeshifted case
In this section we discuss the behavior of the proposed scoring system when aligning protein sequences without a frameshift. Given their construction method, we expect the scores to reflect the amino acid similarities, but also to be influenced by similarities at the DNA level.
Then, considering each amino acid pair as an observation, we compute the correlation coefficient of these expected scores and the BLOSUM matrices as given by [35].
Correlation coefficients of the translation dependent scores used on nonframeshifted amino acids, with BLOSUM scores and classic DNA scores
DNA  BLOSUM 62  

TDS _{ M } (0.1)  0.86  0.52 
TDS _{ M } (0.3)  0.81  0.50 
TDS _{ M } (0.5)  0.77  0.50 
TDS _{ M } (0.7)  0.75  0.50 
TDS _{ M } (1.0)  0.71  0.47 
TDS _{ E }  0.59  0.88 
They suggest that the obtained translation dependent score matrices, either obtained from mechanistic or empirical codon substitution models, are a compromise between the "fully selective" BLOSUM matrices and the nonselective DNA scores.
On the one hand, the scores obtained using the mechanistic model do not make use of the selective pressure, and for this reason are more likely to be correlated with the classic DNA scores. On the other hand, the scores based on empirical codon substitution models reflect the constraints imposed by the similarity of the amino acids encoded by the codons. Hence, they show a strong correlation with the BLOSUM matrices when used without a frameshift.
Results and Discussion
We have proposed a method for aligning protein sequences with frameshifts, by backtranslating the proteins into graphs that implicitly contain all the putative DNA sequences, and aligning them with a dynamic programming algorithm that uses a scoring system designed for this particular purpose.
Implementation and availability
A Java implementation of our method is available at http://bioinfo.lifl.fr/path/. The files containing translation dependent score matrices computed for several evolutionary distances can be downloaded at the same address.
Experimental results
We will further discuss several significant frameshifted alignments obtained with our method. The experimental results presented here were obtained in the following experimental setup: a search for frameshifted forward alignments was launched on samples from the full NCBI protein databases for several species, using a 00.50 base per codon divergence scoring matrix; we selected only the alignments with an Evalue < 10^{9}, presenting at least one significant frameshift.
Yersinia pestis: Frameshifted transposases
Xylella fastidiosa: Frameshifted βglucosidases
Interestingly, βglucosidase frameshifts have already been studied in [37] on several bacteria including such γproteobacteria as Erwinia herbicola and Escherichia coli of Enterobacteriales but not directly observed in Xanthomonadales.
Venom neurotoxins
Diversification of venom toxins has been studied in [38] for advanced snakes: frameshifts were one of the most significant mechanisms involved in the "evolution of the arsenal" and its diversification toward specialized prey capture, sometime with a loss of neurotoxicity [39]. We thus studied neurotoxins from several higher snakes.
Plateletderived growth factor
It is interesting to notice that this double frameshift (if confirmed) may have little influence on the protein (only the beginning of the receptor binding interface is modified). It is also interesting to notice that both the "inducing" and "correcting" frameshifts are located on two different exons.
Conclusions
In this paper, we addressed the problem of finding distant protein homologies, in particular affected by frameshift events, from a codon evolution perspective. We search for protein common origins by implicitly aligning all their putative coding DNA sequences, stored in efficient data structures called backtranslation graphs. Our approach relies on a dynamic programming alignment algorithm for these graphs, which involves a nonmonotonic gap penalty that handles frameshifts and full codon indels differently. We designed a powerful translationdependent scoring function for nucleotide pairs, based on codon substitution models, whose purpose is to reflect the expected dynamics of coding DNA sequences.
We illustrated our approach with several alignment examples of known and hypothetical frameshifted proteins, some of which are not detectable via classic alignment methods because of the low coding sequence similarity. Such examples support our method's applicability in the discovery of distant protein homologies and functional frameshifts, without any explicit information about the coding sequences. Future work includes further improvements of the scoring system, for example in order to focus on the detection of short double correcting frameshifts (two frameshifts separated by a small number of bases, where the second corrects the reading frame disrupted by the first). Such cases have been shown to occur frequently [37], but are often highly penalized by sequence comparison methods, that discard the correct alignment in favor of an ungapped one with a higher score.
Some natural extensions of our work include the support for multiple alignments of backtranslation graphs. This feature can be useful for confirming frameshifts by similarity of the frameshifted subsequence with the corresponding fragments from several members of a family. Also, for boosting the efficiency, seeding techniques for backtranslation graphs could be explored, possibly inspired by the BLASTP scorebased seeds.
Declarations
Acknowledgements
Laurent Noé was supported by the ANR project CoCoGen (BLAN071 185484).
The authors would like to thank the WABI'09 anonymous reviewers, where a preliminary version of this work [40] was submitted, for their interesting questions and useful suggestions, that helped to improve some aspects of the work, as well as its presentation.
Authors’ Affiliations
References
 Raes J, Peer Van de Y: Functional divergence of proteins through frameshift mutations. Trends in Genetics. 2005, 21 (8): 428431. 10.1016/j.tig.2005.05.013PubMedView ArticleGoogle Scholar
 Okamura K, Feuk L, MarquèsBonet T, Navarro A, Scherer S: Frequent appearance of novel proteincoding sequences by frameshift translation. Genomics. 2006, 88 (6): 690697. 10.1016/j.ygeno.2006.06.009PubMedView ArticleGoogle Scholar
 Harrison P, Yu Z: Frame disruptions in human mRNA transcripts, and their relationship with splicing and protein structures. BMC Genomics. 2007, 8: 37110.1186/147121648371PubMedPubMed CentralView ArticleGoogle Scholar
 Hahn Y, Lee B: Identification of nine humanspecific frameshift mutations by comparative analysis of the human and the chimpanzee genome sequences. Bioinformatics. 2005, 21 (Suppl 1): i186i194. 10.1093/bioinformatics/bti1000PubMedView ArticleGoogle Scholar
 Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool. JMB. 1990, 215 (3): 403410.View ArticleGoogle Scholar
 Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 33893402. 10.1093/nar/25.17.3389PubMedPubMed CentralView ArticleGoogle Scholar
 Grantham R, Gautier C, Gouy M, Mercier R, Pave A: Codon catalog usage and the genome hypothesis. Nucleic Acids Research. 1980, 4962. 8 10.1093/nar/25.17.3389Google Scholar
 Shepherd JC: Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. Proceedings National Academy Sciences USA. 1981, 15961600. 10.1073/pnas.78.3.1596. 78 10.1073/pnas.78.3.1596View ArticleGoogle Scholar
 Guigó R: DNA composition, codon usage and exon prediction. Genetic databases. Edited by: Bishop M. 1999, 5380.Google Scholar
 Gonnet GH: Back Translation (protein to DNA) in an optimal way. Tech Rep 505. 2005, Informatik, ETH, Zurich, http://www.biorecipes.com/BackTranslate/code.htmlGoogle Scholar
 Stothard P: The sequence manipulation suite: JavaScript programs for analyzing and formatting protein and DNA sequences. Biotechniques. 2000, 28 (6): 11021104.PubMedGoogle Scholar
 Moreira A, Maass A: TIP: protein backtranslation aided by genetic algorithms. Bioinformatics. 2004, 20 (13): 21482149. 10.1093/bioinformatics/bth204PubMedView ArticleGoogle Scholar
 Giugno R, Pulvirenti A, Ragusa M, Facciola L, Patelmo L, Di Pietro V, Di Pietro C, Purrello M, Ferro A: Locally sensitive backtranslation based on multiple sequence alignment. Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, (CIBCB). 2004, 231237.View ArticleGoogle Scholar
 Suyama M, Torrents D, Bork P: PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Research. 2006, W609W612. 34 Web ServerView ArticleGoogle Scholar
 BinindaEmonds O: transAlign: using amino acids to facilitate the multiple alignment of proteincoding DNA sequences. BMC Bioinformatics. 2005, 6: 15610.1186/147121056156PubMedPubMed CentralView ArticleGoogle Scholar
 Wernersson R, Pedersen A: RevTrans: Multiple alignment of coding DNA from aligned amino acid sequences. Nucleic Acids Research. 2003, 31 (13): 35373539. 10.1093/nar/gkg609PubMedPubMed CentralView ArticleGoogle Scholar
 Fontaine A, Touzet H: Computational identification of proteincoding sequences by comparative analysis. Proceedings of the 1st IEEE international conference on Bioinformatics and Biomedecine (BIBM), Silicon Valley, California. 2007, 95102.Google Scholar
 Leluk J: A new algorithm for analysis of the homology in protein primary structure. Computers and Chemistry. 1998, 22: 123131. 10.1016/S00978485(97)000351PubMedView ArticleGoogle Scholar
 Leluk J: A nonstatistical approach to protein mutational variability. BioSystems. 2000, 56 (23): 8393. 10.1016/S03032647(00)000745PubMedView ArticleGoogle Scholar
 Claverie J: Detecting frame shifts by amino acid sequence comparison. Journal of molecular biology. 1993, 234 (4): 11401157. 10.1006/jmbi.1993.1666PubMedView ArticleGoogle Scholar
 Pellegrini M, Yeates T: Searching for Frameshift Evolutionary Relationships Between Protein Sequence Families. Proteins. 1999, 37: 278283. 10.1002/(SICI)10970134(19991101)37:2<278::AIDPROT12>3.0.CO;2XPubMedView ArticleGoogle Scholar
 Arvestad L: Aligning coding DNA in the presence of frameshift errors. Proceedings of the 8th Annual CPM Symposium. 1997, 1264: 180190.Google Scholar
 Arvestad L: Algorithms for biological sequence alignment. PhD thesis. 2000, Royal Institute of Technology, Stocholm, Numerical Analysis and Computer ScienceGoogle Scholar
 CornishBowden A: IUPACIUB symbols for nucleotide nomenclature. Nucleic Acids Res. 1985, 13: 30213030. 10.1093/nar/13.9.3021PubMedPubMed CentralView ArticleGoogle Scholar
 Smith T, Waterman M: Identification of common molecular subsequences. J Mol Bwl. 1981, 147: 195197. 10.1016/00222836(81)900875. 10.1016/00222836(81)900875View ArticleGoogle Scholar
 Hirschberg D: A linear space algorithm for computing maximal common subsequences. Communications of the ACM. 1975, 18 (6): 341343. 10.1145/360825.360861.View ArticleGoogle Scholar
 Blake R, Hess S, NicholsonTuell J: The influence of nearest neighbors on the rate and pattern of spontaneous point mutations. JME. 1992, 34 (3): 189200. 10.1007/BF00162968. 10.1007/BF00162968View ArticleGoogle Scholar
 Kosiol C, Holmes I, Goldman N: An Empirical Codon Model for Protein Sequence Evolution. Molecular Biology and Evolution. 2007, 24 (7): 14641479. 10.1093/molbev/msm064PubMedView ArticleGoogle Scholar
 Lio P, Goldman N: Models of Molecular Evolution and Phylogeny. Genome Research. 1998, 8 (12): 12331244.PubMedGoogle Scholar
 Pedersen A, Jensen J: A dependentrates model and an MCMCbased methodology for the maximumlikelihood analysis of sequences with overlapping reading frames. Molecular Biology and Evolution. 2001, 18: 763776.PubMedView ArticleGoogle Scholar
 Schneider A, Cannarozzi G, Gonnet G: Empirical codon substitution matrix. BMC bioinformatics. 2005, 6: 13410.1186/147121056134PubMedPubMed CentralView ArticleGoogle Scholar
 Altschul S, Bundschuh R, Olsen R, Hwa T: The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Research. 2001, 29 (2): 351361. 10.1093/nar/29.2.351PubMedPubMed CentralView ArticleGoogle Scholar
 Olsen R, Bundschuh R, Hwa T: Rapid assessment of extremal statistics for gapped local alignment. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology. 1999, 211222. AAAI pressGoogle Scholar
 Rastas P: A General Framework for Local Pairwise Alignment Statistics with Gaps. Proceedings of the 9th International Workshop in Algorithms in Bioinformatics (WABI), Philadelphia (USA), of Lecture Notes in Computer Science. Edited by: Salzberg S, Warnow T. 2009, 5724: 233245. Springer VerlagGoogle Scholar
 Henikoff S, Henikoff J: Amino Acid Substitution Matrices from Protein Blocks. Proc of the National Academy of Sciences. 1992, 89 (22): 1091510919. 10.1073/pnas.89.22.10915. 10.1073/pnas.89.22.10915View ArticleGoogle Scholar
 Licznar P, Bertrand C, Canal I, Prère MF, Fayet O: Genetic variability of the frameshift region in IS911 transposable elements from Escherichia coli clinical isolates. FEMS Microbiology Letters. 2006, 218 (2): 231237. 10.1111/j.15746968.2003.tb11522.x. 10.1111/j.15746968.2003.tb11522.xView ArticleGoogle Scholar
 Rojas A, GarciaVallvé S, Montero MA, Arola L, Romeu A: Frameshift mutation events in betaglucosidases. Gene. 2003, 314: 191199. 10.1016/S03781119(03)00828XPubMedView ArticleGoogle Scholar
 Fry BG, Scheib H, Weerd van der L, Young B, McNaughtan J, Ryan Ramjan SF, Vidal N, Poelmann RE, Norman JA: Evolution of an Arsenal: Structural and Functional Diversification of the Venom System in the Advanced Snakes (Caenophidia). Molecular and Cellular Proteomics. 2008, 7: 215246. 10.1074/mcp.M700094MCP200PubMedView ArticleGoogle Scholar
 Li M, Fry BG, Kini RM: EggsOnly Diet: Its Implications for the Toxin Profile Changes and Ecology of the Marbled Sea Snake (Aipysurus eydouxii). Journal of Molecular Evolution. 2005, 60: 8189.10.1007/s0023900401380PubMedView ArticleGoogle Scholar
 Gîrdea M, Kucherov G, Noé L: Backtranslation for discovering distant protein homologies. Proceedings of the 9th International Workshop in Algorithms in Bioinformatics (WABI), Philadelphia (USA), of Lecture Notes in Computer Science. Edited by: Salzberg S, Warnow T. 2009, 5724: 108120. Springer VerlagGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.