# Finding coevolving amino acid residues using row and column weighting of mutual information and multi-dimensional amino acid representation

- Rodrigo Gouveia-Oliveira
^{1}Email author and### Affiliated with

- Anders G Pedersen
^{1}### Affiliated with

**2**:12

**DOI: **10.1186/1748-7188-2-12

© Gouveia-Oliveira and Pedersen. 2007

**Received: **25 May 2007

**Accepted: **03 October 2007

**Published: **03 October 2007

## Abstract

### Background

Some amino acid residues functionally interact with each other. This interaction will result in an evolutionary co-variation between these residues – coevolution. Our goal is to find these coevolving residues.

### Results

We present six new methods for detecting coevolving residues. Among other things, we suggest measures that are variants of Mutual Information, and measures that use a multidimensional representation of each residue in order to capture the physico-chemical similarities between amino acids. We created a benchmarking system, in silico, able to evaluate these methods through a wide range of realistic conditions. Finally, we use the combination of different methods as a way of improving performance.

### Conclusion

Our best method (Row and Column Weighed Mutual Information) has an estimated accuracy increase of 63% over Mutual Information. Furthermore, we show that the combination of different methods is efficient, and that the methods are quite sensitive to the different conditions tested.

## Background

As an example from comparative biology, where this problem was first identified, lets imagine we wished to compare mammals and birds in many characters, looking for the coevolution in these characters. Their simultaneous absence or presence would be taken as a proof of interaction. We would rightly conclude that having wings has an influence on the presence of light bones, but we would unfortunately also conclude that feathers interacted with laying eggs. In fact, any two characters that segregate according to the phylogeny would be perceived to be coevolving. A pioneering method in extracting the phylogenetic signal in such situations was made by Ridley et al.[4], followed by the one of Maddison[5]. These methods estimate the ancestral state for the characters, and count the co-occurrence of transitions between states. The later method of Harvey and Pagel[6] further incorporated branch length information and was framed as a statistical model. Coming back to sequence analysis, most characters (sites) indeed segregate according to the phylogeny (thus enabling us to reconstruct it), and are therefore hard to separate from truly coevolving sites. One way of going around the problem has been to use highly divergent sequences, in which the phylogenetic signal is weak. However, it is desirable to construct methods that effectively counter this obstacle. The work of Pollock et al.[7] presented one such method, based on likelihood ratio tests. The only obstacle to this methodology is that it is computationally intensive and dependent on the knowledge of the phylogeny. On the other hand, it uses that knowledge to make better predictions. The recent work of Dimmic et al.[8] follows a similar path, but under the Bayesian framework. Most other authors have used simpler approaches, which could eventually be used on large datasets. A majority of these approaches consists of calculating some statistic between all pairs of sites, resulting in a matrix, and these are therefore called "matrix-based methods"[9–17]. Some of these statistics are variations of correlations between some chemico-physical properties of the aminoacids normalized by some quantity, usually their variances[9–11, 14, 16], others are based on Mutual Information[12, 15, 18], or in some heuristics not very different from correlation estimation[13]. Exceptions are the methods of Dekker et al.[19] and Suel et al.[20], which are "perturbation-based" methods, in which sub-alignments are build and analyzed. Only a few of these methods, however, tried to tackle the phylogenetic confounding that makes this problem difficult. That was done either by confirming the coevolution predictions in subclades of the tree[17] or by weighting each pair of sites by their phylogenetic dependency[12]. Another main obstacle to accurate coevolution detection has stemmed from biases on site conservation, as shown by Fodor et al.[21]. Recently, however, the conservation bias problem has been addressed [12, 15, 17, 18]. There is the additional problem of not having a good biological control. As we haven't yet answered "what causes residues to coevolve?" we are not able to establish a proper benchmark based on real data. That need for a proper benchmarking platform has caused many artefacts to be published as real findings. We have thus built a complete benchmarking system for this problem, which we use to compare our methods. It has some differences to other benchmarking systems recently made available[13, 22]. In this work we simulate alignments of protein sequences including coevolving pairs of sites with differing rates of evolution, differing rates of coevolution, various numbers of taxa and using different methods to simulate coevolution. Simulating these variations is needed to properly evaluate the presented coevolution detecting statistics, as the variations occur in biological datasets and the statistics are sensitive to their effects. It would be more biologically realistic to simulate evolution considering the disturbance to the protein 3D structure, an idea first developed by Parisi and Echave[23] and whose further developments are well described in Rodrigue et al.[24]. We have opted not to do so because the statistics under evaluation do not account for all the extra information available in datasets simulated in this way, but that would definitely be an advantage for more complex methods for detecting coevolution, specially for methods performing n-way comparisons, Such methods would overcome the limitation of detecting n-way coevolution only through its pair wise components, which prevents the detection of networks of sites having many weak pair wise interactions.

## Methods

### Simulated Datasets

Our goal was to compare the performance of the coevolution measures presented in this paper on a number of datasets that covered a wide range of assumptions about how residues evolve and coevolve in real biological data. To do so, we simulated many and varied protein alignments including coevolving residues, applied the above-mentioned methods to find these residues, and then compared their performances. The alignments varied in the number of taxa, in the evolutionary and coevolutionary rates, and in the method of simulating co-evolution. Specifically, each simulated alignment was 300 residues long; 260 residues were independently evolved, while the remaining 40 residues consisted of 20 pairs of coevolving sites. The sequences were simulated along strictly bifurcating trees, of 32, 64, 128 or 256 taxa, with same-size branch lengths. Evolution along the branches followed a BLOSUM62 matrix of transition probabilities[25] – which can be derived from the more common log-odds format – raised to the power r, thus specifying the evolutionary rate. Each alignment had residues evolving at 4 different rates, with values of r of 1 (quickest), 1/5, 1/20 and 1/50 (slowest).

The pairs of coevolving sites were done as described by Pollock et al.[7]. One site was the "driver", evolving as an independent site (also at one of the 4 different rates). The other one was the dependent site. Evolution at the dependent site is, to a level, dependent on the state at the "driver" site. For every state at the "driver" site there is a favoured state at the dependent site. Evolution at the dependent site tends to follow the one at the "driver " site, and this tendency is proportional to a "coevolution factor", c. Coevolution is an easy-to-understand but hard-to-define concept; as we wanted our results to be as general as possible, we modelled coevolution using three different approaches. In all approaches, the evolution at the dependent site was determined by a BLOSUM matrix, transformed by the coevolution factor. The BLOSUM matrix of transition probabilities represents the probability of change from any amino acid (one per row) to any other (one per column). So if we have, for example, a tryptophan as an ancestor, we will use the tryptophan row of the BLOSUM matrix.

*P*(*aa*
_{
i
}|*w*) = *BLOSUM*
_{
wi
} (1)

and thus the dependent site had its evolution conditioned by the "driver" site. The "driver" site "chooses" the amino acids substitutions that will be favoured by increasing the value on the respective columns. The way the subset of columns is defined is what differs between the three coevolution models explained next:

#### 1) Model cluster

Amino acid clusters used in model "Cluster"

Cluster number: | Amino acids |
---|---|

1 | A P S T |

2 | I L M V |

3 | F W Y |

4 | N D |

5 | G |

6 | R Q E H K |

7 | C |

These clusters were the states of the model, defined in equation 2. Each site will evolve according to the row in the BLOSUM matrix corresponding to its own previous/ancestral amino acid. This row, however, will have the columns corresponding to amino acids belonging to group j multiplied by the coevolution factor. That is, if the "driver" site has any amino acid belonging to group j, the dependent site will be pushed into the corresponding state (which for simplicity is also amino acids belonging to group j).

#### 2) Model one-on-one

If Nature wanted amino acids to be clustered into 7 groups, there would only be 7 amino acids, some may say (including ourselves). We therefore also used the one-on-one model of coevolution, where the states are instead the individual amino acids. That is, for each amino acid in the "driver" site there is a favoured one in the dependent site. As we have seen, and for simplicity, we chose it to be the same amino acid, and therefore evolution on the dependent site follows the row of the BLOSUM matrix of the previous/ancestral amino acid, having the column corresponding to the favourite amino acid multiplied by the coevolution factor, c.

#### 3) Model BLOSUM

This model is very similar to the one-on-one. The only difference is the row of the BLOSUM matrix used for determining the probabilities of change. In the previous model, the row was the one of the ancestral amino acid. That is, if the dependent site had an alanine as a residue, and the "driver" site had changed to a proline, the evolution at the end of the branch would follow the BLOSUM row corresponding to the alanine, with an increase probability of changing to a proline. Therefore, it would have high chances of changing into a proline and of staying as an alanine, and mild chances of changing into amino acids similar to alanine. In the BLOSUM model, the BLOSUM row taken is instead the one corresponding to the "driver" site's new amino acid, being given thus more importance to the new physico-chemical demands of that site. In this example, it would be the proline row, having thus increased chances of changing into proline and amino acids similar to proline.

*P*(*Dep*
_{
state
}|*Dep* = *j* &*Driver* = *i*
_{{i∈state}}) = *BLOSUM*
_{
i, state
} × *coev.f* (3)

This effect, however, should be quite small, as it proved to be. For each model, we tried three different values in the coevolution factor c, yielding a strong, medium or weakly coevolving version. We made 10 replicates of each alignment through all combination of coevolving model, c, and number of taxa. Each alignment had 20 coevolving pairs.

### Analysis

After the alignments had been simulated, we applied the different methods for detecting coevolution to analyze them. All of these methods calculate a statistic for each possible pair of sites in the sequence. Each site is in fact an ordered column of residues. The calculation of a statistic for each pair yields an n by n-1 upper triangular matrix, n being the length of the protein sequence. The cells in the matrix are then ranked and a rank list is presented as the output. The statistics used for each pair wise comparison between sites were:

#### 1) Mutual Information

where A and B are the two sites being compared, and *i* and *j* run through all the occurring amino acids in each site. The base for the logarithm is 20, the number of letters in the protein alphabet. This causes the MI statistic to scale to a maximum of 1. To estimate MI, here and in the following sections, we have estimated all the probabilities involved by their correspondent observed frequencies.

#### 2) Mutual Information of adaptable logarithmic base

where #i represents the number of different amino acids occurring in site A, and #j in site B.

By using a logarithmic base, which is the geometric average of the number of different amino acids present in each site, we make it possible for every score to range from 0 to 1, independently of the number of different amino acids present in the two original sites.

#### 3) Simple correlation

Which can also be seen as MI without the logarithmization. This measure has the desirable property of being only dependent on the correlation between the sites. A possible problem is that by neglecting the diversity of amino acid composition, it will score very conserved pairs highly, including the totally conserved sites. Thus, for totally conserved sites we, a posteriori, assigned the value 0 to the estimator, as these conserved sites are obviously not coevolving.

#### 4) Row-column weighting

Let us think of every site in the alignment as a column-vector. If the sequences had no evolutionary signal and there were no constraints, the vector would approach n random realizations from a multinomial distribution with probabilities reflecting the amino acid composition. For a sufficiently high n (number of sequences), the chance of a vector of length n having the same pattern of conservation as another site would then be very small, and no false positives would result.

However, when there is evolutionary signal, sequences that are more closely related will tend to have the same or similar amino acids when compared with more distantly related sequences. This means that some patterns of amino acid conservation are much more common in the alignment columns than others. This is, in fact, the reasoning behind the construction of phylogenetic trees: sites are grouped into patterns, and the resulting tree is the one which is compatible with the majority of observed patterns.

where MI.j denotes the sum of the Mutual Information matrix over all lines in column j. We call this measure RCW (Row-Column-Weighting). Performing RCW on figure 2a, yields figure 2c. It can be seen that the row-column dependency is now weaker, albeit still present, and that the highest values of RCW MI are not predominantly present in the quickest evolving sites, as they were in the figure 2a. The use of other functions besides the arithmetic average might improve the removal of the row-column dependency, but most of the remainder is due to the fact that it is a simultaneous bi-dimensional optimization. The weighting can also be performed excluding the top hits of every row/column, to accommodate for more than two-way coevolution.

As RCW is a weighting of an existing matrix, it is compatible with all matrix-based methods. We therefore tested RCW with all the remaining statistics presented in this work.

#### 5) Multi-dimensional amino acid representation (MDAR)

In previous work addressing the question of how to detect coevolution, there have been two main approaches to looking at the sequence data, namely one where the focus is on individual amino acids, and another where the focus is instead on groups of amino acids. In our view, both approaches have drawbacks. The first approach treats all the 20 × 19 possible amino acid substitutions as different. A huge amount of evidence in phylogenetic science points to this not being the case. For instance, serine is physico-chemically similar to threonine, and substitution with the former is in some sense similar to substitution with the latter. The second approach has two problems: it relies on just one out of the many possible ways of grouping amino acids, and it treats amino acids within any group as completely identical while amino acids in different groups are seen as entirely different.

We have tried to accommodate for the similarities between amino acids by representing each amino acid by a vector of length 20, each dimension being a measure of the similarity between that amino acid and each other amino acid. This vector is, in fact, a row from the BLOSUM62 matrix, which represents similarities between amino acids based on empirically observed substitution patterns. Therefore, each site in the alignment becomes a matrix instead of a vector. By representing amino acids in this multidimensional fashion we solve the above-mentioned problem on how to categorize amino acids, but we are left with the problem of measuring coevolution. Mutual Information and similar statistics have no equivalent in this framework. So to compare two sites we start by calculating the correlation between them. We do that by calculating the Mantel correlation between the two matrices that represent each site.

#### 6) Multi-dimensional amino acid representation vs tree

where "corr" stands for Mantel Correlation. The real calculation was a transformation of the above one with the denominator scaled between 1 and 0 and then multiplied by the numerator, to prevent divisions by 0 or extremely low values.

### Evaluation of performance

To evaluate the performance of the methods on each alignment, we used the Area Under the ROC Curve (AUC) using[29]. The AUC is calculated from a ranked list, being threshold independent. It has the value 1 when all the positives are ranked higher than the negatives. In our case, the list is the output of each method, and with alignments of length 300 the list can extend up to 300 × 299/2 = 44850 pairs of sites. For simplicity, we considered only the 200 highest ranked pairs. In case some of the coevolving pairs were not present in the best 200, they were placed at the bottom of the list. This procedure means that the expected value of a random classifier is ≈ 0 (in most instances where AUC is applied, the whole list is used, the expected value of a random classifier is 0.5).

## Results and discussion

### General performance

As mentioned in the previous section, we simulated alignments containing 20 pairs of coevolving residues. This was done for 4 different numbers of taxa, for 3 different levels of coevolution and with data being simulated by 3 different methods. For each combination of these conditions, we generated 10 alignments (replicates) and analyzed each one of them by the different methods presented in Materials and Methods.

One can see that all non-Row-and-Column-Weighting methods outperform plain Mutual Information. One can also see that Row and Column Weighting improves all those methods which do not incorporate a multidimensional amino acid representation, in which case it totally ruins their performance. The most striking improvement is when Row and Column Weighting is done on a Mutual Information matrix, in which case the performance is 63% better than Mutual Information is on its own.

### Performance under different evolution models

### Combining different methods

*e.g*., MDRA vs. Tree) can give a positive contribution when combined with a sufficiently different statistic.

Error bars, for an α = 0.05 and with Bonferroni correction, have also been added to the plot. These provide an indication of the level of randomness affecting our results, which we consider quite low. In addition to AUC, which is generally considered to be the best measure of overall performance, we have also used an alternative measure to analyze our results. Specifically, we counted the fraction of positives present in the best-ranked 20 pairs of sites (recall that there are 20 positives in each dataset). The average values are plotted as black bars, also on figure 6. We can see that this measure does not follow AUC closely, but the correlation is good enough for our purpose (RCW MI is the best statistic under both measures, and from the above 11 statistics, 5 will stay in the best 11 group). The major difference is that, under the fraction of positives in the best 20 pairs of sites, the best methods are not the intersections containing RCW MI.

### The effect of conservation on performance

## Conclusion

Driven by the lack of a credible biological benchmarking system in which to test coevolution detection methods, we established a realistic and broad platform for benchmarking 2-way-coevolution at the protein sequence level. In our system, one can simulate coevolving sequences varying in the model of coevolution used, the strength of that coevolution, the evolutionary rate and the number of taxa. The output is then evaluated by comparing the Area Under the ROC Curve (AUC) or by the ratio between true positives and total positives.

We have shown how small differences in the benchmarking system can lead to disparate results and how strict and consensual one has to be in defining coevolution and the methods of its evaluation.

We have studied the effect of different rates of evolution between positions in one alignment, and verified that it is a major player in uncovering coevolution, and shown that, in this context, Mutual Information suffers from an "attraction" to quickly evolving sites that prevents it from becoming an effective coevolution detection measure.

To counter this effect, we have proposed several measures, including Row and Column Weighting of output matrices, which proved to be a strong improvement, increasing the accuracy of Mutual Information (measured by the Area Under the ROC Curve and in the conditions tested) by 63%. We have also included the concept of Multi-dimensional Amino acid Representation, which we believe has potential to be improved.

Finally, we have also compared the intersection of several measures, showing that this, in general, leads to an improvement in accuracy.

## Declarations

### Acknowledgements

The authors would like to acknowledge Ole Lund for giving us the idea on Row and Column Weighting and Peter W Sackett for his helpfulness. The authors would also like to thank one anonymous reviewer and Jeff L. Thorne for their comments that greatly improved the manuscript. RGO thankfully acknowledges grant SFRH/BD/12448/2003 from F.C.T., Portuguese Ministério da Ciência e Ensino Superior. The authors would like to thank the Danish Center for Scientific Computing.

## Authors’ Affiliations

## References

- Pollock DD:
**Genomic biodiversity, phylogenetics and coevolution in proteins.***Appl Bioinformatics*2002,**1**(2)**:**81–92.PubMed - Pazos F, Helmer-Citterich M, Ausiello G, Valencia A:
**Correlated mutations contain information about protein-protein interaction.***J Mol Biol*1997,**271**(4)**:**511–523.View ArticlePubMed - Socolich M, Lockless SW, Russ WP, Lee H, Gardner KH, Ranganathan R:
**Evolutionary information for specifying a protein fold.***Nature*2005,**437**(7058)**:**512–518.View ArticlePubMed - Ridley M:
**The explanation of organic diversity: the comparative method and adaptations for mating.**Oxford University Press 1983. - Maddison WP:
**A method for testing the correlated evolution of two binary characters: are gains or losses concentrated on certain branches of a phylogenetic tree.***Evolution*1990,**44:**539–557.View Article - Harvey PA, Pagel MD:
**The comparative method in evolutionary biology.**Oxford University Press 1991. - Pollock DD, Taylor WR, Goldman N:
**Coevolving protein residues: maximum likelihood identification and relationship to structure.***J Mol Biol*1999,**287**(1)**:**187–198.View ArticlePubMed - Dimmic MW, Hubisz MJ, Bustamante CD, Nielsen R:
**Detecting coevolving amino acid sites using Bayesian mutational mapping.***Bioinformatics*2005,**21**(Suppl 1)**:**i126-i135.View ArticlePubMed - McLachlan AD:
**Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551.***J Mol Biol*1971,**61**(2)**:**409–424.View ArticlePubMed - Gobel U, Sander C, Schneider R, Valencia A:
**Correlated mutations and residue contacts in proteins.***Proteins*1994,**18**(4)**:**309–317.View ArticlePubMed - Kass I, Horovitz A:
**Mapping pathways of allosteric communication in GroEL by analysis of correlated mutations.***Proteins*2002,**48**(4)**:**611–617.View ArticlePubMed - Tillier ER, Lui TW:
**Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments.***Bioinformatics*2003,**19**(6)**:**750–755.View ArticlePubMed - Pritchard L, Bladon P, J MOM, M JD:
**Evaluation of a novel method for the identification of coevolving protein residues.***Protein Eng*2001,**14**(8)**:**549–555.View ArticlePubMed - Noivirt O, Eisenstein M, Horovitz A:
**Detection and reduction of evolutionary noise in correlated mutation analysis.***Protein Eng Des Sel*2005,**18**(5)**:**247–253.View ArticlePubMed - Martin LC, Gloor GB, Dunn SD, Wahl LM:
**Using information theory to search for co-evolving residues in proteins.***Bioinformatics*2005,**21**(22)**:**4116–4124.View ArticlePubMed - Kundrotas PJ, Alexov EG:
**Predicting residue contacts using pragmatic correlated mutations method: reducing the false positives.***BMC Bioinformatics*2006,**7:**503.View ArticlePubMed - Fares MA, Travers SA:
**A novel method for detecting intramolecular coevolution: adding a further dimension to selective constraints analyses.***Genetics*2006,**173**(1)**:**9–23.View ArticlePubMed - Gloor GB, Martin LC, Wahl LM, Dunn SD:
**Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions.***Biochemistry*2005,**44**(19)**:**7156–7165.View ArticlePubMed - Dekker JP, Fodor A, Aldrich RW, Yellen G:
**A perturbation-based method for calculating explicit likelihood of evolutionary co-variance in multiple sequence alignments.***Bioinformatics*2004,**20**(10)**:**1565–1572.View ArticlePubMed - Suel GM, Lockless SW, Wall MA, Ranganathan R:
**Evolutionarily conserved networks of residues mediate allosteric communication in proteins.***Nat Struct Biol*2003,**10**(1)**:**59–69.View ArticlePubMed - Fodor AA, Aldrich RW:
**Influence of conservation on calculations of amino acid covariance in multiple sequence alignments.***Proteins*2004,**56**(2)**:**211–221.View ArticlePubMed - Gesell T, von Haeseler A:
**In silico sequence evolution with site-specific interactions along phylogenetic trees.***Bioinformatics*2006,**22**(6)**:**716–722.View ArticlePubMed - Parisi G, Echave J:
**Structural constraints and emergence of sequence patterns in protein evolution.***Mol Biol Evol*2001,**18**(5)**:**750–756.PubMed - Rodrigue N, Philippe H, Lartillot N:
**Assessing site-interdependent phylogenetic models of sequence evolution.***Mol Biol Evol*2006,**23**(9)**:**1762–1775.View ArticlePubMed - Henikoff S, Henikoff JG:
**Amino acid substitution matrices from protein blocks.***Proc Natl Acad Sci USA*1992,**89**(22)**:**10915–10919.View ArticlePubMed - Gorodkin J, Staerfeldt HH, Lund O, Brunak S:
**MatrixPlot: visualizing sequence constraints.***Bioinformatics*1999,**15**(9)**:**769–770.View ArticlePubMed - Shannon CE:
**A mathematical theory of communication.***Bell System Technical Journal*1948,**27:**379–423. - Buck MJ, Atchley WR:
**Networks of coevolving sites in structural and functional domains of serpin proteins.***Mol Biol Evol*2005,**22**(7)**:**1627–1634.View ArticlePubMed - Fawcett T:
**ROC graphs: Notes and practical considerations for researchers.***Technical report**(Edited by: Laboratories H).*Palo Alto: HP Laboratories 2004.

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.