Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail
- Stefan Wolfsheimer^{1, 2}Email author,
- Bernd Burghardt^{1} and
- Alexander K Hartmann^{1, 2}
DOI: 10.1186/1748-7188-2-9
© Wolfsheimer et al; licensee BioMed Central Ltd. 2007
Received: 05 October 2006
Accepted: 11 July 2007
Published: 11 July 2007
Abstract
Background
The optimal score for ungapped local alignments of infinitely long random sequences is known to follow a Gumbel extreme value distribution. Less is known about the important case, where gaps are allowed. For this case, the distribution is only known empirically in the high-probability region, which is biologically less relevant.
Results
We provide a method to obtain numerically the biologically relevant rare-event tail of the distribution. The method, which has been outlined in an earlier work, is based on generating the sequences with a parametrized probability distribution, which is biased with respect to the original biological one, in the framework of Metropolis Coupled Markov Chain Monte Carlo. Here, we first present the approach in detail and evaluate the convergence of the algorithm by considering a simple test case. In the earlier work, the method was just applied to one single example case. Therefore, we consider here a large set of parameters:
We study the distributions for protein alignment with different substitution matrices (BLOSUM62 and PAM250) and affine gap costs with different parameter values. In the logarithmic phase (large gap costs) it was previously assumed that the Gumbel form still holds, hence the Gumbel distribution is usually used when evaluating p-values in databases. Here we show that for all cases, provided that the sequences are not too long (L > 400), a "modified" Gumbel distribution, i.e. a Gumbel distribution with an additional Gaussian factor is suitable to describe the data. We also provide a "scaling analysis" of the parameters used in the modified Gumbel distribution. Furthermore, via a comparison with BLAST parameters, we show that significance estimations change considerably when using the true distributions as presented here. Finally, we study also the distribution of the sum statistics of the k best alignments.
Conclusion
Our results show that the statistics of gapped and ungapped local alignments deviates significantly from Gumbel in the rare-event tail. We provide a Gaussian correction to the distribution and an analysis of its scaling behavior for several different scoring parameter sets, which are commonly used to search protein data bases. The case of sum statistics of k best alignments is included.
Background
Sequence alignment is a powerful tool in bioinformatics [1, 2] to detect evolutionarily related proteins by comparing their sequences of amino acids. Basically one wants to determine the "similarity" of the sequences. For example, given a protein in a database like PDB [3], such similarity analysis can be used to detect other proteins, which are evolutionary close to it. Related approaches are also used for the comparison of DNA sequences, i.e. shotgun DNA sequencing [4], but the application to DNA is not considered in this article.
Alignment algorithms find optimum alignments and maximum alignment scores S of two or more sequences for a given scoring system. Needleman and Wunsch suggested a method to compute global alignments [5], whereas the Smith-Waterman algorithm [6] aims at finding local similarities. Insertions and deletions of residues are taken into account by allowing for gaps in the alignment. Gaps yield a negative contribution to the alignment score and are usually modeled by a gap-length l depending score function g (l). Widely used are affine gap costs because for two given sequences of length L and M, because fast algorithms with running time $\mathcal{O}$ (LM) are available for this case [7]. Note that for database queries even this is too complex, hence fast heuristics like BLAST [8] are used there.
By itself, the alignment score, which measures the similarity of two given sequences, does not contain any information about the statistical significance of an alignment. One approach to quantify the statistical significance is to compute the p-value for a given score S. This means under a random sequence model one wants to know the probability for the occurrence of at least one hit with a score S greater than or equal to some given threshold value b, i.e. ℙ(S ≥ b). Often E-values are used instead. They describe the number of expected hits with a score greater than or equal to some threshold value. One possible access to the statistical significance can be achieved under the null model of random sequences. Then the optimal alignment score S becomes a random variable and the probability of occurrence of S under this model P (s) = ℙ (S = s) provides estimates for p-values. Analytic expressions for P (s) are only known asymptotically in the case of gapless alignments of long sequences, where an extreme value distribution (also called Gumbel distribution) [9, 10] was found. For alignments with gaps, such analytical expressions are not available. Approximation for scenarios with gaps based on probabilistic alignment [11–13], large deviations [14] and a Poisson model [15] had been developed. Altschul and Gish [16] investigated the score statistics of random sequences for a number of scoring systems and gap parameters by computer simulations: They obtained histograms of optimum scores for randomly sampled pairs of sequences by simple sampling. By curve fitting, they showed that in the region of high probability the extreme value distribution describes the data well, also for gapped alignments of finite sequences. Additionally, they found that the theoretical predictions for the relation between the scoring system on one side and the Gumbel parameters on the other side hold approximately for gapped alignments. In this context they obtained two improvements: Using a correction to account for finite sequence lengths and sum statistics of the k-best alignments, theoretical predictions for ungapped alignments could be applied more accurately to gapped alignments. Recently Olsen et al. introduced the "island method" [17, 18], which accelerates sampling time. BLAST [8] uses precomputed data, generated with the island method, to estimate E-values. In any case, as already pointed out, the studies in Ref. [16] and [18] give reliable data in the region where P (s) is large only. This is outside the region of biological interest because pairs of biologically related sequences have a higher similarity than pairs of purely randomly drawn sequences.
To overcome this drawback a rare-event sampling technique was proposed recently [19], which is based on methods from statistical physics. This general approach allows to obtain the distribution over a wide range, in the present case down to P (s) = 10^{-40}. So far this method has been applied to one relevant case only, namely protein alignment with the BLOSUM 62 score matrix [7] and affine gap costs with α = 12 opening and β = 1 extension costs. It turned out that at least for one scoring matrix and one set of gap-cost parameters, the distribution deviates from the Gumbel form in the biologically relevant rare-event tail, where simple sampling methods fail. Empirically, a Gaussian correction to the original distribution was proposed for this case.
Results as in Ref. [19] are only useful if one obtains the distribution for a large range of parameter values which are commonly used in bioinformatics. It is the purpose of this work to study the distribution of S for other relevant cases. Here we consider the BLOSUM62 and the PAM250 score matrices in connection with various parameters α , β of affine gap costs.
The paper is organized as follows. In the second section we define alignments formally and state a few main results on the statistics of local sequence alignment. Next, we state the rare-event approach used here and in the fourth section we explain our approach in detail. We introduce some toy examples which are also used to evaluate the convergence properties of the algorithm. In the fifth section, we present our results for BLOSUM62 and PAM 250 matrices in conjunction with different affine gap costs. We show also our results for the sum statistics of the k largest alignments. In the last section, we summarize and discuss our results.
Statistics of local sequence alignment
In this section, we define sequence alignment, and state some analytical results for the distribution of the optimum scores S over pairs of random sequences.
which can be obtained in $\mathcal{O}$(LM) time [7].
In the case of gapless optimum local alignments of two random sequences of L and M independent letters from Σ with frequencies {f_{ a }} with a ∈ Σ and ∑_{ a }f_{ a }= 1, referred as null model, the score statistics can be calculated analytically in the asymptotic regime of long sequences [9, 10].
In this case one obtains the Gumbel distribution (Karlin-Altschul statistics) [23]
ℙ(S ≥ b) = 1 - exp [- KLM e^{-λb}]
or
P_{Gumble} (s) = ℙ(S = s) = λ KLM exp [-λ s - KLM e^{-λ s}]
The parameters λ and K of Eq. (3) can be derived directly from the score matrix σ (a, b) and frequencies f_{ a }[9, 10].
As pointed out by Altschul and Gish [16], in finite systems there occur edge effects: An alignment may extend to the end of either sequence and the score will be distorted towards lower values and high scores become less probable. Since this effect vanishes in the limit of infinite sequences, the tail of Eq. (3) can be understood as an upper bound for finite sequences.
Arratia and Waterman [24] predicted a phase transition between a linear phase and a logarithmic phase, i.e. a linear growth of the excepted score as a function of the sequence length, changing to a logarithmic growth with increasing gap costs. In the linear phase an optimum alignment may spread over a large range of the sequences and the statistical theory breaks down. However, only the logarithmic phase is of interest in biological questions because the alignment algorithm becomes more sensitive in this phase, especially near the threshold [25].
In the asymptotic theory the score can be seen as a continuous variable and the probabilities Eq. (4) and Eq. (5) become probability densities. Then the probability of finding a normalized score b or larger is given by the integral $\mathbb{P}(S\ge b)={\displaystyle {\int}_{b}^{\infty}f(t)dt}$. However in computer simulations the score is a discrete variable and therefore the normalization constants in Eq. (5) differ from continious scoring. Below we will compare the results of our numerical studies to this distribution in the tail of the data for values k = 2, ..., 5.
Sampling of rare-events
Metropolis Hastings Algorithm
As already pointed out, the main purpose of this paper is to calculate the tail of the distribution of optimum scores of gapped local alignments over pairs of randomly and independently drawn sequences of finite lengths. The basic idea of our approach is to generate the sequences from different distributions, which are biased towards higher scores.
In order to be more precise let us denote the state space of all possible pairs of sequences (x, y) as $\mathcal{X}$ and an element in this space as a configuration. We write X = (x, y).
Since the region of biological interest is located in the rare-event tail a huge amount of samples would be needed to achieve an acceptable accuracy. In practice the rare-event tail becomes inaccessible.
where ${\tilde{q}}_{T}$ the unnormalized weight of a configuration, Z_{ T }is a (usually unknown) normalization constant and T an adjustable parameter, which we will call "temperature" (In the framework of statistical mechanics, which is closely related to our method, the parameter T describes the temperature of a physical system. The pair of sequences can be seen as a configuration of a physical system and the negative score as the energy function. Then exp [S (X)/T] refers to the so called Gibbs-Boltzmann distribution.) The close-to Gumbel form of the distribution is also directly related to the so called "large deviation rate function", which basically describes the decay rate of the tail of the distribution. Note that, if the score distribution is an exact Gumbel distribution Eq. (3), i.e. the rate function a known constant λ, then setting T = 1/λ in Eq. (7) yields a "flat score histogram" for sufficient large s. Hence, in this case, a simulation at a single carfully chosen value T would be sufficient to obtain the full result. Since P (s) does not follow the Gumbel form exactly, importance sampling has to be applied. Each value of T selects one region of the distribution around which a high accurracy is obtained.
This importance sampling approach is conceptual related to the method of "measure change" in large deviation theory. For example Siegmund and Yakir [14] approximated the p-value for local sequence alignment by considering the log-likelihood ratio between an alternative measure and the measure of the null model. Under the new measure a rare event occurs more likely than under the original null measure and approximations become possible. Another example can be found in Ref. [29], where techniques from large deviation theory were applied to proof "asymptotic efficiency" of rare-event simulations.
with ΔS = S (X*) - S (X) If the trial configuration is not accepted, the previous configuration X is kept for the next time step t + 1. In this way, the Markov chain fulfills the detailed balance condition P (X*, X)$\tilde{p}$(X* → X)·q_{ T }(X*) = P (X, X*)$\tilde{p}$(X → X*)·q_{ T }(X). In this case it has been proven that an ergodic Markov chain converges to the stationary distribution q_{ T }. Ergodicity means, that there is a non-zero probability for a path between any pair(X_{ 1 }, X_{ 2 }) of configurations.
We used a simple way to define the neighborhood of a configuration and constructed the trial configuration as follows: First a letter a is drawn from the alphabet Σ according to the letter weights f_{ a }and next one of the sequences (x or y) and a position i is chosen randomly. Finally, the letter at position i is replaced by a.
and estimate Z from the sample $Z={\displaystyle {\sum}_{k=1}^{n}{\tilde{q}}_{{T}^{\prime}}}({X}_{k})/{\tilde{q}}_{T}({X}_{k})$. A detailed discussion about this issue can be found in Ref. [31, 32]. In practice this may work badly as soon as the parameter ranges of the given distribution and the target distribution do not overlap sufficiently. In this case q_{ T' }(X_{ i }) is very small, but the configurations where q_{ T' }(X)/q_{ T }(X) is sufficiently large are not generated because q_{ T }(X) is relatively small for those. Therefore we sampled a mixture of many coupled Monte Carlo chains and reweighted the mixture, which is explained in detail in the next section. This allows for large overlap between neighboring distributions and to determine the normalization constants, up to an irrelevant global constant.
Metropolis Coupled MCMC
Metropolis Coupled Markov Chain Monte Carlo (MCMCMC) was first invented by Charles Geyer [33] and then reinvented by Hukushima and Nemoto [34] under the term exchange Monte Carlo. In physical literature MCMCMC is often denoted as parallel tempering. The method has become a standard tool in disordered systems with a rough (free) energy landscape [35]. These rough energy landscapes are characterized by high energy barriers and can be found for problems like protein folding [36–40], nucleation [41], spin-glasses [42, 43] and other models characterized by rare events [19, 44]. In the last decade it turned out that MCMCMC accelerates equilibration and mixing remarkably.
In the framework of MCMCMC m copies X^{(1)}, ..., X^{(m)}of the system held at different temperatures T_{1} <T_{2} < ... <T_{ m }are simulated in parallel. This means one samples from the product of the state space $\mathcal{X}$^{ m }weighted with the joint distribution with weights $\prod}_{j=1}^{m}{q}_{{T}_{j}$. Since the different copies are allowed to exchange temperatures during the simulation, let us define the space of all possible mappings from the m configurations to the m temperatures as temperature space.
where, $\Delta {\beta}_{k}=\frac{1}{{T}_{k+1}}-\frac{1}{{T}_{k}}$, ΔS = S (X^{(k + 1)}) - S (X^{ k }) and all weights are calculated with the configurations before the flip. This leads to a "random walk in temperature space" of the configurations.
Note that another possible approach based on Markov chains to compute p-values of a random model with a random variable X, ℙ [X > b] was introduced by Wilbur [45]. The first step is to sample from an unbiased Markov chain based on the model of interest and compute the median of the (high probability) distribution. In the second iteration the random walk is truncated such that only values larger than the median of the first iteration occur. This corresponds to choosing a lower temperaure T in Eq. (7). The third iteration uses the median of the second iteration and so forth. This is repeated until a fraction of 1/4 of all events lay beyond a certain threshold value leading to a non decreasing sequence of splitting intervals defined by the medians of each iteration. This sequence is used in the second stage of the algorithm, where p-values are computed explicitly by multiplying the p-values of the truncated distribution in each iteration.
Although this method is easy to implement and errors can be estimated relatively simply, the MCMCMC approach has the advantage that the different configurations are not subjected to a sequence of decreasing temperatures, but perform a random walk in temperature space, i.e. visit all temperatures several times. Thus, mixing is accelerated and hence fewer Monte Carlo steps are required.
Reweighting the mixture
with n = ∑_{ j }n_{ j }. The unknown constants c ≡ (c_{1}, ..., c_{ m }) may be estimated by reverse logistic regression introduced by Geyer [46]. Here we used an alternative approach to obtain the constants c developed by Meng and Wong [47], which is explained now.
Since the global normalization constant Z in Eq. (11) is trivial, the problem is reduced to the estimation of (m - 1) ratios of normalization constants to some reference value. One possible choice is to fix the normalization constant of q_{1} and estimate the ratios r_{ i }= c_{1}/c_{ i }(i = 2, ..., m).
with a_{ ii }= ∑_{j ≠ i}b_{ ij }and a_{ ij }= -b_{ ij }for i ≠ j. This equations cannot be solved directly, because the coefficients a_{ ij }do depend on the unknown ratios. However it is possible to solve Eq. (13) self-consistently. Using $\widehat{b}$ = (b_{11}, b_{21},..., b_{m 1}) and including explicitely the dependence on r = (r_{1}, r_{2},..., r_{ m }) we obtain
A (r^{(t)})·r^{(t + 1)}= b(r^{(t)}).
This equation can be solved by starting with r^{(1)} = (1, 1, ..., 1) and iteratively solving for r^{(t + 1)}till convergence. Following the paper of Meng and Wong [47] Eq. (14) with the choice ${\alpha}_{ij}(X)=\frac{{n}_{i}{n}_{j}}{{\left|n\right|}^{2}}\cdot {q}_{\text{mix}}(X)$ converges to same estimator as proposed by Geyer [46], which is based on maximization of a quasi-loglikelihood. The desired probability P (s) can be achieved by setting q_{ T' }to the unbiased weight q_{∞} = 1 and estimate the expectation values of the indicator functions h_{ S }in Eq. (11).
Illustration and convergence diagnostics
In order to guarantee start configurations taken from the stationary distribution the first few iterations of the chains have to be discarded. The number of iterations to be discarded is denoted as burning or equilibration period. Usually one starts from a random (i.e. disordered) configuration and equilibrates the system. At the beginning of the simulation the system has a low score and hence it can reach in principle most regions of the score landscape. If the temperature is low, one sees when looking at Eq. (7) that configurations with large score dominate. Hence, typically the score increases or stays the same during the simulation with only few score-decreasing fluctuations.
Note that if "ground states" are also known, i.e. the maxima of the score landscape, the reverse process is possible, i.e. starting from a high maximum and sampling its local environment. One can use this fact to verify, whether a system has equilibrated on a larger scale, i.e. whether it is able to overcome the typical barriers in the score landscape. This is the case when the average behavior for two runs, one starting with a disordered configuration and one starting with an "ground-state" configuration, is the same (within fluctuation). If the temperature is too small, this is usually not possible.
and affine gap costs with α = 4 and β = 2.
A more quantitative method was introduced by Raftery and Lewis [48, 49], that estimates equilibration and sample times for a set of quantils. Raftery and Lewis's program, which is available from StatLib [50] or in the CODA package [51], estimates a thining interval n_{thin} as well. That means only every n_{thin}th step is used for inference in order to avoid correlations between the scores at time t and t + Δt, that occur in MCMC in constrast to direct generating random sequences. The program requires three parameters: the desired accuracy r, the required probability s of attaining the specified accuracy and a less relevant tolerance parameter ε.
We compared the result of the estimate of the equilibration time with the simple visual approach: For the example given in Fig. 2 we maximized numerical estimate of equilibration time over a set of quantils between 0.1 and 0.95 for r = 0.0125, s = 0.95, ε = 0.001): The results for the equilibration time obtained by this approach are always much smaller than those obtained by the visual inspection. For example for L = 20, the Rafter-Lewis approach gives an equilibration time of 800 steps for the lowest temperature, whereas Fig. 2 suggests 20000 steps. Therefore equilibrium might not be guaranteed with the Rafter-Lewis approach and the visual inspection seems to be more conservative.
Once the equilibration period is estimated one may check the convergence of the remaining parts of the chains to the equilibrium distributions. This was done by computing the Gelman and Rubin shrink factors R [49, 52, 53]. This diagnostic compares the "within-chain" and the "inter-chain variance" of a set of multiple Monte Carlo chains. When the factor R approaches 1 the within-chain variance dominates and the sampler has forgotten its starting point. For the lowest temperature in our toy model L = 20 we found R = 1.03 for the 99.995% quantile, which appears to be reasonable.
which has a finite overlap between all pairs. Note that in general a weaker condition must be fulfilled, namely that a connected path from the lowest to the hightest temperature must be possible, as outlined before. In more complex models only this condidition might be fulfilled.
For small systems one may enumerate all possible configurations and compare the complete distribution with the Monte Carlo data. The empirical probability distribution for L = 10 in Fig. 5 coincides with the exact result, such that a the difference is not visible in the plot. However L = 10 is a very small system in contrast to real biological sequences, which are considered in section "Results", but exact enumeration is only possible on a modern computer cluster. Hence only for L = 10 the relative error $\epsilon (s)=\frac{\left|{P}_{\text{sample}}(s)-{P}_{\text{exact}}(s)\right|}{{P}_{\text{exact}}(s)}$ (see inset of Fig. 6) can be computed on the full support. In principle one is able to reduce variance on the low score end of the distribution by introducing negative temperature values, but this is beyond of the scope of this article.
Error estimation
For example the relative errors ${\sigma}_{{r}_{j}^{J}}/{r}_{j}$ of the normalization constant ratios increase from 8.6 × 10^{-4} for r_{2} to 1.29 × 10^{-2} for r_{5}. This indicates that the method is able to capture the error propagation of the relative normalization constants due to weak overlaps of distant distributions (see also Eq. (17)). Similar errors for the probabilities P (s) can be estimated by applying this approach.
Results
Optimal alignment statistics
Next, we show the results from the application of the method to biologically relevant systems: local sequence alignment of protein sequences using BLOSUM62 [20] and PAM250 [21, 22] matrices. We apply amino acid background frequencies by Robinson and Robinson [55]. We consider different affine gap cost with 10 ≤ α ≤ 16, β = 1 for the BLOSUM62 matrix and 11 ≤ α ≤ 17, β = 3 when using the PAM250 matrix, as well as infinite gap costs. We study ten different sequence lengths between M = L = 40 and M = L = 400, in detail L = 40, 60, 80, 100, 150, 200, 250, 300, 350, 400.
Thus the overlap graph is connected sufficientely. For L = 40 we obtained relative errors of the normalization constants between 10^{-4}(highest temperature) and 0.4 (lowest temperature) and similar values for L = 400.
Fit parameters of the modified Gumbel distribution Eq. (18) using the BLOSUM62 scoring matrix and affine gap costs with α = 10, β = 1 . 10^{4} ${\lambda}_{2}^{\text{extra}}$ describes the estimated value of λ_{2} using the scaling relation Eq. (19). Fit parameters for other scoring systems are provided as supplementary material to this artilce [see additional file 1].
L, M | λ | 10^{4} λ_{2} | K | ^{ S }0 | ${\chi}_{\ast}^{2}$ | 10^{4} ${\lambda}_{2}^{\text{extra}}$ |
---|---|---|---|---|---|---|
40 | 0.3272 ± 0.108% | 8.6347 ± 0.412% | 0.1028 ± 0.65% | 15.597 ± 0.0676% | 79.05 | 8.1560 ± 12.485% |
60 | 0.3034 ± 0.086% | 6.2007 ± 0.285% | 0.0751 ± 0.60% | 18.455 ± 0.0645% | 49.40 | 6.1711 ± 12.907% |
80 | 0.2892 ± 0.070% | 4.8781 ± 0.222% | 0.0612 ± 0.53% | 20.644 ± 0.0540% | 21.67 | 5.0458 ± 13.280% |
100 | 0.2747 ± 0.072% | 4.3187 ± 0.330% | 0.0472 ± 0.58% | 22.413 ± 0.0611% | 39.42 | 4.3056 ± 13.627% |
150 | 0.2541 ± 0.083% | 3.2974 ± 0.529% | 0.0303 ± 0.61% | 25.682 ± 0.0422% | 39.46 | 3.2047 ± 14.437% |
200 | 0.2432 ± 0.063% | 2.6343 ± 0.344% | 0.0241 ± 0.52% | 28.257 ± 0.0412% | 10.47 | 2.5806 ± 15.214% |
250 | 0.2359 ± 0.071% | 2.1999 ± 0.454% | 0.0198 ± 0.60% | 30.196 ± 0.0459% | 9.40 | 2.1701 ± 15.984% |
300 | 0.2303 ± 0.061% | 1.9101 ± 0.348% | 0.0174 ± 0.54% | 31.934 ± 0.0408% | 2.00 | 1.8758 ± 16.758% |
350 | 0.2261 ± 0.046% | 1.6404 ± 0.239% | 0.0153 ± 0.41% | 33.334 ± 0.0300% | 1.27 | 1.6525 ± 17.544% |
400 | 0.2224 ± 0.052% | 1.4806 ± 0.266% | 0.0136 ± 0.49% | 34.556 ± 0.0369% | 1.36 | 1.4762 ± 18.347% |
600 | 0.2140 ± 0.062% | 1.0206 ± 0.384% | 0.0106 ± 0.64% | 38.561 ± 0.0472% | 2.15 | 1.0250 ± 21.787% |
800 | 0.2090 ± 0.063% | 0.7660 ± 0.419% | 0.0088 ± 0.67% | 41.320 ± 0.0457% | 1.82 | 0.7691 ± 25.697% |
All estimated standard errors in this paper are written behind the values and separated by "±".
Note that only for not too small sequences ${\chi}_{\ast}^{2}$ is in the order of one. This means that Eq. (18) describes the data better for longer sequences. However biological relevant sequence lengths (L > 200) sit in the range were the fit works fine. Moreover the results for shorter sequences are still several orders of magnitude below the naive Gumbel result, which yield ${\chi}_{\ast}^{2}$ a value of about 10^{4} for the L = 40 system.
We also tried smaller gap costs than α < 10 (β = 1, BLOSUM62) and α < 11 (β = 3, PAM250 matrices), but in this case the distributions deviate from Gumbel not only in the tail but even in the high-probability region. The reason is presumably that the values of the parameters are close to the critical value of the linear-logarithmic phase transition [24], i.e. the alignment is not really local any more.
for the smallest gap costs and faster than a power law for larger gap costs.
Fitting parameters of the scaling relation Eq. (19).
Parameter | BLOSUM62 α = 10, β = 1 | BLOSUM62 α = 12, β = 1 |
---|---|---|
a | 0.00928 ± 0.0001 | 0.0309 ± 0.01 |
b | 0.643 ± 0.027 | 0.971 ± 0.08 |
10^{-5} ${\lambda}_{2}^{\ast}$ | 4.9 ± 1.2 | 3.2 ± 2.0 |
Parameter | PAM250 α = 11, β = 3 | PAM250 α = 13, β = 3 |
a | 0.0049 ± 0.0008 | 0.0053 ± 0.0005 |
b | 0.575 ± 0.046 | 0.591 ± 0.023 |
10^{-5} ${\lambda}_{2}^{\ast}$ | 3.015 ± 2.0 | 6.1 ± 1.1 |
Note that these arguments are purely heuristical attempts to look at the scaling behaviour and its upper bound. It is hard to decide, wether the extrapolation is valid for L = M → ∞. However an important range of biological interessting sequence lengths are governed with this scaling analysis.
where L is the query length and N the total number of amino acids of the entire database, with parameters K = 0.0410 and λ = 0.267. Using the suggested E-value of 10 [58], we find a cut-off of b_{cut} = 64.8 above which a result is considered to be significant, with ℙ [S > b_{cut}] = 4.75 × 10^{-5}. Our cumulative distribution achieves this probability at b_{cut} = 54, i.e. significantly below the BLAST value. Hence, using the true distributions of the scores, a considerable amount of queries, those which have a score between 54 and 64, are significant in contrast to the result of the significance estimation within the Gumbel approximation. Hence, using the data provided in this work, one is able to estimate the significance of protein-data-base queries for the most commonly used parameter sets with much higher precission than when applying the approximation of the Gumbel distribution.
Sum statistics of the k-best alignments
Temperature parameters for sum-statistics.
L | k = 2 | k = 3 | k = 4 | k = 5 |
---|---|---|---|---|
40 | 2.75, 3, 3.5, 4, 7, ∞ | |||
60 | 2.75, 3, 3.5, 4, 7, ∞ | |||
80 | 2.75, 3, 3.5, 4, 7, ∞ | 3.75, 4, 4.5, 5, 8, ∞ | 5.25, 5.5, 6, 8, ∞ | 6, 6.25, 6.5, 7, 8, 12, ∞ |
100 | 2.75, 3, 3.5, 4, 7, ∞ | 3.75, 4, 4.5, 5, 8, ∞ | 5.25, 5.5, 6, 8, ∞ | 6, 6.25, 6.5, 7, 8, 12, ∞ |
150 | 2.75, 3, 3.5, 4, 7, ∞ | 3.75, 4, 4.5, 5, 8, ∞ | 5.25, 5.5, 6, 8, ∞ | 6, 6.25, 6.5, 7, 8, 12, ∞ |
200 | 3.25.3.5, 4, 7, ∞ | 3.75, 4, 4.25, 4.5, 5, 8, ∞ | 4.75, 5, 5.25, 5.5, 6, 8, ∞ | 5.75, 6, 6.25, 6.5, 7, 8, 12,∞ |
300 | 3.25.3.5, 4, 7, ∞ | 3.75, 4, 4.25, 4.5, 5, 8, ∞ | 4.75, 5, 5.25, 5.5, 6, 8, ∞ | 5.75, 6, 6.25, 6.5, 7, 8, 12,∞ |
400 | 3.25.3.5, 3.75, 4, 4.25, 5, 8,∞ | 3.75, 4, 4.25, 4.5, 5, 8, ∞ | 5.25, 5, 5.75, 6, 8, 10, ∞ | 6, 6.25, 6.5, 7, 9, 11,∞ |
Note that for k > 3 the systems could not be equilibrated in the very low temperature regime T < 5. Therefore, for theses cases, the tail could only be obtained in an intermediate range of probabilities (~10^{-20}), which is nevertheless low enough to obtain significance figures much better compared to using a simple-sampling approach.
Correction parameter λ_{2} for the sum statistics k = 2 and k = 3. λ_{2} is estimated by a fit for Eq. (21) using optimal the Gumbel-parameters λ and S_{0} from optimal score statistics (k = 1). BLOSUM62 with affine gap costs (α = 12, β = 1) was used as scoring system.
L | 10^{4} ${\lambda}_{2}^{(k=2)}$ | 10^{4} ${\lambda}_{2}^{(k=3)}$ |
---|---|---|
60 | 2.692 ± 0.30% | |
80 | 1.631 ± 0.63% | 1.074 ± 2.59% |
100 | 1.488 ± 0.23% | 0.649 ± 2.06% |
150 | 1.056 ± 0.06% | 0.344 ± 1.90% |
200 | 0.749 ± 0.13% | 0.280 ± 1.14% |
300 | 0.463 ± 0.15% | 0.189 ± 0.70% |
400 | 0.338 ± 0.29% | 0.139 ± 0.92% |
Discussion and summary
We have studied the distribution of optimum alignment scores over a wide range using a rare-event sampling method. First, by comparing the results for a small 4-letter test system, we illustrated how the method works and provided some evidence for its convergence. In the main part, we considered protein alignment for two types of substitution matrices, i.e. BLOSUM and PAM matrices. We also studied many different sets of biologically relevant parameters by varying gap costs and sequence lengths.
For large enough gap costs it was previously assumed that the distribution follows the Gumbel extreme-value distribution, even when aligning finite sequences and allowing for gaps. Hence, the Gumbel distribution is used for calculating p-values in protein data bases so far. We observe clear deviations from the Gumbel distribution in the biologically relevant rare-event-tail, which is out of reach of simple sampling methods used so far.
An analysis of the scaling behavior of the correction parameter λ_{2} gives evidence that the Gumbel distribution correctly describes the data only in the limit of infinite sequence lengths, even for gapped sequence alignments. For finite protein lengths of biological relevance, we observed that the distributions can be fitted well by a Gumbel distribution with a Gaussian correction. Therefore, for data bases like BLAST [8, 18, 58], we recommend to use distribution functions determined by the empirical fitting parameters provided in this work because the critical value S_{cut}, above which a result is considered to be significant, changes considerably, as we have seen.
We have also studied the sum-statistics of the k-best alignments. Again a Gaussian correction to the assumed form of the distribution was found empirically. Extrapolation to infinitely long sequences gives good evidence that the ungapped statistical theory describes the gapped case for L = M → ∞ as well.
Declarations
Acknowledgements
We thank B. Morgenstern and P. Müller for critically reading the manuscript. The authors have received financial support from the VolkswagenStiftung (Germany) within the program "Nachwuchsgruppen an Uni-versitäten", and from the European Community via the DYGLAGEMEM program.
Authors’ Affiliations
References
- Brown S: Bioinformatics. 2000, Natick (MA): Eaton Publishing
- Rashidi S, Buehler L: Bioinformatics Basics. 2000, Boca Raton (FL): CRC Press
- The Protein Data Bank. http://www.pdb.org
- Fraser C, Gocayne J: The Minimal Gene Complement of Mycoplasma Genitalium. Science. 1995, 270: 397-PubMedView Article
- Needleman SB, Wunsch CD: A General Method Applicabel to Search for Similarities in the Amino Acid Sequence of two Proteins. J Mol Biol. 1970, 48: 443-453.PubMedView Article
- Smith TF, Waterman MS: Identification of Common Molecular Subsequences. J Mol Biol. 1981, 147: 195-197.PubMedView Article
- Gotoh O: An Improved Algorithm for Matching Biological Sequences. J Mol Biol. 1982, 162: 705-PubMedView Article
- Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic Local Alignment Search Tool. J Mol Biol. 1990, 215: 403-410.PubMedView Article
- Karlin S, Altschul S: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA. 1990, 87: 2264-PubMedPubMed CentralView Article
- Dembo A, Karlin S, Zeitouni O: Limit Distribution of Maximal Non-Aligned Two-Sequence Segmental Score. Ann Prob. 1994, 22: 2022-2039.View Article
- Yu Y, Hwa T: Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models. J Comp Biol. 2001, 8 (3): 249-282.View Article
- Yu Y, Bundschuh R, Hwa T: Statistical Significance and Extreme Ensemble of Gapped Local Hybrid Alignment. Biological Evolution and Statistical Physics. Edited by: Lässig M, Valeriani A. 2002, 3-22. Berlin: Springer-VerlagView Article
- Kschischo M, Lässig M, Yu Y: Toward an accurate statistics of gapped alignments. Bull Math Biol. 2004, 67: 169-191.View Article
- Siegmund D, Yakir B: Approximate p-Values for Local Sequence Alignments. Annals of Statistics. 2000, 28: 657-680.
- Metzler D, Grossmann S, Wakolbinger A: A poisson model for gapped local alignments. Stat Prob Letters. 2002, 60: 91-100.View Article
- Altschul S, Gish W: Local Alignment Statistics. Meth Enzym. 1996, 266: 460-PubMedView Article
- Olsen R, Bundschuh R, Hwa T: Rapid Assessment of Extremal Statistics for Local Alignment with Gaps. Proceedings of the seventh International Conference on Intelligent Systems for Molecular Biology. Edited by: Lengauer T, Schneider R, Bork P, Brutlag D, Glasgow J, Mewes HW, Zimmer R, Menlo Park. 1999, 270: 211-222. CA: AAAI Press
- Altschul S, Bundschuh R, Olsen R, Hwa T: The estimation of statistical parameters for local alignment score distributions. Nucl Acid Res. 2001, 29 (2): 351-361.View Article
- Hartmann A: Sampling rare events: Statistics of local sequence alignments. Phys Rev E. 2002, 65 (5 Pt 2): 056102-View Article
- Heinkoff S, Heinkoff J: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992, 89: 10915-10919.View Article
- Dayhoff M, Schwartz R, Orcutt B: A model of Evolutionary Change in Proteins. Atlas of Protein Sequence and Structure. Edited by: Dayhoff M. 1978, 5 (Suppl 3): 345-352. Washington, D.C: National Biomedical Research Foundation
- Schwartz R, Dayhoff M: Matrices for Detecting Distant Relationships. Atlas of Protein Sequence and Structure. Edited by: Dayhoff M. 1978, 5 (Suppl 3): 353-358. Washington, D.C.: National Biomedical Research Foundation
- Gumbel E: Statistics of Extremes. 1958, New York: Columbia University Press
- Arratia R, Waterman M: A Phase Transition for the Score in Matching Random Sequences Allowing Deletions. Ann Appl Prob. 1994, 4: 200-225.View Article
- Hwa T, Lässig M: Optimal Detection of Sequence Similarity by Local Alignment. Proceedings of the Second Annual International Conference on Computational Molecular Biology (RECOMB98). Edited by: Istrail S, Pevzner P, Waterman M. 1998, 109-View Article
- Sellers P: Pattern recognition in genetic sequences by mismatch density. Bull Math Biol. 1984, 46: 501-514.View Article
- Altschul S, Erickson B: Locally optimal subalignments using nonlinear similartity functions. Bull Math Biol. 1986, 48: 633-660.PubMedView Article
- Karlin S, Altschul S: Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci USA. 1993, 90: 5873-5877.PubMedPubMed CentralView Article
- Dieker A, Mandjes M: On Asymptotically efficient simulation of large deviation probabilities. Adv Appl Prob. 2005, 37: 539-552.View Article
- Hastings WK: Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika. 1970, 57: 97-109.View Article
- Liu J: Monte Carlo Strategies in Scientific Computing. 2002, New York: Springer
- Liu J: Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statist Comput. 1996, 6: 113-119.View Article
- Geyer C: Monte Carlo Maximum Likelihood for Depend Data. Proceedings of the 23rd Symposium on the Interface. 1991, 156-163.
- Hukushima K, Nemoto K: Exchange Monte Carlo Method and Application to Spin Glass Simulations. J Phys Soc Jpn. 1996, 65: 1604-1608.View Article
- Earl D, Deem M: Parallel tempering: Theory, applications, and new perspectives. Phys Chem Chem Phys. 2005, 7: 3910-3916.PubMedView Article
- Zhou R: Exploring the protein folding free energy landscape: Coupling replica exchange method with P3ME/RESPA algorithm. J Molec Graph Mod. 2004, 22 (5): 451-463.View Article
- Zhou R, Berne B: Can a continuum solvent model reproduce the free energy landscape of a β -hairpin folding in water?. Proc Natl Acad Sci USA. 2002, 99: 12777-12782.PubMedPubMed CentralView Article
- Zhou R, Berne B: Trp-cage: Folding free energy landscape in explicit water. Proc Natl Acad Sci USA. 2002, 100 (23): 13280-13285.View Article
- Garci'a A, Onuchic J: Folding a protein in a computer: An atomic description of the folding/unfolding of protein. Proc Natl Acad Sci USA. 2003, 100: 13898-13903.View Article
- Zhou R, Berne B, Germain R: The free energy landscape for β hairpin folding in explicit water. Proc Natl Acad Sci USA. 2001, 98: 14931-14936.PubMedPubMed CentralView Article
- Auer S, Frenkel D: Prediction of absolute crystal-nucleation rate in hard-sphere colloids. Nature. 2001, 409: 1020-1023.PubMedView Article
- Marinari E, Parisi G, Ruiz-Lorenzo J: Numerical Simulations of Spin Glass Systems. Spin Glasses and Random Fields, Directions in Condensed Matter Physics. Edited by: Young A. 1998, 12: 109-World Scientific
- Katzgraber H, Palassini M, Young A: Monte Carlo simulations of spin glasses at low temperatures. Phys Rev B. 2001, 63: 1844221-18442210.View Article
- Körner M, Katzgraber H, Hartmann A: Probing tails of energy distributions using importance-sampling in the disorder with a guiding function. Stat Mech. 2006, P04005-
- Wilbur W: Accurate Monte Carlo Estimation of Very Small P-Values In Markov Chains. Comp Stat. 1998, 13: 153-168.
- Geyer C: Estimating Normalization Constants and Reweighting Mixtures in Markov Chain Monte Carlo. Tech Rep 568. 1994, School of Statistics, University of Minnesota
- Meng X, Wong W: Simulating Ratios of Normalization Constants via a Simple Identity: ATheoretical Exploration. Statistica Sinica. 1996, 6: 831-860.
- Raftery A, Lewis S: How Many Iterations in the Gibbs Sampler. Bayesian Statistics 4. Edited by: Bernardo J, Berger J, Dawid A, Smith A. 1992, 763-773. Oxford University Press
- Cowles M, Carlin B: Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review. JASA. 1996, 91 (434): 883-904.View Article
- StatLib. http://lib.stat.cmu.edu/
- Coda R package. http://www.r-project.org/
- Gelman A, Rubin D: Inference from iterative simulation using multiple sequences. Stat Sci. 1992, 7: 457-472.View Article
- Brooks S, Gelman A: General methods for monitoring convergence of iterative simulations. J Comput Graph Stat. 1998, 7: 434-455.
- BEfron: The Jackknife, the Bootstrap and Other Resampling Plans. 1982, New York: SIAMView Article
- Robinson A, Robinson L: Distribution of glutamine and asparagine residues and their near neighbours in peptides and proteins. Proc Natl Acad Sci USA. 1991, 88: 8880-8884.PubMedPubMed CentralView Article
- gnuplot. http://www.gnuplot.info/
- SWISSPROT. http://www.expasy.org/
- NCBI BLAST. http://www.ncbi.nlm.nih.gov/BLAST
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.