A simple dataadaptive probabilistic variant calling model
 Steve Hoffmann^{1, 2, 3}Email author,
 Peter F Stadler^{2, 3, 4, 5, 6, 7, 8} and
 Korbinian Strimmer^{9, 10}
https://doi.org/10.1186/s1301501500375
© Hoffmann et al.; licensee BioMed Central. 2015
Received: 21 May 2014
Accepted: 11 January 2015
Published: 4 March 2015
Abstract
Background
Several sources of noise obfuscate the identification of single nucleotide variation (SNV) in next generation sequencing data. For instance, errors may be introduced during library construction and sequencing steps. In addition, the reference genome and the algorithms used for the alignment of the reads are further critical factors determining the efficacy of variant calling methods. It is crucial to account for these factors in individual sequencing experiments.
Results
We introduce a simple dataadaptive model for variant calling. This model automatically adjusts to specific factors such as alignment errors. To achieve this, several characteristics are sampled from sites with low mismatch rates, and these are used to estimate empirical loglikelihoods. The likelihoods are then combined to a score that typically gives rise to a mixture distribution. From this we determine a decision threshold to separate potentially variant sites from the noisy background.
Conclusions
In simulations we show that our simple model is competitive with frequently used much more complex SNV calling algorithms in terms of sensitivity and specificity. It performs specifically well in cases with low allele frequencies. The application to nextgeneration sequencing data reveals stark differences of the score distributions indicating a strong influence of data specific sources of noise. The proposed model is specifically designed to adjust to these differences.
Background
Recent studies report a strikingly low concordance of currently available methods and pipelines for identification of single nucleotide variation (SNV), both somatic and germline, indicating that computational methods as well as sequencing protocols have a major impact on the sensitivity and specificity of the variation calling tool [1]. Specifically, the allelic fraction as well as the coverage of the variant allele are crucial determinants for the statistical benchmarks [2,3]. Practical guidelines of SNV callers such as GATK [4] or SAMtools [5] suggest to apply rigorous postprocessing filters to reduce the number of false positive calls. Other studies indicate that the application of these filters lead to a substantial improvement of the concordance of the callers [6]. Nevertheless, applying stringent thresholds for variables such as the strand bias, the coverage or read start variation bears the risk of losing important information [7]. These authors emphasize that the different algorithmic and statistical components of a variant caller have to be evaluated as a whole and cannot not be meaningfully judged as single components.
is computed, and a variant is called if Pr{c>1} exceeds a certain specified threshold.
where Pr{A is trueD _{ j } is miscalled} is a precomputed, sequencer specific lookup table. Using prior information based on precomputed heterozygosity estimations GATK evaluates the posterior probabilities of a site to be variant. As with SAMtools calls are determined using fixed preselected thresholds.
Here, we propose a simple probabilistic model for variant calling using a data adaptive threshold on the scale of logoddratios computed from empirical distributions of certain site characteristics. Our approach allows to optimally separate simulated SNVs from the noisy background without specification of a threshold for posterior probabilities. In brief, our model starts out by evaluating the mismatch frequencies δ in a data set. Subsequently, we sample several characteristics of the sites with small δ to serve as empirical reference model. The fundamental idea used here is that the vast majority of sites is invariant and thus allows to capture the features of the data specific error model. These characteristics are then used to form empirical loglikelihoods that are combined to a logodds type score. Typically, we observe a mixture distribution of two score populations, which we may then separate by a decision threshold.
Next, we discuss the details of our approach and the proposed dataadaptive variant calling algorithm. Subsequently, we apply our method to both synthetic and next generation sequencing data from various species. A reference implementation in C99 of our method called haarz is available at http://www.bioinf.unileipzig.de/software.html.
Methods
Notation
We denote the position of a site in the reference genome by i∈[1,…,n] where n is the genome length. After aligning the reads to a genome each reference base typically has several nucleotides aligned to it. We refer to the set of all aligned nucleotides as the cross section C ^{ i } at position i. The coverage at position i is the size C ^{ i } of this set. We use the index j∈[1,…,C ^{ i }] to refer to a specific read. The length of read j aligned to site i is denoted by \({m^{i}_{j}}\), and the position of a nucleotide in a read is denoted by \({k^{i}_{j}} \left \{1, \ldots, {m^{i}_{j}}\right \}\). For simplicity, we occasionally leave out the index i when there is no danger of ambiguity.
The nucleotide in a cross section can be partitioned into sets of match (M) and mismatch (\(\overline {M}\)) nucleotides so that \(C^{i} = M^{i} \cup \overline {M}^{i}\). The variant calling algorithm described below uses the partition \(\{ M^{i}, \overline {M}^{i} \}\) at each position i to compute an overall score for this particular site.
Biological versus technical variation
The mismatch rate δ ^{ i }=M ^{ i }/C ^{ i } is the observed number of mismatches divided by the coverage. The mismatch rate δ ^{ i }=β ^{ i }+ε ^{ i } may be decomposed into biological and technical variation, where ε ^{ i } denotes the technical error that accumulated during the preparation, sequencing and alignment steps and β ^{ i } denotes the biological nucleotide variation at site i.
We aim to distinguish biologically variant positions (β ^{ i }>0) from nonvariant positions (β ^{ i }=0), based on the observed mismatch rates δ ^{ i } and site characteristic scores derived from sequence data or produced during sequencing.
We assume that cross sections with high mismatch rates are indicative of biological variation in the sample, whereas in cross sections with small mismatch rate the mismatches are more likely due to technical errors. Conversely, in the overwhelming majority of cross sections C ^{ i } we may assume that there is no biological variation present, i.e. β ^{ i }=0, and thus mismatches are only caused by technical errors.
For use in the variant calling score we estimate for each δ ^{ i } the corresponding empirical quantile q(δ ^{ i }). The motivation for using the quantile rather than the actual value is that it implements a simple normalization of the error. The empirical quantile q(δ ^{ i }) is estimated by tabulating the cumulative frequencies of δ ^{ i } across the genome and then reading off the quantile from the resulting empirical distribution function (ECDF).
To ascertain the probabilities of certain site characteristics, discussed further below, we uniformly sample from sites with 0<δ ^{ i }<0.5. Informally, these characteristics then reflect “background distributions” of nonvariant sites and thus are estimated from sites with less than 50% of mismatches.
The degree of biological variation depends on the type of genome. For heterozygous genomes one expects to find predominantly SNP alleles with β ^{ i }=0.5 or β ^{ i }=1.0, whereas cancer tissues may show mutations with 0<β ^{ i }≤1 depending on the heterogeneity and cancer cell content of the sample. Similarly, arbitrary values of β _{ i } will appear in whole population sequencing data. Accordingly, we expect different values of β ^{ i } for mixtures of sequencing data from different individuals. The variant calling algorithm introduced in the following makes no assumptions concerning the presence of diploid genomes, knowledge about the ploidy, homo or heterozygosity.
Site characteristics

the nucleotide qualities (Q),

relative read position (P),

errors in the alignment (R), and

the number of multiple hits (H).
The nucleotide qualities take on values between 0 and 1 and are given as Q=1−ζ, i.e., as probability of a base being correct, with values close to 1 corresponding to optimally accurate sequencing. We directly use Q in computing the variant calling score.
The relative read position is given by \(P = {k^{i}_{j}} /{m^{i}_{j}}\). For the construction of our variant calling score we employ the probability Pr(MP) of a match at a given read position, along with the maximum P _{ M }= maxP Pr(MP). The probability of a mismatch is then given by \(\Pr (\overline {M}P) = 1  \Pr (MP)\), and its maximum \(P_{\overline {M}}= \max _{P} \Pr (\overline {M}P)\). We estimate the probability Pr(MP) empirically, i.e., by appropriately counting matches and mismatches over all sites and reads.
The number of errors in the alignment is an integer value greater or equal to zero, and denoted here by R. Finally, the number of multiple hits H describes the number of alignments for each read. The multiplicity of an alignment yields information on the repetitiveness of a genomic region. As above for the relative read position, we tabulate the occurrence of matches for each value of R and and H and correspondingly obtain estimates of the probabilities Pr(MR) and Pr(MH).
Scores for distinguishing variant and nonvariant sites
Informally, in a nonvariant cross section (β ^{ i }=0) we expect that the probability of a match base increases with high nucleotide qualities (good sequencing), low read error rates, few multiple hits and good read positions. Conversely, the probability of mismatching bases in nonvariant cross sections increases with low nucleotide qualities (poor sequencing), high read error rates, multiple hits and errorprone read positions. For variant sites with β>0 we expect to have high nucleotide qualities, good read positions and few multiple hits also for the mismatch bases. Consequently, for distinguishing variant from nonvariant sites only the mismatching bases are relevant.
Variant calling with adaptive threshold
This score comprises the four summaries of the site characteristics, as well as the logquantile of the observed mismatch rate δ ^{ i }, i.e. the observed number of changes at position normalized by coverage. A low quantile for δ ^{ i } thus strongly penalizes the overall score.
For variant calling we now proceed as follows. We assume that the majority of the sites are nonvariant, and only a smaller part is variant, with S _{ i }>0. Thus, the observed distribution of S _{ i } will be a mixture distribution, consisting of a null distribution corresponding to the invariant sites and an alternative “contamination” component corresponding to the variant sites.
In order to find an optimal adaptive cutoff separating the background from potential variants we estimate the densities by fitting a natural spline using Poissonregression to the histogram, following the procedure described by [10].
Subsequently, we numerically find the location with the minimum density and use it as threshold for separating the two score populations. Specifically, we fit natural splines to the histogram for S _{ i }>0 and numerically determine the local minimum. If there are multiple minima the leftmost minimum is used. The corresponding threshold is denoted by S ^{∗}.
In some cases there is no minimum in the histogram of the empirical scores. In this case we use as fallback solution the upper 95% quantile as threshold. A missing minimum might indicate that the score model does not suffice to reliably call the variants.
We note that by construction of the score S _{ i } we assume independence of the site characteristics. However, in practice there will be correlation, and as alternative one may also consider a fully multivariate construction of the score S _{ i }. However, this is not without its own drawbacks, as the correlation among site characteristics may be hard to estimate reliably. Moreover, as is well known from classification and “naive Bayes” analysis, independence models are typically rather robust and often even outperform more complex parameterrich multivariate models.
Results and discussion
Simulation study
To evaluate the reference implementation “haarz” of our adaptive model we compared it with the two frequently used SNV callers GATK [4] and SAMtools [5]. The precise command line settings are summarized in the Appendix.
Application to data sets
The good overlap between the different methods in our simulation study as well as the small number of false positives is in stark contrast to the experience of greatly differing variant calls in real life data (e.g. [1]). We therefore applied our model to diverse real data from both diploid and haploid organisms.
Paired end next generation sequencing data for Arabidopsis thaliana (SRR519713), Escherichia coli (ERR163894) and Drosophila melanogaster (SRR1177123) was downloaded under the respective accession numbers from the Short Read Archive (www.ncbi.nlm.nih.gov/sra). The Arabidopsis data was aligned to the reference genome version 10.5. With the data set SRR519713 we obtained a coverage of ∼30fold. The E. coli data set was aligned to the reference genome E. coli k12 assembly v1.16. With ERR163894 we obtained a coverage of ∼60fold. Finally, SRR1177123 was aligned to the D. melanogaster reference version dm3. The coverage was ∼25fold. For the alignments, calling and filtering we used standard parameters. Precise settings are given in the Appendix.
In the lower part of Figure 6 we show the concordance of variant calls of three investigated methods for the three data sets. For the well separable cases, the number of calls made by the data adaptive model is equal or higher as compared to the two competing callers. For the D. melanogaster data set our approach is more conservative and reports fewer variants. Most of these, however, are also found by SAMtools and GATK. About 92% percent of the calls from our model are also supported by both of the other callers and only 4% are not supported by any of the two alternative approaches. From the score distributions for D. melanogaster it is clear that there is a large overlap of the two score populations and hence the choice of S ^{∗} necessarily depends on the desired specificity and/or sensitivity. In the simulated data (Figure 4 and Figure 5) we see that haarz generally achieve a high recall (sensitivity), and at the same time offers a high positive predictive value, (PPV) i.e. low false discovery rate. Thus, for the D. melanogaster data many sites may be ambiguous to call, and our tool will err on the conservative side to maintain a high PPV.
Conclusions
We have presented a data adaptive model for variant calling based on easily accessible read characteristics, namely the loglikelihoods of nucleotide qualities, relative read positions, alignment errors, multiple hits and the mismatch rate at a position to obtain a score. With the exception of nucleotide qualities, which are provided as input by the sequencing method, all loglikelihoods are sampled from the data itself. We show that in simulated as well as in real data sets this score gives rise to a mixture distribution that distinguishes between true variants and noise. A spline fit to the overall marginal density allows us to determine a decision threshold that optimally separates these score populations. In the simulated data we demonstrated that this simple model is at par with two of the most commonly used probabilistic models for SNV calling methods in terms of both sensitivity and specificity.
When applying our model to actual nextgeneration sequencing data, we observe that the distributions of the scores vary significantly among the different data sets. As expected, the clearest separation of the mixture was obtained for the haploid E. coli data set. In addition, the small size of the genome and the absence of repetitive elements probably improves the separability of the scores. The situation for the two diploid genomes A. thaliana and D. melanogaster is different. While both genomes have comparable sizes, the separability of the score distributions varies strongly among these two data sets. While a clear minimum can be found for the plant, the mixture in D. melanogaster appears to be more complicated. In this case, by construction our model selects a conservative decision threshold. While the number of calls is similar to the other probabilistic SNV callers in the simulations as well as the nextgenerations data sets of the plant and the bacteria, it is significantly reduced in the fruit fly data set. These differences indicate that the characteristics of nextgeneration data sets have a strong impact on the success of variant calling. Furthermore, we observe that, at least for the score proposed here, a significant difference of the separability of the mixture distribution can be found between simulated and real data. Thus, we argue that data adaptive components could help to balance the tradeoff between sensitivity and specificity.
Appendix
Read simulation
For the simulation of reads and allele contents we used the program GemSIM (v. 1.6). We simulated reads for the human chromosome 21 (hg19) with different coverages using an Illumina specific error model (ill100v5_s).
Benchmarks and command line parameters
For the benchmarks we have aligned the simulated as well as the real reads with bwa and called the variants with SAMtools and GATK. For our own model the reads were aligned using segemehl. The command line parameters and version numbers are given below. a) BWA v 0.6.2
b) GATK v 2.8.1 (GenomeAnalysisTK2.81g932cd3a) calling:
filtering:
c) SAMtools v 0.1.19calling:
filtering:
d) segemehl v 0.1.7
obtaining site characteristics (written to sorted.haarz.idx):
obtaining site characteristics for low variant allel frequencies:
calling:
Declarations
Acknowledgements
This research was supported by LIFE (Leipzig Research Center for Civilization Diseases), Leipzig University and the Deutsche Forschungsgemeinschaft under the auspicies of SPP 1590 “Probabilistic Structures in Evolution”, proj. no. STA 850/141. LIFE is funded by the European Union, by the European Regional Development Fund (ERDF), the European Social Fund (ESF) and by the Free State of Saxony within the excellence initiative. We thank the referees for their very helpful comments.
Authors’ Affiliations
References
 O’Rawe J, Jiang T, Sun G, Wu Y, Wang W, Hu J, et al.Low concordance of multiple variantcalling pipelines: practical implications for exome and genome sequencing. Genome Med. 2013; 5(3):28.View ArticlePubMed CentralPubMedGoogle Scholar
 Xu H, DiCarlo J, Satya R, Peng Q, Wang Y. Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genomics. 2014; 15:244.View ArticlePubMed CentralPubMedGoogle Scholar
 Yu X, Sun S. Comparing a few SNP calling algorithms using lowcoverage sequencing data. BMC Bioinformatics. 2013; 14:274.View ArticlePubMed CentralPubMedGoogle Scholar
 McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing nextgeneration DNA sequencing data. Genome Res. 2010; 20:1297–303.View ArticlePubMed CentralPubMedGoogle Scholar
 Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Horner N, et al.The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25:2078–9.View ArticlePubMed CentralPubMedGoogle Scholar
 Liu X, Han S, Wang Z, Gelernter J, Yang BZ. Variant Callers for NextGeneration Sequencing Data: A Comparison Study. PLoS ONE. 2013; 8(9):e75619+.View ArticlePubMed CentralPubMedGoogle Scholar
 Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, et al. A survey of tools for variant analysis of nextgeneration genome sequencing data. Brief Bioinformatics. 2014; 15:256–78.View ArticlePubMed CentralPubMedGoogle Scholar
 Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011; 27(21):2987–93.View ArticlePubMed CentralPubMedGoogle Scholar
 DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using nextgeneration DNA sequencing data. Nat Genet. 2011; 43(5):491–8.View ArticlePubMed CentralPubMedGoogle Scholar
 Efron B, Tibshirani R. Using specially designed exponential families for density estimation. Ann Stat. 1996; 24:2431–61.View ArticleGoogle Scholar
 McElroy KE, Luciani F, Thomas T. GemSIM: general, errormodel based simulator of nextgeneration sequencing data. BMC Genomics. 2012; 13:74.View ArticlePubMed CentralPubMedGoogle Scholar
 Li H, Durbin R. Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics. 2009; 25:1754–60.View ArticlePubMed CentralPubMedGoogle Scholar
 Hoffmann S, Otto C, Kurtz S, Sharma CM, Khaitovich P, Vogel J, et al.Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLOS Comp Biol. 2009; 5:e1000502.View ArticleGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.