- Software article
- Open Access
Estimation of genetic diversity in viral populations from next generation sequencing data with extremely deep coverage
https://doi.org/10.1186/s13015-016-0064-x
© Zukurov et al. 2016
- Received: 25 March 2015
- Accepted: 25 February 2016
- Published: 11 March 2016
Abstract
Background
In this paper we propose a method and discuss its computational implementation as an integrated tool for the analysis of viral genetic diversity on data generated by high-throughput sequencing. The main motivation for this work is to better understand the genetic diversity of viruses with high rates of nucleotide substitution, as HIV-1 and Influenza. Most methods for viral diversity estimation proposed so far are intended to take benefit of the longer reads produced by some next-generation sequencing platforms in order to estimate a population of haplotypes which represent the diversity of the original population. The method proposed here is custom-made to take advantage of the very low error rate and extremely deep coverage per site, which are the main features of some neglected technologies that have not received much attention due to the short length of its reads, which precludes haplotype estimation. This approach allowed us to avoid some hard problems related to haplotype reconstruction (need of long reads, preliminary error filtering and assembly).
Results
We propose to measure genetic diversity of a viral population through a family of multinomial probability distributions indexed by the sites of the virus genome, each one representing the distribution of nucleic bases per site. Moreover, the implementation of the method focuses on two main optimization strategies: a read mapping/alignment procedure that aims at the recovery of the maximum possible number of short-reads; the inference of the multinomial parameters in a Bayesian framework with smoothed Dirichlet estimation. The Bayesian approach provides conditional probability distributions for the multinomial parameters allowing one to take into account the prior information of the control experiment and providing a natural way to separate signal from noise, since it automatically furnishes Bayesian confidence intervals and thus avoids the drawbacks of preliminary error filtering.
Conclusions
The methods described in this paper have been implemented as an integrated tool called Tanden (Tool for Analysis of Diversity in Viral Populations) and successfully tested on samples obtained from HIV-1 strain NL4-3 (group M, subtype B) cultivations on primary human cell cultures in many distinct viral propagation conditions. Tanden is written in C# (Microsoft), runs on the Windows operating system, and can be downloaded from: http://tanden.url.ph/.
Keywords
- Viral diversity
- Bayesian inference
- Dirichlet distribution
Background
Viruses with RNA genomes are recognized to generate particularly mutant-rich populations called quasispecies. The genetic heterogeneity characteristic of viral quasispecies is largely due to high mutational rates combined with an elevated population size [1]. Human immunodeficiency virus 1 (HIV-1), as an example, has a mean substitution rate of order 10−5 per nucleotide position [2]; that is by far higher than those of cellular organisms [3, 4] and assures a constant viral mutant production.
Next-generation sequencing (NGS) platforms have been used mainly for the de novo sequence assembly of viruses. However, more recently, new interest arose in re-sequencing known virus genomes using NGS to study the diversity of viral populations. All NGS platforms produce short segments of DNA, called reads, which provide only imperfect and incomplete information about the structure of the viral population. Sequencing errors and length of reads are factors that must be taken into account in the analysis of data obtained from NGS viral quasi-species. In addition, reverse transcription and PCR amplification are procedures prone to errors. The impact of these errors on studies of viral diversity could be huge (see below), therefore one wants to separate true genetic variation from methodological noise and if both are of the same order of magnitude the task becomes virtually impossible.
Regarding the development of tools to estimate genetic diversity of viral populations (total number of genetic characteristics in a viral ensemble), the most commonly used NGS platforms are the 454™ (Life Sciences/Roche)—since Roche’s announced in October 2013 it will shutdown the 454™ platform [5] its use has been fading—and the Illumina™ (Solexa), mainly due to their capacity to produce longer reads. The ability to produce relatively long sequences favors the development of methods aiming at the haplotype (the genetic variants in a viral population) reconstruction of the representative viral particles in population [6, 7]. However, the propagation of sequencing errors is a serious problem in these methods, requiring the development of procedures for error correction, which my introduce unwanted biases. In general, the fraction of wrong reads increases with the error rate per base and the average length. The expected proportion of reads with at least one sequencing error as a function of the error rate per base ε and the average length L of reads is given by the relation 1 − (1 − ε) L [8]. As the estimated error rate of 454™ is about 0.1–0.5 % and Illumina™ error rates are in the range of 0.1–1 % [9], with an average length of reads from 400 bp up to 1000 bp, the proportion of reads with at least one error is in the range 35–90 %. The platform SOLiD™ (Life Technologies), for instance, is at the other end of the spectrum. With reads of short length, of at most 50 bases (the main limitation for the construction of haplotypes) and estimated error rate of 0.06 % [9], the proportion of reads with at least one error is around 2 %. Recently, a different solution to the problem of sequencing errors has been proposed [10], based on the development of high-fidelity sequencing protocols [11].
A more serious challenge associated with the assembly of all possible haplotypes is the NP-hardness of the corresponding combinatorial optimization problems [12]. In fact, some approximate solution must be employed and a crucial hindering factor is the ratio between the size of the reads and the size of the genomic region being reconstructed. For instance, it has been reported [10] that short read lengths (less than 100 base pairs) dramatically inhibit reconstruction of genomes with more than 3400 bp, evidenced by the failing to produce any complete genome. Another major shortcoming of all existing methods for haplotype reconstruction is that they are unable to handle large insertions or deletions (indels), only very recently this problem seems to have been overcome [13].
As mentioned before, the ability of the other NGS platforms to produce relatively long sequences have been a great stimulus to the development of methods for building haplotype representatives of viral particles in the population and the vast majority of softwares for viral diversity estimation that have been proposed until very recently adopt this perspective [6]. The aim of this work is to propose a different approach to measure genetic diversity that does not demand any kind of length assumption on the short reads, but takes advantage of the low error rate and the high depth of coverage per site inherent to some NGS platforms. Therefore, we shall considerably depart from the most traditional developments aiming at haplotype reconstruction, since not every one has access to the NGS platforms appropriate for that purpose. Indeed, although the short length of the reads produced by these platforms essentially hinders haplotype reconstruction, it is possible to measure genetic diversity through probability distributions along the genome (one per site) and this approach is enhanced by the highly deep coverage provided by these NGS platforms.
A recent study [14] comparatively assessed the performance of some NGS platforms (including 454™ and Illumina™) and reported an average (range) coverage of ~23,000 reads (5000–47,000) for the Illumina™ and ~7000 reads (2000–22,000) for the 454™. We used the SOLiD™ platform and were able to achieve an average (range) coverage of ~50,000 reads (10,000–150,000), for instance (see Fig. 2). In addition, the low error rate of 0.06 % provided by the SOLiD™ platform virtually eliminates the necessity of any error correction procedure. Instead, we use the estimated probability distributions to separate signal from noise.
The first step in nucleotide sequence analysis is read mapping/alignment. This is important for many bioinformatics applications, as exemplified by nucleic acid conformational structure prediction and phylogeny studies [15, 16]. As expected, this is also an important aspect for NGS data analysis involving all the different platforms as Ion Torrent™ (Life Technologies), SOLiD™, 454™ and Illumina™ [17, 18] and others. Nowadays, users can choose from a panoply of tools for mapping and indexing NGS reads, available on-line and for download. MAQ (Mapping and Assembly with Qualities) [19], BWA (Burrows–Wheeler alignment tool) [20], BFAST (Blat-like fast accurate search tool) [21], Bowtie [22] and MOSAIK [23] are examples of such alternatives. Those tools allow the fast mapping and alignment of reads belonging to genomes up to 109 bp in length [20, 24, 25].
After the read mapping is finished, the following step consists in the choice of a strategy for statistical inference. There is a wide variety of methods depending on the scope and the goals of the analysis: (1) consensus generation, (2) single nucleotide variant (SNV), also called single position diversity estimation, (3) local diversity estimation and (4) read graph-based haplotype reconstruction, also known as global diversity estimation, see [8, 26] for a thorough explanation of these concepts. Existing tools for genetic diversity evaluation of viral NGS sequences, intended for 454™ and Illumina™ platforms [8, 26–33], are based on several techniques aiming at haplotype reconstruction [30, 34–39].
In order to estimate the genetic diversity without resorting to haplotype reconstruction, we propose to represent the genetic diversity of a sample population through a family of multinomial probability distributions indexed by the sites of the virus genome, each one representing the distribution of nucleic bases per site. Moreover, the inference of the multinomial parameters is done in a Bayesian framework using smoothed Dirichlet estimation inspired by a method for modeling text data [40].
Inference of multinomial parameters is a challenging problem in statistics. For the simplest case, i.e., the Bernoulli model or binomial estimation, the history traces back to Thomas Bayes [41]. Karl Pearson [42] called this seemingly simple problem the “fundamental problem of practical statistics”. In the frequentist context the problem is called “interval estimation of a binomial proportion” and there is a textbook solution based on a confidence interval for this problem, which however has several drawbacks [43]. In both frameworks the frequency of occurrence of a category plays a crucial role, leading to the “sufficient statistics” in the Bayesian context and the “estimator of proportion in a sample” in the frequentist context.
The choice of a Bayesian framework is motivated by two features that are not present in the frequentist framework: (1) it allows one to obtain conditional (posterior) probability distributions for the multinomial parameters and thus interpret the point estimates as probabilities—this interpretation conceptually incorrect when applied to pure relative frequencies (which is not the same thing as adopting frequentist framework)—even though the law of large numbers implies that they converge to the point estimates obtained from the Dirichlet distributions when the number of observations goes to infinity; (2) on may take into account the prior information of the control experiment (whose genome sequence is known) within the inference of a posterior experimental condition by means of Bayes’ formula and thus relate two temporally connected events. Finally, it provides a natural way to separate signal from noise through credible (or Bayesian confidence) intervals—another problem with the use of pure relative frequencies is that it is not possible to associate “error bars” to them. Therefore, in our aproach, the errors introduced during the sequencing process are not filtered before the inference, but after it, when we identify the relevant signal—this allows us to avoid the drawbacks of preliminary error filtering [10, 44].
In summary, we sought to build an analysis platform suitable to address the problem of estimation of the populational genetic diversity of RNA viruses. Due to high mutational rates and accelerated replicative kinetics, RNA viruses constitute ensembles of variants, known as quasispecies, which, instead of a collection of viral particles, behave as a single and coherent organism which is act on by the host’s pressures [45]. Furthermore, our mathematical and computational approach allows for a better understanding of virtually all the viral diversity present in a clinical sample. By saying this, we mean that our method gives the user an idea about the real structure of a viral population contained in a clinical sample at a given time. When a quasispecies population is challenged by selective pressures, it responds as a sole organism due to the mutational link between the genetic variants it contains. Following this train of thoughts, at any given moment this distribution of mutants may bear variants with resistance mutations (which could minimize therapeutic success) or virus with genomic compositions that were sufficiently close to therapeutical resistance. In this manner, as far as clinical applications are concerned, our method offers an opportunity to the clinician to observe the entire viral mutational landscape in a clinical sample. This comprehensive view could help the clinician when deciding the best therapeutic approach.
Based on the aforementioned assumptions and that a NGS platform generates an extremely high number of reads of short length allowing for a deep and extensive coverage of the data, and with a very low error rate, we propose an approach to the estimation of genetic diversity of viral populations that does not make requirements on the form of the sequenced data (such as [10], which works only with Illumina™) and does not assume any statistical model for filtering errors [8]. Despite its apparent mathematical and computational involvement the approach proposed here is one of the simplest conceptually correct possible choices.
Unfortunately, we could not find any other method or software in the literature, which uses a similar form to represent the viral genetic diversity as a family of distributions indexed by the genome and does not need sufficiently long reads—all proposals in the literature that we were able to find are aimed at haplotype reconstruction and require longer reads (more than 100 bases). Any attempt to make comparison between such different approaches would be misleading, therefore it is not our aim here to make the point if the method presented here is an improvement over (non-)existing similar ones. In fact, we believe that the approach proposed here should not be considered alternative or rival, but complementary, to haplotype reconstruction.
Implementation
Flowchart containing the main steps of our method
Experimental procedure and preparation
Summary of sequence data analyzed
Sequenced data | Reads | Nucl. | Mapped (%) | Time (hour) |
---|---|---|---|---|
HIV1—non-stimulated CD4/R5 | 1.11 × 107 | 5.53 × 108 | 91.81 | 34.88 |
HIV2—stimulated CD4/R5 | 1.04 × 107 | 5.19 × 108 | 90.84 | 34.18 |
HIV3—non-stimulated CD4/X4 | 1.10 × 107 | 5.50 × 108 | 91.22 | 29.05 |
HIV4—stimulated PBMC/X4 | 9.41 × 106 | 4.71 × 108 | 91.14 | 13.63 |
HIV5—stimulated PBMC/R5 | 1.14 × 107 | 5.73 × 108 | 90.49 | 24.09 |
HIV6—non-stimulated PBMC/X4 | 1.12 × 107 | 5.62 × 108 | 91.42 | 27.54 |
HIV7—non-stimulated PBMC/R5 | 1.19 × 107 | 5.94 × 108 | 93.27 | 15.34 |
HIV8—control (pNL4-3kfs) | 1.01 × 107 | 5.05 × 108 | 93.10 | 30.14 |
Total | 8.65 × 107 | 4.33 × 109 | 91.67 | 208.85 |
Raw sequence data from the sequencing must be treated according to the standard procedures of the specific NGS platform [46] up to the generation of FASTA files, which are the standard type of input file adopted in our implementation.
The data used in this paper for testing the method is the subject of another publication [47], where the details of the experiments and the biological implications to the HIV replication are discussed. The method and the results described here do not depend on the experimental details and the results reported in [47].
Read mapping/alignment
The main goal of this step is the mapping of reads with 50 nucleotides or more originating from the NGS platform to a database of reference sequences. The database may contain several sequences, which must be aligned amongst themselves. The read mapping is performed using a local executable of BLAST [48]. The criteria for retaining the reads are the following: (1) it must align at least 45 nucleotides and (2) have the lowest e value score. A first alignment attempt is made with sequences from reads in the forward sense; in case of no match, a second attempt with the reverse complementary sequence is performed. Moreover, since we are using several references, the output can, in principle, display the same number of matches as there are reference sequences. The criteria for the selection of the most suitable alignment option are the following (in this order): (1) the lowest e value score and (2) the lowest Hamming distance from the consensus sequence obtained from of the sequencing of the control experiment. The alignment strategy described above is set as default, but the user, according to some specific purposes or simply for increasing processing speed, can change some of its parameters. Finally, it is possible to create suitable reference databases for specific research purposes.
Nucleic base estimation
The probability distributions of nucleic bases (A, T, C, G) at each position of the genome are estimated from the aligned data. In this respect, our approach may be classified as a diversity estimation in single positions. The idea is that at each position in the genome the probability distribution is given by a multinomial distribution, determined by four probabilities (p A, p T, p C, p G) satisfying p A + p T + p C + p G = 1. These conditional probabilities represent the fraction of the population that has each of the four associated nucleic bases at the corresponding site given the observed sequence data. Thus, one has a family of multinomial probability distributions indexed by the sites of the genome, where at each position the four probabilities (p A, p T, p C, p G) should be estimated from the data.
The Bayesian framework for the inference of categorical data is based on the notion of conjugate prior, which in the case of categorical data is given by the Dirichlet distributions [49, 50]. A Dirichlet distribution is characterized by a n-tuple of positive numbers α = (α 1,…, α n ) called hyper-parameters—however, unlike the multinomial parameters that must sum to one, the hyper-parameters are unconstrained. In our case, the Dirichlet distribution of each site is parametrized by the quadruple (α A, α T, α C, α G). The fact that the Dirichlet distribution is the conjugate prior of the multinomial distribution amounts to saying that the Bayes’ formula for the posterior distribution takes a very simple form in terms if the hyper-parameters: if (α 1,…,α n ) is a vector of hyper-parameters of a Dirichlet prior distribution and the counts of each of the k categories in an experiment are (c 1,…, c n ) then the posterior distribution is also a Dirichlet distribution with hyper-parameters (α 1 + c 1,…, α n + c n ). Within this context, the first step in the estimation of the probability distributions of nucleic bases consists in using the sequenced data from the control experiment as the input for the determination of prior hyper-parameters. Then, in the second step, one considers this distribution together with the sequenced data form the experimental conditions one uses Bayes’ formula to compute the posterior hyper-parameters.
Since we are dealing with a sparse estimation problem in the sense that one of the categories occur with much higher frequency that the other categories, we shall employ the smoothed sufficient statistics defined by introducing a small parameter η and setting p jk = M jk /M, where M jk is the number of occurrences of the kth category at the jth sample, M is total number of observations at the jth sample and p jk = η if there is no occurrence of the kth category at the jth sample. The smoothing parameter η acts as “background noise” representing sequencing and PCR errors that can not be removed. However it can be suitably tuned in order to account for the true variability of the data. When this procedure is applied to the sequenced data from the control experiment (a “clonal” population) one would expect no diversity at all. However, that is not completely true and, in fact, even the sequenced data from the control experiment should display some variability (mainly due to sequencing errors). The smoothing parameter η should be at same (or smaller than) of the order of magnitude of expected error rate ε. In the smoothed version of the Newton–Raphson iteration scheme, Ronning’s initialization step is given by setting α old = (η,…, η).
For instance, for the default value of z, which is 80 %, one has a sample of size N ≈ 0.7C, each sample vector computed from 0.8 M nucleotides. On the other hand, the value z = 50 % gives a sample of size N ≈ C, each sample vector computed from 0.5 M nucleotides.
Once the hyper-parameters of the prior distribution are estimated, they must be used together with the sequenced data of the other experimental conditions in order to compute the hyper-parameters of the posterior distributions by Bayes’ formula, as a result, one obtains a family of Dirichlet probability distributions indexed by the genome of the organism for every sequenced experimental condition, including the control experiment.
This is much simpler than the contrived expectation–maximization (EM) approximate schemes usually employed to obtain the MAP estimate from a log-likelihood function, in which case approximations are unavoidable, since this function is non-convex.
Note that both the mean ‹x k › and the mode ‹‹x k ›› converge to the same value when the number of observations goes to infinity. In particular, when the number of observed nucleic bases at certain site is very large then the relative frequencies of the different nucleic bases are very close to the values of the Dirichlet mean and mode.
Since the marginal distribution of each x k is a one-dimensional Dirichlet distribution, also known as Beta distribution, the standard deviation of the mean σ(x) = square-root(Var(x)) may be used to construct Bayesian credible intervals about the expectation value.
The the standard deviation of the mean σ(x k ) may be used to define credible intervals about the mode as well. Since the Beta distribution is unimodal, when all α k > 1 (k = 1,…, n), and has finite variance, a 3-sigma interval around the mean or the mode would provide about 95 % of confidence in the prediction (this is a general consequence of the Gauss-Vysochanskij-Petunin inequality, see [56]).
Finally, we should note that the inference procedure explained above is clearly not restricted to the case of four nucleotides (A, T, C, G), that is n = 4. It is trivial to modify it in order to account for insertions and deletions, or to work with codons and amino-acids.
Selection criteria and error filtering
Once the inference has been completed it is desirable to filter the errors and extract some subset of the data—for instance, most conserved sites, most variable sites, etc. In order to do so we have implemented two selection criteria based on simple quantities: (1) complementary probability per site and (2) variational distance per site.
The complementary probability per site is defined as p comp = 1 − max{p A, p T, p C, p G} and it depends only on the probability distribution of each site. It provides a measure of how much the distribution is concentrated in one state. For instance, if the complementary probability at a site is high it means that there was variation in the site prior to the experiment.
The variational distance per site is a positive number between 0 and 2 defined by \(vd = \left| {p_{A} - p_{A}^{\prime } } \right| \,+\, \left| {p_{AT} - p_{T}^{\prime } } \right| \,+\, \left| {p_{C} - p_{C}^{\prime } } \right| \,+\, \left| {p_{G} - p_{G}^{\prime } } \right|\), where \(\left( {p_{A} ,p_{T} ,p_{C} ,p_{G} } \right)\) is the probability distribution per site of the control experiment and \(\left( {p_{A}^{\prime } ,p_{T}^{\prime } ,p_{C}^{\prime } ,p_{G}^{\prime } } \right)\) is the probability distribution at the corresponding site of the experimental condition. It is a measure of the relative variation per site from the clonal population before and after the infection. If it is very low at a site it means that the site did not undergo significant changes in relation to the sequenced data from the control experiment.
The complementary probabilities and the variational distance can work as filters and the user must specify the thresholds for them. By using these two criteria in combination one may easily obtain some qualitative information about the behavior at a site.
An example of how to combine our method with any haplotype reconstruction procedure is the following. In a haplotype reconstructed population the fraction of a nucleic base X at a given position could be computed by summing the proportions of all haplotypes that have the nucleic base X at the given position. Using these proportions one can construct a family of distributions (f A, f T, f C, f G), with f A + f T + f C + f G = 1, per position. Since in practice it is very difficult to obtain the low-frequent variants the variational distance between the distribution (f A, f T, f C, f G) and the distribution (p A, p T, p C, p G) can be used to estimate how far one is from having obtained all the variants up to the lowest-frequent ones.
Another application, is to use the distributions (p A, p T, p C, p G) to generate a population of “random haplotypes” with the correct nucleic base distribution and compare with a population of reconstructed haplotypes in order to study the correlations between the sites.
Results and discussion
The method presented here was tested on samples obtained after the HIV-1 strain NL4-3 (group M, subtype B) cultivation on primary human cell cultures. Different viral propagation conditions were used—varying the cellular activation status, the co-receptor usage and the target cells. The pseudo-typed viruses produced in these experiments were able to perform exactly one round of the replicative cycle. As a whole, there were seven experimental conditions in addition to the control experiment (Table 1).
Experimental procedure and preparation
The experimental procedure was performed in accordance with the standard procedures of the NGS platform SOLiD™ [46], up to the generation of FASTA files, which are the input data of our computational tool. Standard Life Technologies guidelines were used during sample preparation and sequencing while using the SOLiD™ platform v. 3.0. The size of the FASTA files containing the reads of each condition is around 700 Mb, consisting of about 107 reads.
Read mapping/alignment
The use of BLAST to perform the read mapping has two reasons: first the multiple reference sequences allowed by BLAST makes it advantageous for analysis of viral populations classically described as quasispecies, since we can include several variant genomes belonging to the same phylogenetic branch, and second, it is easier to control the parameters of the alignment and multi-threading inside our program.
We used a reference database composed by 1258 sequences, properly aligned, all representatives of HIV-1 epidemics, most of them belonging to major group (group M) and its subtypes A, B, C, D, F, G, H, J, K, and some of their recombinants (http://www.hiv.lanl.gov/content/sequence/HelpDocs/subtypes-more.html). Some of the sequences are of complete HIV-1 genomes while others represented virtually (more than 90 % complete) complete genomes. All sequences are available at NCBI/Genbank. The reference database also contains sequences from HIV-1 group O, SIVcpz and HIV-1 strain NL4-3 and is packaged together with the software.
Even though it is known that BLAST is not the fastest aligner when compared with next-generation aligners, we were able to achieve a reasonable speed and control by setting the parameters and implementing some optimizations. For instance, even though BLAST is capable of identifying alignments in both the forward and the reverse complementary senses, we have found that manually doing this significantly increases the retrieval of reads—we had 15 % increase in the retrieval of reads. BLAST can be sensitive if the right parameters are chosen (small word size, in particular). It can find an alignment of a 42-mer with a multiple mismatches and gaps. On the other hand, some next generation aligners may fail to find an alignment if a mismatch or gap (or more than one of these) occurs within the beginning of the read, as this portion is used as a seed. An important feature of BLAST is that all alignments are returned. If a read has 1000 alignments, 1000 alignments are reported. Another advantage is the ability to perform sub-string alignments. Next generation aligners tend to be focused on aligning the entire read length. BLAST will find an alignment and report what position within the read that the alignment start and ends. Finally, BLAST is a more sensible treatment of N’s. Some of the next-generation aligners store bases in 2-bit format. Meaning they can only internally represent A, T, C, G. The solution is to randomly assign N’s to one of the other bases, a solution that some may find imperfect.
For each experimental condition, comprising the alignment of around 107 reads of 50 bases against 103 HIV-reference sequences with 104 sites each, we were able to map around 90 % of the reads, since the estimated fraction of reads with at least one error is around 2 %, we have achieved an almost optimal retrieval of reads.
Depth and coverage of one SOLiD™ sequencing of the HIV-1 genome. The major peaks in the middle representing the most deeply covered regions coincide with the overlapping primers from the PCR step, an evidence that there is in fact some influence of pre-sequencing phases on the frequency of the short-reads retained in the alignment. The major peak in the beginning is related to the difficulties in mapping the LTR region. Other significant peaks maybe due to PCR artifacts, as well
Inference
An important difficulty that should be overcome in order to implement the inference procedure for Dirichlet hyper-parameters in the context of nucleotides is due to the sparsity. Even with the high mutation rate displayed by viruses, there is a fair amount of nucleotide conservation. From a populational point of view, most of individuals will present the same nucleotide at a specific genomic position, and only the less representative subgroups, if any, will present one of the three remaining possibilities.
The standard Bayesian method outlined in most textbooks, where one usually chooses an uninformative (uniform) prior distribution is appropriate for the general task of multinomial estimation [57], but generally provides poor results when used for sparse multinomial distributions. This is primarily a consequence of the erroneous assumption that all categories should be considered as equally possible values for each site. Indeed, sparse multinomial distributions are characterized by the fact that only a few symbols actually occur (site conservation). In such cases, applying the standard method will give too much weight to symbols that never occur and consequently give a poor estimate of the true distribution. This issue becomes critical in our case when treating data obtained from the control experiment, which, in principle, is a clonal population, where one expects a uniquely well-defined nucleotide at each site and thus the Dirichlet likelihood function would be identically zero.
The sparsity problem, namely, the fact that one of the categories occur with much higher frequency that the other categories, is usually solved in the literature of text modeling (see [40]) by introducing a smoothing parameter η and modifying the Newton–Raphson method in such a way that the sufficient statistics does not have any zero entry. In practice, this may result in an over-smoothed distribution, but one can choose a small enough value for η in such a way that all the rare events do not have the same probability of appearing in all states.
Check points and validation
The method described here contains some heuristic decisions that should be supervised and properly validated. We have added a checkpoint at the read mapping stage and performed a validation of the smoothed Newton–Raphson method, attaining very good concordance with the expected results.
Quality values (QV) of the short-reads before (blue) and after (red) the alignment. The histogram contains the quality values on the horizontal axis and the proportion of short-reads in the vertical axis, displaying a large concentration of values around the average (approximately 23) and the majority (more than 90 %) of short-reads in the range 16–32. Short-reads with quality values in this range are considered to have excellent fidelity
The checking of the read mapping procedure was done by computing the distributions of all quality values of each condition prior and after the alignment (see Fig. 3). The mean value of the QV distributions remained unchanged after alignment. Likewise, at both steps of the process more than 80 % of reads had QV comprised between 20 and 32, assuring that the quality of the retrieved sequences was preserved and no bias was introduced.
The validation of the nucleotide inference step is performed at two points. The re-sampling procedure has been validated by comparing at each site, the nucleotide frequencies obtained from all the reads that cover the site, with the nucleotide frequencies obtained from the sampled reads that cover the site. It is observed that both frequencies agree with high precision (up to order 10−4). This ensures that the sufficient statistics obtained with re-sampling is the correct one. The validation of the implementation of the Newton–Raphson scheme for the Dirichlet maximum likelihood is performed by computing the MLE for a standard data set that is not sparse. The data set for pollen counts analyzed in Mosiman is often used for testing Dirichlet maximum likelihood implementations (see [49]). Since our implementation has a smoothing parameter, it is expected that the obtained values converge to the known values when the parameter approaches zero. It is indeed observed that this convergence occurs, with perfect agreement occurring above the order of magnitude of the smoothing parameter.
We have included in the software the appropriate options for the user to perform validation procedures. In particular, it is possible to run the Dirichlet MLE on any data set (with 4 categories) given as a list of multinomial observations.
Setting the smoothing parameter and separating the signal from the noise
The complementary probability of an ideal clonal population is expected to be identically zero. However, in the smoothed inferential scheme proposed here it is expected that p comp ≈ η. The choice of smoothing parameter does not affect the running time of the Newton–Raphson method; it affects the values of the estimated hyper-parameters. The effect is of the same order of magnitude of the smoothing parameter. This suggest the following guidelines for setting the smoothing parameter: η must be smaller that the error rate of the sequencing. Since the expected error rate ε in our case is around 6 × 10−4 a value of η = 10−5 is a reasonable choice for the smoothing parameter.
The concentration parameter s of the Dirichlet distributions for the sequenced data from the control experiment is a measure of the quality of the inference: when s > 1 the inference may be considered meaningful. Sites with s ≤ 1 may be excluded from further analysis. Sites with low value of s may happen due poor coverage and total conservation (all reads with the same nucleotide at that position).
Histogram of the complementary probabilities of the control data. The complementary probability per site is defined as p comp = 1 − max{p A, p T, p C, p G} and it depends only on the probability distribution of each site. The horizontal axis shows the values of complementary probabilities and the vertical axis the proportions of sites. The histogram contains the sites with p comp < 0.02, which comprises 98.5 % of all genome. These are the sites that have a unique dominant nucleotide with probability greater or equal than 0.98. The remaining 1.5 % sites are the ones displaying some variability in the distribution of nucleotides
Summary statistics of the complementary probability (p comp) and the concentration parameter (s) of the control experiment, after removal of the genomic positions with concentration parameter below 1
Statistics | p comp | s |
---|---|---|
Mean | 0.00260 | 3.43 |
Deviation | 0.01824 | 0.50 |
Median | 0.00066 | 3.60 |
Mode | 0.00001 | 3.74 |
Minimum | 0.00001 | 1.01 |
Maximum | 0.49242 | 6.26 |
The expectation MEAN(p comp), which is very sensitive to the long tail, is a reasonable conservative choice of a cut off value for noise filtering, a more conservative choice would be MEAN(p comp) + SD − MEAN(p comp). While these choices provide uniform cut off along the genome it is possible to use the individual Dirichlet distributions at each site to construct a more refined cut off function. Finally, the cut off value for p comp can be used to obtain a cut off value for the variational distance, since the cut off value of vd is twice the cut off value of p comp (vd is a piece-wise linear function of the probabilities).
Variational distance (vd) between the control data and an experimental condition along the genome. The variational distance per site is defined by \(vd = \left| {p_{A} - p_{A}^{\prime } } \right| + \left| {p_{T} - p_{T}^{\prime } } \right| + \left| {p_{C} - p_{C}^{\prime } } \right| + \left| {p_{G} - p_{G}^{\prime } } \right|\), where \(\left( {p_{A} ,p_{T} ,p_{C} ,p_{G} } \right)\) is the probability distribution per site in the control data and \(\left( {p_{A}^{\prime } ,p_{T}^{\prime } ,p_{C}^{\prime } ,p_{G}^{\prime } } \right)\) is the probability distribution of the corresponding site in the experimental condition. The horizontal axis shows the sites of the genome (with the LTR regions removed) and the vertical axis shows the corresponding variational distances. Applying the conservative cut-off value of 0.04 for vd one obtains the sites with significant variation
Histograms of the relative frequencies of the variational distances (vd). The horizontal axis shows the variational distances and the vertical axis shows the proportions of sites. The main panel (red) shows the sites with vd < 0.04, which comprises 98 % of all sites and the upper panel (blue) shows the sites with vd > 0.04, which comprises 2 % of all sites
Conclusions
High throughput sequencing technologies are constantly evolving and new platforms and refinements in the chemistry and base calling algorithms are constantly improving. Recently the PacBio™ sequencer has been gaining space as it produces long reads, but with a large number of randomly generated sequencing errors [59]. New approaches to sequencing using known technologies have been proposed, such as circle sequencing for Illumina [60]. We expect that the proposed approach, with slight modifications can be adopted for other technologies such as Ion Torrent™, Illumina™ (HiSeq, MiSeq and NextSeq) and PacBio™.
We have described a platform suitable to address the problem of estimation of populational diversity of RNA viruses. Based on the fact that the SOLiD™ sequencing platforms generate an extremely high number of reads allowing for a deep and extensive coverage of the data with very low error rate, we propose to measure the populational genetic diversity through a family of probability distributions indexed by the sites of the genome, each one representing the populational distribution of the diversity. This approach allowed us to avoid some very hard problems related to haplotype reconstruction (need of long reads, preliminary error filtering and assembly) and emphasize the main features of the sequencing technology used in this work, the SOLiD™ platform.
We have tested the method proposed here on samples obtained after the HIV-1 strain NL4-3 (group M, subtype B) cultivation on primary human cell cultures in many distinct viral propagation conditions, thus successfully demonstrating the capability of the method in handling large data-sets and delivering very clean results, suggesting that the software is a valuable tool for investigating the genetic diversity in viral populations. We have successfully demonstrated Tanden’s capability of handling large data-sets and delivering very clean results, suggesting that the software is a valuable tool for investigating the genetic diversity in viral populations as a complementary to some haplotype reconstruction method.
Availability and requirements
Project name: Tanden
Project web site: http://tanden.url.ph/
Operating systems: Windows
Programming language: Microsoft-C#
License: free to all users under the LGPL license
Minimum requirements: 4 GB RAM (16 GB recommended), 500 GB disk space
Third party software used: BLAST + standalone for windows.
(http://www.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
Declarations
Authors’ contributions
JPZ: contributed to the statistical analysis, developed and implemented the software, drafted the manuscript; SNB: carried out the biological analysis, drafted the manuscript; ACV: contributed to NGS and bioinformatics analysis, drafted the manuscript; GCO: contributed to NGS bioinformatics analysis, drafted the manuscript; LMRJ: conceived the statistical analysis, contributed to NGS and biological analysis, drafted the manuscript; FA: conceived the statistical analysis, developed and implemented the software, contributed to NGS and biological analysis, drafted the manuscript. All authors read and approved the final manuscript.
Acknowledgements
The ideas presented in this paper were developed in collaboration with Francisco A. R. Bosco, who suddenly passed away in December of 2012.
Competing interests
The authors declare that they have no competing interests.
Funding
Support was provided to: JPZ and SNB by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) [http://www.capes.gov.br]; ACV by Fogarty International Center National Institutes of Health [http://www.fic.nih.gov] (TW007012); GCO by Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG) [http://www.fapemig.br] (309,312/2012-4, 306362/2012-0), Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) [http://www.cnpq.br], Fogarty International Center National Institutes of Health [http://www.fic.nih.gov] (TW007012); LMRJ by Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) [http://www.fapesp.br] (2009/14543-0); FA Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) [http://www.cnpq.br] (PQ-306362/2012-0). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Duffy S, Shackelton LA, Holmes EC. Rates of evolutionary change in viruses: patterns and determinants. Nat Rev Genet. 2008;9:267–76.View ArticlePubMedGoogle Scholar
- Mansky LM, Temin HM. Lower in vivo mutation rate of human immunodeficiency virus type 1 than that predicted from the fidelity of purified reverse transcriptase. J Virol. 1995;69:5087–94.PubMedPubMed CentralGoogle Scholar
- Fu Q, Mittnik A, Johnson PLF, Bos K, Lari M, Bollongino R, Sun C, Giemsch L, Schmitz R, Burger J, Ronchitelli AM, Martini F, Cremonesi RG, Svoboda J, Bauer P, Caramelli D, Castellano S, Reich D, Pääbo S, Krause J. A revised timescale for human evolution based on ancient mitochondrial genomes. Curr Biol. 2013;23:553–9.View ArticlePubMedGoogle Scholar
- Okoro CK, Kingsley RA, Connor TR, Harris SR, Parry CM, Al-Mashhadani N, Kariuki S, Musefula CL, Gordon MA, de Pinna E, Wain J, Heyderman RS, Obaro S, Alonso PL, Mandomando I, MacLennon CA, Tapia MD, Levine MM, Tennant SM, Parkhill J, Dougan G. Intracontinental spread of human invasive Salmonella Typhimurium pathovariants in sub-Saharan Africa. Nat Genet. 2012;44:1215–21.View ArticlePubMedPubMed CentralGoogle Scholar
- Nederbragt AJ. On the middle ground between open source and commercial software—the case of the Newbler program. Genome Biol. 2014;15:113.View ArticlePubMedPubMed CentralGoogle Scholar
- Beerenwinkel N, Zagordi O. Ultra-deep sequencing for the analysis of viral populations. Curr Opin Virol. 2011;1:413–8.View ArticlePubMedGoogle Scholar
- Schopman NCT, Willemsen M, Liu YP, Bradley T, van Kampen A, Baas F, Berkhout B, Haasnoot J. Deep sequencing of virus-infected cells reveals HIV-encoded small RNAs. Nucleic Acids Res. 2012;40:414–27.View ArticlePubMedPubMed CentralGoogle Scholar
- Beerenwinkel N, Günthard HF, Roth V, Metzner KJ. Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front Microbiol. 2012;3:16.View ArticleGoogle Scholar
- Mardis E. Next-generation sequencing technologies. In: Current topics in genome analysis. National Human Genome Research Institute. 2014. p. 1–26. http://www.genome.gov/12514288.
- Mangul S, Wu NC, Mancuso N, Zelikovsky A, Sun R, Eskin E. Accurate viral population assembly from ultra-deep sequencing data. Bioinforma Oxf Engl. 2014;30:i329–37.View ArticleGoogle Scholar
- Kinde I, Wu J, Papadopoulos N, Kinzler KW, Vogelstein B. Detection and quantification of rare mutations with massively parallel sequencing. Proc Natl Acad Sci USA. 2011;108:9530–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Huang A, Kantor R, DeLong A, Schreier L, Istrail S. QColors: an algorithm for conservative viral quasispecies reconstruction from short and non-contiguous next generation sequencing reads. In Silico Biol. 2011;11:193–201.PubMedGoogle Scholar
- Töpfer A, Marschall T, Bull RA, Luciani F, Schönhuth A, Beerenwinkel N. Viral Quasispecies Assembly via Maximal Clique Enumeration. PLoS Comput Biol. 2014;10:e1003515.View ArticlePubMedPubMed CentralGoogle Scholar
- Giallonardo FD, Töpfer A, Rey M, Prabhakaran S, Duport Y, Leemann C, Schmutz S, Campbell NK, Joos B, Lecca MR, Patrignani A, Däumer M, Beisel C, Rusert P, Trkola A, Günthard HF, Roth V, Beerenwinkel N, Metzner KJ. Full-length haplotype reconstruction to infer the structure of heterogeneous virus populations. Nucleic Acids Res. 2014;48:e115.View ArticleGoogle Scholar
- Kumar S, Tamura K, Nei M. MEGA3: integrated software for molecular evolutionary genetics analysis and sequence alignment. Brief Bioinform. 2004;5:150–63.View ArticlePubMedGoogle Scholar
- Wallace IM, O’Sullivan O, Higgins DG. Evaluation of iterative alignment algorithms for multiple alignment. Bioinformatics. 2005;21:1408–14.View ArticlePubMedGoogle Scholar
- Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol. 2008;26:1135–45.View ArticlePubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. Subgroup 1000 genome project data processing: the sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;18:1851–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25:1754–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Homer N, Merriman B, Nelson SF. BFAST: an alignment tool for large scale genome resequencing. PLoS One. 2009;11:e7767.View ArticleGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Wan-Ping Lee MPS. MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping. PLoS One. 2014;9:e90581.View ArticleGoogle Scholar
- Bao S, Jiang R, Kwan W, Wang B, Ma X, Song YQ. Evaluation of next-generation sequencing software in mapping and assembly. J Hum Genet. 2011;56:406–14.View ArticlePubMedGoogle Scholar
- Flicek P, Birney E. Sense from sequence reads: methods for alignment and assembly. Nat Methods. 2009;6:6–12.View ArticleGoogle Scholar
- Goya R, Sun MG, Morin RD, Leung G, Ha G, Wiegand KC, Senz J, Crisan A, Marra MA, Hirst M, Huntsman D, Murphy KP, Aparicio S, Shah SP. SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors. Bioinformatics. 2010;26:730–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Prosperi MCF, Prosperi L, Bruselles A, Abbate I, Rozera G, Vincenti D, Solmone MC, Capobianchi MR, Ulivi G. Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing. BMC Bioinform. 2011;12:5.View ArticleGoogle Scholar
- Willerth SM, Pedro HAM, Pachter L, Humeau LM, Arkin AP, Schaffer DV. Development of a low bias method for characterizing viral populations using next generation sequencing technology. PLoS One. 2010;5:e13564.View ArticlePubMedPubMed CentralGoogle Scholar
- Zagordi O, Däumer M, Beisel C, Beerenwinkel N. Read length versus depth of coverage for viral quasispecies reconstruction. PLoS One. 2012;7:e47046.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang C, Mitsuya Y, Gharizadeh B, Ronaghi M, Shafer RW. Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance. Genome Res. 2007;17:1195–201.View ArticlePubMedPubMed CentralGoogle Scholar
- Fischer W, Ganusov VV, Giorgi EE, Hraber PT, Keele BF, Leitner T, Han CS, Gleasner CD, Green L, Lo CC, Nag A, Wallstrom TC, Wang S, McMichael AJ, Haynes BF, Hahn BH, Perelson AS, Borrow P, Shaw GM, Bhattacharya T, Korber BT. Transmission of single HIV-1 genomes and dynamics of early immune escape revealed by ultra-deep sequencing. PLoS One. 2010;5:e12303.View ArticlePubMedPubMed CentralGoogle Scholar
- Lataillade M, Chiarella J, Yang R, Schnittman S, Writz V, Uy J, Seekins D, Krystal M, Mancini M, McGrath D, Simen B, Egholm M, Kozal M. Prevalence and clinical significance of HIV drug resistance mutations by ultra-deep sequencing in antiretroviral-naive subjects in the CASTLE study. PLoS One. 2010;5:e10952.View ArticlePubMedPubMed CentralGoogle Scholar
- Tsibris AMN, Korber B, Arnaout R, Russ C, Lo C-C, Leitner T, Gaschen B, Theiler J, Paredes R, Su Z, Hughes MD, Gulick RM, Greaves W, Coakley E, Flexner C, Nusbaum C, Kuritzkes DR. Quantitative deep sequencing reveals dynamic HIV-1 escape and large population shifts during CCR5 antagonist therapy in vivo. PLoS One. 2009;4:e5683.View ArticlePubMedPubMed CentralGoogle Scholar
- Bruselles A, Rozera G, Bartolini B, Prosperi M, Del Nonno F, Narciso P, Capobianchi MR, Abbate I. Use of massive parallel pyrosequencing for near full-length characterization of a unique HIV Type 1 BF recombinant associated with a fatal primary infection. AIDS Res Hum Retroviruses. 2009;25:937–42.View ArticlePubMedGoogle Scholar
- Eshleman SH, Hudelson SE, Redd AD, Wang L, Debes R, Chen YQ, Martens CA, Ricklefs SM, Selig EJ, Porcella SF, Munshaw S, Ray SC, Piwowar-Manning E, McCauley M, Hosseinipour MC, Kumwenda J, Hakim JG, Chariyalertsak S, de Bruyn G, Grinsztejn B, Kumarasamy N, Makhema J, Mayer KH, Pilotto J, Santos BR, Quinn TC, Cohen MS, Hughes JP. Analysis of genetic linkage of HIV from couples enrolled in the HIV prevention trials network 052 trial. J Infect Dis. 2011;204:1918–26.View ArticlePubMedPubMed CentralGoogle Scholar
- Macalalad AR, Zody MC, Charlebois P, Lennon NJ, Newman RM, Malboeuf CM, Ryan EM, Boutwell CL, Power KA, Brackney DE, Pesko KN, Levin JZ, Ebel GD, Allen TM, Birren BW, Henn MR. Highly sensitive and specific detection of rare variants in mixed viral populations from massively parallel sequence data. PLoS Comput Biol. 2012;8:e1002417.View ArticlePubMedPubMed CentralGoogle Scholar
- Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012;22:568–76.View ArticlePubMedPubMed CentralGoogle Scholar
- Eriksson N, Pachter L, Mitsuya Y, Rhee S-Y, Wang C, Gharizadeh B, Ronaghi M, Shafer RW, Beerenwinkel N. Viral population estimation using pyrosequencing. PLoS Comput Biol. 2008;4:e1000074.View ArticlePubMedPubMed CentralGoogle Scholar
- Westbrooks K, Astrovskayaa I, Campob D, Khudyakovb Y, Bermanc P, Zelikovsky A. HCV quasispecies assembly using network flows. In: Bioinformatics research and applications. Berlin-Heidelberg: Springer-Verlag; 2008:159–70.Google Scholar
- Madsen R, Kauchak D, Elkan C. Modeling word burstiness using the Dirichlet distribution. In Proceedings of the 22nd international conference on machine learning. Bonn; 2005:545–52.Google Scholar
- Bayes M, Price M. An essay towards solving a problem in the doctrine of chances. By the Late Rev. Mr. Bayes, F. R. S. Communicated by Mr. Price, in a Letter to John Canton, A. M. F. R. S. Philos Trans. 1763,53:370–418.Google Scholar
- Pearson K. The fundamental problem of practical statistics. Biometrika. 1920;13:1–16.View ArticleGoogle Scholar
- Brown LD, Cai TT, DasGupta A. Interval estimation for a binomial proportion. Stat Sci. 2001;16:101–33.Google Scholar
- Verbist B, Clement L, Reumers J, Thys K, Vapirev A, Talloen W, Wetzels Y, Meys J, Aerssens J, Bijnens L, Thas O. ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering. BMC Bioinform. 2015;16:59.View ArticleGoogle Scholar
- Domingo E, Sheldon J, Perales C. Viral quasispecies evolution. Microbiol Mol Biol Rev. 2012;76:159–216.View ArticlePubMedPubMed CentralGoogle Scholar
- Biosystems A. Applied biosystems SOLiD(TM) 3 system: instrument operation guide. 2009.Google Scholar
- Nascimento-Brito S, Paulo Zukurov J, Maricato JT, Volpini AC, Salim ACM, Araújo FMG, Coimbra RS, Oliveira GC, Antoneli F, Janini LMR. HIV-1 tropism determines different mutation profiles in proviral DNA. PLoS One. 2015;10:e0139037.View ArticlePubMedPubMed CentralGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipmanl DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10.View ArticlePubMedGoogle Scholar
- Ng KW, Tian G-L, Tang M-L. Dirichlet and related distributions: theory methods and applications, vol. 889. New York: Wiley; 2011.View ArticleGoogle Scholar
- Wicker N, Muller J, Kalathura RK, Pocha O. A maximum likelihood approximation method for Dirichlet’s parameter estimation. Comput Stat Data Anal. 2008;52:1315–22.View ArticleGoogle Scholar
- Press WHH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical recipes: the art of scientific computing. 3rd ed. Cambridge: Cambridge University Press; 2007.Google Scholar
- Narayanan A. A note on parameter estimation in the multivariate beta distribution. Comput Math Appl. 1992;24:11–7.View ArticleGoogle Scholar
- Ronning G. Maximum likelihood estimation of dirichlet distributions. J Stat Comput Simul. 1989;32:215–21.View ArticleGoogle Scholar
- Efron B, Gong G. A Leisurely Look at the Bootstrap, the Jackknife, and Cross-Validation. Am Stat. 1983;37:36–48.Google Scholar
- Efron B. The jackknife, the bootstrap, and other resampling plans. Philadelphia: Society for Industrial and Applied Mathematics; 1987.Google Scholar
- Pukelsheim F. The three sigma rule. Am Stat. 1994;48:88.Google Scholar
- Kendall MG, O'Hagan A, Foster, J. Kendall’s advanced theory of statistics, vol 2B-Bayesian Inference, 2nd ed. London: Edward Arnold Press; 2004.Google Scholar
- Peckham HE, McLaughlin SF, Ni JN, Rhodes MD, Malek JA, McKernan KJ, Blanchard AP. SOLiD sequencing and 2-base encoding. Poster. USA: Applied Biosystems; 2007.Google Scholar
- Roberts RJ, Carneiro MO, Schatz MC. The advantages of SMRT sequencing. Genome Biol. 2013;14:405.View ArticlePubMedGoogle Scholar
- Lou DI, Hussmann JA, McBee RM, Acevedo A, Andino R, Press WH, Sawyer SL. High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. Proc Natl Acad Sci. 2013;110:19872–7.View ArticlePubMedPubMed CentralGoogle Scholar