We propose a subtle but instructive change in the usual computational protocol for evaluating transcript abundance estimates. We propose a quantify-then-validate approach which evaluates the quality of transcript abundance estimates directly on read-sets, analogous to train-then-test approaches for evaluating probabilistic predictors common in natural language processing (NLP) and other fields [29, Ch. 1.3]. Instead of quantifying all available fragments and then performing evaluation with respect to complementary measurements downstream, the quantify-then-validate approach validates and evaluates the quality of a given abundance estimate directly on a set of held-out validation fragments withheld from inference.
We derive and adapt from NLP, the notion of perplexity for transcript abundance estimation for this quantify-then-validate approach [26, 27]. Perplexity is computed given only an abundance estimate, and a held-out validation set of fragments as input. Thus, perplexity evaluates the quality of abundance estimates on fragments directly and can evaluate estimates from experimental data in the absence of ground truth. Most importantly, evaluating perplexity with the quantify-then-validate approach enables quantitative, evidence-based, cross-validated selection of hyperparameters for transcript abundance estimation methods that use them.
Perplexity for transcript abundance estimation quantifies the intuition that an abundance estimate for a given sample ought, with high probability, explain and generate the set of fragments of a technical replicate. The key observation is that the likelihood \(\mathcal {P}(\mathcal {F}\mid \mathbf {\theta })\) is simply a value that can be computed for any fragment set \(\mathcal {F}\) and any abundance estimate \(\mathbf {\theta }\) (model parameters), irrespective of whether \(\mathbf {\theta }\) is inferred from \(\mathcal {F}\). It is the context and application of the likelihood, \(\mathcal {P}(\mathcal {F}\mid \mathbf {\theta })\), that yield semantic meaning.
Given a fragment set, \({\mathbf {F}}\), over which one seeks to infer and evaluate abundance estimates, the quantify-then-validate procedure is as follows. First, partition the input set into a quantified set, \(\mathcal {F}\), and a validation set, \({\widehat{\mathcal {F}}}\). Second, quantify and infer abundance estimates (model parameters) \(\mathbf {\theta }\) given the quantified set \(\mathcal {F}\). Third, validate and compute the perplexity, \(PP({\widehat{\mathcal {F}}}, \mathbf {\theta })\)—the inverse geometric mean held-out per-read likelihood of observing the validation set, \({\widehat{\mathcal {F}}}\)—given model parameters \(\mathbf {\theta }\) and the validation set \({\widehat{\mathcal {F}}}\). The lower the perplexity, the better the parameters \(\mathbf {\theta }\) describe the held-out fragments \({\widehat{\mathcal {F}}}\), and the better the abundance estimate parameterized by \(\mathbf {\theta }\) ought to be. In fact, if we believe that the generative model is truly descriptive of the distributions that arise from the underlying biological and technical phenomena, perplexity is, in expectation, minimized when the “true” latent parameters are inferred.
Formally, given an abundance estimate \(\mathbf {\theta }\), and a validation fragment-set \({\widehat{\mathcal {F}}}= \{\hat{f}_1, \dots , \hat{f}_{{\widehat{N}}}\}\), the perplexity for transcript abundance estimation is:
$$\begin{aligned} \begin{aligned} PP({\widehat{\mathcal {F}}}, \mathbf {\theta })&= \exp \left\{ -\frac{1}{{\widehat{N}}}\log \mathcal {P}({\widehat{\mathcal {F}}}\mid \mathbf {\theta })^{ }\right\} \\&= \exp \left\{ -\frac{1}{{\widehat{N}}}\sum ^{{\widehat{N}}}_{j=1}\log \mathcal {P}({\hat{f}}_j \mid \mathbf {\theta })^{ }\right\} , \end{aligned} \end{aligned}$$
(3)
with per-fragment likelihood,
$$\begin{aligned} \begin{aligned} \mathcal {P}({\hat{f}}_i \mid \mathbf {\theta }) = \sum ^M_{i=1} \mathcal {P}(t_i \mid \mathbf {\theta }) \cdot \mathcal {P}(\hat{f}_j \mid t_i). \end{aligned} \end{aligned}$$
(4)
Crucially, the probability \(\mathcal {P}({\hat{f}}_j \mid \mathbf {\theta })\) of observing each held out fragment given \(\mathbf {\theta }\) is computed and marginalized over the product of two terms, \(\mathcal {P}({\hat{f}}_j \mid t_i)\) that depends only on the validation set of held-out fragments, and \(P(t_i \mid \mathbf {\theta })\) that depends only on the given abundance estimate.
One particular application of the perplexity metric, which we explore here, is to select the best abundance estimate out of many candidate estimates arising from different hyperparameter settings for quantifiers. Thus, in this work, we use the range-factorized equivalence class approximation for perplexity (as in Eq. 2) throughout [28]. Given the range-factorized equivalence classes, \({\widehat{\mathcal {C}}}\), induced by the validation set, \({\widehat{\mathcal {F}}}\), (where \({\widehat{N}}^q\) is the number of fragments in an equivalence class \({\widehat{\mathcal {F}}}^q \in {\widehat{\mathcal {C}}}\)) the approximation is:
$$\begin{aligned} PP({\widehat{\mathcal {F}}}, \mathbf {\theta }) \approx \exp \left\{ -\frac{1}{\widehat{N}} \sum _{{\widehat{\mathcal {F}}}^q \in {\widehat{\mathcal {C}}}} {\widehat{N}}^q \cdot \log \mathcal {P}({\hat{f}}_i \mid {\widehat{\mathcal {F}}}^q, \mathbf {\theta }) \right\} , \end{aligned}$$
(5)
with approximate per-fragment likelihood,
$$\begin{aligned} \begin{aligned} \mathcal {P}({\hat{f}}_i \mid {\widehat{\mathcal {F}}}^q, \mathbf {\theta }) = \sum _{t_i \in \Omega ({\widehat{\mathcal {F}}}^q)} P(t_i \mid \mathbf {\theta }) \cdot \mathcal {P}(\hat{f}_j \mid {\widehat{\mathcal {F}}}^q, t_i). \end{aligned} \end{aligned}$$
(6)
We use salmon’s selective-alignment based probabilistic model for conditional probabilities \(\mathcal {P}({\hat{f}}_j\mid {\widehat{\mathcal {F}}}^q, t_i)\) and effective lengths of transcripts, since the model and equivalence class approximation salmon uses has proven to be a fast and effective way to approximate the full likelihood [17, 28]. For the scope of this work, salmon’s format for storing range-factorized equivalence classes conveniently contains all relevant information and values to compute perplexity with vastly smaller space requirements than would be required to store per-fragment alignment probabilities \(P({\hat{f}}_j \mid t_i)\).
“Impossible” fragments given parameter estimates \(\mathbf {\theta }\)
We now address a perplexity-related issue that is unique to evaluating transcript abundance estimates—that an observed event in the validation set may be deemed “impossible” given model parameters \(\mathbf {\theta }\). The marginal probability, \(\mathcal {P}({\hat{f}}_j \mid \mathbf {\theta })\), for observing a fragment \({\hat{f}}_j\) in the validation set given some abundance estimate, \(\mathbf {\theta }\), may actually be zero, even if said validation fragment aligns to the reference transcriptome. This occurs exactly when all transcripts, \(t_i\), to which the validation fragment \({\hat{f}}_j\) map are deemed unexpressed by \(\mathbf {\theta }\) (i.e. \(P(t_i \mid \mathbf {\theta }) = 0\) for all such transcripts). Here, we say that \({\hat{f}}_j\) is an impossible fragment given \(\mathbf {\theta }\), and that \(\mathbf {\theta }\) calls \({\hat{f}}_j\) impossible. When impossible fragments are observed in the validation set, perplexity is not a meaningful measurement.
To illustrate how impossible fragments come to be, consider the toy example in which all fragments in a quantified set that align to transcripts A, B, or C only ambiguously map to \(\{A, B\}\), or to \(\{A, C\}\). That is, no such fragments uniquely map—a phenomenon observed rather frequently for groups of similar isoforms expressed at low to moderate levels. Now, suppose that an abundance estimation model assigns all such fragments to transcript A and produces an estimate \(\mathbf {\theta }\). The quantifier may be satisfying a prior that prefers sparsity; or prefers to do so because transcript A is considerably shorter than transcripts B and C, which gives it a higher conditional probability under a length normalized model. In this case, the marginal probability, \(\mathcal {P}({\hat{f}}_j \mid \mathbf {\theta })\), of observing a validation fragment \({\hat{f}}_j\) that maps to \(\{B,C\}\) is exactly zero given the parameters \(\mathbf {\theta }\).
As an example, we randomly withhold varying percentages of fragments from one sample (SRR1265495) as validation sets and use all remaining fragments to estimate transcript abundances with salmon’s default model (i.e. the VBEM model using prior size of 0.01 reads-per-transcript). Figure 1 shows that at all partitioned percentages, impossible fragments in the validation set are prevalent with respect to estimated abundances. In fact, due to the prevalence of impossible reads, perplexity as written in Eq. 5 is undefined (or infinite) for all estimates and all validation sets in the experiments below. An important observation in both the toy and experimental examples is that there likely exist better abundance estimates that would call fewer fragments impossible, while still assigning high likelihood to the rest of the (possible) fragments. For example, an abundance estimate that reserves even some small probability mass to transcript B in the toy example would not call the validation fragments in question impossible.
Why perplexities need to be smoothed
The problem with impossible fragments is not only that they exist. The problem is that, for a fixed validation fragment set, perplexity deems an abundance estimate that calls even one fragment impossible equally as bad as an abundance estimate that calls all fragments impossible. Here, both estimates would have unbounded perplexity since the validation set has zero likelihood given each estimate. However, the former ought be preferred over the latter.
Other fields that have adopted and used perplexity (e.g. natural language processing) usually sidestep the issue of impossible events entirely both by construction and pre-processing, working only with smoothed probabilistic models in which no event has probability zero, or removing rare words from input language corpora. However, neither strategy is available nor appropriate for evaluating transcript abundance estimates. It is neither reasonable nor useful to amend and modify each of the many modern quantifiers to produce smooth outputs (outputs in which no transcript has truly zero abundance), and fragments and transcripts cannot be pre-processed away since the set of expressed transcripts cannot be identified a priori. One may also be tempted to simply remove impossible fragments from a validation set, \({\widehat{\mathcal {F}}}\), before computing a perplexity or hold out fragments—but this also is not a valid strategy. This is because two different abundance estimates \(\mathbf {\theta }\) and \(\mathbf {\theta }^\prime\) may call different validation fragments in \({\widehat{\mathcal {F}}}\) impossible, and comparisons of likelihoods \(P({\widehat{\mathcal {F}}}^\prime \mid \mathbf {\theta }^\prime )\) and \(P({\widehat{\mathcal {F}}}\mid \mathbf {\theta })\) are only meaningful if the validation sets are the same (i.e. \({\widehat{\mathcal {F}}}= {\widehat{\mathcal {F}}}^\prime\)). Furthermore, there is no straightforward strategy to sample and hold-out validation fragments so that no fragments are impossible. This is because most validation fragments cannot be determined to be impossible prior to abundance estimation, and any non-uniform sampling strategy would alter the underlying distributions that estimators aim to infer. To compare estimates that may call different validation fragments impossible, the proposed perplexity metric (as in Eq. 5) must be smoothed. Strategies that smooth perplexities ought penalize estimates that call fragments impossible. That is, impossible fragments under such smoothing strategies ought result in a penalty and overcome the shrinkage of \(\mathcal {P}({\widehat{\mathcal {F}}}\mid \mathbf {\theta })\) to zero. Below, we detail two such smoothing strategies for computing perplexities: (a) Laplacian smoothed perplexity and (b) Good-Turing smoothed perplexity.
We schematically illustrate how a smoothed perplexity measure, using the proposed quantify-then-validate protocol, can be computed to evaluate the quality of transcript abundance estimates in Fig. 2.
Laplacian smoothed perplexity
We define Laplacian smoothed perplexity given abundance estimate \(\mathbf {\theta }\) to be the perplexity evaluated with the smoothed distribution \({\widetilde{\mathcal {P}}}_\beta (t_i \mid \mathbf {\theta })\) in place of \(\mathcal {P}(t_i \mid \mathbf {\theta })\). The Laplacian smoothing scheme smooths input abundance estimates by redistributing a small constant probability mass across the reference transcriptome. Let \(\mathcal {P}(t_i \mid \mathbf {\theta }) = \eta _i\) and M be the number of transcripts in the reference. The smoothed distribution parameterized by \(\beta\) is defined to be:
$$\begin{aligned} {\widetilde{\mathcal {P}}}_\beta (t_i \mid \mathbf {\theta }) = \frac{\eta _i + \beta }{1 + M\beta }. \end{aligned}$$
(7)
Laplacian smoothed perplexity is flexible and easy to implement but requires the user to set a value (preferably small e.g. \(1\times 10^{-8}\)) for the smoothing parameter \(\beta\).Footnote 1 At the cost of not being parameter-free, Laplacian smoothed perplexity allows the user to tune the degree to which impossible reads are penalized. The smaller the value of \(\beta\), the smaller larger the penalty an estimate incurs for each validation fragment it calls impossible
Good-Turing smoothed perplexity—an adaptive, “parameter-free” strategy
The major drawback of Laplacian smoothed perplexity is that it depends on a reasonable a priori selection of a value for the smoothing parameter \(\beta\). One further concern is that Laplacian smoothed perplexity is not adaptive and does not account for the amount of evidence from which an input estimate is derived, i.e. the read-depth or the number of quantified reads in a sample. For a fixed value of \(\beta\), the Laplacian smoothed perplexity smooths probabilities inferred from a million fragments equally as much as probabilities inferred from a trillion fragments. However, for the latter estimate that is inferred from much more data, it is more sensible to smooth and redistribute less probability mass.
For example while varying one of salmon’s hyperparameters, Laplacian smoothed perplexities suggest the existence of a locally optimal behavior when computed with a wide range of values for \(\beta\) (see Fig. 16). However, the locally optimal behavior can no longer be observed if Laplacian smoothed perplexities are computed with \(\beta =1\times 10^{-6}\).
A better, adaptive, smoothing strategy would directly estimate the probability of observing fragments from transcripts that are not expressed. Abstractly, the problem to be solved is to estimate the probabilities of observing unobserved events. Here, we turn to the Simple Good-Turing (SGT) method [30] that has been applied in a wide range of areas, including estimating the probabilities of unseen sequences in computational linguistics [30], as well as for the detection of empty droplets in droplet-based single-cell RNA sequencing protocols [31].
Below, we define the Good-Turing smoothed perplexity measure, where smoothed probabilities are derived from SGT smoothed fragment per-transcript counts.
Given frequencies over a population—i.e. the number of reads originating from each trancsript—the SGT method estimates:
-
1
the total probability mass that ought be assigned to unseen events—the “expression” of unexpressed transcripts, and
-
2
the appropriate adjustments for probabilities of observed events—the adjusted probabilities for expressed transcripts.
It is not immediately obvious how to implement SGT smoothing for the purpose of smoothing transcript abundance estimates. One issue is that the SGT estimator expects as input, integer-valued frequencies of observed events, while input abundance estimates for computing perplexity are real-valued estimated frequencies of per-transcript counts. For the purposes of smoothing and computing perplexity, we round the estimated number of fragments per-transcript, \(c_i\), to the nearest integer and treat these as raw frequencies of events for SGT smoothing.
The SGT method also requires that input frequencies-of-frequencies (i.e. the number of transcripts that have the same fragments per-transcript) to be log-linear. Empirically, we show in Fig. 3 that rounded input abundance estimates do, indeed, follow a log-linear distribution. The confirmed log-linear relationship demonstrates the rounding step to be a reasonable approximation.
The SGT method estimates the adjusted frequencies \(r^\star\) for each event observed r times. These adjusted frequencies are then used to compute per-event (or per-transcript) probabilities. Let \(c_i\) be the rounded number of fragments per-transcript \(t_i\). Let the frequency of frequencies \(n_r~=~|\{t_i~\mid ~c_i~=~r\}|\). And let there be \({\mathbf {n}}\) total reads. The SGT method computes and outputs,
-
1
the adjusted frequencies, \(r^\star = (r+1) \frac{S(n_{(r+1)})}{S(n_r)}\);
-
2
and the total probability, \(P_0 = \frac{n_1}{{\mathbf {n}}}\), for observing any transcript with \(c_i = 0\).
Here, \(S(n_r)\) computes a smoothed frequency of frequencies. Frequencies of frequencies \(n_r\) have to be smoothed because \(n_r\) for many large r are zero in observed data. The precise details for computing the smoothed \(S(n_r)\) are described in [30]. In brief, SGT smooths \(n_r\) by fitting a fitted log-linear function on r against \(n_r\) and reading off values of \(n_r\) for “large” r.
Good-Turing smoothed perplexity is perplexity computed with the smoothed per-transcript distribution \({\widetilde{\mathcal {P}}}(t_i\mid \mathbf {\theta })\) in place of \(\mathcal {P}(t_i~\mid ~\mathbf {\theta })\). Here, the smoothed per-transcript distribution is derived from adjusted frequencies \(r^\star\) and \(P_0\).
For each “expressed” transcript \(t_i\) with count, \(c_i = r\), greater than zero, the SGT smoothed probability is proportional to the transcript’s adjusted frequency normalized by its effective length, with \({\widetilde{\mathcal {P}}}(t_i\mid \mathbf {\theta }) \propto r^\star /\tilde{\ell _i}\). The smoothed probabilities for expressed transcripts are normalized so that they sum to \((1-P_0)\).
For each “unexpressed” transcripts with \(c_i = 0\), the SGT smoothed probability is proportional to the transcript’s effective length, and is derived from distributing the probability mass \(P_0\) uniformly over the effective lengths of all unexpressed transcripts in the reference. Here, the smoothed per-transcript distribution is defined \({\widetilde{\mathcal {P}}}(t_i\mid \mathbf {\theta }) \propto P_0{\ell _i}\). The smoothed probabilities for unexpressed transcripts are normalized so that they sum to \(P_0\).
For all following sections we shall use perplexity to mean Good-Turing smoothed perplexity unless stated otherwise.
Model selection using perplexity in practice
Arguably, one of the most useful outcomes of being able to evaluate the quality of abundance estimates in the absence of ground truth is the ability to perform model selection for transcript abundance estimation in experimental data. For those familiar with train-then-test experimental protocols for model selection in machine learning or NLP, model selection for transcript abundance estimation vis-a-vis our proposed quantify-then-validate approach is analogous and identical in abstraction. However, since, to our knowledge, this work is the first to propose a quantify-then-validate approach for transcript abundance estimation, we shall briefly detail how perplexity ought to be used in practice.
Let us consider model selection via fivefold cross-validation using perplexity given some fragment set \({\mathbf {F}}\). First, \({\mathbf {F}}\) is randomly partitioned into five equal sized, mutually exclusive validation sets, \(\{{\widehat{\mathcal {F}}}_1, \dots , {\widehat{\mathcal {F}}}_5\}\)—and quantified sets are subsequently defined, \(\mathcal {F}_i = {\mathbf {F}}-{\widehat{\mathcal {F}}}_i\). Now, suppose we desire to choose between L model configurations (e.g. from L hyperparameter settings). Then for each \(\ell\)-th candidate model, we produce a transcript abundance estimate from each i-th quantified set, \(\mathbf {\theta }^{(\ell )}_i\). To select the best out of the L candidate models, one simply selects the model that minimizes the average perplexity over the five folds, \(\frac{1}{5} \sum _i \overline{PP}({\widehat{\mathcal {F}}}_i, \mathbf {\theta }_i^{(\ell )})\).
One additional practical consideration should also be noted. Given any pair of quantification and validation sets \(\mathcal {F}\) and \({\widehat{\mathcal {F}}}\), a validation fragment, \({\hat{f}}_j \in {\widehat{\mathcal {F}}}\), can be necessarily impossible. A necessarily impossible validation fragment is one that maps to a set of transcripts to which no fragments in the quantified set \(\mathcal {F}\) also map. Such a fragment will always be called impossible given any abundance estimate deriving from the quantified set \(\mathcal {F}\), since no fragments in \(\mathcal {F}\) provide any evidence that transcripts to which \({\hat{f}}_j\) map are expressed.
It is of limited meaning to evaluate estimates with respect to necessarily impossible fragments. For the purposes of this work, we shall consider the penalization of an abundance estimate only with respect to impossible fragments that are recoverable—in other words, fragments that could be assigned non-zero probability given a better abundance estimate inferable from \(\mathcal {F}\). As such, we remove necessarily impossible validation fragments from \({\widehat{\mathcal {F}}}\), given \(\mathcal {F}\), prior to computing perplexity whenever fragment sets are partitioned into validation and quantified fragment sets.
Data
Sequencing Quality Control (SEQC) project data
We downloaded Illumina HiSeq 2000 sequenced data consisting of 100+100 nucleotide paired-end reads from the Sequencing Quality Control (SEQC) project [20]. SEQC samples are labeled by four different conditions \(\{A, B, C, D\}\), with condition A being Universal Human Reference RNA and B being Human Brain Reference RNA from the MAQC consortium [32], with additional spike-ins of synthetic RNA from the External RNA Control Consortium (ERCC) [33]. Conditions C and D are generated by mixing A and B in 3:1 and 1:3 ratios, respectively.
In this work, we analyze the first four replicates from each condition sequenced at the Beijing Genomics Institute (BGI)—one of three official SEQC sequencing centers. For each sample, we aggregate fragments sequenced by all lanes from the flowcell with the lexicographically smallest identifier.Footnote 2 Quantitative PCR (qPCR) data of technical replicates for each sample in each condition are downloaded via the seqc BioConductor package.
Simulated lung transcript expression data
We simulated read-sets based on 10 sequenced healthy lung samples, with Sequence Read Archive accession number SRR1265{495-504} [34]. Transcript abundance estimates inferred by Salmon using the --useEM flag for each sample are used as ground truth abundances for read simulation (expressed in transcripts per million (TPM) and expected read-per-transcript counts). Then, transcript abundances in samples SRR1265{495-499}, for \(10\%\) of transcripts expressed in at least one of the five samples, are artificially up or down regulated by a constant factor (\(2.0\times\)) to simulate differential transcript expression. We treat the resulting read-per-transcript counts as ground truth, and generate for each sample a fragments set of 100+100 nucleotide paired-end reads using Polyester at a uniform error rate of 0.001 with no sequence specific bias [35].
Evaluation and experiments
The purpose of the experiments in this work are twofold. First, to establish the relationship and correspondence between perplexity and commonly used measures of goodness or accuracy in transcript abundance estimation. And second, to demonstrate how model and hyperparameter selection can be performed using perplexity. In particular, we perform and evaluate hyperparameter selection for salmon with respect to the prior size in the variational Bayesian expectation maximization (VBEM) model used for inference [15]. The user-selected prior size for the VBEM model in salmon encodes the prior belief in the number of reads-per-transcript expected for any inferred abundance estimate. This hyperparameter controls salmon’s preference for inferring sparse or smooth estimates—the smaller the prior size, the sparser an estimate salmon will prefer. As discussed above, prior studies on Bayesian models have not necessarily agreed on how sparse or smooth a good estimate ought to be [11, 12]—the experiments in this work aim to provide a quantitative framework to settle this disagreement.
We perform all experiments according to the proposed quantify-then-validate procedure and report results with respect to various metrics over a fivefold cross-validation protocol. We use the Ensembl human reference transcriptome GRCh37 (release 100) for all abundance estimation and analysis [36].
Evaluation versus parallel SEQC qPCR measurements
We analyze the relationship between perplexity and accurate abundance estimation in experimental data from the SEQC consortium. In SEQC data, we evaluate accuracy of abundances estimated by salmon by comparing estimates to qPCR gene expression data on biological replicates, a coarse proxy to ground truth. We evaluate the Spearman correlation between gene expressions of qPCR probed genes in SEQC replicates versus the corresponding abundance estimates. Gene expression from estimated transcript expression is aggregated via txImport [6] with transcript-to-gene annotations from EnsDb.Hsapiens.v86 [37]. From gene expression data, Ensembl genes are mapped to corresponding Entrez IDs via biomaRt [38], and 897 genes are found to have a corresponding qPCR measurement in downloaded SEQC data. Expressions for genes with repeated entries in SEQC qPCR data are averaged.
Evaluation versus ground truth on simulated data
In simulated data, since ground truth abundances are available, we compare estimated TPMs (computed by salmon) against ground truth TPMs under two metrics.
First, we consider the Spearman correlation with respect to known expressed transcripts (i.e. transcripts with non-zero expression in ground truth abundances). We choose to evaluate Spearman correlation with respect to ground truth non-zero TPMs because of the presence of many unexpressed transcripts in the ground truth, meaning a high number of values tied at rank zero. Here, small deviations from zeros can lead to large changes in rank, leading to non-trivial differences in the resulting Spearman correlation metric. We demonstrate this phenomenon with respect to the ground truth abundance of a simulated sample (SRR1265495) with a mean TPM of 5.98, in which 49% of transcripts are unexpressed (82,358 / 167,268). We report the change in Pearson correlation, \(R^2\) score, and Spearman correlation of ground truth TPMs versus ground truth TPMs perturbed with normally distributed noise at varying standard deviations. As we can see from Fig. 4, even small perturbations cause non-trivial changes in Spearman rank correlation, while changes in Pearson correlation are entirely imperceptible. The Pearson correlation, however, suffers from the well known problem that, in long-tailed distributions spanning a large dynamic range, like those commonly observed for transcript abundances, the Pearson correlation is largely dominated by the most abundant transcripts.
Second, we complement measuring Spearman correlation of non-zero ground truth TPMs with reporting the area under receiver operating characteristic (AUROC) for recalling ground truth zeros based on estimated abundances. While the measurement of Spearman correlation on the truly expressed transcripts is robust to small changes in predicted abundance near zero, it fails to account for false positive predictions even if they are of non-trivial abundance. The complementary metric of the AUROC for recalling ground truth zeros complements that metric, since it is affected by false positive predictions.
Differential expression analysis on simulated data
We perform transcript level differential expression analysis and analyze the recall of known differentially expressed transcripts in simulated lung tissue data (see 3.6.2). We perform differential expression analysis at the trancript level using swish [39] using 20 inferential replicates from salmon. We modified salmon to ensure that prior sizes supplied via the --vbPrior flag are propagated to the Gibbs sampling algorithm. We plot receiver operating characteristic (ROC) curves and report the mean AUROC for predicting differentially expressed transcripts over multiple folds. We assign \(P = 1\) to transcripts for which swish does not assign adjusted P-values.
Evaluation of eXpress abundance estimates
We measure the change in perplexities of abundance estimates inferred by eXpress (version 1.5.1) when running 0, 1, and 2 additional rounds of the online expectation maximization (EM) optimization step. We specify the number of additional online EM steps using the –additional-online parameter. We provide to eXpress alignments to the human transcriptome computed by bowtie2 [40] using the parameters recommended by eXpress with: -a -X 600 –rdg 6,5 –rfg 6,5 –score-min L,-.6,-.4 –no-discordant –no-mixed.
To compute perplexity, we use the transcript effective lengths computed by eXpress for each transcript inferred to be expressed. For each transcript inferred to be unexpressed, we use transcript lengths in place of effective lengths, since eXpress sets the effective length for these transcripts to zero. We take effective counts computed by eXpress to be expected fragment per-transcript counts.
Implementation
We implement perplexity in Rust and provide snakemake [41] workflows to (a) set up quantify-validate splits of fragment-sets for K-fold cross-validation, and (b) compute perplexities of salmon abundance estimates with respect to validation fragment sets at: github.com/COMBINE-lab/perplexity. Approximate per-fragment probabilities (Eq. 6) are computed by running salmon with options –skipQuant and –dumpEq. Code to reproduce the experiments and figures for this work is available at github.com/COMBINE-lab/perplexity-paper.