Estimation of alternative splicing isoform frequencies from RNA-Seq data
© Nicolae et al; licensee BioMed Central Ltd. 2011
Received: 18 October 2010
Accepted: 19 April 2011
Published: 19 April 2011
Skip to main content
© Nicolae et al; licensee BioMed Central Ltd. 2011
Received: 18 October 2010
Accepted: 19 April 2011
Published: 19 April 2011
Massively parallel whole transcriptome sequencing, commonly referred as RNA-Seq, is quickly becoming the technology of choice for gene expression profiling. However, due to the short read length delivered by current sequencing technologies, estimation of expression levels for alternative splicing gene isoforms remains challenging.
In this paper we present a novel expectation-maximization algorithm for inference of isoform- and gene-specific expression levels from RNA-Seq data. Our algorithm, referred to as IsoEM, is based on disambiguating information provided by the distribution of insert sizes generated during sequencing library preparation, and takes advantage of base quality scores, strand and read pairing information when available. The open source Java implementation of IsoEM is freely available at http://dna.engr.uconn.edu/software/IsoEM/.
Empirical experiments on both synthetic and real RNA-Seq datasets show that IsoEM has scalable running time and outperforms existing methods of isoform and gene expression level estimation. Simulation experiments confirm previous findings that, for a fixed sequencing cost, using reads longer than 25-36 bases does not necessarily lead to better accuracy for estimating expression levels of annotated isoforms and genes.
Ubiquitous regulatory mechanisms such as the use of alternative transcription start and polyadenylation sites, alternative splicing, and RNA editing result in multiple messenger RNA (mRNA) isoforms being generated from a single genomic locus. Most prevalently, alternative splicing is estimated to take place for over 90% of the multi-exon human genes across diverse cell types , with as much as 68% of multi-exon genes expressing multiple isoforms in a clonal cell line of colorectal cancer origin . Not surprisingly, the ability to reconstruct full length isoform sequences and accurately estimate their expression levels is widely believed to be critical for unraveling gene functions and transcription regulation mechanisms .
Three key interrelated computational problems arise in the context of transcriptome analysis: gene expression level estimation (GE), isoform expression level estimation (IE), and novel isoform discovery (ID). Targeted GE using methods such as quantitative PCR has long been a staple of genetic studies. The completion of the human genome has been a key enabler for genome-wide GE performed using expression microarrays. Since expression microarrays have limited capability of detecting alternative splicing events, specialized splicing arrays have been developed for genome-wide interrogation of both annotated exons and exon-exon junctions. However, despite sophisticated deconvolution algorithms [4, 5], the fragmentary information provided by splicing arrays is typically insufficient for unambiguous identification of full-length transcripts [6, 7]. Massively parallel whole transcriptome sequencing, commonly referred to as RNA-Seq, is quickly replacing microarrays as the technology of choice for performing GE due to their wider dynamic range and digital quantitation capabilities . Unfortunately, most RNA-Seq studies to date still ignore alternative splicing or, similar to splicing array studies, restrict themselves to surveying the expression levels of exons and exon-exon junctions. The main difficulty in inferring expression levels for full-length isoforms lies in the fact that current sequencing technologies generate short reads (from few tens to hundreds of bases), many of which cannot be unambiguously assigned to individual isoforms.
RNA-Seq analyses typically start by mapping sequencing reads onto the reference genome, transcript libraries, exon-exon junction libraries, or combinations thereof. Early RNA-Seq studies have recognized that limited read lengths result in a significant percentage of so called multireads, i.e., reads that map equally well at multiple locations in the genome. A simple (and still commonly used) approach is to discard multireads, and estimate expression levels using only the so called unique reads. Mortazavi et al.  proposed a multiread "rescue" method whereby initial gene expression levels are estimated from unique reads and used to fractionally allocate multireads, with final expression levels obtained by re-estimation based on total counts obtained after multiread allocation. An expectation-maximization (EM) algorithm that extends this scheme by repeatedly alternating between fractional read allocation and re-estimation of gene expression levels was recently proposed in .
A number of recent works have addressed the IE problem, namely isoform expression level estimation from RNA-Seq reads. Under a simplified "exact information" model,  showed that neither single nor paired read RNA-Seq data can theoretically guarantee unambiguous inference of isoform expression levels, although paired reads may be sufficient to deconvolute expression levels for the majority of annotated isoforms. The key challenge in IE is accurate assignment of ambiguous reads to isoforms. Compared to the GE context, read ambiguity is much more significant, since it affects not only multireads, but also reads that map at a unique genome location expressed in multiple isoforms. Estimating isoform expression levels based solely on unambiguous reads, as suggested, e.g., in , results in splicing-dependent biases similar to the transcript-length bias noted in , further complicating the design of unbiased differential expression tests based on RNA-Seq data. To overcome this difficulty,  proposed a Poisson model of single-read RNA-Seq data explicitly modeling isoform frequencies. Under their model, maximum likelihood estimates are obtained by solving a convex optimization problem, and uncertainty of estimates is obtained by importance sampling from the posterior distribution. Li et al.  introduced an expectation-maximization (EM) algorithm similar to that of  but applied to isoforms instead of genes. Unlike the method of , which estimates isoform frequencies only from reads that map to a unique location in the genome, the algorithm of  incorporates multireads as well. The IE problem for single reads is also tackled in , who propose an EM algorithm for inferring isoform expression levels from the read coverage of exons (reads spanning exon junctions are ignored).
The related novel isoform discovery (ID) problem is also receiving much interest in the literature. Although showing encouraging results, de novo transcriptome assembly algorithms such as [15–17] have difficulties in identifying transcripts with moderate coverage. Very recently, [18–20] proposed genome-assisted (i.e., mapping based) methods for simultaneously solving ID and IE based on paired RNA-Seq reads. The method of Feng et al.  generates isoform candidates from the splicing graph derived from annotations and reads spanning exon-exon junctions. After discarding multireads,  formulates IE for a given set of isoforms as a convex quadratic program (QP) that can be efficiently solved for each gene locus. The set of isoform candidates is iteratively refined until the p-value of the objective value of the QP, assumed to follow a χ2 distribution, exceeds an empirically selected threshold of 5%. Pair read information is not directly used in isoform frequency estimation, contributing only as secondary data to filter out false positives in the process of isoform selection. As in , Guttman et al.  construct a splicing graph from the mapped reads and filter candidate isoforms using paired-end information. Isoform specific expression levels are inferred using the method of . After performing spliced alignment of (paired) reads onto the genome using TopHat , the method of Trapnell et al. , referred to as Cufflinks, constructs a read overlap graph and generates candidate isoforms by finding a minimal size path cover via a reduction to maximum matching in a weighted bipartite graph. Reads that match equally well multiple locations in the genome are fractionally allocated to these locations, and estimation is then performed independently at different transcriptional loci, using an extension to paired reads of the methods in .
In this paper we focus on the IE problem, namely estimating isoform expression levels (interchangeably referred to as frequencies) from RNA-Seq reads, under the assumption that a complete list of candidate isoforms is available. Projects such as  and  have already assembled large libraries of full-length cDNA sequences for humans and other model organisms, and the coverage of these libraries is expected to continue to increase rapidly following ultra-deep paired-end transcriptome sequencing projects such as [19, 20] and the widely anticipated deployment of third-generation sequencing technologies such as [24, 25], which deliver reads with significantly increased length. Inferring expression at isoform level provides information for finer-resolution biological studies, and also leads to more accurate estimates of expression at the gene level by allowing rigorous length normalization. Indeed, as shown in the 'Experimental results' section, genome-wide gene expression level estimates derived from isoform level estimates are significantly more accurate than those obtained directly from RNA-Seq data using isoform-oblivious GE methods such as the widely used counting of unique reads, the rescue method of , or the EM algorithm of .
Our main contribution is a novel expectation-maximization algorithm for isoform frequency estimation from any mixture of single and paired RNA-Seq reads. A key feature of our algorithm, referred to as IsoEM, is that it exploits information provided by the distribution of insert sizes, which is tightly controlled during sequencing library preparation under current RNA-Seq protocols. Such information is not modeled in the "exact" information models of [6, 7], challenging the validity of their negative results. Guttman et al.  take into account insert lengths derived from paired read data, but only for filtering candidate isoforms in ID. Trapnell et al.  is the only other work we are aware of that exploits this information for IE, in conjunction with paired read data. We show that modeling insert sizes is highly benefficial for IE even for RNA-Seq data consisting of single reads. Insert sizes contribute to increased estimation accuracy in two different ways. On one hand, they can help disambiguating the isoform of origin for the reads. In IsoEM, insert lengths are combined with base quality scores, and, if available, read pairing and strand information to probabilistically allocate reads to isoforms during the expectation step of the algorithm. As in , the genomic locations of multireads are also resolved probabilistically in this step, further contributing to improved overall accuracy compared to methods that ignore or fractionally pre-allocate multireads. On the other hand, insert size distribution is used to accurately adjust isoform lengths during frequency re-estimation in the maximization step of the IsoEM algorithm.
We also present the results of comprehensive experiments conducted to assess the performance of IsoEM on both synthetic and real RNA-Seq datasets. These results show that IsoEM consistently outperforms existing methods under a wide range of sequencing parameters and distribution assumptions. We also report results of experiments empirically evaluating the effect of sequencing parameters such as read length, read pairing, and strand information on estimation accuracy. Our experiments confirm the surprising finding of  that, for a fixed total number of sequenced bases, longer reads do not necessarily lead to better accuracy for estimation of isoform and gene expression levels.
As with many RNA-Seq analyses, the first step of IsoEM is to map the reads. Our approach is to map them onto the library of known isoforms using any one of the many available ungapped aligners (we used Bowtie  with default parameters in our experiments). An alternative strategy is to map the reads onto the genome using a spliced alignment tool such as TopHat , as done, e.g., in [19, 20]. However, preliminary experiments with TopHat resulted in fewer mapped reads and significantly increased mapping uncertainty, despite providing TopHat with a complete set of annotated junctions. Since further increases in read length coupled with improvements in spliced alignment algorithms could make mapping onto the genome more attractive in the future, we made our IsoEM implementation compatible with both mapping approaches by always converting read alignments to genome coordinates and performing all IsoEM read-isoform compatibility calculations in genome space.
Some of the reads match multiple positions in the genome, which we refer to as alignments (for paired end reads, an alignment consists of the positions where the two reads in the pair align with the genome). Each alignment a can in turn be compatible with multiple isoforms that overlap at that position of the genome. During the line sweep, we compute the relative "weight" of assigning a given read/pair r to isoform j as w r, j = ∑ a Q a F a O a , where the sum is over all alignments of r compatible with j, and the factors of the summed products are defined as follows:
Q a represents the probability of observing the read from the genome locations described by the alignment. This is computed from the base quality scores as , where if position k of alignment a matches the reference genome sequence and 0 otherwise, while ε k denotes the error probability of k-th base of r.
For paired end reads, F a represents the probability of the fragment length needed to produce alignment a from isoform j; note that the length of this fragment can be inferred from the genome coordinates of the two aligned reads and the available isoform annotation. For single reads, we can only estimate an upperbound u on the fragment length: if the alignment is on the same strand as the isoform then u is the number of isoform annotated bases between the 5' end of the aligned read and the 3' end of the isoform, otherwise u is the number of isoform annotated bases between the 5' end of the aligned read and the 5' end of the isoform. In this case F a is defined as the probability of observing a fragment with length of u bases or fewer.
O a is 1 if alignment a of r is consistent with the orientation of isoform j, and 0 otherwise. Consistency between the orientations of r and j depends on whether or not the library preparation protocol preserves the strand information. For single reads O a = 1 when reads are generated from fragment ends randomly or, for directional RNA-Seq, when they match the known isoform orientation. For paired-end reads, O a = 1 if the two reads come from different strands, point to each other, and, in the case of directional RNA-Seq, the orientation of first read matches the known isoform orientation.
since, the number of fragments of length k is expected to be proportional to the number of valid starting positions for a fragment of that length in the isoform. Thus, if the isoform of origin is known for each read, the maximum likelihood estimator for f(j) is given by c(j)/(c(1) + ... + c(N)), where denotes the length-normalized fragment coverage. Note that the length of most isoforms is significantly larger than the mean fragment length μ typical of current sequencing libraries; for such isoforms and c(j) can be approximated by n(j)/(l(j) μ + 1).
E-step: Compute the expected number n(j) of reads that come from isoform j under the assumption that isoform frequencies f(j) are correct, based on weights w r, j computed as described in the previous section
M-step: For each j, set the new value of f(j) to c(j)/(c(1) + ... + c(N)), where normalized coverages c(j) are based on expected counts computed in the prior E-step
Below we describe two implementation optimizations that significantly improve the performance of IsoEM by reducing both runtime and memory usage.
The second IsoEM optimization consists of partitioning the set of reads within each compatibility component into equivalence classes. Two reads are equivalent for IsoEM if they are compatible with the same set of isoforms and their compatibility weights to the isoforms are proportional. Keeping only a single representative from each read class (with appropriately adjusted frequency) drastically reduces the number of reads kept in memory (see Figure 3(b)). As the number of reads increases, the number of read classes increases much slower. Eventually this reaches saturation and no new read classes appear - at which point the runtime of IsoEM becomes virtually independent of the number of reads. Indeed, in practice the runtime bottlenecks are parsing the reads, computing the compatibility graph and detecting equivalent reads.
Find the connected components of the compatibility graph induced by the reads, and
Collapse equivalent reads into read classes with multiplicity indicating the number of reads in each class.
A straightforward approach is to solve the first problem using a union-find algorithm, then to take the reads corresponding to each connected component and remove equivalent reads, e.g., using hashing. However, there are two drawbacks to this approach:
First, all reads need to be kept in memory until all connected components have been computed.
Second, when the number of reads in a connected component is very large the number of collisions increases, which leads to poor performance.
We overcome the two problems presented above using an online version of the union-find algorithm which computes connected components and eliminates equivalent reads on the fly. This way, equivalent reads will never reside too long in memory. Also, we avoid the problem of large hash tables by using multiple smaller hash tables which are guaranteed to be disjoint.
We start our modified version of union-find with an empty set of trees. A new single-node tree is initialized every time a new isoform is found in a read class. In each node we store a hash-table of read classes. Each read is processed as follows:
If the isoforms compatible with the read correspond to nodes in more than one tree unite the corresponding trees. The root of the tallest tree becomes the root of the union tree. Then create a new read class for this read (we can be sure it was not seen before, otherwise the isoforms would have been in the same tree) and add it to the hash table of the root node. Notice that at this point the root node is also (trivially) the Lowest Common Ancestor (LCA) of the nodes corresponding to the isoforms in the read class.
If the isoforms correspond to nodes in the same tree find the LCA of all these nodes. If the class of the read is present in the hash table of the LCA, increment its multiplicity and then drop the read. Otherwise, create a new read class and add it to the LCA's hash table.
Notice that in the second case it suffices to look only in the LCA of the isoforms for an already existing read class. This follows immediately from the fact that we always add reads to the LCA of the nodes (isoforms) compatible with the read. Note that we cannot use path compression to speed up 'find' operations because this would be altering the structure of existing trees. Thus, 'find' operations will take logarithmic (amortized) time. At the end of the algorithm, each tree in the union-find forest corresponds to a connected component. The read classes in each connected component are obtained by traversing the corresponding tree and collecting all the read classes present in the nodes. At this point we are sure that all the read classes are distinct, so the collection process performs simple concatenations. To further speed up the collection process, we can safely use path compression as we traverse the trees, since we no longer care about the exact topology of the subtrees.
Each union operation takes O(1) time, so for a read with k compatible isoforms we spend at most O(k) time doing unions. By always making the root of the taller tree to be the root of a union, we ensure that the height of any tree is not bigger than O(log n) where n is the number of nodes in the tree. Thus, finding the root of a node's tree takes O(log n). For a read with k compatible isoforms we spend at most O(k log n) time processing it. The LCA of two nodes can be computed at constant overhead when performing find operations (by marking the nodes on the paths from isoforms to root). Collecting all the read classes is sped-up by using path compression. The whole collecting phase takes O(nα (n)) time where n is the total number of isoforms and α (n) is the inverse of the Ackermann function. Overall, for q reads with an average of k isoforms per read and n total distinct isoforms, computing read classes and compatibility components using the modified union-find algorithm takes O(qk log n + nα (n)) time.
Since we already collapse equivalent reads into read classes, we can seamlessly incorporate hexamer weights in the algorithm by slightly changing the definition of a read class' multiplicity to , where h(r) denotes the starting hexamer of r. The effect of this correction procedure is to reduce (respectively increase) the multiplicity of reads with starting hexamers that are overrepresented (respectively underrepresented) at the beginning of reads compared to the middle of reads. The underlying assumption is that the average frequency with which a hexamer appears in the middle of reads is not affected by library preparation biases. Recent methods  also target biases surrounding the start site of the read in addition to within reads.
To avoid biases from incorrectly mapped reads originating from repetitive regions, IsoEM will also discard reads that overlap annotated repeats. When applying this correction, isoform lengths are automatically adjusted by subtracting the number of positions resulting in reads that would be discarded.
Single and paired-end reads were randomly generated by sampling fragments from the known isoforms. Each isoform was assigned a true frequency based on the abundance reported for the corresponding gene in the first human tissue of the GNFAtlas2 table, and a probability distribution over the isoforms inside a gene cluster. Thus, the true frequency of isoform j is a(g)p(j), where a(g) is the abundance of the gene g for which j is an isoform and p(j) is the probability of isoform j among all the isoforms of g. We simulated datasets with uniform, respectively truncated geometric distribution with ratio r = 1/2 for the isoforms of each gene. For a gene with k isoforms p(j) = 1/k, j = 1, ..., k, under the uniform distribution. Under the truncated geometric distribution, the respective isoform probabilities are p(j) = 1/2 j for j = 1, ..., k - 1 and p(k) = 1/2k-1. Fragment lengths were simulated from a normal probability distribution with mean 250 and standard deviation 25.
We compared IsoEM to several existing algorithms for solving the IE and GE problems. For IE we included in the comparison the isoform analogs of the Uniq and Rescue methods used for GE , an improved version of Uniq (UniqLN) that estimates isoform frequencies from unique read counts but normalizes them using adjusted isoform lengths that exclude ambiguous positions, the Cufflinks algorithm of  (version 0.8.2), and the RSEM algorithm of  (version 0.6). For the GE problem, the comparison included the Uniq and Rescue methods, our implementation of the GeneEM algorithm described in , and estimates obtained by summing isoform expression levels inferred by Cufflinks, RSEM, and IsoEM. All methods use alignments obtained by mapping reads onto the library of isoforms with Bowtie  and then converting them to genome coordinates, except for Cufflinks which uses alignments obtained by directly mapping the reads onto the genome with TopHat , as suggested in .
Frequency estimation accuracy was assessed using the coefficient of determination, r2, along with the error fraction (EF) and median percent error (MPE) measures used in . However, accuracy was computed against true frequencies, not against estimates derived from true counts as in . If is the frequency estimate for an isoform with true frequency f i , the relative error is defined as if f i ≠ 0, 0 if , and ∞ if . The error fraction with threshold τ, denoted EF τ is defined as the percentage of isoforms with relative error greater or equal to τ. The median percent error, denoted MPE, is defined as the threshold τ for which EF τ = 50%.
r2 for isoform and gene expression levels inferred from 30 M reads of length 25 from reads simulated assuming uniform, respectively geometric expression of gene isoforms.
Median percent error (MPE) and 15% error fraction (EF.15) for isoform expression levels inferred from 30 M reads of length 25 simulated assuming geometric isoform expression.
Median percent error (MPE) and 15% error fraction (EF.15) for gene expression levels inferred from 30 M reads of length 25 simulated assuming geometric isoform expression.
In addition to simulation experiments, we validated IsoEM on two real RNA-Seq datasets. The first dataset consists of two samples with approximately 8 million 27 bp Illumina reads each, generated from two human cell lines (embryonic kidney and B cells) as described in . Estimation accuracy was assessed by comparison with quantitative PCR (qPCR) expression levels determined in  for 47 genes with evidence of alternative isoform expression. To facilitate comparison with these qPCR results, expression levels were determined using transcript annotations in ENSEMBL version 46. The second dataset consists of approximately 5 million 32 bp Illumina reads per sample, generated from the RM11-1a strain of S. cerevisiae under two different nutrient conditions . Expression levels were determined using transcript annotations for the reference strain (June 2008 SGD/sacCer2) and compared against qPCR expression levels measured for 192 genes (for a total of 394 datapoints).
r2 for isoform and gene expression levels inferred from 30 M single, respectively paired reads of length 25, simulated assuming geometric expression of gene isoforms with and without poly(A) tails.
1 × 25
2 × 25
As shown in Figure 10(b), the runtime of IsoEM scales roughly linearly with the number of fragments, and is practically insensitive to the type of sequencing data (single or paired reads, directional or non-directional). IsoEM was tested on a Dell PowerEdge R900 server with 4 Six Core E7450Xeon Processors at 2.4 Ghz (64 bits) and 128 Gb of internal memory. None of the datasets required more than 16 GB of memory to complete. It is also true that increasing the available memory significantly decreases runtime by keeping the garbage collection overhead to a minimum. The runtimes in Figure 10 were obtained by allowing IsoEM to use up to 32 GB of memory, in which case none of the datasets took more than 3 minutes to solve.
In this paper we have introduced an expectation-maximization algorithm for isoform frequency estimation assuming a known set of isoforms. Our algorithm, called IsoEM, explicitly models insert size distribution, base quality scores, strand and read pairing information. Experiments on both real and synthetic RNA-Seq datasets generated using two different assumptions on the isoform distribution show that IsoEM consistently outperforms existing algorithms for isoform and gene expression level estimation with respect to a variety of quality metrics.
The open source Java implementation of IsoEM is freely available for download at http://dna.engr.uconn.edu/software/IsoEM/. In ongoing work we are extending IsoEM to perform allelic specific isoform expression and exploring integration of isoform frequency estimation with identification of novel transcripts using the iterative refinement framework proposed in .
MN and IIM were supported in part by NSF awards IIS-0546457, IIS-0916948, and DBI-0543365 and NIFA award 2011-67016-30331. SM and AZ were supported in part by NSF award IIS-0916401 and NIFA award 2011-67016-30331. All authors would like to thank the anonymous referees for many constructive comments that helped improving the presentation.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.