Skip to main content

A normalization strategy for comparing tag count data

Abstract

Background

High-throughput sequencing, such as ribonucleic acid sequencing (RNA-seq) and chromatin immunoprecipitation sequencing (ChIP-seq) analyses, enables various features of organisms to be compared through tag counts. Recent studies have demonstrated that the normalization step for RNA-seq data is critical for a more accurate subsequent analysis of differential gene expression. Development of a more robust normalization method is desirable for identifying the true difference in tag count data.

Results

We describe a strategy for normalizing tag count data, focusing on RNA-seq. The key concept is to remove data assigned as potential differentially expressed genes (DEGs) before calculating the normalization factor. Several R packages for identifying DEGs are currently available, and each package uses its own normalization method and gene ranking algorithm. We compared a total of eight package combinations: four R packages (edgeR, DESeq, baySeq, and NBPSeq) with their default normalization settings and with our normalization strategy. Many synthetic datasets under various scenarios were evaluated on the basis of the area under the curve (AUC) as a measure for both sensitivity and specificity. We found that packages using our strategy in the data normalization step overall performed well. This result was also observed for a real experimental dataset.

Conclusion

Our results showed that the elimination of potential DEGs is essential for more accurate normalization of RNA-seq data. The concept of this normalization strategy can widely be applied to other types of tag count data and to microarray data.

Background

Development of next-generation sequencing technologies has enabled biological features such as gene expression and histone modification to be quantified as tag count data by ribonucleic acid sequencing (RNA-seq) and chromatin immunoprecipitation sequencing (ChIP-seq) analyses [1, 2]. Different from hybridization-based microarray technologies [3, 4], sequencing-based technologies do not require prior information about the genome or transcriptome sequences of the samples of interest [5]. Therefore, researchers can profile the expression of not only well-annotated model organisms but also poorly annotated non-model organisms. RNA-seq in such organisms enables the gene structures and expression levels to be determined.

One important task for RNA-seq is to identify differential expression (DE) for genes or transcripts. Similar to microarray analysis, we typically start the analysis with a so-called "gene expression matrix," where each row indicates the gene (or transcript), each column indicates the sample (or library), and each cell indicates the number of reads mapped to the gene in the sample. In general, there are two major factors for accurately quantifying and normalizing RNA-seq data: gene length and sequencing depth (or total read counts). Normalization by gene length is important for comparing different genes within a sample because longer genes tend to have more reads to be sequenced [6]. Previous approaches for normalizing length include defining an effective length of a gene that may have two or more transcript isoforms of different lengths, and normalizing by the length [7–11].

Normalization by sequencing depth is particularly important for comparing genes in different samples because different samples generally have different total read counts. Previous approaches include (i) global scaling so that a summary statistic such as the mean or upper-quartile value of read counts for each sample (or library) becomes a common value and (ii) standardization of distribution so that the read count distributions become the same across samples [12–15]. Some groups recently reported that over-representation of genes with higher expression in one of the samples, i.e., biased differential expression, has a negative impact on data normalization and consequently can lead to biased estimates of true differentially expressed genes (DEGs) [15, 16]. To reduce the effect of such genes on the data normalization step, Robinson and Oshlack reported a simple but effective global scaling method, the trimmed mean of M values (TMM) method, where a scaling factor for the normalization is calculated as a weighted trimmed mean of the log ratios between two classes of samples (i.e., Samples A vs. B) [16]. The concept of the TMM method is the basis for developing our normalization strategy.

In this paper, we focus on normalization related to sequencing depth as well as the TMM normalization method. We believe the TMM method can be improved. Consider, for example, a hypothetical dataset containing a total of 1000 genes, where (i) 200 genes (i.e., 200/1000 = 20%) are detected as DEGs when comparing Samples A vs. B (PDEG = 20%), (ii) 180 of the 200 DEGs are highly expressed in Sample A (i.e., PA = 180/200 = 90%), (iii) the dataset can be perfectly normalized by applying a normalization factor calculated based only on the remaining 800 non-DEGs, and (iv) individual DEGs (or non-DEGs) have a negative (or positive) impact on calculation of the normalization factor. In this case, the two parameters should ideally be estimated as PDEG = 20% and PA = 90%. Currently, the TMM method implicitly uses fixed values for these two parameters (i.e., PDEG = 60 and PA = 50) unless users explicitly provide arbitrary values [16, 17]. This is probably because an automatic estimation of the PDEG value is practically difficult.

Hardcastle and Kelly [18] recently proposed an R [19] package, baySeq, for differential expression analysis of RNA-seq data. A notable advantage of this method is that an objective PDEG value is produced by calculating multiple models of differential expression. This method also inspired us in our improvement of the normalization of RNA-seq data. Our normalization strategy, named TbT, consists of TMM [16] and baySeq [18], used twice and once respectively in a TMM-baySeq-TMM pipeline. We show the importance of estimating the PDEG value according to the true PDEG value for individual datasets. The results were obtained using simulated and real datasets.

Results and Discussion

RNA-seq data must be normalized before differential expression analysis can be conducted on them. Some R packages exist for comparing two groups of samples [17, 18, 20, 21], and each package uses its own normalization method and gene ranking algorithm. For example, the R package edgeR [17] uses the TMM method [16] for data normalization and an exact test for negative binomial (NB) distribution [22] for gene ranking. A good normalization method coupled with gene ranking methods should produce good ranked gene lists where true DEGs can easily be detected as top-ranked and non-DEGs are bottom-ranked, when all genes are ranked according to the degree of DE.

Following from our previous study [23–25], the area under the receiver operating characteristic (ROC) curve (i.e., AUC) values were used for evaluating individual combinations based on sensitivity and specificity simultaneously. A good combination should therefore have a high AUC value (i.e., high sensitivity and specificity). In the remainder of this paper, we first describe our normalization strategy (called TbT). We then evaluate a total of eight package combinations: four R packages for differential expression analysis (edgeR [17], DESeq [20], baySeq [18], and NBPSeq [21]) with default normalization settings (which we call edgeR/default, DESeq/default, baySeq/default, and NBPSeq/default) and the same four packages with TbT normalization (i.e., edgeR coupled with TbT (edgeR/TbT), DESeq/TbT, baySeq/TbT, and NBPSeq/TbT). Finally, we discuss guidelines for meaningful differential expression analysis.

Note that the execution of the baySeq package was performed using data after scaling for the reads per million (RPM) mapped reads in each sample. The procedure in the baySeq package and in the other three packages (edgeR, DESeq, and NBPSeq) is not intended for use with RPM-normalized data, i.e., the original raw count data should be used as the input. However, we found that the use of RPM-normalized data generally yields higher AUC values compared to the use of raw count data when executing the baySeq package. We also found that the use of RPM data did not positively affect the results when the other three packages were executed. Accordingly, all of the results relating to the baySeq package were obtained using the RPM-normalized data. This includes step 2 in the TbT normalization and the gene ranking of DEGs using two baySeq-related combinations (baySeq/TbT and baySeq/default).

Outline of TbT normalization strategy

The key feature of TbT is that data assigned as potential DEGs are removed before the normalization factor is calculated. We will explain the concept of TbT by using simulation data that are negative binomially distributed (three libraries from Sample A vs. three libraries from Sample B; i.e., {A1, A2, A3} vs. {B1, B2, B3}). The simulation conditions were that (i) 20% of genes were DEGs (PDEG = 20%), (ii) 90% of PDEG was higher in Sample A (PA = 90%), and (iii) the level of DE was four-fold.

The NB model is generally applicable when the tag count data are based on biological replicates. It has been noted that the variance of biological replicate read counts for a gene (V) is higher than the mean (μ) of the read counts (e.g., V = μ + ϕμ2 where ϕ > 0) and that the extra dispersion parameter ϕ tends to have large (or small) values when μ is small (or large) [20, 21]. To mimic this mean-dispersion relationship in the simulation, we used an empirical distribution of these values (μ and ϕ) calculated from Arabidopsis data available in the NBPSeq package [21]. For details, see the Methods section.

An M-A plot of the simulation data, after scaling for RPM reads in each library, is shown in Figure 1a. The horizontal axis indicates the average expression level of a gene across two groups, and the vertical axis indicates log-ratios (Sample B relative to Sample A). As shown by the black horizontal line, the median log-ratio for non-DEGs based on the RPM-normalized data (0.543) has a clear offset from zero due to the introduced DEGs with the above three conditions. Therefore, the primary aim of our method is to accurately estimate the percentage of true DEGs (PDEG) and trim the corresponding DEGs so that the median log-ratio for non-DEGs is as close to zero as possible when our TbT normalization factors are used.

Figure 1
figure 1

Outline of TbT normalization strategy. Left panel: M-A plot for negative binomially distributed simulation data from Ref. [21], after scaling for RPM mapped reads in each sample. Magenta and black dots indicate DEGs (20% of all genes; PDEG = 20%) and non-DEGs (80%), respectively. 90% of all DEGs is four-fold higher in Sample A than B (PA = 90%). Each dot represents a gene. Right panel: same plot but colored differently. TbT estimates 16.8% of PDEG using this data. Gray dots indicate genes estimated as non-DEGs by step 2 in TbT. Note that the median log-ratio for true non-DEGs when data normalization is performed using the TbT normalization factors (0.045) is closer to zero than that using the TMM normalization factors (0.170).

To accomplish this, our normalization method consists of three steps: (1) temporary normalization, (2) identification of DEGs, and (3) final normalization of data after eliminating those DEGs. We used the TMM method [16] at steps 1 and 3 and an empirical Bayesian method implemented in the baySeq package [18] at step 2. Other methods could have been used, but our choices seemed to produce good ranked gene lists with high sensitivity and specificity (i.e., a high AUC value). We observed that the median log-ratio for non-DEGs based on our TbT normalization factors (0.045) was closer to zero than the log-ratio based on the TMM normalization factors (0.170) that corresponds to the result of TbT right after step 1 (Figure 1b).

This result suggests the validity of our strategy of removing potential DEGs before calculating the normalization factor. Recall that the true values for PDEG and PA in this simulation were 20% and 90%, respectively. Our TbT method estimated 16.8% of PDEG and 76.3% of PA. We found that 64.4% of the estimated DEGs were true DEGs (i.e., sensitivity = 64.4%) and that the overall accuracy was 89.0%. Some researchers might think that the TMM method (i.e., PDEG = 60% and PA = 50%) must be able to remove many more true DEGs than our TbT method (i.e., higher sensitivity). This is true, but the TMM method tends to trim many more non-DEGs than our method (i.e., lower specificity), especially when most DEGs are highly expressed in one of the samples (corresponding to our simulation conditions with high PDEG and PA values). These characteristics for the two normalization methods and the results shown in Figure 1 indicate that the balance of sensitivity and specificity regarding the assignment of both DEGs and non-DEGs is critical. Our TbT method was originally designed to normalize tag count data for various scenarios including such biased differential expression.

The successful removal of DEGs in the data normalization step generally increases both the sensitivity and specificity of the subsequent differential expression analysis. Indeed, when an exact test implemented in the R package edgeR [17] was used in common for gene ranking, the TbT normalization method showed a higher AUC value (i.e., edgeR/TbT = 90.0%) than the default (the TMM method [16] in this package) normalization method (i.e., edgeR/default = 88.9%). We also observed the same trend for the other combinations: DESeq/TbT = 88.7%, DESeq/default = 87.4%, baySeq/TbT = 90.2%, baySeq/default = 78.2%, NBPSeq/TbT = 90.1%, and NBPSeq/default = 80.9%. These results also suggest that our TbT normalization strategy can successfully be combined with the four existing R packages and that the TbT method outperforms the other normalization methods implemented in these packages.

Simulation results

Note that different trials of simulation analysis generally yield different AUC values even if the same simulation conditions are introduced. It is important to show the statistical significance, if any, of our proposed method. The distributions of AUC values for two edgeR-related combinations (edgeR/TbT and edgeR/default) under three conditions (PA = 50, 70, and 90% with a fixed PDEG value of 20%) are shown in Figure 2. The performances between the two combinations were very similar when PA = 50% (Figure 2a; p-value = 0.95, Wilcoxon rank sum test). This is reasonable because the average estimate of the PA values by TbT in the 100 trials (49.62%) was quite close to the truth (i.e., 50%) and TMM uses a fixed PA value of 50%. The higher the PA value (> 50%) TbT estimates, the higher the performance of TbT (compared to TMM) that can be expected.

Figure 2
figure 2

Distributions of AUC values for two edgeR -related combinations. Simulation results for 100 trials under PA = (a) 50%, (b) 70%, and (c) 90%, with PDEG = 20%. Left panel: box plots for AUC values. Right panel: scatter plots for AUC values. When the performances between the two combinations are completely the same, all the points should be on the black (y = x) line. Point below (or above) the black line indicates that the AUC value from the edgeR/TbT combination is higher (or lower) than that from the edgeR/default combination.

Different from the above unbiased case (PA = 50%), we observed the obvious superiority of TbT under the other two conditions (PA = 70 and 90%). A significant improvement resulting from use of TbT may seem doubtful because of the very small difference between the two average AUC values (e.g., 90.52% for edgeR/TbT and 90.26% for edgeR/default when PA = 70%; left panel of Figure 2b), but the edgeR/TbT combination did outperform the edgeR/default combination in all of the 100 trials under the two conditions (right panels of Figures 2b and 2c), and the p-values were lower than 0.01 (Wilcoxon rank sum test).

Table 1 shows the average AUC values for the two edgeR-related combinations under the various simulation conditions (PDEG = 5-30% and PA = 50-100%). Overall, edgeR/TbT performed better than edgeR/default for most of the simulation conditions analyzed. The relative performance of TbT compared to the default method (i.e., the TMM method [16] in this case) can be seen to improve according to the increased PA values starting from 50%. This is because our estimated values for PDEG and PA are closer to the true values than the fixed values of TMM (PDEG = 60% and PA = 50%; see Table 2). The closeness of those estimations will inevitably increase the overall accuracy of assignment for DE and lead directly to the higher AUC values. This success primarily comes from our three-step normalization strategy, TbT (the TMM-baySeq-TMM pipeline).

Table 1 Average AUC values for two edgeR-related combinations.
Table 2 Estimated values for PDEG and PA by TbT.

Table 3 shows the simulation results for the other six combinations. As can be seen, TbT performed better than the individual default normalization methods implemented in the other three packages (DESeq [20], baySeq [18], and NBPSeq [21]). When we compare the results of the four default procedures (edgeR/default, DESeq/default, baySeq/default, and NBPSeq/default), the edgeR/default combination outperforms the others. This result suggests the superiority of the default normalization method (i.e., TMM) implemented in the edgeR package and the validity of our choices at steps 1 and 3 in our TbT normalization strategy. For reproducing the research, the R-code for obtaining a small portion of the above results is given in Additional file 1.

Table 3 Average AUC values for other six combinations.

Recall that the level of DE for DEGs was four-fold in this simulation framework and the shape of the distribution for introduced DEGs is the same as that of non-DEGs (left panel of Figure 1). This indicates that some DEGs introduced as higher expression in Sample A (or Sample B) can display positive (or negative) M values even after adjustment by the median M value for non-DEGs. In other words, there are some DEGs whose log-ratio signs (i.e., directions of DE) are different from the original intentions. Although the simulation framework regarding the introduction of DEGs was the same as that described in the TMM study [16], this may weaken the validity of the current simulation framework.

To mitigate this concern, we performed simulations with compatible directions of DE by adding a floor value of fold-changes (> 1.2-fold) when introducing DEGs. In this simulation, the fold-changes for DEGs were randomly sampled from "1.2 + a gamma distribution with shape = 2.0 and scale = 0.5." Accordingly, the minimum and mean fold-changes were approximately 1.2 and 2.2 (= 1.2 + 2.0 × 0.5), respectively. We confirmed the superiority of TbT under the various simulation conditions (PDEG = 5-30% and PA = 50-100%) with the above simulation framework (data not shown). An M-A plot of the simulation result when PDEG = 20% and PA = 90% is given in Additional file 2. The R-code for obtaining the full results under the simulation condition is given in Additional file 3.

Iterative normalization approach

Recall that the outperformance of TbT compared to TMM (see Table 1 and Figure 2) is by virtue of our DEG elimination strategy for normalizing tag count data and that the identification of DEGs in TbT is performed using baySeq with the TMM normalization factors at step 2. From these facts, it is expected that the accuracy of the DEG identification at step 2 can be increased by using baySeq with the TbT factors instead of the TMM factors when PA > 50%. The advanced DEG elimination procedure (the TbT-baySeq-TMM pipeline) can produce different normalization factors (say "TbT1") from the original ones. As also illustrated in Figure 3a, this procedure can repeatedly be performed until the calculated normalization factors become convergent.

Figure 3
figure 3

Results of iterative TbT approach. (a) Procedure for iterative TbT approach until the third iteration, and simulation results under PA = (b) 50%, (c) 70%, and (d) 90%, with PDEG = 20%. Left panel: accuracies of DEG identifications when step 2 in our DEG elimination strategy is performed using the following normalization factors: TMM (Default), TbT (First), TbT1 (Second), and TbT2 (Third). Right panel: AUC values when the following normalization factors are combined with the edgeR package: TbT (Default), TbT1 (First), TbT2 (Second), and TbT3 (Third).

The results under three simulation conditions (PA = 50, 70, and 90% with a fixed PDEG value of 20%) are shown in Figures 3b-d. The left panels show the accuracies of DEG identifications when step 2 in our DEG elimination procedures is performed using the following normalization factors: TMM (Default), TbT (First), TbT1 (Second), and TbT2 (Third). As expected, the iterative approach does not positively affect the results when PA = 50% (Figure 3b). Indeed, the performances between the baySeq/TMM combination (Default) and the baySeq/TbT2 combination (Third) are not statistically distinguished (p = 0.38, Wilcoxon rank sum test). Meanwhile, the use of the baySeq/TbT combination (First) can clearly increase the accuracy compared to use of the baySeq/TMM combination (Default), though the subsequent iterations do not improve the accuracies when PA = 70% (Figure 3c, left panel). An advantageous trend for the iterative approach was also observed until the second iteration (Second; the baySeq/TbT1 combination) when PA = 90% (Figure 3d, left panel).

The right panels for Figures 3b-d show the AUC values when the following normalization factors are combined with the edgeR package: TbT (Default), TbT1 (First), TbT2 (Second), and TbT3 (Third). The overall trend is the same as that of the accuracies shown in the left panels: the iterative TbT approach can outperform the original TbT approach when the degree of biased differential expression is high (PA > 50%). We confirmed the utility of the iterative approach with the other three packages (DESeq, baySeq, and NBPSeq) (data not shown). These results suggest that the iterative approach can be recommended, especially when the PA value estimated by the original TbT method is displaced from 50%.

Nevertheless, we should emphasize that the improvement of the iterative TbT approach compared to the original TbT approach is much smaller than that of the TbT compared to the default normalization methods implemented in the four R packages investigated (Figures 2 and 3). For example, the average difference of the AUC values between the edgeR/TbT3 and the edgeR/TbT is 0.02% (Figure 3c) while the average difference of the AUC values between the edgeR/TbT and the edgeR/default is 0.26% (Figure 2b), when PA = 70%. Note also that the baySeq package used in step 2 in our TbT method is much more computationally intensive than the other three packages, indicating that the n times iteration of TbT roughly requires n-fold computation time. In this sense, a speed-up of our proposed DEG elimination strategy should be performed next as future work. The R-code for obtaining a small portion of the above results is given in Additional file 4.

Real data (wildtype vs. RDR6 knockout dataset used in baySeq study)

Finally, we show results from an analysis similar to that described in Ref. [18]. In brief, Hardcastle and Kelly compared two wildtype and two RNA-dependent RNA polymerase 6 (RDR6) knockout Arabidopsis thaliana leaf samples by sequencing small RNAs (sRNAs). From a total of 70,619 unique sRNA sequences, they identified 657 differentially expressed (DE) sRNAs that uniquely match tasRNA, which is produced by RDR6, and that are decreased in RDR6 mutants and regarded as provisional true positives. Therefore, we assume that the logical values for PDEG and PA are at least 0.93% (= 657/70,619) and around 100%, respectively. In accordance with that study [18], the evaluation metric here is that a good method should be able to rank those true positives as highly as possible. Recall that the strategy for TbT is to normalize data after the elimination of such DE sRNAs for such a purpose.

The TbT estimated 9.0% of PDEG (5,495 potential DE sRNAs) and 70.2% of PA. We found that the 5,495 sRNAs included 255 of the 657 true positives. This suggests that our strategy was effective because the original percentage (657/70,619 = 0.93%) of true positives decreased ((657 - 255)/(70,619 - 5,495) = 0.62%) before the TbT normalization factor was calculated at step 3. In summary, the TbT normalization factor was calculated based on 65,124 (= 70,619 - 5,495) potentially non-DE sRNAs after 255 out of the 657 provisional DE sRNAs were eliminated.

A true discovery plot (the number of provisional true positives when an arbitrary number of top-ranked sRNAs is selected as differentially expressed) is shown in Figure 4a. Note that this figure is essentially the same as Figure five in Ref. [18], so we chose the colors for indicating individual R packages and the ranges for both axes to be as similar as possible to the original. Since the original study [18] reported that another package (DEGseq [26]) was the best when the range in the figure was evaluated, we also analyzed the package with the same parameter settings as in Ref. [18] and obtained a reproducible result for DEGseq.

Figure 4
figure 4

Results for real data. (a) Number of tasRNA-associated sRNAs (i.e., provisional true discoveries) for given numbers of top-ranked sRNAs obtained from individual combinations. Combinations of individual R packages with TbT and default normalization methods are indicated by dashed and solid lines, respectively. For easy comparison with the previous study, results of DEGseq with the same parameter settings as in the previous study are also shown (solid yellow line). (b) Full ROC plots. Plots on left side (roughly the [0.00, 0.05] region on the x-axis) are essentially the same as those shown in Figure 4a. The R-code for producing Figure 4 is available in Additional file 5.

Three combinations (baySeq/TbT, edgeR/TbT, and edgeR/default) outperformed the DEGseq package. The higher performances of these combinations were also observed from the full ROC curves (Figure 4b). The baySeq/TbT combination displayed the highest AUC value (74.6%), followed by edgeR/default (70.0%) and edgeR/TbT (69.3%). Recall that the edgeR/default combination uses the TMM normalization method [16] and that the basic strategy (i.e., potential DEGs are not used) for data normalization is essentially the same as that of our TbT. This result confirms the previous findings [15, 16]: potential DE entities have a negative impact on data normalization, and their existences themselves consequently interfere with their opportunity to be top-ranked.

Three combinations (edgeR/default, DESeq/default, and baySeq/default) performed differently between the current study and the original one [18]. The difference for the first two combinations can be explained by the different choices for the default normalization methods. Hardcastle and Kelly [18] used a simple normalization method by adjusting the total number of reads in each library for both packages with a reasonable explanation for why the recommended method (i.e., the default method we used here) implemented in the DESeq package was not used. The TMM normalization method that we used as the default in the edgeR package was probably not implemented in the package when they conducted their evaluation. We found that both procedures (i.e., edgeR and DESeq packages with library-size normalization) performed poorly on average (data not shown).

The difference between the current result (baySeq/default; solid red line in Figure 4a) and the previous result (dashed red line in Figure five in Ref. [18]) might be explained by the fact that bootstrap resampling was conducted a different number of times for estimating the empirical distribution on the parameters of the NB distribution. Although the current result was obtained using 10,000 iterations of resampling as suggested in the package, we sometimes obtained a similar result to the previous one when we analyzed baySeq/default using 1,000 iterations of resampling. We therefore determined that the previous result was obtained by taking a small sample, such as 1,000 iterations. In any case, we found that those results for the baySeq/default combination with different parameter settings were overall inferior to the baySeq/TbT combination. For reproducing the research, the R-code for obtaining the results in Figure 4 and AUC values for individual combinations is given in Additional file 5.

Conclusion

We described a strategy (called TbT as an acronym for the TMM-baySeq-TMM procedure) for normalizing tag count data. We evaluated the feasibility of TbT based on three commonly used R packages (edgeR, DESeq, and baySeq) and a recently published package NBPSeq, using a variety of simulation data and a real dataset. By comparing the default procedures recommended in the individual packages (edgeR/default, DESeq/default, baySeq/default, and NBPSeq/default) and procedures where our proposed TbT was used in the normalization step instead of the default normalization method (edgeR/TbT, DESeq/TbT, baySeq/TbT, and NBPSeq/TbT), the effectiveness of TbT has been suggested for increasing the sensitivity and specificity of differential expression analysis of tag count data such as RNA-seq.

Our study demonstrated that the elimination of potential DEGs is essential for obtaining good normalized data. In other words, the elimination of the DEGs before data normalization can increase both sensitivity and specificity for identifying DEGs. Conventional approaches consisting of two steps (i.e., data normalization and gene ranking) cannot accomplish this aim in principle. The two-step approach includes the default procedures recommended in individual packages (edgeR/default, DESeq/default, baySeq/default, and NBPSeq/default). Our proposed approach consists of a total of four steps (data normalization, DEG identification, data normalization, and DEG identification). This procedure enables potential DEGs to be eliminated before the second normalization (step 3).

Our TbT normalization strategy is a proposed pipeline for the first three steps, where the TMM normalization method is used at steps 1 and 3 and the empirical Bayesian method implemented in the baySeq package is used at step 2. This is because our strategy was originally designed to improve the TMM method, the default method implemented in the edgeR package. As demonstrated in the current simulation results comparing two groups (for example, samples A and B), the use of default normalization methods implemented in the existing R packages performed poorly in simulations where almost all the DEGs are highly expressed in Sample A (i.e., the case of PA > > 50% when the range is defined as 50% ≤ PA ≤ 100%). Although the negative impact derived from such biased differential expression gradually increases according to the increased proportion of DEGs in the data, our strategy can eliminate some of those DEGs before data normalization (Tables 1, 2, and 3). The use of the empirical Bayesian method implemented in the baySeq package primarily contributes to solving this problem.

Although we focused on expression-level data in this study, similar analysis of differences in ChIP-seq tag counts would benefit from this method. It is natural to expect that loss of the function of histone modification enzymes will lead to biased distribution of the difference between compared conditions in the corresponding ChIP-seq analysis, in a similar way to the RDR6 case. We observed relatively high performances for NBPSeq/TbT when analyzing simulation data (Tables 1 and 3) and baySeq/TbT when analyzing a real dataset (Figure 4). However, this might simply be because the simulation and real data used in this study were derived from the NBPSeq study [21] and the baySeq study [18], respectively. In this sense, the edgeR/TbT combination might be suitable because it performed comparably to the individual bests. The DEG elimination strategy we proposed here could be applied for many other combinations of methods, e.g., the use of an exact test for NB distribution [22] for detecting potential DEGs at step 2. A more extensive study with other recently proposed methods (e.g., Ref. [27]) based on many real datasets should still be performed.

Methods

All analyses were basically performed using R (ver. 2.14.1) [19] and Bioconductor [28].

Simulation details

The negative binomially distributed simulation data used in Tables 1, 2, and 3 and Figures 1, 2, and 3 were produced using an R generic function rnbinom. Each dataset consisted of 20,000 genes × 6 samples (3 of Sample A vs. 3 of Sample B). Of the 20,000 genes, the PDEG % were DEGs at the four-fold level, and PA % of the PDEG % was higher in Sample A. For example, the simulation condition for Figure 1 used 20% of PDEG and 90% of PA, giving 4,000 (= 20,000 × 0.20) DEGs, 3,600 (= 20,000 × 0.20 × 0.9) of which are highly expressed in Sample A in the simulation dataset.

The variance of the NB distribution can generally be modelled as V = μ + ϕμ2. The empirical distribution of read counts for producing the mean (μ) and dispersion (ϕ) parameters of the model was obtained from Arabidopsis data (three biological replicates for both the treated and non-treated samples) in Ref. [21]. The simulations were performed using a total of 24 combinations of PDEG (= 5, 10, 20, and 30%) and PA (= 50, 60, 70, 80, 90, and 100%) values. The full R-code for obtaining the simulation data is described in Additional file 1. The parameter param1 in Additional file 1 corresponds to the degree of fold-change.

Simulations with different types of DEG distribution were also performed in this study. The fold-change values for individual genes were randomly sampled from a gamma distribution with shape and scale parameters. Specifically, an R generic function rgamma with respective values of 2.0 and 0.5 for the shape and scale parameters was used. This roughly gives respective values of 0.0 and 1.0 for the minimum and mean fold-changes. We added an offset value of 1.2 to prevent low fold-changes for introduced DEGs, giving respective values of 1.2 and 2.2 for the minimum and mean fold-changes. The full R-code for obtaining the simulation data is described in Additional file 3. The values in param1 in Additional file 3 correspond to those parameters.

Wildtype vs. RDR6 knockout dataset used in baySeq study

The dataset was obtained by e-mail from the author of Ref. [18]. The dataset (named "rdr6_wt.RData") consists of 70,619 sRNAs × 4 samples (2 wildtype and 2 RDR6 knockout samples). Of the 70,619 sRNAs, 657 were used as true DE sRNAs whose expressions were higher in the wildtype than the RDR6 knockout samples.

Gene ranking with default procedure

Ranked gene lists according to the differential expression are pre-required for calculating AUC values. The input data for differential expression analysis using five R packages (edgeR ver. 2.4.1, DESeq ver. 1.6.1, baySeq ver. 1.8.1, NBPSeq ver. 0.1.4, and DEGseq ver. 1.6.2) is basically the raw count data where each row indicates the gene (or transcript), each column indicates the sample (or library), and each cell indicates the number of reads mapped to the gene in the sample. The execution of the baySeq package was performed using data after scaling for RPM mapped reads.

The analysis using the edgeR packages with default settings (i.e., the edgeR/default combination) was performed using four functions (calcNormFactors, estimateCommonDisp, estimateTagwiseDisp, and exactTest) in the package [17]. The TMM normalization factor can be obtained from the output object after applying the calcNormFactors function [16]. The genes were ranked in ascending order of the p-values.

The DESeq/default combination was performed using three functions (estimateSizeFactors, estimateDispersions, and nbinomTest) in the package. The genes were ranked in ascending order of the p-values adjusted for multiple-testing with the Benjamini-Hochberg procedure.

The baySeq/default combination was performed using two functions (getPriors.NB and getLikelihoods.NB) in the package [18] for the RPM data. The empirical distribution on parameters of the NB distribution was estimated by bootstrapping from the data. We took sample sizes of (i) 2,000 iterations for the simulation data shown in Tables 1, 2, and 3, Figures 1 and 2, and Additional file 2 (see Additional files 1 and 3), (ii) 5,000 iterations for the simulation data shown in Figure 3 (Additional file 4), and (iii) 10,000 iterations for real data (Additional file 5). The genes were ranked in descending order of the posterior likelihood of the model for differential expression.

The NBPSeq/default combination was performed using the nbp.test function in the package [21]. The genes were ranked in ascending order of the p-values of the exact NB test.

The analysis using the DEGseq package [26] was performed for benchmarking the current study and a previous study [18], both of which analyzed the same real dataset. There are multiple methods in the DEGseq package [26]. Following from the previous study, we used an MA plot-based method with random sampling (MARS), i.e., the DEGexp function with method = "MARS" option was used. A higher absolute value for the statistics indicates a higher degree of differential expression. Accordingly, the genes were ranked in descending order of the absolute value. Note that the execution of this package (ver. 1.6.2) was performed using R 2.13.1 because we encountered an error when executing the more recent version (ver. 1.8.0) using R 2.14.1.

TbT normalization strategy

Our proposed strategy is an analysis pipeline consisting of three steps. In step 1, the TMM normalization factors are calculated by using the calcNormFactors function in the edgeR package with the raw count data. These factors are used for calculating effective library sizes, i.e., library sizes multiplied by the TMM factors.

In step 2, potential DEGs are identified by using the baySeq package with the RPM data. Different from the above baySeq/default combination, the analysis is performed using the effective library sizes. The effective library sizes are introduced when constructing a countData object, the input data for the getPriors.NB function. By applying the subsequent getLikelihoods.NB function, the percentage of DEGs in the data (the PDEG value) and the corresponding potential DEGs can be obtained.

In step 3, TMM normalization factors are again calculated based on the raw count data after eliminating the estimated DEGs. The TbT normalization factors are defined as (the TMM normalization factors calculated in this step) × (library sizes after eliminating the DEGs)/(library sizes before eliminating the DEGs). As the TbT normalization factors are comparable with the original TMM normalization factors such as those calculated in step 1, effective library sizes can also be calculated by multiplying library sizes by the TbT factors.

The four combinations coupled with the TbT normalization strategy (edgeR/TbT, DESeq/TbT, baySeq/TbT, and NBPSeq/TbT) were analyzed to compare the above four combinations coupled with the default normalization strategy. The edgeR/TbT combination introduced the TbT normalization factors instead of the original TMM factors. The NBPSeq/TbT combination introduced the TbT normalization factors in the nbp.test function. The remaining two combinations (DESeq/TbT and baySeq/TbT) introduced the effective library sizes, i.e., the original library sizes multiplied by the TbT factors.

Abbreviations

DE:

differential expression

DEG:

differentially expressed gene

EB:

embryonic body

RPM:

reads per million (normalization)

sRNA:

small RNA

tasRNA:

TAS locus-derived small RNA

TMM:

trimmed mean of M values (method).

References

  1. Weber AP, Weber KL, Carr K, Wilkerson C, Ohlrogge JB: Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing. Plant Physiol. 2007, 144 (1): 32-42. 10.1104/pp.107.096677

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  2. Mardis ER: The impact of next-generation sequencing technology on genetics. Trends Genet. 2008, 24 (3): 133-141. 10.1016/j.tig.2007.12.007

    Article  PubMed  CAS  Google Scholar 

  3. Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995, 270 (5235): 467-470. 10.1126/science.270.5235.467

    Article  PubMed  CAS  Google Scholar 

  4. Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, Brown EL: Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol. 1996, 14 (13): 1675-1680. 10.1038/nbt1296-1675

    Article  PubMed  CAS  Google Scholar 

  5. Asmann YW, Klee EW, Thompson EA, Perez EA, Middha S, Oberg AL, Therneau TM, Smith DI, Poland GA, Wieben ED, Kocher JP: 3' tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer. BMC Genomics. 2009, 10: 531- 10.1186/1471-2164-10-531

    Article  PubMed  PubMed Central  Google Scholar 

  6. Oshlack A, Wakefield MJ: Transcript length bias in RNA-seq data confounds systems biology. Biology Direct. 2009, 4: 14- 10.1186/1745-6150-4-14

    Article  PubMed  PubMed Central  Google Scholar 

  7. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5 (7): 621-628. 10.1038/nmeth.1226

    Article  PubMed  CAS  Google Scholar 

  8. Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O'Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo ML: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008, 321 (5891): 956-960. 10.1126/science.1160342

    Article  PubMed  CAS  Google Scholar 

  9. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoforms switching during cell differentiation. Nat Biotechnol. 2010, 28: 511-515. 10.1038/nbt.1621

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  10. Lee S, Seo CH, Lim B, Yang JO, Oh J, Kim M, Lee S, Lee B, Kang C, Lee S: Accurate quantification of transcriptome from RNA-Seq data by effective length normalization. Nucleic Acids Res. 2010, 39 (2): e9-

    Article  PubMed  PubMed Central  Google Scholar 

  11. Nicolae M, Mangul S, Mandoiu II, Zelikovsky A: Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms Mol Biol. 2011, 6: 9- 10.1186/1748-7188-6-9

    Article  PubMed  PubMed Central  Google Scholar 

  12. Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, McKernan KJ, Grimmond SM: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008, 5 (7): 613-619. 10.1038/nmeth.1223

    Article  PubMed  CAS  Google Scholar 

  13. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18 (9): 1509-1517. 10.1101/gr.079558.108

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  14. Balwierz PJ, Carninci P, Daub CO, Kawai J, Hayashizaki Y, Van Belle W, Beisel C, van Nimwegen E: Methods for analyzing deep sequencing expression data: contructing the human and mouse promoteome with deepCAGE data. Genome Biol. 2009, 10 (7): R79- 10.1186/gb-2009-10-7-r79

    Article  PubMed  PubMed Central  Google Scholar 

  15. Bullard JH, Purdom E, Hansen KD, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010, 11: 94- 10.1186/1471-2105-11-94

    Article  PubMed  PubMed Central  Google Scholar 

  16. Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010, 11: R25- 10.1186/gb-2010-11-3-r25

    Article  PubMed  PubMed Central  Google Scholar 

  17. Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26 (1): 139-140. 10.1093/bioinformatics/btp616

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  18. Hardcastle TJ, Kelly KA: baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010, 11: 422- 10.1186/1471-2105-11-422

    Article  PubMed  PubMed Central  Google Scholar 

  19. R Development Core Team: R: A Language and Environment for Statistical Computing. 2011, R Foundation for Statistical computing, Vienna, Austria

    Google Scholar 

  20. Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11: R106- 10.1186/gb-2010-11-10-r106

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  21. Di Y, Schafer DW, Cumbie JS, Chang JH: The NBP negative binomial model for assessing differential gene expression from RNA-Seq. Stat Appl Genet Mol Biol. 2011, 10: art24-

    Google Scholar 

  22. Robinson MD, Smyth GK: Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008, 9: 321-332.

    Article  PubMed  Google Scholar 

  23. Kadota K, Nakai Y, Shimizu K: A weighted average difference method for detecting differentially expressed genes from microarray data. Algorithms Mol Biol. 2008, 3: 8- 10.1186/1748-7188-3-8

    Article  PubMed  PubMed Central  Google Scholar 

  24. Kadota K, Nakai Y, Shimizu K: Ranking differentially expressed genes from Affymetrix gene expression data: methods with reproducibility, sensitivity, and specificity. Algorithms Mol Biol. 2009, 4: 7- 10.1186/1748-7188-4-7

    Article  PubMed  PubMed Central  Google Scholar 

  25. Kadota K, Shimizu K: Evaluating methods for ranking differentially expressed genes applied to MicroArray Quality Control data. BMC Bioinformatics. 2011, 12: 227- 10.1186/1471-2105-12-227

    Article  PubMed  PubMed Central  Google Scholar 

  26. Wang L, Feng Z, Wang X, Wang X, Zhang X: DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics. 2010, 26 (1): 136-138. 10.1093/bioinformatics/btp612

    Article  PubMed  Google Scholar 

  27. Bergemann TL, Wilson J: Proportion statistics to detect differentially expressed genes: a comparison with log-ratio statistics. BMC Bioinformatics. 2011, 12: 228- 10.1186/1471-2105-12-228

    Article  PubMed  PubMed Central  Google Scholar 

  28. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch f, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R80- 10.1186/gb-2004-5-10-r80

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The authors thank Dr. TJ Hardcastle for providing the dataset used in the baySeq study. This study was supported by KAKENHI (21710208 and 24500359 to KK and 22128008 to TN) from the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Koji Kadota.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

KK performed analyses and drafted the paper. TN provided helpful comments and refined the manuscript. KS supervised the critical discussion. All the authors read and approved the final manuscript.

Electronic supplementary material

13015_2011_145_MOESM1_ESM.R

Additional file 1: R-code for simulation analysis. After execution of this R-code with default parameter settings (i.e., rep_num = 100, param1 = 4,.., and param6 = 090), two output files named "Fig1.png" and "resultNB_020_090.txt" can be obtained. The former is the same as Figure 1. The latter output file will contain raw data for Tables 1, 2, 3 when PDEG = 20% and PA = 90%. The numbers given as rep_num, param1,..., and param6 indicate the number of trials (rep_num), degree of differential expression of fold-change (param1-fold), number of libraries for sample A (param2), number of libraries for sample B (param3), total number of genes (param4), true PDEG (param5), and true PA (param6), respectively. Accordingly, for example, respective values for param5 and param6 should be changed to "030" and "060", to obtain the raw results when PDEG = 30% and PA = 60%. (R 13 KB)

13015_2011_145_MOESM2_ESM.PPT

Additional file 2: Result of TbT using simulation data with > 1.2-fold of DEGs. Legends in this figure are essentially the same as those described in Figure 1. The difference between the two is the distributions of DEGs (magenta dots). This simulation does not have DEGs with low fold-changes (< = 1.2-fold) and the average fold-change is theoretically 2.2. The R code for obtaining the full results under the simulation condition (i.e., PDEG = 20% and PA = 90%) is given in Additional file 3. (PPT 116 KB)

13015_2011_145_MOESM3_ESM.R

Additional file 3: R-code for obtaining simulation results with > 1.2-fold of DEGs. After execution of this R-code with default parameter settings (i.e., rep_num = 100, param1 = c(1.2, 2.0, 0.5),..., and param6 = 090), two output files named "Additional2.png" and "resultNB2_020_090.txt" can be obtained. The former is the same as Additional file 2. The format of the latter output file is essentially the same as the "resultNB_020_090.txt" file obtained by executing Additional file 1. The main difference between the current code and Additional file 1 is in the parameter settings for producing the distributions of DEGs at param1. The parameter values (1.2, 2.0, and 0.5) indicated in param1 are used for the minimum fold-change (= 1.2) and for random sampling of fold-change values from a gamma distribution with shape (= 2.0) and scale (= 0.5) parameters, respectively. (R 14 KB)

13015_2011_145_MOESM4_ESM.R

Additional file 4: R-code for obtaining raw results shown in Figure 3. After execution of this R-code with default parameter settings (i.e., rep_num = 100, param1 = 4,..., and param7 = 5000), four output files named "iteration0_020_090.txt", "iteration1_020_090.txt", "iteration2_020_090.txt", and "iteration3_020_090.txt" can be obtained. The box plots for Default, First, Second, and Third shown in Figure 3 are produced using values in two columns (named "accuracy" and "AUC(edgeR/TbT)") in the first, second, third, and fourth file, respectively. The p-values were calculated based on the Wilcoxon rank sum test. (R 19 KB)

13015_2011_145_MOESM5_ESM.R

Additional file 5: R-code for producing Figure 4 and AUC values for individual combinations. We obtained an input file (named "rdr6_wt.RData") from Dr. T.J. Hardcastle (the corresponding author of Ref. [18]). After execution of this R-code, three output files (arbitrarily named "Fig4a.png", "Fig4b.png", and "AUCvalue_Fig4b.txt") can be obtained. (R 11 KB)

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Kadota, K., Nishiyama, T. & Shimizu, K. A normalization strategy for comparing tag count data. Algorithms Mol Biol 7, 5 (2012). https://doi.org/10.1186/1748-7188-7-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1748-7188-7-5

Keywords