Distributional fold change test – a statistical approach for detecting differential expression in microarray experiments
 Vadim Farztdinov^{1}Email author and
 Fionnuala McDyer^{1}
DOI: 10.1186/17487188729
© Farztdinov and McDyer; licensee BioMed Central Ltd. 2012
Received: 18 June 2012
Accepted: 22 October 2012
Published: 2 November 2012
Abstract
Background
Because of the large volume of data and the intrinsic variation of data intensity observed in microarray experiments, different statistical methods have been used to systematically extract biological information and to quantify the associated uncertainty. The simplest method to identify differentially expressed genes is to evaluate the ratio of average intensities in two different conditions and consider all genes that differ by more than an arbitrary cutoff value to be differentially expressed. This filtering approach is not a statistical test and there is no associated value that can indicate the level of confidence in the designation of genes as differentially expressed or not differentially expressed. At the same time the fold change by itself provide valuable information and it is important to find unambiguous ways of using this information in expression data treatment.
Results
A new method of finding differentially expressed genes, called distributional fold change (DFC) test is introduced. The method is based on an analysis of the intensity distribution of all microarray probe sets mapped to a three dimensional feature space composed of average expression level, average difference of gene expression and total variance. The proposed method allows one to rank each feature based on the signaltonoise ratio and to ascertain for each feature the confidence level and power for being differentially expressed. The performance of the new method was evaluated using the total and partial area under receiver operating curves and tested on 11 data sets from Gene Omnibus Database with independently verified differentially expressed genes and compared with the ttest and shrinkage ttest. Overall the DFC test performed the best – on average it had higher sensitivity and partial AUC and its elevation was most prominent in the low range of differentially expressed features, typical for formalinfixed paraffinembedded sample sets.
Conclusions
The distributional fold change test is an effective method for finding and ranking differentially expressed probesets on microarrays. The application of this test is advantageous to data sets using formalinfixed paraffinembedded samples or other systems where degradation effects diminish the applicability of correlation adjusted methods to the whole feature set.
Keywords
Differential expression Microarray Feature selection Fold change Statistical test ROC curve FFPEBackground
The development of technology over the past two decades has established microarrays as a standard tool for genomic research and discovery[1, 2]. Nowadays, scientists can simultaneously measure the expression of tens of thousands of genes from an experimental sample and identify those genes, which demonstrate a significant change in expression level under the impact of certain experimental conditions. Numerous methods have been proposed to determine differentially expressed genes (DEGs), see, for example[2–9] and references cited therein. In the majority of cases, the utility of these methods was demonstrated by application to the analysis of expression levels of RNA extracted from fresh frozen (FF) tissue samples. However, clinical genomic research is often focused on retrospective studies, utilizing archival samples stored in formalinfixed and paraffinembedded (FFPE) blocks^{a}. By nature of the fixation method, FFPE samples are partially degraded and contain low amounts of total RNA ([10] and references therein for more details) leading to increased expression variability[10, 11]. This RNA degradation is dependent on a number of factors, including fixation protocol, storage time and storage conditions with the resulting variability introducing a number of challenges for gene expression studies[10, 11]. Apart from high technical variance, FFPE samples typically exhibit low gene expression intensities and a compression of fold change across experimental groups relative to matched FF samples (see, for example[11]), thereby compromising the ability to detect DEGs in samples preserved in this manner. Additionally, RNA transcripts from FFPE samples degrade at different rates and to different levels[11–13], which can introduce false negative and false positive correlations between the expression levels of genes. These differential degradation effects impede the direct application of correlation adjusted methods[14, 15] to FFPE samples, and a preselection of the most stable (decaying at the same rate) genes should be considered[12]. Therefore, the development of a method dedicated to the analysis of RNA differential expression from FFPE samples is necessary to support the many studies attempting to make discoveries from the wealth of FFPE archival material available. The absence of such a method is especially surprising in the view of enormous improvement of the methods and protocols for the extraction of RNA from FFPE samples in recent years[16].
In order to shrink the large technical variance inherent in expression levels measured from FFPE tissue samples, one should have enough samples, N_{s} >> 1. Typically microarrays have very large number of probesets N_{p} > 10^{4}[17]. Therefore FFPEderived gene expression experiments fall within the N_{p} >> N_{s} >> 1 paradigm, with the associated complications for subsequent analysis[18]. If we assume that asymptotically, N_{p} → ∞, we may then introduce a dependence of distributions of variables such as fold change and total variance on the expression level and develop an approach where the significance of a gene’s differential expression estimation accounts for its expression level.
Compression of the expression distribution in FFPE samples towards the lower side[10, 11] necessitates a DEG selection method that work equally well with features at any expression level. Spanning the full expression scale will enable the selection of features with low expressions (typically comprising the main distribution of features in FFPE samples) and with high expressions.
Summarizing the requirements for successful DEG selection method for FFPE sample sets, we can say that it should work with reasonable number of samples N_{s} >> 1, pick up DEGs equally well at any expression level and be not bounded to specific preprocessing method. The same requirements are actually applicable to successive method working with samples obtained by any preservation method, be it FF or FFPE or some other[19, 20].
In the following paper, we will use term feature, instead of probeset, transcript, gene, or protein, to emphasize that the methodology presented has general applicability.
This paper presents the description of a method, called the distributional fold change (DFC) test, which is based on the analysis of the distribution of intensities of all features on a microarray mapped to a three dimensional feature space composed of the average difference of gene expression (logarithm of fold change), total variance and average expression level. It introduces a score based on signaltonoise ratio that can be used for accurate ranking of DEGs independently of the expression range they come from – high, medium or low, which is extremely important for DEGs from FFPE samples. It also allows the introduction of a statistical (and expression dependent) threshold for the fold change and in this way removes one of the drawbacks of standard methods of filtering based on fold change – the arbitrariness of a cutoff value.
We evaluate the performance of the new ranking method by comparison with the standard ttest (selected as a basic reference test) and with shrinkage CATtest[7, 14], which was shown[7] (see also[9]) to be a good representative of the set of methods[4–6] developed to stabilize gene expression variance. Account of variance in the data is very important for FFPE data sets and in the performance evaluation of DFC test we limited our comparison to only these tests. Extended comparison of AUC values obtained by DFC test with those from ttest based methods[4–7] and fold change based tests[9] is provided in Additional file1. The MATLAB source code of the DFC test program is provided in Additional file2.
Data sets with established DEGs were selected for testing as these had been previously used for comparison of different methods for detecting differential expression[8, 9]. We limited our comparison to such real life data sets in order to exclude any possibility of bias that could foster the advantage of DFC test.
Methods
Distributional fold change test: general approach
The connection between FC and d is FC = 2^{ d } when expression variances in both classes are close (and/or when expectations in (2) are replaced by medians).
Here${\stackrel{\u2015}{v}}_{\mathit{EE}}$ is an average variance of unregulated features having (nearly) the same expression (see eq. (9) below for definition of${\stackrel{\u2015}{v}}_{\mathit{EE}}$). Note that definition (4) extrapolates the variance from standard unbiased definition of variance when$v\left(X\right)={\stackrel{\u2015}{v}}_{\mathit{EE}}\left(\mu \right)$ and is equivalent to the definition from likelihood maximization when$v\left(X\right)>>{\stackrel{\u2015}{v}}_{\mathit{EE}}\left(\mu \right)$. More complicated shrinkage approaches can be applied to improve test performance on data sets with very small sample size < 10.
Next, we suppose that all features on a microarray can be considered as a mixture of unregulated (equally expressed) and regulated (differentially expressed) features. We will also suppose, for simplicity, that the logFC distribution of unregulated features d_{0} at each expression level, μ, can be described by normal distribution$N\left({\stackrel{\u2015}{d}}_{0}=0,{v}_{0}\left(\mu \right)={\sigma}_{0}{\left(\mu \right)}^{2}\right)$.
This statistic is an intermediate between the normal Zstatistic and Tstatistic because of the presence of the variance of null features logFC distribution, which is expected to be (almost) independent of the sample size. Note that this definition of significance level statistic is similar to those of moderated tstatistics, used in a series of papers on variance stabilization[7] (and references cited therein), but principally differs from them in that the additional term v_{0}(μ) in variance is defined not through the variance of mean internal variance, but mainly through the variance of null features logFC distribution and only to a limited extent through the features’ internal variance.
Even without knowing the exact statistic for the DFCscore, it can be used for ranking features and selection of a fixed number, or best fraction of features with highest score.
Null (unregulated) features distribution and variance threshold
Previously we supposed that we knew the properties of the null features distribution. Here we consider how one can establish them.
As mentioned previously, the log fold change d and total variance v_{T} depend on average expression μ. We suppose that the number of features is large and enough to accurately define these dependences, which will be exact in the limit N_{p} → ∞.
Significance level and power for testing each individual feature
Here Φ^{−1} is normal inverse cumulative distribution function. Below this threshold, all features are considered as having insufficient evidence for differential expression at the confidence level α. As this is specified for the null distribution obtained from analysis of all features on a microarray with nearly the same expression that is through sharing information across these features, the parameter α indicates the significance level of taking multiple testing into account. For α = 1 the threshold (13) turns to 0 and no information about multiple testing is included into finding differentially expressed features.
Here T^{–1} is Student's t inverse cumulative distribution function. Note that in the definition of nonnull features {d_{ D }(μ)}, the requirement for the variance to be above the threshold is also included in order to reflect that condition (11) was used to define properties of null features distribution. The condition is not directly required and is optional in software implementation^{c}.
Strictly speaking in (16.a) we should not assume that d(μ)/s(μ) follows the Student’s tdistribution as stabilized variances (4) are used to calculate s(μ) (14), but keeping in mind that Welch’s definition of degrees of freedom (15) is an approximate solution of BehrensFisher problem[21] and that correction (4) is small except in rare cases of very small number of samples, we suppose that the tdistribution is a sufficient approximation.
Thus the DFC filter incorporates three different statistical filters: the multiple testing based threshold through parameter α, the ttest conditioned on the values of α through parameter β and the variance filter. Compared with a traditional fold change filter where the threshold is arbitrarily selected, the DFC threshold is defined by the features significance level and conditional power and depends on the properties of a particular data set. This method has the advantage of being selfadjusting through the accurate estimation of the unregulated features distribution d_{0} and taking into account the d(μ) distribution of regulated features thus providing an option to impose power requirements. The two significance parameters, α and β, allow for a controlled tuning of filtering threshold.
When α = 1, the method is reduced to the selection of features by a standard ttest with threshold p_{ Th } = 2β_{ Th } combined with variance filter; when β_{ Th } = 0.5 (and α < 1) the method is reduced to selection based on the ‘Unusual Ratio’ variant of fold change method (see, for example,[2]) with internal definition of the null feature distribution. There is no need in setting restrictive values for α and β, standard settings α = 0.05 and β = 0.2 should be sufficient as their intention is to remove unregulated features. Once the (α, β_{ Th }) selection criteria are applied and unregulated features removed, ranking of differentially expressed features can be performed by DFC score (5) and used for selecting best subset of differentially expressed features.
Evaluation method
To evaluate the performance of the DFC algorithm, we use the receiver operating characteristic (ROC) curve[22]. This is a graphical plot of the parametric dependence of the fraction of true positives τ = true positive rate (TPR) on the fraction of false positives η = false positive rate (FPR) as the number of features predicted to be differentially expressed (K or, equivalently, ν = K/N_{ p }), varies. For a given range of η or τ, one ROC curve is better than another if it is lying to the northwest (τ is higher for fixed η, or η is lower for fixed τ) of the first.
as one of criteria for comparison, because it has an important statistical property: the AUC of a test is equivalent to the probability that the test will rank a randomly chosen positive instance higher than a randomly chosen negative instance[23]. AUCs and ROC curves have been used in some previous works for comparison of different feature selection tests see, for examples[7–9], and are standard metrics used for the evaluation and comparison of diagnostic tests.
Data sets from GEO database
N  GEO data set  Experiment summary/Title  N _{ A }  N _{ B }  N _{ PC }  N _{ Ka } 

1  GSE8441  Study of whether inadequate protein intake differentially affects skeletal muscle transcript levels and expression profiles in older adults[24]  11  11  9  5 
2  GSE9499  DNA methyltransferase 3B (DNMT3B) mutations in ICF syndrome[25]  15  7  77  6 
3  GSE2638 and 2639  GSE2639: HUVEC were left untreated or stimulated for 5h with 2 ng/ml TNF. Comparsion of the gene profiles revealed TNFmediated gene expression changes in HUVEC[26]. Study TNF stimulated vs controls.  7  7  13  8 
4  GSE2638 and 2639  GSE2638: HMEC cultures were left untreated or stimulated for 5h with 2 ng/ml TNF. Comparison of the gene expression profiles revealed the TNFmediated gene expression changes[26]. Study HMEC vs HUVEC  3  4  16  9 
5  GSE3860  Comparison of Hutchinson–Gilford Progeria Syndrome fibroblast cell lines to control fibroblast cell lines[27].  9  9  8  11 
6  GSE6344  Gene expression in Stage 1,2 Normal and Tumor kidney cancer[28]  10  10  19  15 
7  GSE7765  Dioxininduced gene expression changes in MCF7 human breast cancer cells[29]  3  3  13  18 
8  GSE6740_1  Comparison of transcriptional profiles of CD4+ and CD8+ T cells from HIVinfected patients and uninfected control group[30]. Study of CD4+ T cells  10  10  40  24 
9  GSE6740_2  Comparison of transcriptional profiles of CD4+ and CD8+ T cells from HIVinfected patients and uninfected control group[30]. Study of CD8+ T cells  10  10  62  25 
10  GSE6011  Expression data from quadriceps muscle of young DMD patients and age matched controls[31]  14  23  10  30 
11  GSE2531  Total RNA from two commonly used choriocarcinoma cell lines, JEG3 and BeWo, are compared in this experiment to identify differentially expressed transcripts[32].  3  4  17  36 
Total N_{ P }  284 
We use standardized partial area (SPA) curves and their ratios as the main criteria for comparison. Note that standardized partial area SPA ≤ SPA(1) = AUC and its value shows how close the performance of a method is to the performance of an ideal method in the range of FPR [0, η]. SPA can be also considered as the average TPR over the same range [0, η]. We use both AUC and SPA to assess the performance of the DFC test.
In typical for FFPE data sets situations where N_{p} >> N_{T}, ROC curves on a normal scale (η) are of little use and are much more informative on logarithmic scale; hence we present our result on log_{10}η scale.
Results
Data sets
We evaluated the performance of the DFC test using 11 publicly available Homo sapiens microarray data sets, listed in Table1, each of which have had a portion of discovered DEGs experimentally validated by a realtime polymerase chain reaction (RTPCR). They are chosen from FF sample sets, listed and described in Ref.[9]. The selection of experimental data sets was based on the requirement that total number of DEGs confirmed by RTPCR should be above ~10 (see Additional file1 for details of subset selection). Having a large number (>>1) of verified DEGs^{d} is important for building representative ROC curves and for the estimation of area and partial area under ROC curves.
It is known[8] that the majority of true DEGs verified by RTPCR in experimental studies on FF samples tend to have high expression levels. This was also exploited in some feature selection methods[9]. The DFC method is designed to pick up DEGs independent of their expression level and therefore should work in these as well as in FFPE data sets where the expression values tend to be comparatively lower.
Following[8, 9] we consider that the evaluation of results based on real experimental data sets should take precedence over those based on artificial data sets. Therefore analysis of the test performance is based on realworld experimental data sets only.
There are several methods available for preprocessing data profiled on Affymetrix microarrays[1, 34]. We used Affymetrix Expression Console with standard settings to apply two of the most frequently used preprocessing methods: MAS5[35, 36], which is designed to work on a single chip basis, and RMA[37, 38], a multiarraybased approach. As can be seen from Figure1, these two methods provide very different distributions of features in expression – variance space and we considered it sufficient to concentrate only on these two methods.
Evaluation
Within the DFC algorithm, features are ranked on the basis of the Z_{ d } score (5) and their relevance to differential expression is assessed using two criteria (13,16): fold change should have an appropriate significance level < α and power > 1 – β_{Th}. The latter two are complemented by requirement that variance should be above a specified threshold. To create continuous ROC curves we set α = 1 and β_{Th} = 0.5 and ranked features using Z_{ d } pvalues, calculated based on the assumption that Z_{ d } follows normal distribution^{e}. Specific values of α and β_{Th} define starting point on the curve and their selection is equivalent to setting appropriate cutoff pvalues. For t and shrinkage t test this is typically done by controlling the false discovery rate.
Our aim is to develop and check performance of a test for systems where technical variation is large (such as FFPE samples sets) and assessment of reliability of detecting differential expression is of extreme importance. Therefore we compared the performance of the DFC test with ttest based methods: the standard ttest and with the CATtest[14] with the ‘diagonal’ option^{f}. This option is equivalent[14] to shrinkage ttest[7], which was shown[7, 9] (see also Additional file1) to perform similarly to other variance stabilization derivatives of the ttest[4–6], and can be considered as their representative. The ordinary ttest is provided as a reference for the improvement of any ttest based method, which DFC test and CAT test clearly are. According to[7] the ordinary t statistic shows average though never optimal performance (regardless of the variance structure across features). Detailed comparison of AUCs for DFC test and a set of ttest based methods[4–7], as well as with fold change test and its ad hoc modification weighted average difference (WAD)[9] method is presented in the Additional file1.
AUC performance of DFC test, ttest, and shrinkage ttest
GEO data set  N _{s}  AUC for MAS5 preprocessed data  AUC for RMA preprocessed data  

ttest  ShrinkT^{a}  DFC  ttest  ShrinkT^{a}  DFC  
GSE8441  22  0.92912  0.94404  0.96996  0.91206  0.92842  0.96812 
GSE9499  22  0.96425  0.98255  0.98529  0.94735  0.97241  0.9718 
GSE2639  14  0.99782  0.99838  0.9987  0.99851  0.99784  0.99896 
GSE2638  7  0.79197  0.83621  0.86199  0.75527  0.82421  0.83175 
GSE3860  18  0.98986  0.99581  0.99742  0.98647  0.99246  0.99568 
GSE6344  20  0.97165  0.98078  0.98854  0.97586  0.98216  0.9889 
GSE7765  6  0.96323  0.97846  0.98564  0.96267  0.98146  0.98939 
GSE6740_1  20  0.99491  0.99676  0.99701  0.9972  0.99803  0.99803 
GSE6740_2  20  0.99115  0.99313  0.99283  0.97599  0.98248  0.98487 
GSE6011  37  0.86072  0.8674  0.90942  0.97544  0.98126  0.97892 
GSE2531  7  0.91614  0.94288  0.9379  0.93889  0.94368  0.94107 
Average^{b}  0.9718  0.9812  0.9857  0.9745  0.9815  0.9861 
Significance of differences in AUC
Test  MAS5  RMA  

DFC – ttest  DFC – CAT  DFC – ttest  DFC – CAT  
Wilcoxon on AUC  0.0005  0.0122  0.0005  0.0322 
ttest on 0.5⋅ln(AUC/(1AUC))  3e5  0.0017  3e4  0.0103 
One of the most important characteristics of the method is its ability to find DEGs independently of the preprocessing method applied to data. This should be evident from AUC as an overall characteristic of the test’s performance. Calculation of correlation coefficients between (logit transformed) AUCs for MAS5 and RMA preprocessed data (see Table A4 in the Additional file1) showed that the DFC test has the highest correlation between AUCs (ρ_{DFC} = 0.92), although its prevalence is not high enough to make it significantly different from other tests (ρ_{ttest} = 0.88 and ρ_{shrinkT} = 0.87), with differences in the correlation coefficients having pvalues above 0.3 (see also Additional file1 for broader range of comparisons).
The behaviour of the DFC test ROC and SPA curves displayed in Figures4 and5 agrees with what one would expect from a test performing better than the standard ttest on a reasonably sized (more than 10 samples) data set with ~ 100 differentially expressed features. When a high fraction of features, ν > 0.5, is taken as differentially expressed the difference between the DFC test and ttest is minimal, as both tests remove the most easily detectable, nonexpressed features. When a very small fraction of features ν ~ 1/N_{p} is taken as differentially expressed, resulting in only few features selected, the difference between the DFC test and ttest will be small again, as the differential expression of the few features should be very strong and can be effectively selected by ttest alone. One can expect an improvement of DFC over ttest when dealing with an intermediate range (20).
Improvement of the DFC test over the CATtest is in a narrower region. This can be clearly seen from Figure7, where the improvement in the partial area under ROC curve is significant for ν > 0.0015 only. It decreases from ~30 Ã· 50% to 10% when ν changes from 0.0015 to 0.01 and then gradually to ~ 1% at ν =1.
Youden Index YI , CI – 80% confidence interval for YI and ν _{ max } for DFC, CAT and ttest
MAS5  RMA  

ttest  CATtest  DFCtest  ttest  CATtest  DFCtest  
YI  0.77  0.83  0.84  0.77  0.80  0.83 
CI  [0.72,0.85]  [0.78, 0.88]  [0.79, 0.90]  [0.72, 0.85]  [0.76, 0.87]  [0.78, 0.90] 
ν_{max}  0.12  0.08  0.04  0.18  0.12  0.09 
Discussion
We have proposed a new method for removing nondifferentially expressed features and ranking differentially expressed features from gene expression data.
It was designed to work with expression data from microarrays containing large number of features ( N_{p} > 10^{4}), allowing one to analyse the distribution of all features on a microarray mapped to a three dimensional space composed of average difference of feature expression (logarithm of fold change), total variance and average expression level. A simple approach was introduced to define the expression dependent null features distribution and to estimate null features expression dependent average variance (9) and variance of logFC (12). These dependences are incorporated into the DFC test score Z_{ d } (5) for individual feature, which in this way explicitly takes into account information about presence of other features and can be used for accurate feature ranking.
The definition of the score Z_{ d } (5) is similar to moderated tstatistics, used in a series of papers on variance stabilization ([1, 7] and references sited therein), but principally differs from them in that the variance stabilization is defined through the variance of null features logFC distribution (12) and to a limited extent through the features’ internal variance.
When α = 1, the method is reduced to selection of features by ttest with threshold p_{ Th } = 2β_{ Th } (combined with variance filter), when β_{ Th } = 0.5 the method is reduced to selection based on the ‘Unusual Ratio’ variant of fold change method[2] with an internal definition of null features distribution. Once the selection criteria (α, β_{ Th }) are applied and the set of unexpressed features removed, ranking of differentially expressed features can be performed by the DFC score (5).
Standard approaches for multiple test correction[1, 2, 18] (and references therein) do not take into account expression dependence of the threshold (22). This problem will be considered in a separate publication. Here we note only that multiplicity correction affects only the arbitrary threshold choice and does not change the ranking of features[1]. Ranking of features with score (5) should be complemented with functional analysis (see, e.g. [1, chapter 5]) for final reduction of the number of false positives based on biological grounds.
The definition of the Type II error (17) has some similarity with recentered t statistic[40], but differs from the TREAT method in the way how threshold is defined. In ref.[40] “a prespecified threshold (τ) for the logfoldchange below which differential expression is not of material interest”[34] is introduced in order to address the thresholded null hypethesis H_{0}: d ≤ τ against alternative H_{1}: d > τ. The relevance of particular choice (τ=log_{2}(1.1), or τ=log_{2}(1.5) or τ=log_{2}(2) were used in[40] for three data sets) to particular dataset actually has to be independently verified, while in our approach the threshold (13) is 1) expression dependent and 2) is defined through the significance parameter α and it fully reflects properties of particular experiment. Ranking of features in[40] is performed using TREAT test pvalue, which is equivalent to 2β (17) but with replacement of ∆_{1Th}(α μ) by an arbitrary threshold τ . Parameter β (conditional on the value of α (or τ according to definition in[40])) is good for defining the threshold (16) above which features differential expression can be considered as reliably detected, but we believe is not well suited for ranking of features (see also[41] for a discussion of fold change and pvalue cutoffs). The best parameter for this purpose is signaltonoise ratio Z_{d} (5) and as it is shown in the paper and Additional file1 it outperforms ranking by moderated t test statistics and fold change based methods.
The performance of the DFC test was verified using 11 real experimental data sets, with DEGs independently verified by RTPCR. Their selection was based on the requirement of having in each set sufficiently large number of verified DEGs to build AUC. The total number of verified DEGs in these data sets was 284. We demonstrated that the DFC test is significantly better than the ttest in terms of the total and partial area under receiver operating curves. The improvement was dramatic (on average > 30%) in the most important (for FF and FFPE sample sets) range of the number of selected features K < 1000.
Some improvement was obtained in comparison with shrinkage ttest[7, 14], which can be considered as one of the best variance stabilizing methods, although improvement in partial area under ROC curve was within confidence limits (for 0.1 confidence level) for a number of selected features below ~30. Variance stabilization is very important for small data sets, although, as comparison shows, even for medium range data sets of 10 Ã· 30 samples, improvement can be significant. Taking into account that the DFC test was not optimized for variance stabilization (FFPE sample sets are seldom small), its performance can potentially benefit from the inclusion of expression dependent stabilization of variance.
Analysis of correlation coefficients between AUCs for MAS5 and RMA preprocessed data showed that DFC method works equally well with both methods. Correlation is very high (ρ_{DFC} = 0.92) and is higher (though not significantly) than for the other tests considered. This demonstrates that the DFC method does accurately take into account expression dependence of fold change and total variance, which are very much different in MAS5 and RMA preprocessed data, see, for example, Figure1 for variance dependences.
We already mentioned above that our comparison was limited by only tests that take into account feature’s variance (which is very important for FFPE datasets as they have high technical variance[10, 11]). The fold change test has no associated value that can indicate the level of confidence in the designation of feature as DE. Its performance depends on features variances which can be very different for different preprocessing methods applied to data[42], see for example Figure1 for comparison of MAS5 and RMA preprocessed data. Fold change test was shown[7] to be good only if features variances are all fairly similar[7]. Basing on this observation and taking into account that features variances are fairly similar for RMA preprocessed data in the high expression range (e.g., 9 – 12 on Figure1) and decrease with expression for MAS5 preprocessed data (e.g., for expressions in the range 6 – 12 on Figure1) one can expect that fold change test should perform well on RMA preprocessed data when a small number of features is looked after and fail on MAS5 preprocessed data. On the contrary, the WAD method[9] should perform well on the data with variances inversely proportional to the expression. Therefore it should work well for MAS5 preprocessed data, and fail on RMA preprocessed data. This corroborates with findings in[9] (see also Additional file1). Nevertheless, when the set sizes and number of independently verified features are restricted to be reasonable, N_{s} and N_{PC} > 10, the DFC test and moderated t tests[4–7] perform better than either of them (see Additional file1).
The independence of fold change test on features variances triggered researchers to look for combined approaches – to require that DE features satisfy both pvalue and fold change criteria simultaneously[40]. Here the question arises as to how to combine these two tests – it was shown recently[41] that the cutoffs can significantly alter microarray interpretations. DFC test is free from these shortcomings as the ranking of features is performed using the signaltonoise ratio (5) and the threshold (16) is defined by expression dependent properties of particular experiment and only removes unreliable features. No artificial fold change thresholds are introduced.
Summarizing discussion we can say that DFC method was developed and shown to work with reasonable number of samples N_{s} >> 1, pick up DEGs equally well at any expression level and is not bounded to specific preprocessing method.
Conclusions
We have proposed a new method, called distributional fold change test for removing nondifferentially expressed genes, and ranking differentially expressed genes from gene expression data. The method was designed to work with data sets of FFPE samples profiled on microarrays, containing large number of genes (> 10^{4}) and to accurately select and rank differentially expressed genes, taking into account their expression level.
The method is based on analysis of the distribution of all genes on a microarray mapped to a three dimensional feature space composed of average difference of gene expression (logarithm of fold change), total variance and average expression level. It allows for the imposition of a statistical (and expression dependent) threshold for the fold change and the introduction a score based on signaltonoise ratio which is used for accurate gene ranking.
Performance of the DFC test was verified using 11 real experimental data sets, with DEGs independently verified by RTPCR. We demonstrated that DFC test is significantly better than the ttest in terms of detecting DEGS as measured by the total and partial area under receiver operating curves. Its advantage is most prominent in the range of low fraction of DEGs, which is the most important range for the analysis of fresh frozen and especially FFPE sample sets. Given its excellent performance we believe that the DFC test should be routinely used for the analysis of microarray data.
Endnotes
^{a}Such studies benefit from the availability of complete (or near complete) clinical information on patient history, treatments and prognosis/survival.
^{b}Details of fitting procedure to get the dependence σ_{0}(μ) is provided in the Additional file1.
^{c}The condition log_{2}v_{T} > LV_{Th}(μ) is a convenient way if imposing expression dependent variance filter with threshold defined by properties of the null features distribution (see eq. 11). Its application is favourable in situations of imposing stringent selection criteria. When imposing mild selection criteria or looking for ranking of all features it shall be switched off (see also endnote e).
^{d}These DEGs may comprise only a portion of true DEGs – not all DEGs can be physically checked by RTPCR due to limitations of the method – but nevertheless allow a comparative analysis of the DFC test’s performance compared to the reference tests.
^{e}For two data sets, GSE6740_2 (MAS5 preprocessing) and GSE9499 (RMA preprocessing), we had to lift the variance filter in order to calculate the AUC.
^{f}This option was chosen because, for extremely highdimensional data, estimating correlation is very difficult and in such instances it is recommended to conduct diagonal analysis[15].
Abbreviations
 AUC:

Area under ROC curve
 CAT:

Correlation adjusted t (test)
 CAT(diag):

CAT –test with option ‘diagonal’
 DE:

Differentially expressed
 DEG:

Differentially expressed gene
 DFC:

Distributional fold change (test)
 EE:

Equally expressed
 FC:

Fold change
 FF:

Fresh frozen
 FFPE:

Formalinfixed and paraffinembedded
 FPR:

False positive rate
 logFC:

Logarithm to base 2 of fold change
 LTA:

Logit transformed AUC, 0.5⋅ln(AUC/(1AUC)
 MAS5:

(Affymetrix) MicroArray Suite version 5
 RMA:

Robust multichip average
 ROC:

Receiver operating characteristic
 RTPCR:

Realtime polymerase chain reaction
 ShrinkT:

Shrinkage ttest, same as CAT(diag)
 SPA:

Standardized partial area under ROC curve
 TPR:

True positive rate
 WAD:

Weighted average difference
 YI:

Youden Index.
Declarations
Acknowledgements
This research was conducted as a part of the Almac Diagnostics company program for developing methods specifically applicable for expression analysis of RNA extracted from FFPE samples. It was supported by the Invest Northern Ireland grant 1009/101038722 and partly by the European Sustainable Competitiveness Programme 2007–2013 under the European Regional Development Fund. The authors gratefully acknowledge Vitali Proutski for continuous support during this work and Miika Ahdesmäki for providing the Matlab version of shrinkage CAT score package. Discussions with colleagues Steve Deharo, Gera Jellema, Eamonn O’Brien, Vitali Proutski, and others are highly appreciated. The authors are thankful to Miika Ahdesmäki and Timothy Davison for their suggestions for improvement of the manuscript content. Timothy Davison also made contribution to improving the language of the manuscript.
Authors’ Affiliations
References
 Göhlmann H, Talloen W: Gene Expression Studies Using Affymetrix Microarrays. Boca Raton: CRC Press, 2009.Google Scholar
 Zhang A: Advanced analysis of gene expression data. Singapore: World Scientific, 2006.View ArticleGoogle Scholar
 Kim SY, Lee JW, Sohn IS: Comparison of various statistical methods for identifying differential gene expression in replicated microarray data. Stat Methods Med Research. 2006, 15: 320. 10.1191/0962280206sm423oaView ArticleGoogle Scholar
 Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004, 3 (1): Article 3.Google Scholar
 Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98 (9): 51165121. 10.1073/pnas.091062498PubMedPubMed CentralView ArticleGoogle Scholar
 Sartor MA, Tomlinson CR, Wesselkamper SC, Sivaganesan S, Leikauf GD, Medvedovic M: Intensitybased hierarchical Bayes method improves testing for differentially expressed genes in microarray experiments. BMC Bioinformatics. 2006, 7: 538. 10.1186/147121057538PubMedPubMed CentralView ArticleGoogle Scholar
 OpgenRhein R, Strimmer K: Accurate ranking of differentially expressed genes by a distribution free shrinkage approach. Statist Appl Genet Mol Biol. 2007, 6: 9.Google Scholar
 Hu J, Xu J: Density based pruning for identification of differentially expressed genes from microarray data. BMC Genomics. 2010, 11 (Suppl 2): S3. 10.1186/1471216411S2S3PubMedPubMed CentralView ArticleGoogle Scholar
 Kadota K, Nakai Y, Shimizu K: A weighted average difference method for detecting differentially expressed genes from microarray data. Algorithm Mol Biol. 2008, 3: 810.1186/1748718838. 10.1186/1748718838View ArticleGoogle Scholar
 Farragher SM, Tanney A, Kennedy RD, Harkin PD: RNA expression analysis from formalin fixed paraffin embedded tissues. Histochem Cell Biol. 2008, 130: 435445. 10.1007/s0041800804797PubMedView ArticleGoogle Scholar
 Abdueva D, Wing M, Schaub B, Triche T, Davicioni E: Quantitative expression profiling in formalinfixed paraffinembedded samples by affymetrix microarrays. J Mol Diagn. 2010, 12: 40917. 10.2353/jmoldx.2010.090155PubMedPubMed CentralView ArticleGoogle Scholar
 Kennedy RD, Bylesjo M, Kerr P, Davison T, Black JM, Kay EW, Holt RJ, Proutski V, Ahdesmaki M, Farztdinov V, Goffard N, Hey P, McDyer F, Mulligan K, Mussen J, O'Brien E, Oliver G, Walker SM, Mulligan JM, Wilson C, Winter A, O'Donoghue D, Mulcahy H, O'Sullivan J, Sheahan K, Hyland J, Dhir R, Bathe OF, Winqvist O, Manne U: Development and independent validation of a prognostic assay for stage II colon cancer using formalinfixed paraffinembedded tissue. J Clin Oncol. 2011, 29: 46204626. 10.1200/JCO.2011.35.4498PubMedView ArticleGoogle Scholar
 Mittempergher L, de Ronde JJ, Nieuwland M, Kerkhoven RM, Simon I: Gene expression profiles from formalin fixed paraffin embedded breast cancer tissue are largely comparable to fresh frozen matched tissue. PLoS One. 2011, 6 (2): e17163. 10.1371/journal.pone.0017163PubMedPubMed CentralView ArticleGoogle Scholar
 Zuber V, Strimmer K: Gene ranking and biomarker discovery under correlation. Bioinformatics. 2009, 25: 27002707. 10.1093/bioinformatics/btp460PubMedView ArticleGoogle Scholar
 Ahdesmäki M, Strimmer K: Feature selection in omics prediction problems using cat scores and false nondiscovery rate control. Ann Appl Stat. 2010, 4: 503519. 10.1214/09AOAS277View ArticleGoogle Scholar
 Klopfleisch R, Weiss AT, Gruber AD: Excavation of a buried treasure–DNA, mRNA, miRNA and protein analysis in formalin fixed, paraffin embedded tissues. Histol Histopathol. 2011, 26 (6): 797810.PubMedGoogle Scholar
 , : Technical Note: Design and Performance of the GeneChip Human Genome U133 Plus 2.0 and Human Genome U133A Plus 2.0 Arrays. 2003, Affymetrix, Inc. Technical Note: GeneChip® Expression Platform: Comparison, Evolution, and Performance, 2004.http://media.affymetrix.com/support/technical/technotes/expression_comparison_technote.pdf, Affymetrix, Inc. Technical Note: GeneChip® Expression Platform: Comparison, Evolution, and Performance, 2004.Google Scholar
 Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning; Data Mining, Inference and Prediction. New York: Springer, 2nd, 2009.Google Scholar
 Braun M, Menon R, Nikolov P, Kirsten R, Petersen K, Schilling D, Schott C, Gündisch S, Fend F, Becker KF, Perner S: The HOPE fixation technique–a promising alternative to common prostate cancer biobanking approaches. BMC Cancer. 2011, 11: 511. 10.1186/1471240711511PubMedPubMed CentralView ArticleGoogle Scholar
 Klopfleisch R, von Deetzen M, Weiss AT, Weigner J, Weigner F, Plendl J, Gruber AD: Weigners fixativean alternative to formalin fixation for histology with improved preservation of nucleic acids. Vet Pathol. Apr 26. [Epub ahead of print], 2012.Google Scholar
 Sawilowsky SS: Fermat, Schubert, Einstein, and Behrens–Fisher: the probable difference between two means when σ1 ≠ σ2. Journal Mod App Stat Meth. 2002, 1: 461472.Google Scholar
 Krzanowski WJ, Hand DJ: ROC curves for continuous data. [Monographs on statistics and applied probability, vol 111], Boca Raton: CRC Press, 2009.View ArticleGoogle Scholar
 Fawcett T: An introduction to ROC analysis. Pattern Recogn Lett. 2006, 27: 861874. 10.1016/j.patrec.2005.10.010View ArticleGoogle Scholar
 ThalackerMercer AE, Fleet JC, Craig BA, Carnell NS: Inadequate protein intake affects skeletal muscle transcript profiles in older humans. Am J Clin Nutr. 2007, 85: 134452.PubMedPubMed CentralGoogle Scholar
 Jin B, Tao Q, Peng J, Soo HM: DNA methyltransferase 3B (DNMT3B) mutations in ICF syndrome lead to altered epigenetic modifications and aberrant expression of genes regulating development, neurogenesis and immune function. Hum Mol Genet. 2008, 17: 690709.PubMedView ArticleGoogle Scholar
 Viemann D, Goebeler M, Schmid S, Nordhues U: TNF induces distinct gene expression programs in microvascular and macrovascular human endothelial cells. J Leukoc Biol. 2006, 80: 17485. 10.1189/jlb.0905530PubMedView ArticleGoogle Scholar
 Csoka AB, English SB, Simkevich CP, Ginzinger DG: Genomescale expression profiling of HutchinsonGilford progeria syndrome reveals widespread transcriptional misregulation leading to mesodermal/mesenchymal defects and accelerated atherosclerosis. Aging Cell. 2004, 3: 23543. 10.1111/j.14749728.2004.00105.xPubMedView ArticleGoogle Scholar
 Gumz ML, Zou H, Kreinest PA, Childs AC: Secreted frizzledrelated protein 1 loss contributes to tumor phenotype of clear cell renal cell carcinoma. Clin Cancer Res. 2007, 13: 47409. 10.1158/10780432.CCR070143PubMedView ArticleGoogle Scholar
 Hsu EL, Yoon D, Choi HH, Wang F: A proposed mechanism for the protective effect of dioxin against breast cancer. Toxicol Sci. 2007, 98: 43644. 10.1093/toxsci/kfm125PubMedView ArticleGoogle Scholar
 Hyrcza MD, Kovacs C, Loutfy M, Halpenny R: Distinct transcriptional profiles in ex vivo CD4+ and CD8+ T cells are established early in human immunodeficiency virus type 1 infection and are characterized by a chronic interferon response as well as extensive transcriptional changes in CD8+ T cells. J Virol. 2007, 81: 347786. 10.1128/JVI.0155206PubMedPubMed CentralView ArticleGoogle Scholar
 Pescatori M, Broccolini A, Minetti C, Bertini E: Gene expression profiling in the early phases of DMD: a constant molecular signature characterizes DMD muscle from early postnatal life throughout disease progression. FASEB J. 2007, 21: 121026. 10.1096/fj.067285comPubMedView ArticleGoogle Scholar
 Burleigh DW, Kendziorski CM, Choi YJ, Grindle KM: Microarray analysis of BeWo and JEG3 trophoblast cell lines: identification of differentially expressed transcripts. Placenta. 2007, 28: 3839. 10.1016/j.placenta.2006.05.001PubMedView ArticleGoogle Scholar
 Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining tens of millions of expression profilesdatabase and tools update. Nucleic Acids Res. 2007, 35 (Database issue): D760D765.PubMedPubMed CentralView ArticleGoogle Scholar
 Bolstad B: Preprocessing and Normalization for Affymetrix GeneChip Expression Microarrays. Methods in microarray normalization. Edited by: Stafford P. Boca Raton: CRC Press, 2008, 4160.View ArticleGoogle Scholar
 Hubbell E, Liu WM, Mei R: Robust estimators for expression analysis. Bioinformatics. 2002, 18: 15851592. 10.1093/bioinformatics/18.12.1585PubMedView ArticleGoogle Scholar
 White paper: Statistical Algorithms Description Document. 2002,http://www.affymetrix.com/support/technical/whitepapers/saddwhitepaper.pdfGoogle Scholar
 Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003, 31 (4): e15. 10.1093/nar/gng015PubMedPubMed CentralView ArticleGoogle Scholar
 Irizarry RA, Hobbs B, Collin F, BeazerBarclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4: 249264. 10.1093/biostatistics/4.2.249PubMedView ArticleGoogle Scholar
 Cramer JS: Logit Models from Economics and Other Fields. Cambridge: Cambridge University Press, 2003.View ArticleGoogle Scholar
 McCarthy DJ, Smyth GK: Testing significance relative to a foldchange threshold is a TREAT. Bioinformatics. 2009, 25 (6): 76571. 10.1093/bioinformatics/btp053PubMedPubMed CentralView ArticleGoogle Scholar
 Dalman MR, Deeter A, Nimishakavi G, Duan ZH: Fold change and pvalue cutoffs significantly alter microarray interpretations. BMC Bioinformatics. 2012, 13 (Suppl. 2): S11.PubMedPubMed CentralView ArticleGoogle Scholar
 Cui X, Churchill GA: Statistical tests for differential expression in cDNA microarray experiments. Genome Biol. 2003, 4: 210. 10.1186/gb200344210PubMedPubMed CentralView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.