Distributional fold change test – a statistical approach for detecting differential expression in microarray experiments

Background Because of the large volume of data and the intrinsic variation of data intensity observed in microarray experiments, different statistical methods have been used to systematically extract biological information and to quantify the associated uncertainty. The simplest method to identify differentially expressed genes is to evaluate the ratio of average intensities in two different conditions and consider all genes that differ by more than an arbitrary cut-off value to be differentially expressed. This filtering approach is not a statistical test and there is no associated value that can indicate the level of confidence in the designation of genes as differentially expressed or not differentially expressed. At the same time the fold change by itself provide valuable information and it is important to find unambiguous ways of using this information in expression data treatment. Results A new method of finding differentially expressed genes, called distributional fold change (DFC) test is introduced. The method is based on an analysis of the intensity distribution of all microarray probe sets mapped to a three dimensional feature space composed of average expression level, average difference of gene expression and total variance. The proposed method allows one to rank each feature based on the signal-to-noise ratio and to ascertain for each feature the confidence level and power for being differentially expressed. The performance of the new method was evaluated using the total and partial area under receiver operating curves and tested on 11 data sets from Gene Omnibus Database with independently verified differentially expressed genes and compared with the t-test and shrinkage t-test. Overall the DFC test performed the best – on average it had higher sensitivity and partial AUC and its elevation was most prominent in the low range of differentially expressed features, typical for formalin-fixed paraffin-embedded sample sets. Conclusions The distributional fold change test is an effective method for finding and ranking differentially expressed probesets on microarrays. The application of this test is advantageous to data sets using formalin-fixed paraffin-embedded samples or other systems where degradation effects diminish the applicability of correlation adjusted methods to the whole feature set.

These assumptions are the grounds for applicability of fold change and variance filters. Using (A2.a) and notation we can rewrite eq. (A1) in integral form The relationship (A3) can be simplified if we find such a value LV Th at which and therefore with account of π <<1 one can replace the expression in curly brackets by 1. Note that due to (A2.c) this can be done for LV << E [LV |d=0,µ]. To find a higher threshold let one consider the conditional expectation of logarithm of total variance (ν T ) of the feature expression, which depends on internal variance v I and log fold change d (A4) One can show that i.e. for a given µ, the conditional expectation of logarithm of total variance has a minimum at d = 0. This property can be used to set up a threshold: The relationship can be further simplified for the range of |d| around d = 0, where It allows reducing eq. (A3) to We will suppose that approximation ( and can be used as a boundary to set up a variance filter. We supposed above that . Basing on approximation (A8) and using the definition (A9) the dependence σ 0 (µ) can be estimated from where MAD stands for median absolute deviation.
Our aim was to develop an approach for finding the FC distribution of null features which was both simple and transparent, while recognizing more elaborate approaches could be developed. Implementation of the algorithm is based on splitting the expression range into n (= 11) slices, finding LV Th (µ ι ) and ln(σ 0 (µ ι )) in each, and fitting polynomial approximations (3 rd order), which are then used to interpolate dependences of LV Th (µ) and ln(σ 0 (µ)) over the whole range of expressions. Number and width of expression intervals and the polynomial order are tuneable parameters and can be adjusted if necessary (see Additional file 2 for details). They were selected to provide essentially equal-sized feature subsets: as equally spaced quantiles of average expression µ cumulative distribution. Only the lowest quantile was made larger, as there is little interest in features with very low expression and two highest quantiles were made progressively smaller in order to be able to get proper dependence of highly expressed features and to catch the effects of expression saturation at high concentration levels. The third polynomial order was selected as the lowest one allowing to provide a smooth curve encompassing the potentially different behaviour in three ranges: low expression range dominated by noise, medium expression range with strong signal and high expression range were saturation effects can be noticeable. An example of complicated expression dependence one can find in Figure 1, where the conditional expectation E [LV | µ] is shown as red line. Figure 2 shows an application of condition (A9) to remove unregulated features in data set GSE6011 using implementation of the algorithm with default settings.

Selection of sample sets for testing
For evaluation of the performance of the DFC test we decided to use 36 publicly available Homo sapiens microarray sample sets listed in [9] with a portion of discovered DEGs experimentally validated by a RT-PCR. This collection of sample sets was used to compare a large number of feature selection methods therefore making our comparison easier. for building representative ROC curves and for the estimation of area and partial area under ROC curves, therefore reduction of the sets is required.
The set selection procedure was applied as follows: from the 36 FF sample sets listed in [9] we selected all sets with number of validated probesets N PC > 10. In this list, the sets with small number of samples N s ≤10 were overrepresented, comprising 50% against 33% in full set. Therefore 2 sets with very small number of samples (N s = 6 (N PC =12) and N s = 8 (N PC =11)) were removed. To the remaining 8 sets, we added 3 sets -set with N s =37 . This selection procedure (see Table 1 for details and Figure A1) has significantly improved distribution over the number of verified DEGs and at the same time the distribution over the sample sizes is close to that of the whole set of 36 data sets -the Kolmogorov-Smirnov test [A1,A2] p-value for similarity is 0.96 . (shrinkT), same as CAT(diag) [14]. The AUC values for MAS5-and RMA-pre-processed data for the selected experimental data sets (described in Table 1 in the paper), are shown in Table A2.
One can see that, on average, the DFC test produces higher AUC than any of the t-test based methods [4][5][6][7]. On MAS5 pre-processed data, it is the best among the all tests in comparison, while for the RMA pre-processed data it is the second best after AD method.
The observed AUC values are very close to 1 and consequently, their distributions and distributions of their differences cannot be well approximated by normal distributions. To obtain a more comprehensive estimation of the significance of difference, we applied pairedsample single sided t-test to logit transformed AUC values, LTA = 0.5⋅ln(AUC/(1-AUC)).
The logit transformation [39] maps the interval (0,1) onto (-∞, +∞) and makes transformed variables more normally distributed and therefore t-test better applicable. The differences AUC or the difference is very small when compared to any of the moderated t-test methods [4][5][6][7]. There is only one data set ( first set in the list with N s = 22; N PC = 9) for which the WAD method is dramatically better than all other methods on MAS5 pre-processed data and AD method is dramatically better than all others for RMA pre-processed data.
The p-values for significance of differences in LTAs, as measured by paired-sample single sided t-test t(LTA i -LTA j ) are presented in Table A3. It is seen that for MAS5 pre-processed data the DFC test is significantly (on a significance level better than 0.05) better than any of the tests except WAD. For RMA pre-processed data DFC test is significantly better than any of the t-test based methods and is equally well as AD. Although WAD test is the second only to the t-test in terms of poor performance for RMA pre-processed data, the t-test did not showed significant difference because of very large variance in WAD data -see Figure A2.  Table A2.

Table A3 -Significance of differences in AUC.
Paired-sample single sided t-test p-values calculated for LTA = 0.5×ln(AUC/(1-AUC)). Notations are the same as in Table A2.  One of the most important characteristics of the method is its ability to find DEGs independently of the pre-processing method applied to data. This should be evident from AUC as an overall characteristic of the test's performance. Calculation of correlation coefficients between logit transformed AUCs for MAS5 and RMA preprocessed data (see Table A4) showed that the DFC test has the highest correlation between AUCs, ρ DFC = 0.92, although its prevalence is not high enough to make it significantly different from other t-test based tests. Difference in correlation coefficients between DFC and AD and WAD tests can be accepted as significant, but only on 0.1 significance level.
Behaviour of the fold change methods on differently pre-processed data is very inconsistent, AD test performs the poorest for MAS5 pre-processed data, while WAD is the second poorest (after t-test). Both methods have the lowest correlation between AUC values obtained on MAS5 and RMA pre-processed data. This makes their application rather limited even though they potentially can achieve very good performance, as it will be always bounded to particular choice of pre-processing method. From Figure A2 it is seen that a good performance of WAD method on MAS5 data and AD method on RMA data is due to one data set only (the first set in the list with N s = 22; N PC = 9), which has a small number of verified DEGs.
We conclude that DFC test was consistently the best, independently of pre-processing method applied to the data, and performed equally well with WAD on MAS5 preprocessed data and with AD on RMA pre-processed data. This finding corroborates very well with the results of [9] where, using the large set of 36 data sets (though biased to the small set sizes and/or small number of verified DEGs), it was found that the WAD test performed the best on MAS5 pre-processed data and AD on RMA preprocessed data.
We believe that the very good performance of WAD and AD tests (apart from being a consequence of the variance dependence of on expression under particular preprocessing, mentioned in the Discussion) is the consequence of bias of testing data sets towards the small set sizes and/or small number of verified DEGs. To check this we narrowed the selection of sets to only those with large sample size N s > 10.
Results, presented in the next section show that both WAD and AD test are behind DFC and moderated t-test type methods [4][5][6][7] independent of the pre-processing method applied.

Large data sets
Sample set selection procedure was as follows: from the 36 FF sample sets listed in [9] we selected all sets with number of validated probesets N PC > 10 and with number of samples in set N s > 10. This resulted in 5 sets (see Figure A1), to which we added one set with N s =37, lying on the selection boundary (N PC =10). The resulting sample is presented in Table A5. Comparison of AUCs revealed that DFC test has the highest average AUC among the methods in comparison -see Table A6. Both WAD and AD tests are behind DFC and moderated t-test type methods [4-7] independent on pre-processing method applied.
The advantage of DFC test was evaluated with paired-sample single sided t-test t(LTA i -LTA j ) and results are presented in Table A7. It is seen that for MAS5 pre-processed data the DFC test is significantly (on a significance level better than 0.05) better than any of the tests except WAD. For RMA pre-processed data the higher performance of DFC test is much less pronounced.
Note also that there is no difference in performance of moderated t-tests [4][5][6][7], all produce the same average AUC, 0.989 for MAS5 pre-processed data and 0.991 for RMA pre-processed data (see Table A6).