We have proposed a new method for removing non-differentially expressed features and ranking differentially expressed features from gene expression data.

It was designed to work with expression data from microarrays containing large number of features ( *N*_{p} > 10^{4}), allowing one to analyse the distribution of all features on a microarray mapped to a three dimensional space composed of average difference of feature expression (logarithm of fold change), total variance and average expression level. A simple approach was introduced to define the expression dependent null features distribution and to estimate null features expression dependent average variance (9) and variance of log*FC* (12). These dependences are incorporated into the DFC test score *Z*_{
d
} (5) for individual feature, which in this way explicitly takes into account information about presence of other features and can be used for accurate feature ranking.

The definition of the score *Z*_{
d
} (5) is similar to moderated t-statistics, used in a series of papers on variance stabilization ([1, 7] and references sited therein), but principally differs from them in that the variance stabilization is defined through the variance of null features log*FC* distribution (12) and to a limited extent through the features’ internal variance.

The same dependences (9) and (12) were used to introduce a statistical (and expression dependent) threshold for the fold change based on specification of power 1 –

*β* at given significance level

*α*. This method has the advantage of being self-adjusting through the accurate estimation of the unregulated features distribution

*f*(

*d*_{0}) and taking into account the

*f*(

*d*|μ) distributions of regulated features, thus providing an option to impose power requirements. The two parameters,

*α* and

*β*, control Type I and Type II errors and allow for a tuning, to particular purposes of experiment, of a threshold (16) below which features are considered as having no sufficient evidence to be called differentially expressed. One can show that features passing DFC test all have (ordinary t-test) p-values below expression dependent threshold

*p* ≤

*p*_{Th} (we use notation

*p*_{Th} to distinguish it from

*α*), which includes correction dependent on properties of unregulated features distribution

$\phantom{\rule{1em}{0ex}}\begin{array}{l}{p}_{\mathit{Th}}\left(\alpha ,\beta |\mu \right)=2\left\{1-T\left[\frac{{\sigma}_{0}\left(\mu \right)}{s\left(\mu \right)}{\Phi}^{-1}\left(1-\frac{\alpha}{2}\right)\right.\right.\\ \phantom{\rule{7em}{0ex}}\left.\left.+{T}^{-1}\left(1-{\beta}_{\mathit{Th}},DF\right),DF\right]\right\}\end{array}$

(22)

When *α* = 1, the method is reduced to selection of features by t-test with threshold *p*_{
Th
} = 2*β*_{
Th
} (combined with variance filter), when *β*_{
Th
} = 0.5 the method is reduced to selection based on the ‘Unusual Ratio’ variant of fold change method[2] with an internal definition of null features distribution. Once the selection criteria (*α*, *β*_{
Th
}) are applied and the set of unexpressed features removed, ranking of differentially expressed features can be performed by the DFC score (5).

Standard approaches for multiple test correction[1, 2, 18] (and references therein) do not take into account expression dependence of the threshold (22). This problem will be considered in a separate publication. Here we note only that multiplicity correction affects only the arbitrary threshold choice and does not change the ranking of features[1]. Ranking of features with score (5) should be complemented with functional analysis (see, e.g. [1, chapter 5]) for final reduction of the number of false positives based on biological grounds.

The definition of the Type II error (17) has some similarity with re-centered t- statistic[40], but differs from the TREAT method in the way how threshold is defined. In ref.[40] “a pre-specified threshold (τ) for the log-fold-change below which differential expression is not of material interest”[34] is introduced in order to address the thresholded null hypethesis *H*_{0}: |*d*| ≤ *τ* against alternative *H*_{1}: |*d*| > *τ*. The relevance of particular choice (*τ*=log_{2}(1.1), or *τ*=log_{2}(1.5) or *τ*=log_{2}(2) were used in[40] for three data sets) to particular dataset actually has to be independently verified, while in our approach the threshold (13) is 1) expression dependent and 2) is defined through the significance parameter *α* and it fully reflects properties of particular experiment. Ranking of features in[40] is performed using TREAT test p-value, which is equivalent to 2*β* (17) but with replacement of ∆_{1Th}(*α* *μ*) by an arbitrary threshold *τ* . Parameter *β* (conditional on the value of *α* (or *τ* according to definition in[40])) is good for defining the threshold (16) above which features differential expression can be considered as reliably detected, but we believe is not well suited for ranking of features (see also[41] for a discussion of fold change and p-value cutoffs). The best parameter for this purpose is signal-to-noise ratio *Z*_{d} (5) and as it is shown in the paper and Additional file1 it outperforms ranking by moderated t- test statistics and fold change based methods.

The performance of the DFC test was verified using 11 real experimental data sets, with DEGs independently verified by RT-PCR. Their selection was based on the requirement of having in each set sufficiently large number of verified DEGs to build AUC. The total number of verified DEGs in these data sets was 284. We demonstrated that the DFC test is significantly better than the t-test in terms of the total and partial area under receiver operating curves. The improvement was dramatic (on average > 30%) in the most important (for FF and FFPE sample sets) range of the number of selected features *K* < 1000.

Some improvement was obtained in comparison with shrinkage t-test[7, 14], which can be considered as one of the best variance stabilizing methods, although improvement in partial area under ROC curve was within confidence limits (for 0.1 confidence level) for a number of selected features below ~30. Variance stabilization is very important for small data sets, although, as comparison shows, even for medium range data sets of 10 Ã· 30 samples, improvement can be significant. Taking into account that the DFC test was not optimized for variance stabilization (FFPE sample sets are seldom small), its performance can potentially benefit from the inclusion of expression dependent stabilization of variance.

Analysis of correlation coefficients between AUCs for MAS5 and RMA pre-processed data showed that DFC method works equally well with both methods. Correlation is very high (*ρ*_{DFC} = 0.92) and is higher (though not significantly) than for the other tests considered. This demonstrates that the DFC method does accurately take into account expression dependence of fold change and total variance, which are very much different in MAS5 and RMA pre-processed data, see, for example, Figure1 for variance dependences.

We already mentioned above that our comparison was limited by only tests that take into account feature’s variance (which is very important for FFPE datasets as they have high technical variance[10, 11]). The fold change test has no associated value that can indicate the level of confidence in the designation of feature as DE. Its performance depends on features variances which can be very different for different pre-processing methods applied to data[42], see for example Figure1 for comparison of MAS5 and RMA pre-processed data. Fold change test was shown[7] to be good only if features variances are all fairly similar[7]. Basing on this observation and taking into account that features variances are fairly similar for RMA pre-processed data in the high expression range (e.g., 9 – 12 on Figure1) and decrease with expression for MAS5 pre-processed data (e.g., for expressions in the range 6 – 12 on Figure1) one can expect that fold change test should perform well on RMA pre-processed data when a small number of features is looked after and fail on MAS5 pre-processed data. On the contrary, the WAD method[9] should perform well on the data with variances inversely proportional to the expression. Therefore it should work well for MAS5 pre-processed data, and fail on RMA pre-processed data. This corroborates with findings in[9] (see also Additional file1). Nevertheless, when the set sizes and number of independently verified features are restricted to be reasonable, *N*_{s} and *N*_{PC} > 10, the DFC test and moderated t- tests[4–7] perform better than either of them (see Additional file1).

The independence of fold change test on features variances triggered researchers to look for combined approaches – to require that DE features satisfy both p-value and fold change criteria simultaneously[40]. Here the question arises as to how to combine these two tests – it was shown recently[41] that the cutoffs can significantly alter microarray interpretations. DFC test is free from these shortcomings as the ranking of features is performed using the signal-to-noise ratio (5) and the threshold (16) is defined by expression dependent properties of particular experiment and only removes unreliable features. No artificial fold change thresholds are introduced.

Summarizing discussion we can say that DFC method was developed and shown to work with reasonable number of samples *N*_{s} >> 1, pick up DEGs equally well at any expression level and is not bounded to specific pre-processing method.