- Research
- Open Access
Differential co-expression framework to quantify goodness of biclusters and compare biclustering algorithms
- Burton Kuan Hui Chia1, 3 and
- R Krishna Murthy Karuturi2Email author
https://doi.org/10.1186/1748-7188-5-23
© Hui and Karuturi; licensee BioMed Central Ltd. 2010
- Received: 21 January 2010
- Accepted: 28 May 2010
- Published: 28 May 2010
Abstract
Background
Biclustering is an important analysis procedure to understand the biological mechanisms from microarray gene expression data. Several algorithms have been proposed to identify biclusters, but very little effort was made to compare the performance of different algorithms on real datasets and combine the resultant biclusters into one unified ranking.
Results
In this paper we propose differential co-expression framework and a differential co-expression scoring function to objectively quantify quality or goodness of a bicluster of genes based on the observation that genes in a bicluster are co-expressed in the conditions belonged to the bicluster and not co-expressed in the other conditions. Furthermore, we propose a scoring function to stratify biclusters into three types of co-expression. We used the proposed scoring functions to understand the performance and behavior of the four well established biclustering algorithms on six real datasets from different domains by combining their output into one unified ranking.
Conclusions
Differential co-expression framework is useful to provide quantitative and objective assessment of the goodness of biclusters of co-expressed genes and performance of biclustering algorithms in identifying co-expression biclusters. It also helps to combine the biclusters output by different algorithms into one unified ranking i.e. meta-biclustering.
Keywords
- Strong Gene
- Unify Ranking
- Biclustering Algorithm
- Lung Dataset
- Iterative Signature Algorithm
Background
The inception of microarrays has facilitated quantification of expression of genes at genomic scale in large sets of conditions in time and cost effective manner resulting in a wealth of massive gene expression datasets. Appropriate analysis of these datasets lead to the understanding of the roles of various genes and pathways at genomic-scale.
Illustrating difference between clustering and biclustering. Heatmaps (red for induction and green for repression) illustrating difference between clustering and biclustering (a) a cluster of genes, genes are co-expressed across most conditions; (b) a bicluster of genes, genes are co-expressed only on a subset of conditions (heatmap on the left) and the heatmap on the right shows no co-expression on the remaining conditions.
Biclustering plays an important role in microarray gene expression analysis. Expression of a cluster of genes may be modulated only in a small subset of conditions demonstrating interesting biology of the condition dependent transcriptional co-regulation and potentially leading to understanding of the underlying mechanisms. For example, in knock out studies, certain groups of genes are activated or suppressed only in a small subset of knock-out conditions. Similarly, in cancer studies, due to heterogeneity of the tumors, certain groups of genes involving in a certain pathway may be co-expressed only in a subset of tumors. In the traditional clustering, the genes co-expressed over all conditions dominate the clustering analysis and the genes co-expressed only in a small subset of conditions may not be elicited.
Biclusters with comparable co-expression of the bicluster genes across non-bicluster conditions. Heatmaps (red for induction and green for repression, genes are indicated in rows and conditions are shown in columns) of biclusters with comparable co-expression of the bicluster genes across non-bicluster conditions. In each figure, the left heatmap shows expression of the bicluster genes (rows) in the bicluster conditions (columns) and the right heatmap shows expression of the bicluster genes in the remaining conditions. All of them were chosen from top 10 biclusters output by the respective algorithms, the rank is indicated in the parenthesis.
On real data, Prelic et al's [11] evaluation was based on the number of gene ontology (GO) terms enriched for the biclusters. It may not be a good measure for four reasons: (1) it solely depends on the genes in the biclusters and does not account for the conditions involved; (2) GO terms may be highly enriched even for normal clusters of genes which may not lack co-expression in any subset of the conditions; (3) it does not distinguish between good biclusters from traditional clusters; and, (4) it may be subjective owing to the hierarchical structure of the GO.
Hence, it is important to develop an objective scoring function that works well on real data to assess the quality or goodness of biclusters and hence the reliability of the biclustering algorithms. It will also be helpful in combining the results of applying different biclustering algorithms on a data into a single unified ranking, i.e. a meta-biclustering, which has not been addressed so far. It would be of great help as it facilitates best utilization of all biclustering algorithms as different algorithms may behave differently on different datasets.
In this paper we propose to develop such a scoring function based on differential co-expression framework similar to that proposed by Kostka and Spang [17]. In this framework, for a given bicluster, we fit two linear models for the expression of genes in the bicluster for the conditions in the bicluster and for the remaining (the non-bicluster) conditions separately. The resultant models are used together to assess goodness of the bicluster using our differential co-expression scoring function. Note that the aim of this paper is not to assess the efficiency of the biclustering algorithms in retrieving underlying biclusters in the data, but to assess how good the identified biclusters are and how to provide a good unified ranking of the biclusters (meta-biclustering algorithm) output by them. Using our scoring function we compare the performance of different biclustering algorithms on six real datasets.
Results
Differential co-expression framework for biclustering



1 ≤ i ≤ I; 1 ≤ j ≤ Jk; 1 ≤ k ≤ 2
,
,
are the estimates of τik, βjk, and -μk respectively.
Different types of co-expression. Heatmaps (red for induction and green for repression, genes are indicated in rows and conditions are shown in columns) illustrating 3 types of co-expression: (1) T-type, gene effects only; (2) B-type, condition effects only; and, (3) μ-type, gene and condition effects.
I(b) is the number of genes in b and Jk(b) is number of conditions in Gk for b. Similar interpretation holds for the other variables also.
Theorem: Tk and Bk are the unbiased estimators of
and
respectively under the assumption that the noise in X
ijk
follows N(0,
)
As E
k
is an unbiased estimator of
,
is an unbiased estimator of β
k
. Similarly
is an unbiased estimator of Γ
k
. ■
In the above proof,
is a non-central Chi-square distribution with 'n' degrees of freedom and 'c' being the non-centrality parameter; ⟨Z⟩ is the expectation of the random variable Z.
Scoring goodness of biclusters
where 0<a<<1, it is a small fudge factor to offset large ratios based on very small co-expression in both groups of a bicluster. Strong positive SB(b) indicates strong co-expression in G1 and weaker or no co-expression in G2 vice versa.
Though we score a bicluster based on its differential co-expression, our quantification of differential co-expression by SB(b) is different from that used by Kostka and Spang, the S(b) = LOG(E1(b)/E2(b)), and their variance standardization approach for two reasons: (1) S(b) accounts mainly for B-type co-expression; and, (2) variance standardization does not account for different signal variances in the two groups.
Stratifying biclusters
where k = 1 if SB(b) > 0
= 2 if SB(b) < 0
Evaluating Biclustering Algorithms and Combining Bicluster Lists
Datasets used in the analysis
S. No | Dataset | Experiment | References | No. of Genes | No. of Samples |
---|---|---|---|---|---|
1 | Breast | Breast Cancer | Wang et al. [16] | 22283 | 286 |
2 | Liver | Liver Cancer | Chen et al. [15] | 10200 | 203 |
3 | Yeast | Knock Out in Yeast | Gasch, et al. [20] | 2993 | 173 |
4 | Lymphoma | Lymphoma and Normal | Alizadeh, et al. [21] | 4026 | 96 |
5 | Lung | Lung Cancer | Broët et al. [14] | 54837 | 79 |
6 | Path_Metabolic | Plant | Wille et al. [22] | 734 | 69 |
We have evaluated the biclustering algorithms based on four criteria: (1) number of biclusters found; (2) median number of conditions in the biclusters; (3) ranking of the biclusters generated by an algorithm in the combined ranking of all biclusters generated by all algorithms; and, (4) types of biclusters generated.
Number of biclusters. The number of biclusters (y-axis) output by different biclustering algorithms for 6 different datasets. The broken curve shows the number of conditions in each dataset.
Median number of conditions. The median number of conditions (y-axis) in the biclusters output by different biclustering algorithms for 6 different datasets after filtering out small condition sized (<5) biclusters.
Rank distribution of biclusters. Rank distribution of the biclusters from each algorithm in a combined ranking on different datasets.
Rank composition of top 100 biclusters. Rank composition of the top 100 biclusters obtained by combined ranking of biclusters from each algorithm on 6 different datasets. The rank is shown on x-axis and the percent contribution of each algorithm is shown on y-axis.
Stratification of biclusters. Cumulative distribution TS1(b) of the biclusters from each algorithm on 6 datasets. Highly negative TS1(b) (< -1) shows B-type co-expression, highly positive TS1(b) (> 1) shows T-type co-expression and TS1(b) close to zero (-1< TS1(b) <1) indicates μ-type co-expression.
Stratification of top 100 biclusters. Cumulative distribution TS1(b) of the biclusters from each algorithm on 6 datasets contributing to the top100 biclusters from combined ranking. TS1(b) < -1 shows B-type co-expression, TS1(b) > 1 shows T-type co-expression and -1 < TS1(b) <1 indicates μ-type co-expression.
Discussion and Conclusions
Our study on real data has shown that evaluation of biclustering algorithms on idealized simulated data may not reflect the actual performance on real data owing to its complexity. So we proposed a conceptually and statistically sound framework based on the concept of differential co-expression to objectively compare the performance of the biclustering algorithms on real data and combine their output into a single unified ranking. This is based on the observation that a bicluster is revealed because the grouping of the bicluster genes could be strong only based on the bicluster conditions. As several biclustering algorithms do not consider the effect of non-bicluster conditions in the scoring and discovery of the biclusters, we found several biclusters with a strong grouping of genes based on the non-bicluster conditions also. This does not qualify them to be biclusters as the genes could be grouped nearly strongly even with all conditions together i.e. co-expression is more of a global effect. The strength of grouping can be represented by condition and gene effects and their differential between bicluster and non-bicluster conditions for the bicluster genes indicate true biclusters. We considered three types of co-expression unlike in a typical differential co-expression study and the ranking is based on the model coefficients rather than the model errors to reflect different types of co-expression. In this formulation, we explicitly estimate the effects of genes, conditions in bicluster conditions and non bicluster conditions. Strong effects of either genes or conditions would indicate co-expression of genes in the given group of conditions. Taking ratio of the co-expression scores between bicluster and non bicluster conditions gives us the measure of the goodness of the biclusters. Further we proposed a bicluster stratification score to classify the biclusters based on their co-expression patterns: high score means genes are co-expressed similarly across conditions in the bicluster, but the genes could be divided into two groups one with induction and the other with repression; low score means genes are co-expressed across conditions, conditions can be divided into two groups - one with induction of all genes and the other with repression; medium or near-zero score means all genes are either induced or repressed but not a combination in all conditions. The framework we used is analogous to ANOVA with Tk, Bk and μk being similar to the variance terms with null centrality parameter being '0'.
We have compared four well known biclustering algorithms: ISA, OPSM, CC and SAMBA. Their application on six different datasets revealed that ISA outputs the best biclusters but its performance is dependent on the number of conditions in the dataset; SAMBA performs well on all datasets of the varying number of conditions; though OPSM does not perform well on most datasets, it is still useful on certain datasets like Lung cancer data; whereas CC outputs least goodness biclusters with high stratification scores. Further, there is a data dependency on the types of co-expression present in the biclusters: all algorithms output predominantly B-type biclusters on Breast and Lung datasets and a mix of B-type and μ-type biclusters for Liver, Yeast and Lymphoma datasets, though μ-type biclusters are slightly more in number. Strikingly, OPSM output mostly B-type biclusters and CC is the only algorithm output T-type biclusters.
However, the evaluation presented in the paper may vary with a change in parameter settings of the individual algorithms. But it is helpful even to compare different parameter settings for a given algorithm and choose suitable parameter settings. Hence, the scoring function is useful, as shown here, to get unified ranking of the biclusters (i.e. meta-biclustering algorithm) produced by different algorithms for different parameter settings. However, we are working on devising an algorithm based on the differential co-expression framework as it may find novel biclusters with strong differential co-expression.
Moreover, though the proposed goodness scoring function is tailored to assess the goodness of the biclusters of co-expressed genes, the general framework of differential co-expression can be extended to evaluate the goodness of the other types of biclusters such as low error in the expression which requires a scoring function proposed by Kostka & Spang i.e. ratio of error variances = E2(b)/E1(b).
Declarations
Acknowledgements
We thank Ian, Huaien and Juntao for their valuable comments during the work. We also thank the anonymous reviewers for their valuable constructive comments which helped improve the manuscript. The research was funded by Genome Institute of Singapore, BMRC, Agency for Science, Technology and Research (A-STAR), Singapore.
Authors’ Affiliations
References
- Ben-Dor A, Shamir R, Yakhini Z: Clustering gene expression patterns. J Comput Biol. 1999, 6 (3-4): 281-297. 10.1089/106652799318274PubMedView ArticleGoogle Scholar
- Dhillon IS, Marcotte EM, Roshan U: Diametrical clustering for identifying anti-correlated gene clusters. Bioinformatics. 2003, 19 (13): 1612-1619. 10.1093/bioinformatics/btg209PubMedView ArticleGoogle Scholar
- Golub , Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene monitoring. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531PubMedView ArticleGoogle Scholar
- Mirkin B: Mathematical classification and clustering. Nonconvex Optimization and Its Applications. Edited by: Pardalos PM. 1996, 11: Boston-Dordrecht: Kluwer Academic PublishersGoogle Scholar
- Cheng Y, Church GM: Biclustering of expression data. Proceedings of Intl Conf Intell Syst Mol Biol 19-23 August 2000; UC San Diego La Jolla. Edited by: Bourne P, Gribskov M, Altman R, Jensen N, Hope D, Lengauer T, Mitchell J, Scheeff E, Smith C, Strande S, Weissig H. 2000, 93-103. AAAIGoogle Scholar
- Ayadi W, Elloumi M, Hao JK: A biclustering algorithm based on a Bicluster Enumeration Tree: application to DNA microarray data. BioData Mining. 2009, 2: 9- 10.1186/1756-0381-2-9PubMedPubMed CentralView ArticleGoogle Scholar
- Ihmels J, Bergmann S, Barkai N: Defining transcription modules using large-scale gene expression data. Bioinformatics. 2004, 20 (13): 1993-2003. 10.1093/bioinformatics/bth166PubMedView ArticleGoogle Scholar
- Ben-Dor A, Chor B, Karp R, Yakhini Z: Discovering Local Structure in Gene Expression Data: The Order Preserving Submatrix Problem. Jl of Comput Biol. 2003, 10 (3-4): 373-384. 10.1089/10665270360688075View ArticleGoogle Scholar
- Kluger Y, Basri R, Chang JT, Gerstein M: Spectral biclustering of microarray data: coclustering genes and conditions. Genome Research. 2003, 13: 703-716. 10.1101/gr.648603PubMedPubMed CentralView ArticleGoogle Scholar
- Madiera SC, Oliveira AL: Biclustering algorithms for biological data analysis: A survey. IEEE Transactions on Computational Biology and Bioinformatics. 2004, 1 (1): 24-45. 10.1109/TCBB.2004.2View ArticleGoogle Scholar
- Prelic A, Bleuler S, Zimmermann P, Wille A, Bühlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics. 2006, 22 (9): 1122-1129. 10.1093/bioinformatics/btl060PubMedView ArticleGoogle Scholar
- Tanay A, Sharan R, Shamir R: Discovering statistically significant biclusters in gene expression data. Bioinformatics. 2002, 18 (1): 136-44.View ArticleGoogle Scholar
- Yang J, Wang W, Yu P: Enhanced biclustering on expression data. Proceedings of the 3rd IEEE Conference on Bioinformatics and Bioengineering: 10-12 March 2003. 2003, 321-327. Bethesda: IEEE Computer SocietyView ArticleGoogle Scholar
- Broët P, Camilleri-Broët S, Zhang S, Alifano M, Bangarusamy D, Battistella M, Wu Y, Tuefferd M, Régnard JF, Lim E, Tan P, Miller LD: Prediction of clinical outcome in multiple lung cancer cohorts by integrative genomics: implications for chemotherapy selection. Cancer Res. 2009, 69 (3): 1055-1062. 10.1158/0008-5472.CAN-08-1116PubMedView ArticleGoogle Scholar
- Chen X, Cheung ST, So S, Fan ST, Barry C, Higgins J, Lai KM, Ji J, Dudoit S, Ng IOL, Rijn M, Botstein D, Brown PO: Gene expression patterns in human liver cancers. Mol Biol Cell. 2002, 13 (6): 1929-1939. 10.1091/mbc.02-02-0023.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005, 365 (9460): 671-679.PubMedView ArticleGoogle Scholar
- Kostka D, Spang R: Finding disease specific alterations in the co-expression of genes. Bioinformatics. 2004, 20 (Suppl 1): i194-i199. 10.1093/bioinformatics/bth909PubMedView ArticleGoogle Scholar
- Barkow S, Bleuler S, Prelic A, Zimmermann P, Zitzler E: BicAT: a biclustering analysis toolbox. Bioinformatics. 2006, 22 (10): 1282-1283. 10.1093/bioinformatics/btl099PubMedView ArticleGoogle Scholar
- Ulitsky I, Maron-Katz A, Shavit S, Sagir D, Linhart C, Elkon R, Tanay A, Sharan R, Shiloh Y, Shamir R: Expander: from expression microarrays to networks and functions. Nature Protocols. 2010, 5 (2): 303-322. 10.1038/nprot.2009.230PubMedView ArticleGoogle Scholar
- Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes. Molecular Biology of the Cell. 2000, 11 (12): 4241-4257.PubMedPubMed CentralView ArticleGoogle Scholar
- Alizadeh AA, Eisen MB, Davis ER, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J, Lisheng Lu, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000, 403: 503-511. 10.1038/35000501PubMedView ArticleGoogle Scholar
- Wille A, Zimmermann P, Vranová E, Fürholz A, Laule O, Bleuler S, Hennig L, Prelic A, von Rohr P, Thiele L, Zitzler E, Gruissem W, Bühlmann P: Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biology. 2004, 5 (11): R92- 10.1186/gb-2004-5-11-r92PubMedPubMed CentralView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.