- Open Access
A robust approach based on Weibull distribution for clustering gene expression data
© Wang et al; licensee BioMed Central Ltd. 2011
Received: 24 December 2010
Accepted: 31 May 2011
Published: 31 May 2011
Clustering is a widely used technique for analysis of gene expression data. Most clustering methods group genes based on the distances, while few methods group genes according to the similarities of the distributions of the gene expression levels. Furthermore, as the biological annotation resources accumulated, an increasing number of genes have been annotated into functional categories. As a result, evaluating the performance of clustering methods in terms of the functional consistency of the resulting clusters is of great interest.
In this paper, we proposed the WDCM (Weibull Distribution-based Clustering Method), a robust approach for clustering gene expression data, in which the gene expressions of individual genes are considered as the random variables following unique Weibull distributions. Our WDCM is based on the concept that the genes with similar expression profiles have similar distribution parameters, and thus the genes are clustered via the Weibull distribution parameters. We used the WDCM to cluster three cancer gene expression data sets from the lung cancer, B-cell follicular lymphoma and bladder carcinoma and obtained well-clustered results. We compared the performance of WDCM with k-means and Self Organizing Map (SOM) using functional annotation information given by the Gene Ontology (GO). The results showed that the functional annotation ratios of WDCM are higher than those of the other methods. We also utilized the external measure Adjusted Rand Index to validate the performance of the WDCM. The comparative results demonstrate that the WDCM provides the better clustering performance compared to k-means and SOM algorithms. The merit of the proposed WDCM is that it can be applied to cluster incomplete gene expression data without imputing the missing values. Moreover, the robustness of WDCM is also evaluated on the incomplete data sets.
The results demonstrate that our WDCM produces clusters with more consistent functional annotations than the other methods. The WDCM is also verified to be robust and is capable of clustering gene expression data containing a small quantity of missing values.
The changes of the gene expression levels are very common in the human complex diseases, such as cancers [1–3]. The advent of microarray technologies have made it possible to measure simultaneously the expression levels of many thousands of genes over different time points and/or under different experimental conditions [4–6]. Numerous computational techniques have been developed to analyze these gene expression data. Among them, clustering is a primary approach to group the genes with similar expression patterns across different conditions, which enables the identification of differentially expressed gene sets in cancerous tissues [7–9]. Clustering is an unsupervised learning technique which assigns a set of objects (genes) into subsets (called clusters) so that the objects in the same clusters are similar according to some similarity metric [10, 11]. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.
Since clustering is proposed, an increasing number of clustering approaches have been developed and improved for the analyses of gene expression data. The common clustering methods include k-means [12, 13], hierarchical clustering , and Self Organizing Map (SOM) [14, 15], and so on. Each method has its own strengths and weaknesses. The k-means is an important clustering algorithm which partitions n objects into k clusters in which each object belongs to the cluster with the nearest mean. In k-means clustering, the number of clusters k is an input parameter, and an inappropriate choice of k may yield poor clustering results. The main advantages of this algorithm are its simplicity and computational speed which allows it to run on large datasets, however, it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. Besides, it conducts poorly with overlapping clusters and is sensitive for noisy data. The hierarchical clustering aims to create a hierarchy of clusters which may be represented by a tree structure called a dendrogram. The root of the tree consists of a single cluster containing all objects, and the leaves correspond to individual objects. The hierarchical technique requires relatively smooth data and the clusters themselves need to be well defined. Like k-means method, noisy data strongly affect the resulting clusters. SOM is a type of artificial neural network that is trained using unsupervised learning to produce a two-dimensional, discretized representation of the input space of observations. It requires the geometry of nodes as input, and the nodes are mapped into two-dimensional space, initially at random, and then iteratively adjusted. SOM imposes the structure on data, with neighboring nodes tending to define related clusters. SOM has good computational properties and is suited to clustering of large data sets. One major drawback of this algorithm is the "boundary effect" of nodes on the edges of the network, which may lead to less effective clustering results. Besides, these clustering methods mentioned above require a complete data set as an input, and therefore those gene rows containing the missing values are either removed or imputed using an imputation method on the missing entries prior to clustering analysis. Removing the missing gene rows may result in omitting some important genes, such as the genes related to diseases, whereas the badly estimated missing values even changes the quality of data, which could influence the accuracy of clustering results.
In this article, we propose a Weibull distribution-based clustering method called WDCM. The assumption of this method is that the gene expression of each gene can be considered as a random variable following unique Weibull distribution , and that a group of genes tend to be clustered together if the Weibull distributions of gene expressions of these genes have similar distribution parameters. Here, we use the gene expression values of each gene to construct its corresponding Weibull distribution and then group these genes by clustering their corresponding distribution parameters.
The following sections of this paper are organized as 'Results', 'Discussion and conclusion' and 'Methods'. In section 'Results', we first introduced three cancer gene expression data sets we used, and then visually demonstrated the clustering results obtained using the WDCM for the three data sets. Second, to assess the performance of the WDCM, we compared the functional consistency of the gene clusters produced by the WDCM to those of the k-means and SOM methods for the same data sets. We also used the external measure Adjusted Rand Index to establish the performance of the WDCM, and the comparisons with the other algorithms were conducted simultaneously. Finally, we tested the robustness of the WDCM on clustering the incomplete data sets. In section 'Discussion and conclusion', we first summarized the main work of this study, discussed the strength and limitation of the WDCM. In the end we briefly mentioned the improvement of the WDCM and the future study. In section 'Methods', we introduced the WDCM together with the algorithm used for clustering the Weibull distribution parameters, the functional consistency assessment method of the clustering result, and the external validation index Adjusted Rand Index of the clustering performance. Moreover, Robustness test of the WDCM on clustering the incomplete data set was also presented in this section.
In this section, the WDCM is described as follows: Given a m × n gene expression matrix, let g ij be the jth expression value of gene i, i = 1, ...,m, and j = 1, ...,n. We here treat one gene expression as a random variable, and construct the distribution of the gene expressions of gene i. We then choose a subset of genes whose distributions of the gene expressions belong to the common Weibull distribution . Due to the consistent distribution function types, we consider that those genes with similar gene expression distribution parameters tend to share the similar expression patterns, and they are probably concerned with the same biological processes or functions together. We further cluster the genes in the selected subset by clustering their corresponding distribution parameters, as each gene corresponds to its unique distribution parameters. In the following we introduce the principle of the distribution function construction procedures.
Weibull distributions of gene expressions construction
Where I(∙) is the indicator function.
where a >0 is the scale parameter and b >0 is the shape parameter of the distribution. The scale parameter a determines the range of the distribution. The shape parameter b is what gives the Weibull distribution its flexibility. By changing the value of the shape parameter, the Weibull distribution can fit a wide variety of data.
under the null hypothesis, converges to the Kolmogorov distribution . The null hypothesis is rejected at significance level α if , otherwise it is accepted, where K α is the critical value of the Kolmogorov distribution. Given α = 0.05, we here select the appropriate parameters for F(i)(x) in order to the null hypothesis is accepted (p - value > 0.05), that is, the random sample comes from the certain Weibull distribution F(i)(x), i = 1,2, ...,m. Following the above procedure, we can obtain the Weibull distributions of m gene expressions, denoted by F(1)(x),F(2)(x),...,F(m)(x).
Weibull distribution parameters of gene expressions clustering
Let θ i denotes the parameter of the Weibull distribution F(i)(x), j = 1, ...,m. Here θ i consists of double-parameter pair (a i ,b i ), we then cluster the m parameters θ1, θ2,..., θ m using a certain clustering algorithm based on the hub points. This algorithm presented by Robert Clason designates a single point as a hub for each cluster and then finds the distance from each remaining point to each hub, as well as assigns this point to the hub to which it is closer . The merit of it is to automatically ascertain the clusters number on the basis of the distances between data points. A detailed description of the algorithm is provided in Additional file 1.
Functional consistency of clustering result
In order to evaluate the performance of the proposed WDCM, we also apply the K-means and Self Organizing Map (SOM) clustering algorithms to the same gene subsets as the WDCM and obtain the gene clusters, respectively. We compare the functional consistency of the gene clusters produced by WDCM to those produced by the other methods. For this purpose, we consider the biological annotations of the gene clusters in terms of Gene Ontology (GO). The Gene Ontology (GO) project provides three structured, controlled vocabularies that describe the gene products in terms of their associated biological processes (BP), cellular components (CC) and molecular functions (MF) . The annotation ratios of each gene cluster in three GO terms were calculated using the web-accessible DAVID 2008 tool . For each of clusters found by one of three clustering methods, under the BP ontology, we search the just GO term in which the most genes in this cluster are enriched, and define the BP annotation ratio for this cluster as the number of genes in both the assigned GO term and this cluster divided by the number of genes in this cluster. After calculating the BP annotation ratios for all clusters, we treat the mean value of all annotation ratios as the final BP annotation ratio. We also define the CC and MF annotation ratios by the same manner. A higher annotation ratio represents that the corresponding clustering result is better than the other ones, that is, gene are better clustered by function, indicating a more functionally consistent clustering result.
Adjusted Rand Index validation index
The value of Adjusted Rand Index varies from 0 to 1 and higher value means that C is more similar to X.
Considering that the genes with similar expression patterns may be functionally related each other , we group the genes in the given data set according to functional similarity and define these gene clusters as X. The clustering results C s are then given by the proposed WDCM, k-means and SOM. We compute and compare the values of Adjusted Rand Index between X and C s to evaluate the performance of WDCM. To this end, we first use the Gene Functional Classification Tool of DAVID to group the genes into the highly functionally related gene clusters and then compute the values of ARI. The higher value indicates the corresponding clustering method performs better.
Robustness of the WDCM on clustering incomplete data set
The WDCM can be applied to cluster the incomplete gene expression data set without imputing the missing values. To test the robustness of this approach, we compared the overlapped degree between the gene clusters for incomplete data sets and the ones for complete data sets. A higher overlapped degree represents a robust clustering method. To this end, we first randomly remove 5-25% of the complete data set in order to create the incomplete gene expression data sets, and then we apply the WDCM to cluster these complete and incomplete data sets and obtain the clustering results, respectively. Here, a Cluster Overlap Ratio (COR) index is introduced for assessing the overlapped degrees at individual missing percentages.
Cluster Overlap Ratio index
|∙| denotes the number of genes in the cluster, and thus p i represents the proportion of genes in the gene cluster I i . Here x i denotes the maximum of overlapped gene numbers between I i and each individual C k (k = 1, ..., n) divided by |I i |.
Identification of six gene clusters for lung cancer data set
It is evident from Figure 1A that the distribution parameters of the genes of a cluster are close and compact to each other, which indicates the Weibull distribution parameters were clustered well. The expression profiles of the corresponding clustered genes plots have been shown in Figure 1B, from which it is also evident that the expression profiles of the genes within identical clusters are quite similar, whereas the profiles for the genes belonging to different clusters differ from each other.
Identification of four gene clusters for follicular lymphoma data set
From Figure 2A, the four parameters clusters are clearly distinguished from each other, meanwhile, the expression profiles of the genes within the same clusters are similar, whereas the ones of the genes across different clusters are distinct (see Figure 2B). The results indicate that the significantly distinct gene clusters were found using the WDCM on follicular lymphoma data set.
Identification of four gene clusters for bladder carcinoma data set
Comparison of clustering performance
To show the performance of the WDCM, we applied the K-means and Self Organizing Map (SOM) algorithms to the same gene subsets clustered by the WDCM and compared the functional consistency of the gene clusters produced by WDCM to those of the gene clusters produced by the other methods (see Methods). Simultaneously, the values of ARI for the WDCM, k-means and SOM algorithms on these three data sets were also contrasted (see Methods).
ARI values of WDCM, k-means and SOM algorithms for the lung cancer, B-cell follicular lymphoma and bladder carcinoma gene expression data sets
The above comparative analyses on the functional annotation ratios of the three algorithms have demonstrated that the genes in each cluster obtained using the WDCM show not only the similar expression patterns, but also more consistent functional annotations, which means these genes are more inclined to be involved in the same biological functions together. Also, the Adjusted Rand Index comparative results indicate the superiority of the performance of the proposed WDCM compared to the other algorithms.
Test for robustness of the WDCM on clustering incomplete data set
COR indices with respect to the specified percentages of missing values for the lung cancer, B-cell follicular lymphoma and bladder carcinoma data sets
Percentage of missing
The results of the cluster overlapped degree comparison tests indicate that the WDCM gave a high overlapped degree of the gene clusters compared with those of complete data set at low missing value, highlighting the robustness and potential of the WDCM. We think that the results might stem from the fact that the missing gene expression values of individual genes have little influence on constructing their corresponding Weibull distribution parameters at low missing values.
Discussion and conclusion
In this article, we propose a robust approach based on Weibull distribution (WDCM) for clustering gene expression data. It is based on the idea that a group of genes tend to be clustered together if the distributions of gene expressions of these genes belong to the common Weibull distribution and have the similar distribution parameters. Consequently, we cluster the genes by clustering the distribution parameters of their gene expressions. A hub nodes-based dynamic clustering algorithm is utilized in the distributions clustering process. The clusters number in a gene expression data set is automatically determined in this clustering algorithm. The performance of the proposed WDCM has been compared with those of K-means and SOM clustering algorithms by the biological annotation ratios to show its effectiveness on three cancer gene expression data sets. The results show that the WDCM is more capable of grouping the genes with similar expression patterns and strong functional consistency together. We also used the external measure Adjusted Rand Index to validate the performance of the WDCM. The comparative results demonstrate that the WDCM provides the better clustering performance compared to k-means and SOM algorithms. Moreover, the WDCM can be applied to cluster the incomplete gene expression data set without imputing the missing values. The results have demonstrated that there is high overlap between the gene clusters for the incomplete data set and those for the complete data set, which illustrates the robustness of the WDCM on clustering the incomplete data set at low percentage of missing values.
In general it is known that due to the complex nature of the gene expression data sets themselves and the experimental errors in detecting the gene expression data, it is difficult to discover an acknowledged best clustering approach. In clustering process, the WDCM disregards a few genes whose gene expression distributions fail to fit the Weibull distribution. In future study, we will consider replacing the single Weibull distribution with the mixture distribution in order to cluster the whole data set. Besides, we will also increase the robustness of this approach on clustering the incomplete gene expression data set containing the missing values of moderate percentage. For the gene clusters found by WDCM, we would like to investigate which gene clusters and genes are correlated with some cancer phenotype, and which biological processes or molecular functions these genes in the clusters are concerned with. Our study may be helpful to gain insights into the complex diseases.
This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 30871394 and 61073136), the National High Tech Development Project of China, the 863 Program (Grant Nos. 2007AA02Z329), the National Basic Research Program of China, the 973 Program 9(Grant Nos. 2008CB517302) and the National Science Foundation of Heilongjiang Province (Grant Nos. JC200711, ZD200816-01), the Graduate Student Creation Science Foundation of Heilongjiang Province (Grants Nos. YJSCX2008-123HLJ) and the Scientific Research Foundation of Heilongjiang Provincial Education Department (Grants Nos. 11551362).
- Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, Van de Rijn M, Waltham M, Pergamenschikov A, Lee JC, Lashkari D, Shalon D, Myers TG, Weinstein JN, Botstein D, Brown PO: Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet. 2000, 24: 227-235. 10.1038/73432PubMedView ArticleGoogle Scholar
- Schlom J, Tsang KY, Kantor JA, Abrams SI, Zaremba S, Greiner J, Hodge JW: Cancer vaccine development. Expert Opin Investig Drugs. 1998, 7: 1439-1452. 10.1517/135437126.96.36.1999PubMedView ArticleGoogle Scholar
- Zhang L, Zhou W, Velculescu VE, Kern SE, Hruban RH, Hamilton SR, Vogelstein B, Kinzler KW: Gene expression profiles in normal and cancer cells. Science. 1997, 276: 1268-1272. 10.1126/science.276.5316.1268PubMedView ArticleGoogle Scholar
- Khademhosseini A: Chips to Hits: microarray and microfluidic technologies for high-throughput analysis and drug discovery. September 12-15, 2005, MA, USA. Expert Rev Mol Diagn. 2005, 5: 843-846. 10.1586/14737188.8.131.523PubMedView ArticleGoogle Scholar
- Khan J, Bittner ML, Chen Y, Meltzer PS, Trent JM: DNA microarray technology: the anticipated impact on the study of human disease. Biochim Biophys Acta. 1999, 1423: M17-28.PubMedGoogle Scholar
- Watson A, Mazumder A, Stewart M, Balasubramanian S: Technology for microarray analysis of gene expression. Curr Opin Biotechnol. 1998, 9: 609-614. 10.1016/S0958-1669(98)80138-9PubMedView ArticleGoogle Scholar
- Ben-Dor A, Shamir R, Yakhini Z: Clustering gene expression patterns. J Comput Biol. 1999, 6: 281-297. 10.1089/106652799318274PubMedView ArticleGoogle Scholar
- Guess MJ, Wilson SB: Introduction to hierarchical clustering. J Clin Neurophysiol. 2002, 19: 144-151. 10.1097/00004691-200203000-00005PubMedView ArticleGoogle Scholar
- Rahnenfuhrer J: Clustering algorithms and other exploratory methods for microarray data analysis. Methods Inf Med. 2005, 44: 444-448.PubMedGoogle Scholar
- Boutros PC, Okey AB: Unsupervised pattern recognition: an introduction to the whys and wherefores of clustering microarray data. Brief Bioinform. 2005, 6: 331-343. 10.1093/bib/6.4.331PubMedView ArticleGoogle Scholar
- Sierra A, Corbacho F: Reclassification as supervised clustering. Neural Comput. 2000, 12: 2537-2546. 10.1162/089976600300014836PubMedView ArticleGoogle Scholar
- MacQueen JB: Some Methods for classification and Analysis of Multivariate Observations. the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967, 281-297. University of California PressGoogle Scholar
- Gourevitch B, Le Bouquin-Jeannes R: K-means clustering method for auditory evoked potentials selection. Med Biol Eng Comput. 2003, 41: 397-402. 10.1007/BF02348081PubMedView ArticleGoogle Scholar
- Cottrell M, Ibbou S, Letremy P: SOM-based algorithms for qualitative variables. Neural Netw. 2004, 17: 1149-1167. 10.1016/j.neunet.2004.07.010PubMedView ArticleGoogle Scholar
- Lee BH, Scholz M: Application of the self-organizing map (SOM) to assess the heavy metal removal performance in experimental constructed wetlands. Water Res. 2006, 40: 3367-3374. 10.1016/j.watres.2006.07.027PubMedView ArticleGoogle Scholar
- Weibull W: A statistical distribution function of wide applicability. J Appl Mech-Trans ASME. 1951, 18: 293-297.Google Scholar
- Turnbull BW: The empirical distribution function with arbitrarily grouped, censored and truncated data. Journal of the Royal Statistical Society Series B. 1976, 38: 290-295.Google Scholar
- Frank J, Massey J: The Kolmogorov-Smirnov Test for Goodness of Fit. Journal of the American Statistical Association. 1951, 46: 68-78. 10.2307/2280095View ArticleGoogle Scholar
- Huang S, Yeo AA, Li SD: Modification of Kolmogorov-Smirnov test for DNA content data analysis through distribution alignment. Assay Drug Dev Technol. 2007, 5: 663-671. 10.1089/adt.2007.071PubMedView ArticleGoogle Scholar
- Ong LD, LeClare PC: The Kolmogorov-Smirnov test for the log-normality of sample cumulative frequency distributions. Health Phys. 1968, 14: 376-PubMedGoogle Scholar
- Clason R: Finding Clusters: An application of the Distance Concept. The Mathematics Teacher. 1990Google Scholar
- Blake JA, Harris MA: The Gene Ontology (GO) project: structured vocabularies for molecular biology and their application to genome and expression analysis. Curr Protoc Bioinformatics. 2008, 7: Unit 7 2Google Scholar
- Huang da W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4: 44-57.PubMedView ArticleGoogle Scholar
- Yeung KY, Haynor DR, Ruzzo WL: Validating clustering for gene expression data. Bioinformatics. 2001, 17: 309-318. 10.1093/bioinformatics/17.4.309PubMedView ArticleGoogle Scholar
- R Giancarlo DS, Utro F: Statistical Indexes for Computational and Data Driven Class Discovery in Microarray Data. In Biological Data Mining. 2009, Chapman and HallGoogle Scholar
- Mosca E, Bertoli G, Piscitelli E, Vilardo L, Reinbold RA, Zucchi I, Milanesi L: Identification of functionally related genes using data mining and data integration: a breast cancer case study. BMC Bioinformatics. 2009, 10 (Suppl 12): S8- 10.1186/1471-2105-10-S12-S8PubMedPubMed CentralView ArticleGoogle Scholar
- Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA. 2001, 98: 13790-13795. 10.1073/pnas.191502998PubMedPubMed CentralView ArticleGoogle Scholar
- Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, Golub TR: Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med. 2002, 8: 68-74. 10.1038/nm0102-68PubMedView ArticleGoogle Scholar
- Dyrskjot L, Thykjaer T, Kruhoffer M, Jensen JL, Marcussen N, Hamilton-Dutoit S, Wolf H, Orntoft TF: Identifying distinct classes of bladder carcinoma using microarrays. Nat Genet. 2003, 33: 90-96. 10.1038/ng1061PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.