Survival associated pathway identification with group L_{ p } penalized global AUC maximization
- Zhenqiu Liu^{1, 2}Email author,
- Laurence S Magder^{2},
- Terry Hyslop^{3} and
- Li Mao^{4}
https://doi.org/10.1186/1748-7188-5-30
© Liu et al; licensee BioMed Central Ltd. 2010
Received: 7 January 2010
Accepted: 16 August 2010
Published: 16 August 2010
Abstract
It has been demonstrated that genes in a cell do not act independently. They interact with one another to complete certain biological processes or to implement certain molecular functions. How to incorporate biological pathways or functional groups into the model and identify survival associated gene pathways is still a challenging problem. In this paper, we propose a novel iterative gradient based method for survival analysis with group L_{ p } penalized global AUC summary maximization. Unlike LASSO, L_{ p } (p < 1) (with its special implementation entitled adaptive LASSO) is asymptotic unbiased and has oracle properties [1]. We first extend L_{ p } for individual gene identification to group L_{ p } penalty for pathway selection, and then develop a novel iterative gradient algorithm for penalized global AUC summary maximization (IGGAUCS). This method incorporates the genetic pathways into global AUC summary maximization and identifies survival associated pathways instead of individual genes. The tuning parameters are determined using 10-fold cross validation with training data only. The prediction performance is evaluated using test data. We apply the proposed method to survival outcome analysis with gene expression profile and identify multiple pathways simultaneously. Experimental results with simulation and gene expression data demonstrate that the proposed procedures can be used for identifying important biological pathways that are related to survival phenotype and for building a parsimonious model for predicting the survival times.
Keywords
Background
Biologically complex diseases such as cancer are caused by mutations in biological pathways or functional groups instead of individual genes. Statistically, genes sharing the same pathway have high correlations and form functional groups or biological pathways. Many databases about biological knowledge or pathway information are available in the public domain after many years of intensive biomedical research. Such databases are often named metadata, which means data about data. Examples of such databases include the gene ontology (GO) databases (Gene Ontology Consortium, 2001), the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [2], and several other pathways on the internet (e.g., http://www.superarray.com; http://www.biocarta.com). Most current methods, however, are developed purely from computational points without utilizing any prior biological knowledge or information. Gene selections with survival outcome data in the statistical literature are mainly within the penalized Cox or additive risk regression framework [3–8]. The L_{1} and L_{ p } (p < 1) penalized Cox regressions can work for simultaneous individual gene selection and survival prediction and have been extensively studied in statistics and bioinformatics literature [8–11]. The performance of the survival model is evaluated by the global area under the ROC curve summary (GAUCS) [12]. Unfortunately, those methods are mainly for individual gene selections and cannot be used to identify pathways directly. In microarray analysis, several popular tools for pathway analysis, including GENMAP, CHIPINFO, and GOMINER, are used to identify pathways that are over-expressed by differentially expressed genes. These gene set enrichment analysis (GSEA)methods are very informative and are potentially useful for identifying pathways that related to disease status [13]. One drawback with GSEA is that it considers each pathway separately and the pathway information is not utilized in the modeling stage. Wei and Li [14] proposed a boosting algorithm incorporating related pathway information for classification and regression. However, their method can only be applied for binary phenotypes. Since most complex diseases such as cancer are believed to be associated with the activities of multiple pathways, new statistical methods are required to select multiple pathways simultaneously with the time-to-event phenotypes.
A ROC curve provides complete information on the set of all possible combinations of true-positive and false-positive rates, but is also more generally useful as a graphic characterization of the magnitude of separation between the case and control distributions. AUC is known to measure the probability that the marker value (score) for a randomly selected case exceeds the marker value for a randomly selected control and is directly related to the Mann-Whitney U statistic [15, 16]. In survival analysis, a survival time can be viewed as a time-varying binary outcome. Given a fixed time t, the instances for which t_{ i } = t are regarded as cases and samples with t_{ i } > t are controls. The global AUC summary (GAUCS) is then defined as GAUCS = P (M_{ j } > M_{ k }|t_{ j } < t_{ k }), which indicates that subject who died at an earlier time has a larger score value, where M is a score function. Heagerty and Zheng [12] have shown that GAUCS is a weighted average of the area under time-specific ROC curves. Liu et al. [17] proposed a L_{1} penalized quadratic support vector machine (SVM) method for GAUCS maximization and individual gene selection with survival outcomes and showed the method outperformed the Cox regression. However, that method is only for gene selections and can not be directly used for identifying pathways without additional criteria. Group LASSO (L_{1}) related penalized methods have been extensively studied recently in logistic regression [18], multiple kernel learning [19], and microarray analysis [20] with binary phenotypes. The methods are designed for selecting groups of variables and identifying important covariate groups (pathways). However, LASSO is biased. L_{ p } (p ≤ 1) (with one specific implementation entitled adaptive LASSO [1, 21]) is asymptotic unbiased and has oracle properties. Therefore, it is reasonable to extend the L_{ p } to group L_{ p } for pathway identifications. In this paper, we therefore extend L_{ p } to group L_{ p } penalty and develop a novel iterative gradient based algorithm for GAUCS maximization (IGGAUCS), which can effectively integrate genomic data and biological pathway information and identify disease associated pathways with right censored survival data. In Section 2, we formulate the GAUCS maximization and group L_{ p } penalty model and propose an efficient EM algorithm for survival prediction. Its performance is compared with its group lasso penalized Cox regression (implemented by ourselves). We also propose an integrated algorithm with GAUCS for microarray data analysis. The proposed approach is demonstrated with simulation and gene expression examples in Section 3. Concluding remarks are discussed in Section 4.
Group L_{ p } Penalized GAUCS Maximization
which indicates the probability that the subject who died (cases) at the early time has a larger value of the marker, where S(t) and g(t) are the survival and corresponding density functions, respectively.
Assuming there are r clusters in the input covariates, our primary aim is to identify a small number of clusters associated with survival time t_{ i }. Mathematically, for each input x_{ i } ϵ ℝ^{ m }, we are given a decomposition of ℝ^{ m } as a product of r clusters:
where M_{ j } = w^{ T } x_{ j } . The ideal situation is that M(x_{ j } ) > M(x_{ k } ) or w^{ T } (x_{ j } - x_{ k }) > 0, ∀ couple (x_{ j } , x_{ k }) with corresponding times t_{ j } < t_{ k } (or j < k) and δ_{ j } = 1.
where 1_{a>b} = 1 if a > b, and 0 otherwise. Obviously, GAUCS is a measure to rank the patients' survival time. The perfect GAUCS = 1 indicates that the order of all patients' survival time are predicted correctly and GAUCS = 0.5 indicates for a completely random choice.
where λ is a penalized parameter controlling model complexity. Equation (5) is the maximum a posterior (MAP) estimator of w with Laplace prior provided we treat the sigmoid function as the pair-wise probability, i.e. Pr(M_{ j } > M_{ k }) = σ(w^{ T } (x_{ j } - x_{ k })). When p = 1, E_{ p } is a convex function. A global optimal solution is guaranteed.
The IGGAUCS Algorithm
In equation (8), the lower bound E(w, θ) is differentiable w.r.t both w and θ. We therefore propose a EM algorithm to maximize E(w, θ) w.r.t w while keeping θ fixed and maximize E(w, θ) w.r.t the variational parameter θ to tighten the variational bound while keeping w fixed. Convergence to the local optimum is guaranteed. Since maximization w.r.t the variational parameters 0 = (|θ_{1}|,|θ_{2}|,..., |θ_{ r }|), with w being fixed, can be solved with the stationary equation $\frac{\partial E(\text{w},\theta )}{\partial \left|{\theta}_{l}\right|}=0$ we have θ_{ l } = w_{ l }, for l = 1, 2,..., r.
Given r candidate pathways potentially associated with the survival time, m_{ l } survival associated genes with the expression of x_{ l } on each pathway l (l = 1, 2,..., r), and letting w = (w_{1}, w_{2},..., w_{ r })^{ T }be a vector of the corresponding coefficients and $g(\text{w})=\frac{{\partial}_{\text{w}}E(\text{w},\theta )}{\partial \text{w}}$, we have the following iterative gradient algorithm for E(w, θ) maximization:
The IGGAUCS Algorithm
Given p, λ, and ϵ = 10^{-6}, initializing ${\text{w}}^{1}={({\text{w}}_{1}^{1},{\text{w}}_{2}^{1},...,{\text{w}}_{r}^{1})}^{T}$ randomly with nonzero ${\text{w}}_{l}^{1},l=1,...,r$, and set θ^{1} = w^{1}.
Update w with θ fixed:
w^{t+1}= w^{ t }+ α^{ t }d^{ t }, where t: the number of iterations and α^{ t }: the step size, d^{ t } is updated with the conjugate gradient method:
d^{ t } = g(w^{ t }) + u^{ t }d^{ t } and ${u}^{t}=\frac{{[g({\text{w}}^{t})-g({\text{w}}^{t-1})]}^{T}g({\text{w}}^{t})}{g{({\text{w}}^{t-1})}^{T}g({\text{w}}^{t-1})}$.
Update θ with w fixed:
θ^{t+1}= w^{t+1}
Stop when |w^{t+1}- w^{ t }| < ϵ or maximal number of iterations exceeded.
Choice of Parameters
There are two parameters p and λ in this method, which can be determined through 10-fold cross validation. One efficient way is to set p = 0.1, 0.2,..., and 1 respectively, and search for an optimal λ for each p using cross validation. The best (p, λ) pair will be found with the maximal test GAUCS value. Theoretically when p = 1, E(w, θ) is convex and we can find the global maximum easily, but the solution is biased and small values of p would lead to better asymptotic unbiased solutions. Our results with limited experiments show that optimal p usually happens at a small p such as p = 0.1. For comparison purposes, we implement the popular Cox regression with group LASSO (GL_{1}Cox), since there is no software available in the literature. Our implementation is based on group LASSO penalized partial log-likelihood maximization. The best λ is searched from λ ϵ [0.1, 25] for IGGAUCS and from λ ϵ [0.1, 40] for GL_{1}Cox method with the step size of 0.1, as the L_{ p } penalty goes to zero much quicker than L_{1}. We suggest that the larger step size such as 0.5 can be used for most applications, since the test GAUCS does not change dramatically with a small change of λ.
Computational Results
Simulation Data
We first perform simulation studies to evaluate how well the IGGAUCS procedure performs when input data has a block structure. We focus on whether the important variable groups that are associated with survival outcomes can be selected using the IGGAUCS procedure and how well the model can be used for predicting the survival time for future patients. In our simulation studies, we simulate a data set with a sample size of 100 and 300 input variables with 100 groups (clusters). The triple variables x_{1} - x_{3}, x_{4} - x_{6}, x_{7} - x_{9},..., x_{298} - x_{300} within each group are highly correlated with a common correlation γ and there are no correlations between groups. We set γ = 0.1 for weak correlation, γ = 0.5 for moderate, and γ = 0.9 for strong correlation in each triple group and generate training and test data sets of sample size 100 with each γ respectively from a normal distribution with the band correlation structure. We assume that the first three groups(9 covariates) (x_{1} - x_{3}, x_{4} - x_{6}, x_{7} - x_{9}) are associated with survival and the 9 covariates are set to be w = [-2.9 2.1 2.4 1.6 -1.8 1.4 0.4 0.8 -0.5]^{ t }. With this setting, 3 covariates in the first group have the strongest association (largest covariate values) with survival time and 3 covariates in group 3 have less association with survival time. The survival time is generated with H = 100 exp(-w^{ T }x + ε) and the Weibull distribution, and the census time is generated from 0.8*median(time) plus a random noise. Based on this setting, we would expect about 25% - 35% censoring. To compare the performance of IGGAUCS and GL_{1}Cox, we build the model based on training data set and evaluate the model with the test data set. We repeat this procedure 100 times and use the time-independent GAUCS to assess the predictive performance.
Frequency of Three Survival Associated Groups Selected in 100 Replications
IGGAUCS/GL_{1}Cox | |||
---|---|---|---|
Parameters | γ = 0.1 | γ = 0.5 | γ = 0.9 |
w_{1} = -2.9 | |||
w_{2} = 2.1 | 100/100 | 100/100 | 100/100 |
w_{3} = 2.4 | |||
w_{4} = 1.6 | |||
w_{5} = -1.8 | 100/78 | 100/84 | 100/96 |
w_{6} = 1.4 | |||
w_{7} = 0.4 | |||
w_{8} = 0.8 | 47/2 | 53/4 | 94/24 |
w_{9} = -0.5 |
Test GAUCS of Simulated Data w ith Different Correlation Structures
Correlation | IGGAUCS | GL_{1}Cox |
---|---|---|
γ = 0.1 | 0.921(±0.023) | 0.897(±0.031) |
γ = 0.5 | 0.889(±0.021) | 0.871(±0.024) |
γ = 0.9 | 0.866(±0.017) | 0.828(±0.025) |
Follicular Lymphoma (FL) Data
Follicular lymphoma is a common type of Non-Hodgkin Lymphoma (NHL). It is a slow growing lymphoma that arises from B-cells, a type of white blood cell. It is also called an "indolent" or "low-grad" lymphoma for its slow nature, both in terms of its behavior and how it looks under the microscope. A study was conducted to predict the survival probability of patients with gene expression profiles of tumors at diagnosis [27].
Candidate Survival Associated Pathways
Pathways | # of Genes | Pathways | # of Genes |
---|---|---|---|
Propanoate metabolism | 5 | Melanoma | 10 |
Type II diabetes mellitus | 6 | Thyroid cancer | 5 |
Adipocytokine signaling pathway | 10 | Prostate cancer | 13 |
Melanogenesis | 13 | Glycolysis/Gluconeogenesis | 8 |
GnRH signaling pathway | 11 | Butanoate metabolism | 6 |
Insulin signaling pathway | 15 | Endometrial cancer | 11 |
Sphingolipid metabolism | 5 | Pancreatic cancer | 10 |
Glycerophospholipid metabolism | 9 | Colorectal cancer | 12 |
T cell receptor signaling pathway | 11 | RNA polymerase | 6 |
Hematopoietic cell lineage | 10 | Huntington's disease | 5 |
Glycerolipid metabolism | 6 | Focal adhesion | 20 |
Toll-like receptor signaling pathway | 13 | Apoptosis | 12 |
Antigen processing and presentation | 9 | Adherens junction | 10 |
Complement and coagulation cascades | 8 | Tryptophan metabolism | 7 |
ECM-receptor interaction | 14 | Histidine metabolism | 6 |
Wnt signaling pathway | 20 | Fatty acid metabolism | 10 |
Ubiquitin mediated proteolysis | 15 | Acute myeloid leukemia | 9 |
Neuroactive ligand-receptor interaction | 28 | Bladder cancer | 6 |
gamma-Hexachlorocyclohexane degradation | 5 | Focal adhesion | 20 |
Calcium signaling pathway | 21 | ErbB signaling pathway | 11 |
MAPK signaling pathway | 31 | PPAR signaling pathway | 16 |
Valine, leucine and isoleucine degradation | 6 | Glioma | 7 |
Pyrimidine metabolism | 12 | Chronic myeloid leukemia | 10 |
Glycan structures - degradation | 5 | Non-small cell lung cancer | 11 |
Porphyrin and chlorophyll metabolism | 7 |
Genes on pathway, relevance accounts, and estimated parameters
w _{ i } | GeneID | Gene Name |
---|---|---|
ECM-receptor interaction (relevance counts: 200) | ||
0.2239 | CD36 | cd36 antigen (collagen type i receptor, thrombospondin receptor) |
0.0409 | FNDC1 | fibronectin type iii domain containing 1 |
0.0746 | SV2C | synaptic vesicle glycoprotein 2c |
0.0804 | SDC1 | syndecan 1 |
-0.1255 | FN1 | fibronectin 1 |
0.0211 | LAMC1 | laminin, gamma 1 (formerly lamb2) |
-0.0854 | GP5 | glycoprotein v (platelet) |
-0.1130 | CD47 | cd47 antigen (rh-related antigen, integrin-associated signal transducer) |
-0.1296 | THBS2 | thrombospondin 2 |
-0.0547 | COL1A2 | collagen, type i, alpha 2 |
-0.1024 | COL5A2 | collagen, type v, alpha 2 |
0.0861 | LAMB4 | laminin, beta 4 |
-0.0315 | COL1A1 | collagen, type i, alpha 1 |
0.0395 | AGRN | agrin RG |
Focal adhesion (relevance counts 145) | ||
0.0054 | PAK3 | p21 (cdkn1a)-activated kinase 3 |
0.0446 | PIK3R3 | phosphoinositide-3-kinase, regulatory subunit 3 (p55, gamma) |
0.0044 | PDPK1 | 3-phosphoinositide dependent protein kinase-1 |
-0.0045 | BAD | bcl2-antagonist of cell death |
0.0087 | PARVA | parvin, alpha |
-0.0144 | FN1 | fibronectin 1 |
0.0041 | LAMC1 | laminin, gamma 1 (formerly lamb2) |
-0.0202 | PARVG | parvin, gamma |
-0.0158 | THBS2 | thrombospondin 2 |
0.0134 | PPP1R12A | protein phosphatase 1, regulatory (inhibitor) subunit 12a |
-0.0044 | SOS1 | son of sevenless homolog 1 (drosophila) |
-0.0084 | COL1A2 | collagen, type i, alpha 2 |
-0.0122 | COL5A2 | collagen, type v, alpha 2 |
0.0091 | LAMB4 | laminin, beta 4 RG Homo sapiens |
-0.0068 | RAF1 | v-raf-1 murine leukemia viral oncogene homolog 1 |
-0.0038 | ACTN1 | actinin, alpha 1 |
-0.0034 | COL1A1 | collagen, type i, alpha 1 |
-0.0067 | GSK3B | glycogen synthase kinase 3 beta |
-0.0065 | MAPK8 | mitogen-activated protein kinase 8 |
-0.0030 | MYL7 | myosin, light polypeptide 7, regulatory |
Neuroactive ligand-receptor interaction (relevance counts: 200) | ||
0.0894 | P2RY6 | pyrimidinergic receptor p2y, g-protein coupled, 6 |
-0.2753 | PTAFR | platelet-activating factor receptor |
0.1648 | GLRA3 | glycine receptor, alpha 3 |
0.0857 | FPRL1 | formyl peptide receptor-like 1 |
-0.1783 | EDNRA | endothelin receptor type a |
0.3233 | HRH4 | histamine receptor h4 |
0.2106 | GRM2 | glutamate receptor, metabotropic 2 |
-0.1112 | GRIN1 | glutamate receptor, ionotropic, n-methyl d-aspartate 1 |
-0.0220 | PTHR1 | parathyroid hormone receptor 1 |
0.0971 | OPRM1 | opioid receptor, mu 1 |
-0.4303 | CTSG | cathepsin g |
-0.0404 | P2RY8 | purinergic receptor p2y, g-protein coupled, 8 |
-0.0783 | BDKRB1 | bradykinin receptor b1 |
0.3247 | FSHR | follicle stimulating hormone receptor |
-0.1430 | ADRA1B | adrenergic, alpha-1b-, receptor |
0.1464 | C3AR1 | complement component 3a receptor 1 |
0.1120 | P2RX2 | purinergic receptor p2x, ligand-gated ion channel, 2 |
0.0311 | AVPR1B | arginine vasopressin receptor 1b |
0.2646 | FPR1 | formyl peptide receptor 1 |
0.2003 | GABRA5 | gamma-aminobutyric acid (gaba) a receptor, alpha 5 |
-0.0278 | PRLR | prolactin receptor |
-0.1070 | ADORA1 | adenosine a1 receptor |
0.2652 | HTR7 | 5-hydroxytryptamine (serotonin) receptor 7 (adenylate cyclase-coupled) |
-0.0194 | GABRA4 | gamma-aminobutyric acid (gaba) a receptor, alpha 4 |
0.0145 | GHRHR | growth hormone releasing hormone receptor |
-0.3163 | MAS1 | mas1 oncogene |
-0.0760 | PTGER3 | prostaglandin e receptor 3 (subtype ep3) |
0.2196 | PARD3 | par-3 partitioning defective 3 homolog (c. elegans) |
Ubiquitin mediated proteolysis (relevance counts: 200) | ||
0.0186 | UBE2B | ubiquitin-conjugating enzyme e2b (rad6 homolog) |
-0.0677 | CUL4A | cullin 4a |
0.0051 | PML | promyelocytic leukemia |
-0.1023 | UBE3B | ubiquitin protein ligase e3b |
-0.1581 | UBE3C | ubiquitin protein ligase e3c |
0.1390 | BTRC | beta-transducin repeat containing |
-0.0669 | HERC3 | hect domain and rld 3 |
0.00009 | RBX1 | ring-box 1 |
0.0011 | CUL5 | cullin 5 |
0.0267 | ANAPC4 | anaphase promoting complex subunit 4 |
-0.0253 | UBE2L3 | ubiquitin-conjugating enzyme e2l 3 |
0.0096 | KEAP1 | kelch-like ech-associated protein 1 |
-0.0267 | UBE2E1 | ubiquitin-conjugating enzyme e2e 1 (ubc4/5 homolog, yeast) |
0.0116 | CBL | cas-br-m (murine) ecotropic retroviral transforming sequence |
0.0328 | BIRC6 | baculoviral iap repeat-containing 6 (apollon) |
Porphyrin and chlorophyll metabolism (relevance counts: 185) | ||
0.0330 | BLVRA | biliverdin reductase a |
-0.0103 | FTH1 | ferritin, heavy polypeptide 1 |
0.0227 | ALAD | aminolevulinate, delta-, dehydratase |
0.1983 | HMOX1 | heme oxygenase (decycling) 1 |
0.0070 | UROS | uroporphyrinogen iii synthase (congenital erythropoietic porphyria) |
-0.1596 | GUSB | glucuronidase, beta |
0.0077 | UGT2B15 | udp glucuronosyltransferase 2 family, polypeptide b15 |
Calcium signaling pathway (relevance counts: 200) | ||
-0.0025 | BST1 | bone marrow stromal cell antigen 1 |
0.0003 | BDKRB1 | bradykinin receptor b1 |
-0.0016 | PTAFR | platelet-activating factor receptor |
-0.0002 | ADRA1B | adrenergic, alpha-1b-, receptor |
-0.0007 | PPP3CC | protein phosphatase 3 (formerly 2b), catalytic subunit, gamma isoform |
-0.0001 | ADCY7 | adenylate cyclase 7 |
0.0008 | GNA11 | guanine nucleotide binding protein (g protein), alpha 11 (gq class) |
-0.0013 | AVPR1B | arginine vasopressin receptor 1b |
0.0014 | P2RX2 | purinergic receptor p2x, ligand-gated ion channel, 2 |
-0.0013 | CACNA1E | calcium channel, voltage-dependent, alpha 1e subunit |
0.0004 | EDNRA | endothelin receptor type a |
0.00009 | SLC8A1 | solute carrier family 8 (sodium/calcium exchanger), member 1 |
0.0006 | CACNA1B | calcium channel, voltage-dependent, l type, alpha 1b subunit |
-0.0013 | PLCD1 | phospholipase c, delta 1 |
-0.0029 | HTR7 | 5-hydroxytryptamine (serotonin) receptor 7 (adenylate cyclase-coupled) |
0.0014 | GRIN1 | glutamate receptor, ionotropic, n-methyl d-aspartate 1 |
0.0028 | CAMK2A | calcium/calmodulin-dependent protein kinase (cam kinase) ii alpha |
-0.0024 | CACNA1I | calcium channel, voltage-dependent, alpha 1i subunit |
-0.0002 | TNNC1 | troponin c type 1 (slow) |
0.0009 | PTGER3 | prostaglandin e receptor 3 (subtype ep3) |
0.0012 | CACNA1F | calcium channel, voltage-dependent, alpha 1f subunit |
Fatty acid metabolism (relevance counts: 192) | ||
0.0043 | ACSL3 | acyl-coa synthetase long-chain family member 3 |
0.0675 | ALDH2 | aldehyde dehydrogenase 2 family (mitochondrial) |
-0.0491 | ACAT2 | acetyl-coenzyme a acetyltransferase 2 (acetoacetyl coenzyme a thiolase) |
0.0120 | ALDH1B1 | aldehyde dehydrogenase 1 family, member b1 |
0.0597 | CYP4A11 | cytochrome p450, family 4, subfamily a, polypeptide 11 |
-0.1447 | ACADSB | acyl-coenzyme a dehydrogenase, short/branched chain |
0.0736 | CPT1A | carnitine palmitoyltransferase 1a (liver) |
0.1075 | CPT1B | carnitine palmitoyltransferase 1b (muscle) |
-0.0366 | ACADVL | acyl-coenzyme a dehydrogenase, very long chain |
0.2168 | ADH4 | alcohol dehydrogenase 4 (class ii), pi polypeptide |
MAPK signaling pathway (relevance counts: 200) | ||
-0.1570 | PLA2G10 | phospholipase a2, group x |
-0.2415 | MAPKAPK5 | mitogen-activated protein kinase-activated protein kinase 5 |
-0.1636 | IL1B | interleukin 1, beta |
-0.0651 | ZAK | sterile alpha motif and leucine zipper containing kinase azk |
0.0572 | PPP3CC | protein phosphatase 3 (formerly 2b), catalytic subunit, gamma isoform |
-0.0674 | MAP3K2 | mitogen-activated protein kinase kinase kinase 2 |
-0.1729 | JUND | jun d proto-oncogene |
-0.1718 | SOS1 | son of sevenless homolog 1 (drosophila) |
0.1082 | FGF14 | fibroblast growth factor 14 |
0.3102 | PTPN5 | protein tyrosine phosphatase, non-receptor type 5 |
-0.3903 | CACNB1 | calcium channel, voltage-dependent, beta 1 subunit |
0.2678 | MAP3K7 | mitogen-activated protein kinase kinase kinase 7 |
0.3176 | CACNG8 | calcium channel, voltage-dependent, gamma subunit 8 |
0.0893 | FGF19 | fibroblast growth factor 19 |
-0.0853 | RRAS2 | related ras viral (r-ras) oncogene homolog 2 |
0.0215 | NLK | nemo-like kinase |
0.0452 | MAP4K4 | mitogen-activated protein kinase kinase kinase kinase 4 |
0.1639 | CACNA1E | calcium channel, voltage-dependent, alpha 1e subunit |
-0.0290 | ARRB1 | arrestin, beta 1 |
-0.1169 | STK4 | serine/threonine kinase 4 |
0.1008 | CACNA1B | calcium channel, voltage-dependent, l type, alpha 1b subunit |
0.0839 | MOS | v-mos moloney murine sarcoma viral oncogene homolog |
-0.1244 | MEF2C | mads box transcription enhancer factor 2, polypeptide c |
0.1572 | RAF1 | v-raf-1 murine leukemia viral oncogene homolog 1 |
0.1757 | MAPK8IP1 | mitogen-activated protein kinase 8 interacting protein 1 |
0.2908 | IKBKB | inhibitor of kappa light polypeptide gene enhancer in b-cells, kinase beta |
-0.3452 | CACNA1I | calcium channel, voltage-dependent, alpha 1i subunit |
-0.2473 | MAPK8 | mitogen-activated protein kinase 8 |
-0.2339 | CACNA1F | calcium channel, voltage-dependent, alpha 1f subunit |
0.0979 | CD14 | cd14 antigen |
-0.1047 | MRAS | muscle ras oncogene homolog |
Pathway Ranks
Pathway | ∑_{ j }|w_{ j }|/L | Rank |
---|---|---|
MAPK signaling pathway | 0.1614 | 1 |
Neuroactive ligand-receptor interaction | 0.1562 | 2 |
ECM-receptor interaction | 0.0863 | 3 |
Fatty acid metabolism | 0.0772 | 4 |
Porphyrin and chlorophyll metabolism | 0.0627 | 5 |
Ubiquitin mediated proteolysis | 0.0494 | 6 |
Focal adhesion | 0.0100 | 7 |
Calcium signaling pathway | 0.0012 | 8 |
Genes in red color are highly expressed in patients with aggressive FL and genes in yellow are highly expressed in the earlier stage of FL cancers. Many important cancer related genes are identified with our methods. For example, SOS1, one of the RAS genes (e.g., MIM 190020), encodes membrane-bound guanine nucleotide-binding proteins that function in the transduction of signals that control cell growth and differentiation. Binding of GTP activates RAS proteins, and subsequent hydrolysis of the bound GTP to GDP and phosphate inactivates signaling by these proteins. GTP binding can be catalyzed by guanine nucleotide exchange factors for RAS, and GTP hydrolysis can be accelerated by GTPase-activating proteins (GAPs). SOS1 plays a crucial role in the coupling of RTKs and also intracellular tyrosine kinases to RAS activation. The deregulation of receptor tyrosine kinases (RTKs) or intracellular tyrosine kinases coupled to RAS activation has been involved in the development of a number of tumors, such as those in breast cancer, ovarian cancer and leukemia. Another gene, IL1B, is one of a group of related proteins made by leukocytes (white blood cells) and other cells in the body. IL1B, one form of IL1, is made mainly by one type of white blood cell, the macrophage, and helps another type of white blood cell, the lymphocyte, fight infections. It also helps leukocytes pass through blood vessel walls to sites of infection and causes fever by affecting areas of the brain that control body temperature. IL1B made in the laboratory is used as a biological response modifier to boost the immune system in cancer therapy.
As shown in Figure 2, the genes SOS1, IL1B, RAS, CACNB1, MEF2C, JUND, and MAPKAPK5 are highly expressed in patients who were diagnosed earlier and lived longer and the genes FGF14, PTPN5, MOS, RAF1, CD14 are highly expressed in patients who were diagnosed at more aggressive stages and died earlier, which may indicate that oncogenes such such SOS1, JUND, and RAS may initialize FL cancer and genes such as MOS, IKK, and CD14 may cause FL cancer to be more aggressive. There are several causal relations among the identified genes on MAPK. For instance, the down-expressed SOS and RAS cause the up-expressed RAF1 and MOS and the up-stream gene IL1 is coordinately expressed with CASP and the gene MST1/2.
Conclusions
Since a large amount of biological information on various aspects of systems and pathways is available in public databases, we are able to utilize this information in modeling genomic data and identifying pathways and genes and their interactions that might be related to patient survival. In this study, we have developed a novel iterative gradient algorithm for group L_{ p } penalized global AUC summary (IGGAUCS) maximization methods for gene and pathway identification, and for survival prediction with right censored survival data and high dimensional gene expression profile. We have demonstrated the applications of the proposed method with both simulation and the FL cancer data set. Empirical studies have shown the proposed approach is able to identify a small number of pathways with nice prediction performance. Unlike traditional statistical models, the proposed method naturally incorporates biological pathways information and it is also different from the commonly used Gene Set Enrichment Analysis (GSEA) in that it simultaneously considers multiple pathways associated with survival phenotypes.
With comprehensive knowledge of pathways and mammalian biology, we can greatly reduce the hypothesis space. By knowing the pathway and the genes that belong to particular pathways, we can limit the number of genes and gene-gene interactions that need to be considered in modeling high dimensional microarray data. The proposed method can efficiently handle thousands of genes and hundreds of pathways as shown in our analysis of the FL cancer data set.
There are several directions for our future investigations. For instance, we may want to further investigate the sensitivity of the proposed methods to the misspecification of the pathway information and misspecification of the model. We may also extend our method to incorporate gene-gene interactions and gene (pathway)- environmental interactions.
Even though we have only applied our methods to gene expression data, it is straightforward to extend our methods to SNP, miRNA CGH, and other genomic data without much modification.
Declarations
Acknowledgements
We thank the Associate Editor and the two anonymous referees for their constructive comments which helped improve the manuscript. ZL was partially supported by grant 1R03CA133899-01A210 from the National Cancer Institute.
Authors’ Affiliations
References
- Zou H: The Adaptive Lasso and its Oracle Properties. Journal of the American Statistical Association. 2006, 101 (476): 1418-1429. 10.1198/016214506000000735View ArticleGoogle Scholar
- Kanehisa L, Goto S: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research. 2002, 28: 27-30. 10.1093/nar/28.1.27View ArticleGoogle Scholar
- Tibshirani R: Regression shrinkage and selection via the lasso. J Royal Statist Soc B. 1996, 58 (1): 267-288.Google Scholar
- Tibshirani R: The lasso method for variable selection in the Cox model. Statistics in Medicine. 1997, 16 (4): 385-95. 10.1002/(SICI)1097-0258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3PubMedView ArticleGoogle Scholar
- Gui J, Li H: Variable Selection via Non-concave Penalized Likelihood and its Oracle Properties. Journal of the American Statistical Association, Theory and Methods. 2001, 96: 456-Google Scholar
- Van Houwelingen H, Bruinsma T, Hart A, Van't Veer L, Wessels L: Cross-validated Cox regression on microarray gene expression data. Stat Med. 2006, 25: 3201-3216. 10.1002/sim.2353PubMedView ArticleGoogle Scholar
- Segal M: Microarray gene expression data with linked survival phenotypes: diffuse large-B-cell lymphoma revisited. Biostatistics. 2006, 7: 268-285. 10.1093/biostatistics/kxj006PubMedView ArticleGoogle Scholar
- Ma S, Huang J: Additive risk survival model with microarray data. BMC Bioinformatics. 2007, 8: 192- 10.1186/1471-2105-8-192PubMedPubMed CentralView ArticleGoogle Scholar
- Liu Z, Jiang F: Gene identification and survival prediction with Lp Cox regression and novel similarity measure. Int J Data Min Bioinform. 2009, 3 (4): 398-408. 10.1504/IJDMB.2009.029203PubMedView ArticleGoogle Scholar
- Park M, Hastie T: L_{1} regularization path algorithm for generalized linear models. J R Stat Soc B. 2007, 69: 659-677. 10.1111/j.1467-9868.2007.00607.xView ArticleGoogle Scholar
- Sohn I, Kim J, Jung S, Park C: Gradient Lasso for Cox Proportional Hazards Model. Bioinformatics. 2009, 25 (14): 1775-1781. 10.1093/bioinformatics/btp322PubMedView ArticleGoogle Scholar
- Heagerty P, Zheng Y: Survival model predictive accuracy and ROC curves. Biometrics. 2005, 61 (1): 92-105. 10.1111/j.0006-341X.2005.030814.xPubMedView ArticleGoogle Scholar
- Tian L, Greenberg S, Kong S, Altschuler J, Kohane I, Park P: Discovering statistically significant pathways in expression profiling studies. PNAS. 2005, 103: 13544-13549. 10.1073/pnas.0506577102View ArticleGoogle Scholar
- Wei Z, Li H: Nonparametric pathways-based regression models for analysis of genomic data. Biostatistics. 2007, 8 (2): 265-284. 10.1093/biostatistics/kxl007PubMedView ArticleGoogle Scholar
- Pepe M: The Statistical Evaluation of Medical Tests for Classification and Prediction. 2003, Oxford: Oxford University Press,Google Scholar
- Pepe M: Evaluating technologies for classification and prediction in medicine. Stat Med. 2005, 24 (24): 3687-3696. 10.1002/sim.2431PubMedView ArticleGoogle Scholar
- Liu Z, Gartenhaus R, Chen X, Howell C, Tan M: Survival Prediction and Gene Identification with Penalized Global AUC Maximization, Journal of Computational Biology. Journal of Computational Biology. 2009, 16 (12): 1661-1670. 10.1089/cmb.2008.0188PubMedPubMed CentralView ArticleGoogle Scholar
- Meier L, van de Geer S, Buhlmann P: The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008, 70 (1): 53-71(19). 10.1111/j.1467-9868.2007.00627.xView ArticleGoogle Scholar
- Bach F: Consistency of the Group Lasso and Multiple Kernel Learning. The Journal of Machine Learning Research. 2008, 9: 1179-1225.Google Scholar
- Ma S, Song X, Huang J: Supervised group Lasso with applications to microarray data analysis. BMC Bioinformatics. 2007, 8: 60- 10.1186/1471-2105-8-60PubMedPubMed CentralView ArticleGoogle Scholar
- Zou H, Li R: One-step sparse estimates in non-concave penalized likelihood models. The Annals of Statistics. 2008, 36 (4): 1509-1533. 10.1214/009053607000000802PubMedView ArticleGoogle Scholar
- Liu Z, Tan M: ROC-Based Utility Function Maximization for Feature Selection and Classification with Applications to High-Dimensional Protease Data. Biometrics. 2008, 64 (4): 1155-1161. 10.1111/j.1541-0420.2008.01015.xPubMedView ArticleGoogle Scholar
- Jordan M, Ghahramani Z, Jaakkola T, Saul L: An Introduction to Variational Methods for Graphical models. Learning in Graphical Models. Edited by: Jordan M. 1998, Cambridge: The MIT Press,View ArticleGoogle Scholar
- Kaban A, Durrant R: Learning with L_{ q } < 1 vs L_{1}-norm regularization with exponentially many irrelevant features. Proc of the 19th European Conference on Machine Learning (ECML08). 2008, 15-19.Google Scholar
- Fan J, Li R: Penalized Cox Regression Analysis in the High-Dimensional and Low-sample Size Settings, with Applications to Microarray Gene Expression Data. Bioinformatics. 2005, 21: 3001-3008. 10.1093/bioinformatics/bti422View ArticleGoogle Scholar
- Hastie T, Tibashirani DR, Botstein , Brown P: Supervised harvesting of expression trees. Genome Biology. 2001, 2: 3.1-3.12. 10.1186/gb-2001-2-1-research0003View ArticleGoogle Scholar
- Dave S, Wright G, Tan B, Rosenwald A, Gascoyne R, Chan W, Fisher R, Braziel R, Rimsza L, Grogan T, Miller T, LeBlanc M, Greiner T, Weisenburger D, Lynch J, Vose J, Armitage J, Smeland E, Kvaloy S, Holte H, Delabie J, Connors J, Lansdorp P, Ouyang Q, Lister T, Davies A, Norton A, Muller-Hermelink H, Ott G, Campo E, Montserrat E, Wilson W, Jaffe E, Simon R, Yang L, Powell J, Zhao H, Goldschmidt N, Chiorazzi M, Staudt L: Prediction of survival in follicular lymphoma based on molecular features of tumor-infiltrating immune cells. N Engl J Med. 2004, 351 (21): 2159-2169. 10.1056/NEJMoa041869PubMedView ArticleGoogle Scholar
- Liu Z, Jiang F, Tian G, Wang S, Sato F, Meltzer S, Tan M: Sparse logistic regression with Lp penalty for biomarker identification. Statistical Applications in Genetics and Molecular Biology. 2007, 6 (1): Article 6-10.2202/1544-6115.1248. 10.2202/1544-6115.1248View ArticleGoogle Scholar
- Elenitoba-Johnson K, Jenson S, Abbott R, Palais R, Bohling S, Lin Z, Tripp S, Shami P, Wang L, Coupland R, Buckstein R, Perez-Ordonez B, Perkins S, Dube I, Lim M: Involvement of multiple signaling pathways in follicular lymphoma transformation: p38-mitogen-activated protein kinase as a target for therapy. Proc Natl Acad Sci USA. 2003, 100 (12): 7259-64. 10.1073/pnas.1137463100PubMedPubMed CentralView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.