- Research
- Open Access
A markov classification model for metabolic pathways
- Timothy Hancock^{1}Email author and
- Hiroshi Mamitsuka^{1}
https://doi.org/10.1186/1748-7188-5-10
© Hancock and Mamitsuka; licensee BioMed Central Ltd. 2010
- Received: 12 August 2009
- Accepted: 4 January 2010
- Published: 4 January 2010
Abstract
Background
This paper considers the problem of identifying pathways through metabolic networks that relate to a specific biological response. Our proposed model, HME3M, first identifies frequently traversed network paths using a Markov mixture model. Then by employing a hierarchical mixture of experts, separate classifiers are built using information specific to each path and combined into an ensemble prediction for the response.
Results
We compared the performance of HME3M with logistic regression and support vector machines (SVM) for both simulated pathways and on two metabolic networks, glycolysis and the pentose phosphate pathway for Arabidopsis thaliana. We use AltGenExpress microarray data and focus on the pathway differences in the developmental stages and stress responses of Arabidopsis. The results clearly show that HME3M outperformed the comparison methods in the presence of increasing network complexity and pathway noise. Furthermore an analysis of the paths identified by HME3M for each metabolic network confirmed known biological responses of Arabidopsis.
Conclusions
This paper clearly shows HME3M to be an accurate and robust method for classifying metabolic pathways. HME3M is shown to outperform all comparison methods and further is capable of identifying known biologically active pathways within microarray data.
Keywords
- Support Vector Machine
- Metabolic Network
- Correct Classification Rate
- Dominant Pathway
- Iterative Reweighted Little Square
Background
Networks are a natural way of understanding complex processes involving interactions between many variables. Visualizing a process as a network allows the researcher to form an intuitive understanding of complex phenomena. A clear example of the effective use of networks is the visualization of metabolic networks to provide a detailed map of key chemical reactions and their genetic dependencies that occur within a cell. However the size and complexity of metabolic networks has increased to the point where the ability to understand the entire network is lost. Researchers must now rely on models of the network structure to capture the key functional components that relate to an observed response. In this paper we propose a model capable of identifying the key pathways through metabolic networks that are related to a specific biological response.
Metabolic networks, as described in databases such as KEGG [1], can be represented as directed graphs, with the vertices denoting the compounds and the edges labeled by the reactions. The reactions within metabolic networks are catalyzed by specific genes. If a gene is active, then it is possible for the corresponding reaction to occur. If a reaction is active then a pathway is created between two metabolic compounds that is labeled by the gene that catalyzed the reaction. Information about the activity of genes within metabolic networks can be readily obtained from microarray experiments. Microarray experiments are then used to view differences in gene activity under varying experimental conditions such as (y = 1) patients treated with drug A and (y = 2) patients treated with drug B. The question asked by such experiments is: are there any gene pathways that are differentially expressed when patients are given drug A or B? The abundance of publicly available microarray expression observations found in databases such as ArrayExpress [2] along with the detailed biological knowledge contained within pathway databases like KEGG, has spurred biologists to want to combine these two sources of information and model the metabolic network dynamics under different experimental conditions.
This paper proposes a novel classification model for identifying frequently observed paths within a specified network structure that can be used to classify known response classes. Our proposed model is a probabilistic combination of a Markov mixture model which identifies frequently observed pathway clusters and an ensemble of supervised techniques each trained locally within each pathway cluster to classify the response. We require the prior specification of the metabolic network, gene expression data and response variable that labels the experimental conditions of interest.
Graphical methods such as Bayesian networks present a framework capable of modeling a network structure imposed upon a dataset [5]. Bayesian networks search for the most likely network configuration by drawing edges connecting dependent variables. However, when considering mining the dominant paths within a known network such an approach may not be the most direct solution. For example constructing a Bayesian network of a metabolic pathway will join related genes by assuming a conditional dependence between each gene and its parent genes within the network. This dependency is valid when considering problems concerning the prediction of unknown structure [6, 7] though may be inappropriate for the prediction of frequently observed paths through a known network structure. To predict frequently observed paths, a more natural assumption is accommodated by Markov methods which assume that the decision on the next step taken along a path only requires information on the current and next set of genes within the network.
Hidden Markov Models (HMM) are commonly used for identifying structure within sequence information [8]. HMMs assume that the nodes of the network are unknown and the observed sequences are a direct result of transition between these hidden states. However, if the network structure is known, a more direct approach is available through a mixture of Markov chains. Markov mixture models such as 3M [9] directly search for dominant pathways within sequence data by assuming each mixture component is a Markov chain through a known network structure. For metabolic networks, Markov mixture models, such as 3M, have been shown to provide an accurate and highly interpretable model of dominant pathways throughout a known network structure. However, both HMM and 3M are unsupervised models and therefore are not able to direct their search to explicitly uncover pathways that relate to specific experimental conditions.
The creation of a supervised classification technique that exploits the intuitive nature of Markov mixture models would be a powerful interpretable tool for biologists to analyze network pathways. In this paper we propose a supervised version of the 3M model using the Hierarchical Mixture of Experts (HME) framework [10]. We choose the mixture of experts framework as our supervised model because it provides a complete probabilistic framework for localizing a classification model to specific clusters within a dataset. Our proposed model, called HME3M employs a HME to combine the 3M with penalized logistic regressions classifiers as the experts within each cluster to classify the response.
Experiments
Our problem has the following inputs: the network structure, microarray observations and a response variable. A pathway through the network, x_{ i }, is assumed to be a binary vector, where a 1 indicates a traversed edge and 0 represents a non-traversed edge. The decision on which edges can be traversed is made for each microarray observation based on the expression of each gene. Once the set of valid edges have been defined, for each microarray observation all valid pathways are extracted. After extracting all observed pathways we label each path with the response label of the original microarray experiment. Once this is completed for all observations it is possible to set up a supervised classification problem where the response vector y denotes the response label of each pathway, and the predictor matrix X is an N × P binary matrix of pathways, where N is the number of pathways and P is the number of edges within the network. The binary predictor matrix, X and its response y can now be directly analyzed by our proposed pathway classifier, HME3M, and also with standard supervised techniques. We assess the performance of HME3M in both simulated and real data environments and compare it to PLR and Support Vector Machines (SVM) with three types of kernels, linear, polynomial (degree = 3) and radial basis. The implementation of SVM used for these experiments is sourced from the R package e1071 [11].
We point out here that the predictor matrix X is a list of all pathways through the network observed within the original dataset. Therefore X contains all available information on the given network structure contained within the original dataset. Using this information as input into the PLR and SVM models is supplying these methods with the same network information that is provided to the HME3M model. As the supplied information is the same for all models the comparison is fair. The performance of the models are expected to differ because SVM and PLR do not consider the Markov nature of the input pathways whereas HME3M explicitly models this property with a first order Markov mixture model.
Experiments comparing HME3M to standard classification techniques are performed first on simulated network pathways and then on real metabolic pathways and microarray expression data. We now describe the details of each experiment.
Synthetic Data
To construct the simulation experiments we assume that the dataset is comprised of dominant pathways that define the groups and random noise pathways. To ensure that the pathway structure is the major information within the dataset, we specify the network structure and simulate only the binary pathway information. A dominant pathway is defined as a frequently observed path within a response class. The level of expression of a dominant pathway is defined to be the number of times it is observed within a group. A noise pathway is defined to be a valid pathway within the network that leads from the start to the end compounds but is not any of the specified dominant pathways. As the percent of noise increases, the relative expression of the dominant paths decreases, making correct classification harder.
KEGG Networks
KEGG Arabidopsis Glycolysis Pathway
In Figure 3 we extract from KEGG the core component of the glycolysis network for Arabidopsis between C00668 (Alpha-D-Glucose) and C00022 (Pyruvate). The extracted network in Figure 3 is a significantly more complex graph than our simulated designs and has 103680 possible pathways between C00668 and C00022. We extract the gene expression observations for all genes on this pathway from the AltGenExpress development series microarray expression data [12] downloaded from the ArrayExpress database [2]. The AltGenExpress development database [12] is a microarray expression record of each stage within the growth cycle of Arabidopsis and contains expression observations of 22814 genes over 79 replicated conditions. For our purposes we extract observations for "rosette leaf" (n = 21) and "flower" (n = 15) and specify "flower" to be target class (y = 1) and "rosette leaf" to be the comparison class (y = 0). For the glycolysis experiment we set the HME3M parameters to be: λ = 1 and α = 0.7.
Number of pathways extracted for the Arabidopsis glycolysis network for each gene activity tolerance.
Expression Tolerance | Flower Pathways | Rosette Leaf Pathways | Total |
---|---|---|---|
-0.1 | 12720 | 32664 | 45384 |
0 | 4288 | 20608 | 24896 |
0.1 | 3024 | 14496 | 17520 |
KEGG Arabidopsis Pentose Phosphate Pathway
In Figure 4 we extract from KEGG the core component of the pentose phosphate network for Arabidopsis between C00668 (Alpha-D-Glucose) and C00118 (D-Glyceraldehyde 3-Phosphate). The extracted network is more complex again than the glycolysis network and has 1305924 possible pathways between C00668 and C00118. We extract the gene expression observations for all genes on this pathway from the AltGenExpress abiotic stress microarray expression data [13].
The AltGenExpress abiotic stress database [12] contains gene expression measurements on the responses of the "Shoots" or "Roots" of Arabidopsis to various stress stimuli. For our purposes we extract observations for Arabidopsis "Shoots" in both the oxidative stress and control groups for all observed times from 0.25 to 3 hours. This results in six experiments from the "Oxidative" (n = 6) and 10 experiments from the "Control" (n = 10) and we specify "Oxidative" to be target class (y = 1) and "Control" to be the comparison class (y = 0).
We select this particular subset of the AltGenExpress abiotic stress as observations on the metabolite abundance for the pentose phosphate pathway [14] clearly show that within the first 3 hours of exposure to oxidative stress a significant increase in the abundance of C00117 (D-Ribose 5-phosphate) is observed. In [14] it was suggested that this increase was a result of an increase in the flux through the oxidative branch of the pentose phosphate pathway (Figure 4). In this paper we try to confirm this observation within the AltGenExpress abiotic stress with HME3M.
Number of pathways extracted for the Arabidopsis pentose phosphate network for each gene activity tolerance.
Expression Tolerance | Oxidative Pathways | Control Pathways | Total |
---|---|---|---|
-0.1 | 157468 | 48557 | 206025 |
-0.05 | 84503 | 44320 | 128823 |
0 | 19225 | 43777 | 63002 |
0.05 | 42846 | 14422 | 57268 |
0.1 | 10086 | 8935 | 19021 |
Results and Discussion
Synthetic Data
The median and range of the 10 × 10-fold correct classification rates (CCR) for all simulation experiments.
Percent Within Group Noise | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Graph | Model | M = 2 | M = 3 | ||||||||
0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | ||
Small | HME3M | 96.79 | 90.94 | 85.79 | 79.80 | 74.97 | 96.14 | 92.64 | 86.46 | 79.86 | 76.13 |
CCR Range | 4.15 | 6.99 | 4.49 | 7.42 | 6.73 | 3.36 | 7.69 | 3.98 | 16.83 | 7.36 | |
PLR | 50.67* | 50.98* | 50.68* | 50.35* | 50.96* | 50.83* | 50.70* | 50.58* | 50.62* | 50.70* | |
CCR Range | 1.53 | 1.54 | 2.13 | 1.66 | 1.55 | 1.11 | 1.71 | 1.35 | 1.42 | 0.68 | |
SVM (linear) | 95.14* | 90.48* | 85.28 | 78.94 | 72.53* | 94.77* | 89.01* | 85.16* | 78.66 | 75.21 | |
CCR Range | 0.72 | 2.31 | 3.16 | 2.61 | 7.02 | 1.61 | 4.39 | 2.72 | 11.55 | 14.54 | |
SVM (polynomial) | 95.25* | 90.05* | 84.19* | 79.42 | 75.03 | 94.91* | 89.71* | 84.82* | 78.66 | 75.73 | |
CCR Range | 1.52 | 1.18 | 3.74 | 3.15 | 6.92 | 2.33 | 3.67 | 4.38 | 3.09 | 8.41 | |
SVM (radial) | 95.28* | 89.73* | 84.19* | 79.39 | 75.07 | 94.37 | 89.48* | 84.82* | 78.94 | 75.35 | |
CCR Range | 1.53 | 2.81 | 4.20 | 2.61 | 8.02 | 2.32 | 4.53 | 3.28 | 4.14 | 7.29 | |
Medium | HME3M | 98.94 | 94.29 | 88.68 | 80.78 | 77.05 | 98.80 | 96.22 | 90.11 | 84.02 | 77.89 |
CCR Range | 2.92 | 6.18 | 5.73 | 12.51 | 5.56 | 2.40 | 5.59 | 10.57 | 8.98 | 8.76 | |
PLR | 50.48* | 50.62* | 50.68* | 50.48* | 50.55* | 50.55* | 50.35* | 50.62* | 50.55* | 50.48* | |
CCR Range | 2.04 | 0.99 | 1.30 | 2.13 | 1.23 | 0.97 | 0.69 | 1.63 | 1.31 | 1.13 | |
SVM (linear) | 94.68* | 89.80* | 83.91 | 79.64* | 76.02 | 94.28* | 89.52* | 84.80* | 80.19* | 73.41* | |
CCR Range | 2.74 | 22.91 | 3.56 | 4.17 | 4.12 | 1.72 | 2.60 | 5.17 | 2.89 | 5.40 | |
SVM (polynomial) | 94.92* | 90.30* | 84.53* | 79.32* | 75.07* | 94.22* | 90.35* | 84.53* | 79.72* | 75.21* | |
CCR Range | 2.73 | 21.22 | 4.00 | 5.80 | 2.83 | 1.65 | 3.69 | 6.85 | 2.85 | 4.91 | |
SVM (radial) | 94.60* | 90.30* | 84.52* | 79.28* | 74.70* | 94.19* | 89.68* | 84.31* | 80.72* | 74.80* | |
CCR Range | 2.45 | 4.35 | 2.82 | 2.99 | 4.56 | 1.90 | 3.49 | 4.67 | 2.93 | 3.59 | |
Large | HME3M | 99.54 | 97.36 | 93.29 | 83.73 | 77.88 | 99.39 | 97.52 | 94.08 | 84.78 | 83.11 |
CCR Range | 2.96 | 5.00 | 9.44 | 10.17 | 9.97 | 1.77 | 6.12 | 10.23 | 10.05 | 11.56 | |
PLR | 50.69* | 50.27* | 50.49* | 50.56* | 50.62* | 50.41* | 50.70* | 50.34* | 50.83* | 50.82* | |
CCR Range | 1.40 | 1.27 | 1.69 | 1.69 | 0.82 | 1.11 | 1.39 | 1.51 | 0.98 | 1.09 | |
SVM (linear) | 94.91* | 89.31* | 85.58* | 79.55* | 73.79* | 94.23* | 89.34* | 85.30* | 79.67* | 74.34* | |
CCR Range | 0.76 | 3.29 | 4.66 | 3.72 | 4.51 | 2.24 | 2.18 | 2.68 | 2.77 | 7.30 | |
SVM (polynomial) | 94.78* | 89.45* | 85.80* | 78.67* | 74.05* | 94.28* | 89.86* | 84.69* | 79.44* | 74.33* | |
CCR Range | 2.22 | 2.87 | 4.94 | 3.46 | 4.60 | 2.24 | 15.39 | 3.24 | 4.66 | 4.86 | |
SVM (radial) | 94.98* | 89.74* | 85.09* | 79.36* | 74.21* | 94.62* | 89.83* | 84.47* | 79.56* | 74.62* | |
CCR Range | 1.43 | 4.00 | 5.29 | 2.62 | 2.66 | 1.69 | 2.39 | 2.28 | 4.99 | 6.10 |
The performance of PLR for the simulated pathways is particularly poor because the dataset is noisy and binary. PLR can only optimize on these noisy binary variables and is supplied with no additional information such as the kernels of the SVM models and the pathway information of HME3M. Additionally, the L2 ridge penalty is not a severe regularization and will estimate coefficients for pure noise pathway edges. Combining the lack of information within the raw binary variables with the nature of L2 regularization, it is clear in this case that PLR will overfit and lead to poor performance.
Table 3 also demonstrates that as you increase the number of mixture components in the HME3M model, M, the model's resistance to noise increases. The increased robustness of HME3M is observed in the increase in median performance from M = 2 to M = 3 when the noise levels are 30% or more (≥ 0.3). A supporting observation of particular note is that when the performances of HME3M with M = 2 is compared with the linear kernel SVM on the medium graph and 50% noise there is no significant difference between the model's performances. However, by increasing M to 3, HME3M is observed to significantly outperform linear kernel SVM. Further, in a similar but less significant case, for the small graph with 50% added noise, by increasing M from 2 to 3 the median performance of HME3M becomes greater than that of linear kernel SVM. Although this increase did not prove to be significant the observed increasing trend within the median performance is clearly driving the results of the t-test.
It is noticeable in Table 3 that the HME3M performance can be less precise than SVM or PLR models. However the larger range of CCR performances is not large enough to affect the significance of the performance gains made by HME3M. The imprecision of HME3M in this case is most likely due to the constant specification of λ, α and M over the course of the simulations. In the microarray data experiments we show that careful choice of M produces stable model performances with a comparable CCR range than the nearest SVM competitor.
KEGG Arabidopsis Glycolysis Pathway
The ROC curves for each HME3M component are presented in Figure 6 and clearly show that the third component is the most important with an AUC of 0.752, whereas the other three components seem to hold limited or no predictive power. A bar plot of the HME3M transition probabilities (θ_{ m }) for the third (m = 3) component is presented in Figure 7. Overlaying the transition probabilities from Figure 7 onto the full network in Figure 3 it is found that for three transitions only single genes are required for the reaction to proceed:
A further analysis of the genes identified reveals the interaction between AT1G09780 (θ = 1) and AT1G74030 (θ = 0.969) is of particular importance in stress response of Arabidopsis. A literature search on these genes identified both AT1G09780 and AT1G74030 as important in the response of Arabidopsis to environmental stresses such as cold exposure, salt and osmotic stress [15, 16]. However, AT2G21180, apart from being involved in glycolysis, has not previously been found to be strongly involved in any specific biological function. Interestingly however, a search of TAIR [17] revealed that AT2G21180 is found to be expressed in the same growth and developmental stages as well as in the same plant structure categories as both AT1G09780 and AT1G74030. These findings are indicative of a possible relationship between these three genes in particular in the response to environmental stress.
The second path connecting compounds C00197 through C00631 to C00074 is found by HME3M to have a high probability of being differently expressed when comparing glycolysis in flowers and rosette leaves. The branching of glycolysis at Glycerate-3P (C 00197) through to Phosphoenol-Pyruvate (C 00074) corresponds known variants of the glycolysis pathway in Arabidopis; the glycolysis I pathway located in the cytosol and the glycolysis II pathway located in the plastids [17]. The key precursor that leads to the branching within cytosol variant by the reactions to convert Beta-D-Fructose-6P (C 05378) to Beta-D-Fructose-1,6P (C 05378) using diphosphate rather than ATP [17]. Referencing the included pathway genes in Figure 7 within the reference Arabidopsis database TAIR [17] we observe that the genes specific to the percursor reactions for the cytosol variant of glycolysis are included within the pathway, i.e. the genes [AT 1G 12000, AT 1G 20950, AT 4G 0404] for converting beta-D-fructose-6P (C 005345) into beta-D-fructose-1,6P2 (C 005378) utilizing diphosphate rather than ATP. HME3M's identification of the plant cytosol variant of the glycolysis pathway confirms this pathway as a flower specific, because the plastids variant is clearly more specific to rosette leaves due to their role in photosynthesis.
KEGG Arabidopsis Pentose Phosphate Pathway
In contrast increasing the tolerance level to 0.1 we observe a decrease in the performance of HME3M as M is increased from M = 2 to M = 4 (Figure 8). This uncharacteristic drop in performance of HME3M is the result of insufficient variation within the pathway dataset. This assertion is supported by HME3M finding the optimum model over all datasets at tolerance of 0.05. However when the gene activity tolerance is increased to 0.1 the optimal performance observed at a tolerance of 0.05 is never reached. Therefore increasing the tolerance to 0.1 is removing important pathways are required to produce the optimal model. HME3M then attempts to compensate for this lack of variation within the pathways observed at a tolerance of 0.1 by overfitting. This overfitting then leads to the decrease in performance observed as the model complexity of HME3M is increased.
Observed sign differences in HME3M PLR coefficients β for each path identified in the Pentose Phosphate pathway.
β_{m= 2}< 0 | β_{m= 2}= 0 | β_{m= 2}> 0 | |
---|---|---|---|
β_{m = 1}< 0 | 28 | 2 | 25 |
β_{m = 1}= 0 | 1 | 6 | 0 |
β_{m = 1}> 0 | 3 | 0 | 15 |
Conclusions
In this paper we have presented a novel approach for the detection of dominant pathways within a network structure for binary classification using the Markov mixture of experts model, HME3M. Simulations clearly show HME3M to outperform both PLR and SVM with linear, polynomial and radial basis kernels. When applied to actual metabolic networks with real microarray data HME3M not only maintained its superior performance but also produced biologically meaningful results.
Naturally it would be interesting to explore the performance of HME3M in other contexts where the properties of the datasets and networks are different. Future work on HME3M could be to assess the performance of different pathway activity definitions, other than simply over expressed genes. Furthermore, the 3M component of HME3M is also able to be extended to include other gene information such as protein class and function. Incorporating additional information on specific gene functions or using different pathway definitions would allow HME3M to examine metabolic pathways at several resolutions and help improve the understanding of the underlying dynamics of the metabolic network.
Methods
Hierarchical Mixture of Experts (HME)
where β_{ m }are the parameters of each expert and θ_{ m }are the parameters of mixture component m. A HME does not restrict the source of the mixture weights p(m|x, θ_{ m }) and as such can be generated from any model that returns posterior component probabilities for the observations. Taking advantage of this flexibility we propose a HME as a method to supervise the Markov mixture model for metabolic pathways 3M [9]. Combining HME with a Markov mixture model first employs the Markov mixture to find dominant pathways. Posterior probabilities are then assigned to each sequence based on its similarity to the dominant pathway. These are then passed as input weights into the parameter estimation procedure within the supervised technique. Using the posterior probabilities of 3M to weight the parameter estimation of each supervised technique is in effect localizing each expert to summarize the predictive capability of each dominant pathway. Therefore incorporating the 3M Markov mixture model within a HME is creating a method capable of combining network structures with standard data table information. We now formally state the base 3M model and provide the detail of our proposed model, Hierarchical Mixture Experts 3M (HME3M) classifier.
3M Mixture of Markov Chains
where π_{ m }is the mixture model component probability, p(c_{1}|θ_{1m}) is the probability of the initial state c_{1}, and p(c_{ t }, x_{ t }|c_{t-1}, θ_{ tm }) is the probability of a path traversing the edge x_{ t }linking states c_{t-1}and c_{ t }. The 3M model is simply a mixture model and as such its parameters are conveniently estimated by an EM algorithm [9]. The result of 3M is M mixture components, where each component, m, corresponds to a first order Markov model defined by θ_{ m }= {θ_{1m}, [θ_{2m}, ..., θ_{ tm }, ..., θ_{ Tm }]} which are the estimated probabilities for each transition along the m^{ th }dominant path.
HME3M
The parameters of (3) can be estimated using the EM algorithm by defining the esponsibilities variable h_{ im }to be the probability that a sequence i belongs to component m, given x, θ_{ m }, β_{ m }and y. These parameters are iteratively optimized with the following E and M steps:
M-Step: Estimate the Markov mixture and expert model parameters:
(1) Estimate the mixture parameters
where δ (x_{ it }= 1) denotes whether a transition t is active within observation i, or x_{ it }= 1. This condition enforces the constraint that the probabilities of each set of transitions between any two states must sum to one. Additionally it can be shown that for this model all initial state probabilities p(c_{1}|θ_{1m}) = 1.
(2) Estimate the expert parameters
where is the vector of probabilities and W_{ m }is a diagonal matrix of weights such that and z_{ m }is the working response for the IRLS algorithm . However, in this setting, X is a sparse matrix of binary pathways where we expect and are explicitly looking for dominant pathways. Thus, simple IRLS maximization of (6) is likely to be inaccurate. Furthermore, the severity of the sparsity within X is compounded by the additional weighting required by the experts' inclusion into the HME architecture. These conditions will manifest themselves in duplicate rows within X, causing rank deficiency and results in unstable estimates for the parameters of a logistic regression model. Therefore the simple IRLS scheme proposed by [10] is inappropriate for use in this case. To overcome the rank deficiency issue we propose using a regularized form of logistic regression [19].
Penalized logistic regression (PLR)
where Λ is a P × P diagonal matrix with λ along the diagonal where P is the number of variables in X and z_{ m }is the working response as specified in (7).
However, another issue is that the Iterative Reweighted Least Squares algorithm (IRLS) used for estimating the parameters of a PLR is known to be unstable and not guaranteed to converge [20].
where Λ is a diagonal matrix with the regularization parameter λ along the diagonal and W_{ m }is a diagonal matrix of observation weights combining information from the IRLS algorithm and the HME architecture. The observation weights are defined to be , where weights the observations to optimally predict y by sourced from the IRLS algorithm, and h_{ im }are the EM responsibilities (4). This update for β_{ m }gives control over the size of the coefficients through λ and speed in which these parameters are learned through α. It is noted by [18] that this method will converge to the same solution as the IRLS method, however the effect of α will increase the number of iterations for convergence. In (10) the action of λ is to control the size of each β_{ m }by artificially inflating their variance.
Declarations
Acknowledgements
Timothy Hancock was supported by a Japan Society for the Promotion of Science (JSPS) fellowship and BIRD. Hiroshi Mamitsuka was supported in part by BIRD of Japan Science and Technology Agency (JST).
Authors’ Affiliations
References
- Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28: 27-30. 10.1093/nar/28.1.27PubMedPubMed CentralView ArticleGoogle Scholar
- Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, Oezcimen A, Rocca-Serra P, Sansone S: ArrayExpress-a public repository for microarray gene expression data at the EBI. Nucl Acids Res. 2003, 31: 68-71. 10.1093/nar/gkg091PubMedPubMed CentralView ArticleGoogle Scholar
- Pang H, Lin A, Holford M, Enerson B, Lu B, Lawton MP, Floyd E, Zhao H: Pathway analysis using random forests classification and regression. Bioinformatics. 2006, 22 (16): 2028-36. 10.1093/bioinformatics/btl344PubMedView ArticleGoogle Scholar
- Pireddu L, Poulin B, Szafron D, Lu P, Wishart DS: Pathway Analyst - Automated Metabolic Pathway Prediction. Proceedings of the 2005 IEEE Symposium on Computational Intelligence. 2005, http://metabolomics.ca/News/publications/2005cibcb-path.pdfGoogle Scholar
- Jordan M: Learning in Graphical Models. 1998, Norwell, MD: Kluwer Academic PublishersView ArticleGoogle Scholar
- Imoto S, Goto T, Miyano S: Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. Proc Pac Symp on Biocomputing. 2002, 7: 175-186.Google Scholar
- Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data. RECOMB. 2000, 127-135. full_text.View ArticleGoogle Scholar
- Evans WJ, Grant GR: Statistical methods in bioinformatics: An introduction. 2005, New York: Springer, 2Google Scholar
- Mamitsuka H, Okuno Y, Yamaguchi A: Mining biologically active patterns in metabolic pathways using microarray expression profiles. SIGKDD Explorations. 2003, 5 (2): 113-121. 10.1145/980972.980986.View ArticleGoogle Scholar
- Jordan M, Jacobs R: Hierarchical mixtures of experts and the EM algorithm. Neural Computation. 1994, 6 (2): 181-214. 10.1162/neco.1994.6.2.181.View ArticleGoogle Scholar
- Dimitdadou E, Hornik K, Leisch F, Meyer D, Weingessel A: e1071 - misc functions of the Department of Statistics. 2002, http://cran.r-project.org/Google Scholar
- Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron M, Schölkopf B, Weigel D, Lohmann JU: A gene expression map of Arabidopsis thaliana development. Nature Genetics. 2005, 37 (5): 501-506. 10.1038/ng1543PubMedView ArticleGoogle Scholar
- Kilian J, Whitehead D, Horak J, Wanke D, Weinl S, Batistic O, D'Angelo C, Bornberg-Bauer E, Kudla J, Harter K: The AtGenExpress global stress expression data set:protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. The Plant Journal. 2007, 50: 347-363. 10.1111/j.1365-313X.2007.03052.xPubMedView ArticleGoogle Scholar
- Baxter C, Redestig H, Schauer N, Repsilber D, Patil K, Nielsen J, Selbig J, Liu J, Fernie A, Sweetlove L: The metabolic response of heterotrophic Arabidopsis cells to oxidative stress. Plant physiology. 2007, 143: 312 10.1104/pp.106.090431PubMedPubMed CentralView ArticleGoogle Scholar
- Chawade A, Bräutigam M, Lindlöf A, Olsson O, Olsson B: Putative cold acclimation pathways in Arabidopsis thaliana identified by a combined analysis of mRNA co-expression patterns, promoter motifs and transcription factors. BMC Genomics. 2007, 8: 304 10.1186/1471-2164-8-304PubMedPubMed CentralView ArticleGoogle Scholar
- Ndimba BK, Chivasa S, Simon WJ, Slabas AR: Identification of Arabidopsis salt and osmotic stress responsive proteins using two-dimensional difference gel electrophoresis and mass spectrometry. Proteomics. 2005, 5 (16): 4185-4196. 10.1002/pmic.200401282PubMedView ArticleGoogle Scholar
- Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia-Hernandez M, Foerster H, Li D, Meyer T, Muller R, Ploetz L, Radenbaugh A, Singh S, Swing V, Tissier C, Zhang P, Huala E: The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucl Acids Res. 2007, 36: D1009-14. 10.1093/nar/gkm965PubMedPubMed CentralView ArticleGoogle Scholar
- Waterhouse SR, Robinson AJ: Classification Using Mixtures of Experts. IEEE Workshop on Neural Networks for Signal Processing. 1994, 177-186. full_text. IVView ArticleGoogle Scholar
- Park MY, Hastie T: Penalized logistic regression for detecting gene interactions. Biostatistics. 2008, 9 (1): 30-50. 10.1093/biostatistics/kxm010PubMedView ArticleGoogle Scholar
- Hastie T, Tibshirani R, Friedman J: Elements of Statistical Learning. 2001, New York: SpringerView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.