Using the message passing algorithm on discrete data to detect faults in boolean regulatory networks
- Anwoy Kumar Mohanty^{1},
- Aniruddha Datta^{1}Email author and
- Vijayanagaram Venkatraj^{2}
https://doi.org/10.1186/s13015-014-0020-6
© Mohanty et al.; licensee BioMed Central Ltd. 2014
Received: 16 April 2014
Accepted: 9 July 2014
Published: 16 August 2014
Abstract
Background
An important problem in systems biology is to model gene regulatory networks which can then be utilized to develop novel therapeutic methods for cancer treatment. Knowledge about which proteins/genes are dysregulated in a regulatory network, such as in the Mitogen Activated Protein Kinase (MAPK) Network, can be used not only to decide upon which therapy to use for a particular case of cancer, but also help in discovering effective targets for new drugs.
Results
In this work we demonstrate how one can start from a model signal transduction network derived from prior knowledge, and infer from gene expression data the probable locations of dysregulations in the network. Our model is based on Boolean networks, and the inference problem is solved using a version of the message passing algorithm. We have done simulation experiments on synthetic data to verify the efficacy of the algorithm as compared to the results from the much more computationally intensive Markov Chain Monte-Carlo methods. We also applied the model to analyze data collected from fibroblasts, thereby demonstrating how this model can be used on real world data.
Keywords
Message passing Sum-product Markov chain Monte-CarloBackground
Modeling cellular behavior is a first step towards the holistic understanding of the multivariate interactions among various genes. One possible approach to do that is through gene regulatory networks. These networks could also help in developing better intervention strategies in order to shift the state of the cell or the tissue to a more favorable one. Many different approaches have been proposed in the literature for modeling the behavior of genetic regulatory networks. These include differential equations [1], deterministic and probabilistic Boolean networks [2],[3], and Bayesian and dynamic Bayesian networks [4],[5]. Some of these methods rely on the assumption that the transition probabilities are provided beforehand. Such an assumption may not be realistic since the sheer volume of data required to effectively estimate the transition probabilities makes it a practically difficult proposition. Some methods such as the REVEAL algorithm [6] provide approaches to learn deterministic Boolean networks from discretized time course data. However time course data from biological samples itself can be difficult to come by.
One way to get around the problem of insufficient data is to use prior knowledge about the regulatory interactions between the various biological molecules in a cell. In the biological literature, a lot of information is available regarding the various regulatory interactions. This information has been collected by biologists over a long period of time. These regulatory interactions, collectively referred to as pathway knowledge, are generally not incorporated into the various methods of modeling gene regulatory networks. Using this information, however, would result in models which describe cellular behavior more accurately.
Figure 1 also shows the locations where some important kinase inhibitory drugs act. The inputs in the Boolean network corresponding to these drugs are ‘1’ when the drugs are administered, ‘0’ otherwise. When the input corresponding to a drug is ‘1’, it causes the target variable to get downregulated irrespective of the variables upstream of it, which in turn affects other downstream variables. At the bottom of the network, various observable variables such as the transcription factors (FOS-JUN, SP1, SRF-ELK) are represented as arrows. Other proteins of interest (BCL2, BCL2L1, CCND1) are also represented as arrows. Activity of transcription factors can be determined by the use of appropriate reporter genes. Hence this network can be considered as a multi input multi output system with the inputs being the exposure to various drug combinations and the output being the activity of the observable variables. The growth factors can also be considered as inputs, but in this paper we are simply considering the drugs as inputs whereas the growth factors are all considered to be one, that is, constantly active.
Locating stuck-at faults in a given Boolean regulatory network could help in the identification of key dysregulated genes that have a strong impact on the observable variables. This in turn could be used to identify targets for new drugs. Knowledge about the locations of the stuckat faults along with knowledge about the targets of the kinase inhibitory drugs can be used to come up with optimal intervention strategies. A method to devise optimal intervention strategies using such Boolean regulatory networks with stuck-at faults is described in [9]. Accordingly, the problem we pose is this: given data points, where each data point consists of a combination of drugs used as the input and the activity of the observable variables as outputs, is it possible to locate the variables where stuck-at-faults have occurred? In the following sections we represent the problem as a statistical model with unknown parameters which are estimated from the data points using the message passing algorithm. This algorithm allows for rapid computation of the posterior probabilities of the parameters. The estimates obtained are evaluated by comparison with the results given by Markov Chain Monte Carlo methods.
Model description
There are many ways to model a gene regulatory network which describes the behavior of neoplastic tissue. The general rule is that the more the number of unknown parameters, the more the amount of data that is required to get an effective estimate of those parameters. Hence the modeling must be done keeping in mind the limited amount of data available from biological experiments.
Before we go into the details, we would like to point out that literature survey would enable us to know the most likely locations in the Boolean network where stuck-at faults can take place. As stated in the previous section, in 30% of human breast cancers there is an over expression of the ERBB2 gene, and in 90% of the pancreatic cancer cases we see a mutated Ras gene. These are among many examples where prior knowledge about locations of faults is available. This knowledge would allow us to limit the search space for faults in the network. For example we may provide a set of locations where we want to search for faults.
One important assumption made in the modeling of mutations is that they are random events that occur independently of each other [11]-[13]. We make use of this assumption in our model by assuming that the faults occur unconditionally independent from each other with certain unknown probabilities associated with them. These unknown probability parameters are to be estimated from the collected data. These estimated probabilities will indicate our confidence about where the faults have occurred in the Boolean regulatory network.
The second network has a single stuck-at-one fault at RHEB alone, and it’s probability is given by P(M=1/ρ)=(1−ρ_{1})(1−ρ_{2})ρ_{3}. Similarly, the third network has a single stuck-at-one fault at IRS1 alone, with a probability of P(M=2/ρ)=(1−ρ_{1})ρ_{2}(1−ρ_{3}), and so on. The variable M is the decimal equivalent of the binary number representing the different fault combinations and could equivalently represent the particular faulty Boolean network being considered. Since there are three possible locations where stuck-at-faults can take place in this example, M can take 2^{3}=8 different values. In our convention, we use integers from 0 to 2^{3}−1 to represent the values taken by M. For example M=6 corresponds to a network with faults at RAF and IRS1 but not at RHEB and has a corresponding probability of ρ_{1}ρ_{2}(1−ρ_{3}).
where R_{v,m} is either 0 or 1 and m can vary from 0 to 2^{ V }−1.
In our example, we will proceed by assuming that the observable variables (the O_{ j }’s) are independent given the faulty network and the drug combinations. This assumption can be easily relaxed for the case when the 7 observable variables represented as arrows at the bottom of Figure 1 are observed together for each drug combination used as the input. In this case, instead of P(O_{ j }/M), we will be working with P(O_{1},O_{2},…,O_{7}/M). This however does not affect our fundamental results and is a simple extension of our example.
which is nothing but the likelihood function. In order to handle experimental repeats, we can have the the drug combinations I_{ i } to be the same for more than one value of the index i.
Following this the posterior means can also be calculated.
However the number of additive terms in equation 5 represented by the summing variable k in general rises exponentially with the number of data points collected. In the worst case, the left hand side of equation (3) will contain 2^{ V } terms. Since the number of multiplicative terms in equation 4 is NJ (the number of data points collected), upon expanding the right hand side of equation 4 we get 2^{ V N J } additive terms in equations 5, 7, and 8. Thus the computational cost to compute the mean of any given ρ_{ l } is O(2^{ V N J }). Hence the total computation cost to compute the posterior means of all the elements of ρ (ρ_{1} through ρ_{ V }) is O(V×2^{ V J N }). Therefore the straightforward approach for calculating the posterior distributions of ρ_{ l }’s and their posterior means will get intractable as the amount of data collected increases.
To get around this difficulty we will use an iterative algorithm to obtain an approximation of the marginal distributions of the elements of the parameter vector ρ. From the marginal distribution it will be straightforward to obtain the posterior means and confidence intervals of the individual elements of ρ.
Factor graph representation of the model
Factor Graphs are an important tool used in various applications such as signal processing and telecommunications. Many algorithms can be easily understood and derived using the factor graph approach. These include Kalman Filters, the Viterbi Algorithm, the Forward-Backward algorithm and Turbo Codes to name a few. The approach involves first representing the probability model as a factor graph and then applying the message passing algorithm along the edges. The reader is referred to [14],[15] for an in-depth coverage of factor graphs and the message passing algorithm. Here we provide a short primer to the subjects and go into the details of only our particular example.
A simple example
which sums over 2A different values. For continuous variables, the summation is replaced by integration. The optimal strategy for calculating the marginal of x_{2} is straightforward to derive in this simple example. However a systematic approach to find the optimal strategy to calculate the marginal of any variable for any given probability function is given by the message passing algorithm which acts on the factor graph representation of the function.
The marginal distribution of a variable is simply the product of all the messages being received by the corresponding variable node. Hence $\sum _{{x}_{1},{x}_{3}}g({x}_{1},{x}_{2},{x}_{3})={\mu}_{{f}_{1}\to {x}_{2}}\left({x}_{2}\right)\times {\mu}_{{f}_{2}\to {x}_{2}}\left({x}_{2}\right)$ and thus equation (9) is derived using factor graphs and the message passing algorithm. Calculating the rest of the messages would allow us to calculate the marginals of x_{1} and x_{3} as well. The message passing algorithm would terminate when messages along both directions of all the edges in the graph have been calculated.
The message passing algorithm terminates and gives exact marginals for the cases where the factor graph has no cycles. But the most interesting applications are for those cases where the factor graph has cycles, where the marginals are calculated by iteratively updating the messages (for example the iterative decoding of turbo codes). We similarly use an iterative version of the message passing algorithm in our model to approximate the marginal posterior distribution of the unknown parameters.
Using factor graphs and the message passing algorithm on the signal transduction network model
Now, P(ρ/O,I)∝P(O/ρ,I) as is evident from equation (6), while the expression for P(O/ρ,I) is given in equation (4). Let P_{i,j} represent the multiplicative factor P(O_{ j }=o_{i,j}/I_{ i },ρ) in equation (4). In a factor graph, each multiplicative factor is represented by a factor node and each element of ρ is represented by a variable node. Hence there are NJ number of factor nodes with each corresponding to one particular multiplicative term in equation 4, and there are V number of variable nodes with each corresponding to one particular unknown parameter (one out of ρ_{1} through ρ_{ V }). The purpose of this algorithm is to compute the posterior marginal distributions of the unknown parameters ρ_{1} through ρ_{ V }, which can then be used to compute their means and confidence intervals.
- 1.
initialize all ${\mu}_{{\rho}_{v}\to {P}_{i,j}}\left({\rho}_{v}\right)=1$.
- 2.
calculate all ${\mu}_{{P}_{i,j}\to {\rho}_{v}}\left({\rho}_{v}\right)$ as per equation (11).
- 3.
calculate all ${\mu}_{{\rho}_{v}\to {P}_{i,j}}\left({\rho}_{v}\right)$ as per equation (10).
- 4.
repeat steps 2 and 3 in that order.
Since we are dealing with continuous variables between 0 and 1, the summations are replaced by integrations. Every time ${\mu}_{{P}_{i,j}\to {\rho}_{v}}\left({\rho}_{v}\right)$ are computed in step 2, they come out to be polynomials of degree one due to the multiplicatively separable nature of the integrands involved and that all the parameters ρ_{ v } are being integrated from 0 to 1 (a rectangular integration region). Let them be represented as b_{0,v,i,j}+b_{1,v,i,j}×ρ_{ v }. Hence ${\mu}_{{P}_{i,j}\to {\rho}_{v}}\left({\rho}_{v}\right)$ can be represented by a vector b_{v,i,j}=(b_{0,v,i,j}b_{1,v,i,j})^{ T }. Every time ${\mu}_{{\rho}_{v}\to {P}_{i,j}}\left({\rho}_{v}\right)$ are computed in step 3, they will be polynomials of degree N J−1 since they are simply the product of all incoming messages except one. Let them be represented as $\sum _{k=0}^{\mathit{\text{NJ}}-1}{a}_{k,v,i,j}{\rho}_{v}^{k}$. Hence ${\mu}_{{\rho}_{v}\to {P}_{i,j}}\left({\rho}_{v}\right)$ can be represented by a vector a_{v,i,j}=(a_{0,v,i,j}a_{1,v,i,j} … a_{N J−1,v,i,j})^{ T }.
By comparing coefficients of either side of equation (17), the values of the elements of the vector a_{v,i,j} are updated. This is also equivalent to the convolution of the message vectors b_{v,g,h} for g≠i,h≠j. At each iteration, the message vectors can be multiplied by constants so as to prevent overflow or underflow when implementing the algorithm on a digital computer with finite precision. In that case the final solutions we get are simply the required marginal distributions scaled by some unknown constant. If we are simply interested in the marginal distributions, then it is not necessary to keep track of the multiplied constants. We simply need to normalize the marginals so that their integrals from 0 to 1 give unity.
The message vectors a_{v,i,j} and b_{v,i,j} are iteratively updated until some convergence criteria is satisfied (for example if the Hellinger distance between the marginals of two successive iterations is below a certain threshold). In our simulations, we saw that as few as 2 iterations gave satisfactory results in terms of convergence. Hence the time complexity of the algorithm is dependent on steps 2 and 3 of the algorithm.
In order to calculate ${\mu}_{{\rho}_{v}\to {P}_{i,j}}\left({\rho}_{v}\right)$ in step 3, first calculate the polynomial ${U}_{v}\left({\rho}_{v}\right)=\prod _{g,h}{\mu}_{{P}_{g,h}\to {\rho}_{v}}\left({\rho}_{v}\right)$ of degree NJ. Then find the quotient of the division operation ${U}_{v}\left({\rho}_{v}\right)\xf7{\mu}_{{P}_{i,j}\to {\rho}_{v}}\left({\rho}_{v}\right)$. This gives ${\mu}_{{\rho}_{v}\to {P}_{i,j}}\left({\rho}_{v}\right)$. Along with that, we can also calculate and store the value of ${\theta}_{v,i,j,1}=\underset{0}{\overset{1}{\int}}{\rho}_{v}{\mu}_{{\rho}_{v}\to {P}_{i,j}}\left({\rho}_{v}\right)d{\rho}_{v}$ and ${\theta}_{v,i,j,0}=\underset{0}{\overset{1}{\int}}(1-{\rho}_{v}){\mu}_{{\rho}_{v}\to {P}_{i,j}}\left({\rho}_{v}\right)d{\rho}_{v}$ which will be used in step 2. Note that ${\theta}_{v,i,j,1}=\sum _{k=0}^{\mathit{\text{NJ}}-1}\frac{{a}_{k,v,i,j}}{k+2}$ and ${\theta}_{v,i,j,0}=\sum _{k=0}^{\mathit{\text{NJ}}-1}\frac{{a}_{k,v,i,j}}{(k+1)(k+2)}$. Calculating the coefficients of U_{ v }(ρ_{ v }) is of time complexity at most O((N J)^{2}). This is because it involves the convolution of NJ different first degree polynomials. Calculating the quotient of ${U}_{v}\left({\rho}_{v}\right)\xf7{\mu}_{{P}_{i,j}\to {\rho}_{v}}$, and θ_{v,i,j,1} and θ_{v,i,j,0} are of time complexity O(N J). The last three operations of O(N J) have to be done for all NJ of the factor nodes for each variable node. Hence the time complexity of calculating the messages from one variable node to all factor nodes is of time complexity O((N J)^{2}). Repeating this action for all V variable nodes gives us the time complexity of step 3 of the algorithm to be O((N J)^{2}V).
If we look at equations (15) and (16), the computation of b_{v,i,j} seems to be of O(N J V 2^{ V }) time complexity. Since there are NJV of b_{v,i,j} to be computed, step 2 seems to be of O((N J)^{2}(V)^{2}2^{ V }) time complexity. However some of the computations are repeated and storing these computations for reuse can reduce the time complexity. Let ${\kappa}_{m,i,j}=\prod _{l=1}^{V}{\theta}_{l,i,j,1}^{{R}_{l,m}}{\theta}_{l,i,j,0}^{1-{R}_{l,m}}$. Then ${\mu}_{{P}_{i,j}\to {\rho}_{v}}\left({\rho}_{v}\right)=\sum _{m=0}^{{2}^{V}-1}{S}_{m,i,j}{\rho}_{v}^{{R}_{v,m}}{(1-{\rho}_{v})}^{1-{R}_{v,m}}\times \phantom{\rule{2.77626pt}{0ex}}\frac{{\kappa}_{m,i,j}}{{\theta}_{v,i,j,1}^{{R}_{v,m}}{\theta}_{v,i,j,0}^{1-{R}_{v,m}}}$. Computation of κ_{m,i,j} for all m is of O(V 2^{ V }) time complexity for a given factor node P_{i,j}. Computation of ${\mu}_{{P}_{i,j}\to {\rho}_{v}}\left({\rho}_{v}\right)$ for all v is of O(V 2^{ V }) time complexity for a given factor node P_{i,j}. Hence computation of ${\mu}_{{P}_{i,j}\to {\rho}_{v}}\left({\rho}_{v}\right)$ from a single factor node to all variable nodes is of O(V 2^{ V }) time complexity. Hence total computation for all factor nodes in step 2 comes out to be of O(N J V 2^{ V }) time complexity.
Hence the complexity of each iteration of the algorithm comes out to be O(N J V(2^{ V }+C N J)), where C is a constant. This is quadratic with respect to the number of data points NJ, as opposed to the exponential complexity of the straightforward approach discussed in section ‘Model description’.
where γ is a normalization constant which can be calculated to give $\underset{0}{\overset{1}{\int}}P\left({\rho}_{v}/O,I\right)d{\rho}_{v}=1$.
Simulation experiments
We did simulations where the algorithm was tested on synthetic data as well as applied to real world data. The marginal posterior distributions estimated using the iterative message passing algorithm were compared with the marginal posteriors estimated using the time consuming and computationally intensive Markov Chain Monte Carlo (MCMC) methods and the estimates obtained using both methods came out to be close thereby verifying the iterative message passing algorithm’s correctness.
Various literature on MCMC methods exist [16]-[18]. We will describe the details used in our simulations instead of going into a detailed discussion of MCMC methods. The Markov Chain Monte Carlo Method involves creating a Markov Chain whose stationary distribution is the required posterior distribution. The Metropolis-Hastings Algorithm will be used to generate such a Markov Chain since the samples need to be generated from a non standard probability distribution. This method will be used to generate samples from the posterior distribution of the unknown parameters of the vector ρ. These samples can then be used to get an estimate of the joint as well as the marginal posterior distributions of the unknown parameters using kernel density estimation.
- 1.
Initialize all elements of ρ ^{(0)} to be 0.5.
- 2.
At the n ^{ t h } iteration of the MH algorithm, generate ρ ^{∗} from the proposal distribution U(ρ/ρ ^{(n)},Δ). The proposal distribution and the tuning parameter Δ will be discussed in the next paragraph.
- 3.Calculate the acceptance ratio(Recall that the prior of the parameter vector is constant). P(O/ρ ^{∗},I) and P(O/ρ ^{(n)},I) can be easily calculated for known values of ρ ^{∗} and ρ ^{(n)} without the expansion of P(O/ρ,I) described in equation (5). Accept ρ ^{∗} as the next sample ρ ^{(n+1)} with probability m i n(1,D), or keep ρ ^{(n+1)} equal to ρ ^{(n)} with probability 1−m i n(1,D).$D=\frac{P\left(O/{\rho}^{\ast},I\right)U\left({\rho}^{\left(n\right)}/{\rho}^{\ast},\Delta \right)}{P\left(O/{\rho}^{\left(n\right)},I\right)U\left({\rho}^{\ast}/{\rho}^{\left(n\right)},\Delta \right)}$(19)
- 4.
Repeat steps 2 and 3 to generate samples from the posterior of P(ρ/O,I).
where B e t a(x,y) is the beta function with parameters x and y and Δ is a scalar tuning parameter which controls the variance of the distributions of the ρ_{ i }’s. It can be adjusted to give autocorrelation properties of the Markov Chain within acceptable ranges.
Experiments with synthetic data
To demonstrate the working of the algorithm, we ran simulations of the message passing algorithm as well as the MH algorithm on synthetic data. We generated synthetic data from the example described in section ‘Model description’ which was derived from the MAPK signal transduction network, which is a well understood network.
The set of locations where faults can take place was taken to be composed of RAF, IRS1, and RHEB. The probabilities of stuck-at-one faults at these locations (The parameters ρ_{1}, ρ_{2}, and ρ_{3}) were taken as 0.7, 0.4, and 0.2. Synthetic observations of the observable variable (the variables shown at the bottom of Figure 1 as arrows) were generated for various drug combinations as inputs (the drugs being AG1024, AG825, Lapatinib, LY294002, U0126, and Temsirolimus, whose action on the Boolean network of the MAPK network is shown in Figure 1) according to the probability model described in the previous sections. The inputs at the top of the network corresponding to growth factors (EGF, HBEGF, IGF, and NRG1) were all taken as 1 (if the cells were being grown on petridishes, then this would be equivalent to the case where all the four growth factors have been supplied in the serum). Hence the data set {(o_{i,1},o_{i,2},…,o_{i,J}),I_{ i }} is generated. There are 6 drugs in the Boolean model. All the 2^{6}−1 drug combinations were used to generate the data points. Hence i varies from 1 to 63.
where σ_{ v } is the bandwidth of the Gaussian kernel which is set to $\frac{{\delta}_{v}}{{L}^{\frac{1}{5}}}$. L is the number of samples generated by the MH algorithm (5000 in our case) and δ_{ v } is the standard deviation of the generated samples. This rule of thumb to calculate the bandwidth of the Gaussian kernel is discussed in [19].
As we can see in Figure 4, there is almost no difference in the inference of the marginal posterior distributions of the unknown parameters between the message passing algorithm and the MCMC approach. The posterior mean of ρ_{ v } is calculated from the message passing algorithm as $\underset{0}{\overset{1}{\int}}{\rho}_{v}\gamma \prod _{i,j}{\mu}_{{P}_{i,j}\to {\rho}_{v}}\left({\rho}_{v}\right)d{\rho}_{v}$ and from the MCMC approach as $\frac{1}{L}\sum _{n}{\rho}_{v}^{\left(n\right)}$. These come out to be (0.7254 0.3891 0.2799) and (0.7326 0.3961 0.2830) respectively. These estimates are close to each other and to the actual values of (0.7 0.4 0.2).
This simulation shows that the message passing algorithm successfully calculates the posterior marginal distributions of the unknown parameters ρ_{1} through ρ_{3} and gives the same inferences as the Metropolis-Hastings algorithm. We did simulations with various values of ρ and for different sets of locations of faults. The iterative message passing algorithm gave estimates of the posterior marginal distributions of the parameters same as those estimated using the MCMC approach for all the test cases considered in our simulations.
Applications to real data
To test our model, we performed experiments on healthy adult fibroblasts where it is fair to assume that there are no cancer causing mutations present in the tissue. Hence it is fair to assume that a Boolean regulatory network with no faults would best model this tissue.
Adult fibroblasts were grown in petri-dishes till confluence and then maintained in 0.2% FBS (Fetal Bovine Serum) for four days. It is a general assumption that FBS contains most of the important growth factors. After this, the cells were exposed to 0.2% FBS and 100 μ M Anisomycin for 30 minutes. Anisomycin is a protein synthesis inhibitor which activates the MAPK signal transduction network and keeps it responsive to kinase specific inhibitors [20],[21]. That is, with the addition of Anisomycin, we anticipate the MAPK signal transduction network to respond to the addition of kinase inhibitors such as U0126. Anisomycin, being a protein synthesis inhibitor, would also cut off any feedback path which has a translation (protein synthesis) step in it. The cells were then grouped into three groups (group 0, group 1, and group 2). Group 0 was the control group which was exposed to 100 μ M Anisomycin only. Group 1 was exposed to 100 μ M Anisomycin and 50 μ M of LY294002. Group 2 was exposed to 100 μ M Anisomycin, 50 μ M of LY294002, and 10 μ M of U0126. All three groups were also exposed to 20% FBS along with the other chemicals. LY294002 and U0126 are highly specific inhibitors of PI3 Kinase (PI3K in Figure 1) and MEK1 respectively. The molecular targets of LY294002 and U0126 are shown in Figure 1. Genes having the SP1 and SRF-ELK response elements in their promoters were quantified through real time PCR and the delta-delta method [22] with GAPDH as the reference gene and group 0 as the control. The genes were measured in quadruplets for each experiment.
Gene expression levels and their discrete values for the gene EGR1
group 1 | normalized gene expression | 0.5987 | 0.7320 | 0.5586 | 0.6199 |
discrete value | 1 | 1 | 1 | 1 | |
group 2 | normalized gene expression | 0.4796 | 0.2892 | 0.2535 | 0.2698 |
discrete value | 1 | 0 | 0 | 0 |
Table showing the normalized gene expression ratios and their Reference Sequence (RefSeq) numbers
EGR1 | JUN | CMYC | DECORIN | IRF3 | VEGFA | |
---|---|---|---|---|---|---|
RefSeq | NM_001964.2 | NM_002228.3 | NM_002467.4 | NM_133503.2 | NM_001571.5 | NM_003376.5 |
Group 1 | 0.5987 | 0.4931 | 0.3209 | 0.4353 | 0.5176 | 0.4444 |
0.7320 | 0.6736 | 0.2852 | 0.4601 | 0.4204 | 0.4989 | |
0.5586 | 0.6598 | 0.3439 | 0.4147 | 0.3560 | 0.5176 | |
0.6199 | 0.7792 | 0.2994 | 0.4323 | 0.3345 | 0.5105 | |
Group 2 | 0.4796 | 0.1550 | 0.2570 | 0.2793 | 0.2624 | 0.3164 |
0.2892 | 0.2793 | 0.2059 | 0.3789 | 0.2553 | 0.4601 | |
0.2535 | 0.3015 | 0.2717 | 0.3737 | 0.2253 | 0.4633 | |
0.2698 | 0.3415 | 0.2679 | 0.3536 | 0.2031 | 0.3660 |
As we can see in Figure 5, the posterior marginal distribution associated with ERK1/2 comes out to be quite tightly distributed with a mean of 0.1538 while that for IRS1 comes out to be uniformly distributed between 0 and 1. This is because the data does not contain any discriminating information about the occurrence of any fault at IRS1 under this MAPK Boolean model. But it does tell us that the probability of occurrence of a fault at the variable corresponding to ERK1/2 is pretty low, judging by its mean to be having a low value of close to 15%. This is expected since the data comes from adult fibroblasts, where we can be fairly sure that no cancer causing mutations are present. If data had been collected after exposure to other combinations of other drugs (for instance Lapatinib or Temsirolimus) then the data might have allowed the model to make meaningful inferences regarding occurrences of faults at locations besides ERK1/2 as well as give sharper confidence intervals than that shown in Figure 5.
Conclusion
In this paper we have described a method to estimate the probabilities with which certain faults have taken place in a given Boolean Regulatory network, provided we have the observations of the observable variables whose behavior is determined by the network. We have described the probability model and described a fast algorithm based on message passing to make the inferences about the posterior marginal probability distributions of the unknown parameters of the model (These parameters being the probabilities of the occurrences of the faults). We have compared the performance of the algorithm with Markov Chain Monte Carlo techniques (the Metropolis-Hastings Algorithm) through simulations, and we have shown that the message passing algorithm gives results comparable to those obtained using the MCMC methods with the added advantage of much smaller computation times. We also applied the model to analyze data collected from fibroblasts, thereby demonstrating how this model can be used on real world data. Such a computationally manageable approach has the potential to allow the inference of locations of faults in a Boolean regulatory network in a probabilistic setting from data, such as gene expression data.
Locating the points of dysregulations in a deterministic Boolean signal transduction network could be used to suggest therapies as described in [9]. Since we are locating faults in a probabilistic setting, the therapy could be designed keeping in mind the tradeoff between treating cancer and managing the side effects of the treatment. For example, consider a case where we have two possible locations of faults. Let the computed probability of the occurrence of a fault at the first location be smaller than that of the second location. Then we may only consider the second fault in our therapy design process, thereby reducing the exposure of the patient to excessive drugs which may have unwanted side effects.
Future work could focus on performing experiments on cancerous cell lines being exposed to various combinations of drugs and infer from the collected data the likely locations of dysregulations in the corresponding Boolean regulatory network. Also, algorithms could be developed to automate the process of selecting the set of locations of faults instead of having the user provide it to the algorithm.
Availability of codes and data
The MATLAB codes and the data can be obtained by sending a request to anwoy.rkl@gmail.com.
Declarations
Acknowledgements
This work was supported by the National Science Foundation under Grant ECCS-1068628 and by a Texas A&M Engineering Genomics and Bioinformatics Seed Grant.
Authors’ Affiliations
References
- Bower JM, Bolouri H: Computational Modeling of Genetic and Biochemical Networks, 1st edition, Boston: MIT Press; 2001., Google Scholar
- Shmulevich I, Dougherty ER, Kim S, Zhang W: Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics 2002, 18(2):261–274., View ArticlePubMedGoogle Scholar
- Datta A, Dougherty E: Introduction to Genomic Signal Processing with Control, New York: CRC Press; 2007., Google Scholar
- Friedman N, Linial M, Nachman I, Pe’er D: Using Bayesian networks to analyze expression data. J Comput Biol 2000, 7(3–4):601–620., View ArticlePubMedGoogle Scholar
- Zou M, Conzen SD: A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 2005, 21:71–79., View ArticlePubMedGoogle Scholar
- Liang S, Fuhrman S, Somogyi R: REVEAL, a general reverse engineering algorithm for inference of genetic network architectures. Pac Symp Biocomput 1998, 3(3):18–29., Google Scholar
- Layek RK, Datta A, Dougherty ER: From biological pathways to regulatory networks. Mol BioSyst 2011, 7:843–851., View ArticlePubMedGoogle Scholar
- Mohanty AK, Datta A, Venkatraj V: A model for cancer tissue heterogeneity. IEEE T Bio-Med Eng 2014, 61(3):966–974., View ArticleGoogle Scholar
- Layek RK, Datta A, Bittner M, Dougherty ER: Cancer therapy design based on pathway logic. Bioinformatics 2011, 27(4):548–555., View ArticlePubMedGoogle Scholar
- Weinberg RA: The Biology of Cancer, 1st edition, Princeton: Garland Science; 2006., Google Scholar
- Greenman C, Wooster R, Futreal PA, Stratton MR, Easton DF: Statistical analysis of pathogenicity of somatic mutations in cancer. Genetics 2006, 173(4):2187–2198., View ArticlePubMedPubMed CentralGoogle Scholar
- Goldman N, Yang Z: A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol 1994, 11(5):725–736., PubMedGoogle Scholar
- Yang Z, Ro S, Rannala B: Likelihood models of somatic mutation and codon substitution in cancer genes. Genetics 2003, 165:695–705., PubMedPubMed CentralGoogle Scholar
- Kschischang FR, Frey BJ, Loeliger HA: Factor graphs and the sum-product algorithm. IEEE T Inform Theory 2001, 47(2):498–519., View ArticleGoogle Scholar
- Wymeersch H: Iterative Receiver Design, New York: Cambridge University Press; 2007., View ArticleGoogle Scholar
- Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian Data Analysis, 2nd edition, Boca Raton, London, New York, Washington D.C.: Chapman and Hall/CRC; 2004., Google Scholar
- Gelman A, Hill J: Data Analysis Using Regression and Multi-level/hierarchical Models, New York: Cambridge University Press; 2007., Google Scholar
- Hoff PD: A First Course in Bayesian Statistical Methods, Dordrecht, Heidelberg, London, New York: Springer Texts in Statistics; 2009., View ArticleGoogle Scholar
- Scott DW: Multivariate Density Estimation: Theory, Practice, and Visualization, New York, Chichester, Brisbane, Toronto, Singapore: John Wiley & Sons; 1992., View ArticleGoogle Scholar
- Bébien M, Salinas S, Becamel C, Richard V, Linares L, Hipskind RA: Immediate-early gene induction by the stresses anisomycin and arsenite in human osteosarcoma cells involves MAPK cascade signaling to Elk-1, CREB and SRF. Oncogene 2003, 22(12):1836–1847., View ArticlePubMedGoogle Scholar
- Dhawan P, Bell A, Kumar A, Golden C, Mehta KD: Critical role of p42/44(MAPK) activation in anisomycin and hepatocyte growth factor-induced LDL receptor expression: activation of Raf-1/Mek-1/p42/44(MAPK) cascade alone is sufficient to induce LDL receptor expression. J Lipid Res 1999, 40(10):1911–1919., PubMedGoogle Scholar
- Livak KJ, Schmittgen TD: Analysis of relative gene expression data using real-time quantitative PCR and the 2^{ −ΔΔ C }_{ t }method. Methods 2001, 25(4):402–408., View ArticlePubMedGoogle Scholar
- Clarkson RW, Shang CA, Levitt LK, Howard T, Waters MJ: Ternary complex factors Elk-1 and Sap-1a mediate growth hormone induced transcription of Egr-1 (early growth response factor-1) in 3T3-F442A Preadipocytes. Mol Endocrinol 1999, 13(4):619–631., View ArticlePubMedGoogle Scholar
- Rozek D, Pfeifer GP: In vivo protein-DNA interactions at the c jun promoter: preformed complexes mediate the UV response. Mol Cell Biol 1993, 13(9):5490–5499., View ArticlePubMedPubMed CentralGoogle Scholar
- Levens D: How the c-myc promoter works and why it sometimes does not. J Natl Cancer I Monographs 2008, 39:41–43., View ArticlePubMedGoogle Scholar
- Verrecchia F, Rossert J, Mauviel A: Blocking sp1 transcription factor broadly inhibits extracellular matrix gene expression in vitro and in vivo: implications for the treatment of tissue fibrosis. J Invest Dermatol 2001, 116(5):755–763., View ArticlePubMedGoogle Scholar
- Xu HG, Jin R, Ren W, Zou L, Wang Y, Zhou GP: Transcription factors Sp1 and Sp3 regulate basal transcription of the human IRF-3 gene. Biochimie 2012, 94(6):1390–1397., View ArticlePubMedGoogle Scholar
- Samson SL, Wong NC: Role of Sp1 in insulin regulation of gene expression. J Mol Endocrinol 2002, 29(3):265–279., View ArticleGoogle Scholar
- Pagés G, Pouysségur J: Transcriptional regulation of the vascular endothelial growth factor gene-a concert of activating factors. Cardiovasc Res 2005, 65(3):564–573., View ArticlePubMedGoogle Scholar
- Otsu N: A threshold selection method from gray-level histograms. Automatica 1975, 11(285–296):23–27., Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.