Refining transcriptional regulatory networks using network evolutionary models and gene histories
 Xiuwei Zhang^{1} and
 Bernard ME Moret^{1}Email author
DOI: 10.1186/1748718851
© Zhang and Moret; licensee BioMed Central Ltd. 2010
Received: 10 August 2009
Accepted: 4 January 2010
Published: 4 January 2010
Abstract
Background
Computational inference of transcriptional regulatory networks remains a challenging problem, in part due to the lack of strong network models. In this paper we present evolutionary approaches to improve the inference of regulatory networks for a family of organisms by developing an evolutionary model for these networks and taking advantage of established phylogenetic relationships among these organisms. In previous work, we used a simple evolutionary model and provided extensive simulation results showing that phylogenetic information, combined with such a model, could be used to gain significant improvements on the performance of current inference algorithms.
Results
In this paper, we extend the evolutionary model so as to take into account gene duplications and losses, which are viewed as major drivers in the evolution of regulatory networks. We show how to adapt our evolutionary approach to this new model and provide detailed simulation results, which show significant improvement on the reference network inference algorithms. Different evolutionary histories for gene duplications and losses are studied, showing that our adapted approach is feasible under a broad range of conditions. We also provide results on biological data (cisregulatory modules for 12 species of Drosophila), confirming our simulation results.
Introduction
Transcriptional regulatory networks are models of the cellular regulatory system that governs transcription. Because establishing the topology of the network from bench experiments is very difficult and timeconsuming, regulatory networks are commonly inferred from geneexpression data. Various computational models, such as Boolean networks [1], Bayesian networks [2], dynamic Bayesian networks (DBNs) [3], and differential equations [4, 5], have been proposed for this purpose, along with associated inference algorithms. Results, however, have proved mixed: the high noise level in the data, the paucity of well studied networks, and the many simplifications made in the models all combine to make inference difficult.
Bioinformatics has long used comparative and, more generally, evolutionary approaches to improve the accuracy of computational analyses. Work by Babu's group [6–8] on the evolution of regulatory networks in E. coli and S. cerevisiae has demonstrated the applicability of such approaches to regulatory networks. They posit a simple evolutionary model for regulatory networks, under which network edges are simply added or removed; they study how well such a model accounts for the dynamic evolution of the two most studied regulatory networks; they then investigate the evolution of regulatory networks with gene duplications [8], concluding that gene duplication plays a major role, in agreement with other work [9].
Phylogenetic relationships are well established for many groups of organisms; as the regulatory networks evolved along the same lineages, the phylogenetic relationships informed this evolution and so can be used to improve the inference of regulatory networks. Indeed, Bourque and Sankoff [10] developed an integrated algorithm to infer regulatory networks across a group of species whose phylogenetic relationships are known; they used the phylogeny to reconstruct networks under a simple parsimony criterion. In previous work [11], we presented two refinement algorithms, both based on phylogenetic information and using a likelihood framework, that boost the performance of any chosen network inference method. On simulated data, the receiveroperator characteristic (ROC) curves for our algorithms consistently dominated those of the standard approaches used alone; under comparable conditions, they also dominated the results from Bourque and Sankoff. Both our previous approach and that of Bourque and Sankoff are based on an evolutionary model that considers only edge gains and losses, so that the networks must all have the same number of genes (orthologous across all species). Moreover, the gain or loss of an edge in that model is independent of any other event. However, this process accounts for only a small part of regulatory network evolution; in particular, gene duplication is known to be a crucial source of new genetic function and a mechanism of evolutionary novelty [8, 9].
In this paper we present a model of network evolution that takes into account gene duplications and losses and their effect on regulatory network structures. Such a model provides a direct evolutionary mechanism for edge gains and losses, while also enabling broader application and more flexible parameterization. For example, in the networks to be refined, the genes can have different numbers of copies for different organisms. Within this broader framework, the phylogenetic information that we use lies on two levels: the evolution of gene contents of the networks and the regulatory interactions of the networks. The former can be regarded as the basis of the latter, and can be obtained by inferring the history of gene duplications and losses during evolution. We then extend our refinement algorithms [11] to handle this data and use different models of gene duplications and losses to study their effect on the performance of the refinement algorithms.
Our experimental results confirm that our new algorithms provide significant improvements over the base inference algorithms, and support our analysis of the performance of refinement algorithms under different models of gene duplications and losses.
Background
Our refinement algorithms [11] work iteratively in two phases after an initialization step. First, we obtain the regulatory networks for the family of organisms; typically, these networks are inferred from geneexpression data for these organisms, using standard inference methods. We place these networks at the corresponding leaves of the phylogeny of the family of organisms and encode them into binary strings by simply concatenating the rows of their adjacency matrix. We then enter the iterative refinement cycle. In the first phase, we infer ancestral networks for the phylogeny (strings labelling internal nodes), using our own adaptation of the FastML[12] algorithm; in the second phase, these ancestral networks are used to refine the leaf networks. These two phases are then repeated as needed. Our refinement algorithms are formulated within a maximum likelihood (ML) framework and focused solely on refinementthey are algorithmic boosters for one's preferred network inference method. Our new algorithms retain the same general approach, but include many changes to use the duplication/loss data and handle the new model.
Base network inference methods
We use both DBN and differential equation models as base inference methods in our experiments. When DBNs are used to model regulatory networks, an associated structurelearning algorithm is used to infer the networks from geneexpression data [3, 13, 14]; so as to avoid overly complex networks, a penalty on graph structure complexity is usually added to the ML score, thereby reducing the number of false positive edges. In [11] we used a coefficient k_{ p }to adjust the weight of this penalty and studied different tradeoffs between sensitivity and specificity, yielding the optimization criterion log Pr(DG, )  k_{ p }#G log N, where D denotes the dataset used in learning, G is the (structure of the) network, is the ML estimate of parameters for G, #G is the number of free parameters of G, and N is the number of samples in D.
In models based on differential equations [4, 5], a regulatory network is represented by the equation system dx/dt = f(x(t))Kx(t), where x(t) = (x_{1}(t),⋯, x_{ n }(t)) denotes the expression levels of the n genes and K (a matrix) denotes the degradation rates of the genes. The regulatory relationships among genes are then characterized by f (·). To get networks with different levels of sparseness, we applied different thresholds to the connection matrix to get final edges. In our experiments we use Murphy's Bayesian Network Toolbox [15] for the DBN approach and TRNinfer[5] for the differential equation approach; we refer to them as DBI and DEI, respectively.
Refinement algorithms in our previous work
The principle of our phylogenetic approach is that phylogenetically close organisms are likely to have similarly close regulatory networks; thus independent network inference errors at the leaves get corrected in the ancestral reconstruction process along the phylogeny. In [11], we gave two refinement algorithms, RefineFast and RefineML. Each uses the globally optimized parents of the leaves to refine the leaves, but the first simply picks a new leaf network by sampling from the inferred distribution (given the parent network, the evolutionary model parameters, and the phylogeny), while the second combines the inferred distribution with a prior, the existing leaf network, using a precomputed belief coefficient that indicates our confidence in the current leaf network, and returns the most likely network under these parameters.
Inference of gene duplication and loss history
To infer ancestral networks with the extended network evolution model, we need a full history of gene duplications and losses for the gene families. Reconciliation of gene trees with the species tree [16–18] is one way to infer this history. The species tree is the phylogenetic tree whose leaves correspond to the modern organisms; gene duplications and losses occur along the branches of this tree. A gene tree is a phylogenetic tree whose leaves correspond to genes in orthologous gene families across the organisms of interest; in such a tree, gene duplications and speciations are associated with internal nodes.
When gene duplications and losses occur, the species trees and the gene trees may legitimately differ in topology. Reconciling these superficially conflicting topologiesthat is, explaining the differences through a history of gene duplications and lossesis known as lineage sorting or reconciliation and produces a list of gene duplications and losses along each edge in the species tree. While reconciliation is a hard computational problem, algorithms have been devised for it in a Bayesian framework [16] or using a simple parsimony criterion [17]. In our experiments, we use the parsimonybased reconciliation tool Notung [17], but we also investigate the effect of using alternate histories.
The evolutionary model
The new network evolutionary model
Although transcriptional regulatory networks produced from bench experiments are available for only a few model organisms, other types of data have been used to assist in the comparative study of regulatory mechanisms across organisms [19–21]. For example, geneexpression data [21], sequence data like transcriptional factor binding site (TFBS) data [19, 20], and cisregulatory elements [21] have all been used in this context. Moreover, a broad range of model organisms have been studied, including bacteria [7], yeast [19, 21], and fruit fly [20]. While these studies offer some insights, they have not to date sufficed to establish a clear model for regulatory networks or their evolution. Our new model remains simple, but can easily be generalized or associated with other models, such as the evolutionary model of TFBSs [22]. In this new model, the networks are represented by binary adjacency matrices.
The evolutionary operations are:

Gene duplication: a gene is duplicated with probability p_{ d }. After a duplication, edges for the newly generated copy can be assigned as follows:Neutral initialization: Create connections between the new copy and other genes randomly according to the proportion π_{1} of edges in the background network independently of the original copy.
Inheritance initialization: Connections of the duplicated copy are reported to correlate with those of the original copy [7–9]. This observation suggests letting the new copy inherit the connections of the original, then lose some of them or gain new ones at some fixed rate [23].
Preferential attachment: The new copy gets preferentially connected to genes with high connectivity [23, 24].

Gene loss: a gene is deleted along with all its connections with probability p_{ l }.

Edge gain: an edge between two genes is generated with probability p_{01}.

Edge loss: an existing edge is deleted with probability p_{10}.
The model parameters are thus p_{ d }, p_{ l }, the proportions of 0 s and 1 s in the networks Π = (π_{0} π_{1}), the substitution matrix of 0 s and 1 s, , plus parameters suitable to the initialization model.
Models of gene duplications and losses
While networks evolve according to the network evolutionary model described above, a history of gene duplications and losses is created along the evolution. However, during reconstruction, this history may not be exactly reconstructed. Therefore, we propose other models of gene duplications and losses to approximate the true history:

The duplicationonly model: We assume that different gene contents are due exclusively to gene duplication events.

The lossonly model: We assume that different gene contents are due exclusively to gene loss events.
We also compare outcomes when the true history is known.
The new refinement methods
 1.
Reconstruct the history of gene duplications and losses, from which the gene contents for the ancestral regulatory networks (at each internal node of the species tree) can be determined. We present algorithms for history reconstruction with different gene duplication and loss models.
 2.
Infer the edges in the ancestral networks once we have the genes of these networks. We do this using a revised version of FastML.
 3.
Refine the leaf networks with new versions of RefineFast and RefineML.
 4.
Repeat steps 2 and 3 as needed.
Inferring gene duplication and loss history
The duplicationonly and lossonly models allow simplifying the inference of the gene duplication and loss history and of the gene contents of the ancestors. For a certain internal node of the phylogenetic tree, with the duplicationonly assumption, the intersection of the genes of all the leaves in the subtree rooted at this internal node is its set of genes, while with the lossonly assumption, the union of genes in all the leaves of the subtree is the set of genes. Gene duplication and loss histories inferred with these methods have a minimum number of gene duplications, respectively losses  they are optimal under the model.
Note that this approach assumes 11 orthologies, whereas orthology is a manytomany relationship. In biological practice, however, 11 orthologies are by far the most common.
Inferring ancestral networks
Given P', let i, j, k denote a tree node, and a, b, c ∈ S' possible values of a character at some node. For each character a at each node i, we maintain two variables:

L_{ i }(a): the likelihood of the best reconstruction of the subtree with root i, given that the parent of i is assigned character a.

C_{ i }(a): the optimal character for i, given that its parent is assigned character a.
 1.
If leaf i has character b, then, for each a ∈ S', set C_{ i }(a) = b and L_{ i }(a) = .
 2.If i is an internal node and not the root, its children are j and k, and it has not yet been processed, then

if i has character x, for each a ∈ S', set L_{ i }(a) = ·L_{ j }(x). L_{ k }(x) and C_{ i }(a) = x;

otherwise, for each a ∈ S', set L_{ i }(a) = max_{c∈{0,1}} ·L_{ j }(c)·L_{ k }(c) and C_{ i }(a) = argmax_{c∈{0,1}} ·L_{ j }(c)·L_{ k }(c).

 3.
If there remain unvisited nonroot nodes, return to Step 2.
 4.
If i is the root node, with children j and k, assign it the value a ∈ {0,1} that maximizes π_{ a }·L_{ j }(a)·L_{ k }(a), if the character of i is not already identified as x.
 5.
Traverse the tree from the root, assigning to each node its character by C_{ i }(a).
Refining leaf networks: RefineFast
 1.For each entry(i, j) of each leaf network A_{ l }.

if A_{ l }(i, j) ≠ x and A_{ p }(i, j) ≠ x, evolve A_{ p }(i, j) by P to get (i, j);

otherwise, assign (i, j) = A_{ l }(i, j).

 2.
Use the (i, j) to replace A_{ l }(i, j).
In this algorithm, the original leaf networks are used only in the first round of ancestral reconstruction, after which they are replaced with the sample networks drawn from the distribution of possible children of the parents.
Refining leaf networks: RefineML
where h_{ i }= 0 if leaf i has the corresponding genes, h_{ i }= 1 otherwise.
 1.
Learn the CPT parameters for the leaf networks reconstructed by the base inference algorithm and calculate the belief coefficient k_{ b }for every site.
 2.For each entry(i, j) of each leaf network A_{ l }

If A_{ l }(i, j) ≠ x and A_{ p }(i, j) ≠ x, let a = A_{ p }(i, j), b = A_{ l }(i, j),
 (a)
let Q(c) = k_{ b }if b = c, 1  k_{ b }otherwise;
 (b)
calculate the likelihood L(a) = max_{c∈{0,1}}p_{ ac }·Q(c);
 (c)
assign (i, j) = arg max_{c∈{0,1}}p_{ ac }·Q(c).

Otherwise, assign (i, j) = A_{ l }(i, j).

 3.
Use (i, j) to replace A_{ l }(i, j).
Experimental design
To test the performance of our approach, we need regulatory networks as the input to our refinement algorithms. In our simulation experiments, we evolve networks along a given tree from a chosen root network to obtain the "true" leaf networks. Then, in order to reduce the correlation between generation and reconstruction of networks, we use the leaf networks to create simulated expression data and use our preferred network inference method to reconstruct networks from the expression data. These inferred networks are the true starting point of our refinement procedurewe use the simulated gene expression data only to achieve better separation between the generation of networks and their refinement, and also to provide a glimpse of a full analysis pipeline for biological data. We then compare the inferred networks after and before refinement against the "true" networks (generated in the first step).
Despite of the advantages of such simulation experiments (which allow an exact assessment of the performance of the inference and refinement algorithms), results on biological data are highly desirable, as such data may prove quite different from what was generated in our simulations. TFBS data is used to study regulatory networks, assuming that the regulatory interactions determined by transcription factor (TF) binding share many properties with the real interactions [19, 20, 25]. Given this close relationship between regulatory networks and TFBSs and given the large amount of available data on TFBSs, we chose to use TFBS data to derive regulatory networks for the organisms as their "true" networksrather than generate these networks through simulation. In this fashion, we produce datasets for the cisregulatory modules (CRMs) for 12 species of Drosophila.
With the extended evolutionary model, conducting experiments with real data involves several extra steps besides the refinement step, each of which is a potential source of errors. For example, assuming we have identified gene families of interest, we need to build gene trees or assign orthologies for these genes to be able to reconstruct a history of duplications and losses. Any error in gene tree reconstruction or orthology determination leads to magnified errors in the history of duplications and losses. Assessing the results under such circumstances (no knowledge of the true networks and many complex sources of error) is not possible, so we turned to simulation for this part of the testing. This decision does not prejudice our ability to apply our approach to real data and to infer highquality networks: it only reflects our inability to compute precise accuracy scores on biological data.
Experiments on biological data with the basic evolutionary model
We use regulatory networks derived from TFBS data as the "true" networks for the organisms rather than generating these networks through simulations. Such data is available for the Drosophila family (whose phylogeny is well studied) with 12 organisms: D. simulans, D. sechellia, D. melanogaster, D. yakuba, D. erecta, D. ananassae, D. pseudoobscura, D. persimilis, D. willistoni, D. mojavensis, D. virilis, and D. grimshawi. The TFBS data is drawn from the work of Kim et al. [22], where the TFBSs are annotated for all 12 organisms on 51 CRMs.
We conduct separate experiments on different CRMs. For each CRM, we choose orthologous TFBS sequences of 6 transcription factors (TFs): Dstat, Bicoid, Caudal, Hunchback, Kruppel, and Tailless, across the 12 organisms. Then for each organism, we can get a network with these 6 TFs and the target genes indicated by the TFBS sequences, where the arcs are determined by the annotation of TFBSs and the weights of arcs are calculated from the binding scores provided in [22]. (In this paper we do not distinguish TFs and target genes and call them all "genes.") These networks are regarded as the "true" regulatory networks for the organisms.
Geneexpression data is then generated from these "true" regulatory networks; data is generated independently for each organism, using procedure DBNSim, based on the DBN model [11]. Following [14], DBNSim uses binary geneexpression levels, where 1 and 0 indicate that the gene is, respectively, on and off. Denote the expression level of gene g_{ i }by x_{ i }, x_{ i }∈ {0,1}; if m_{ i }nodes have arcs directed to g_{ i }in the network, let the expression levels of these nodes be denoted by the vector y = y_{1}y_{2} ⋯ and the weights of their arcs by the vector w = w_{1} = w_{2} ⋯ . From y and w, we can get the conditional probability Pr(x_{ i }y). Once we have the full parameters of the leaf networks, we generate simulated timeseries geneexpression data. At the initial time point, the value of x_{ i }is generated by the initial distribution Pr(x_{ i }); x_{ i }at time t is generated based on y at time t  1 and the conditional probability P_{ r }(x_{ i }y). We generate 100 time points of geneexpression data for each network in this manner. With this data we can apply our approach. DBI is applied to infer regulatory networks from the geneexpression data. The inferred networks are then refined by RefineFast and RefineML. The whole procedure is run 10 times to provide smoothing and we report average performance over these runs.
Experiments on simulated data with the extended model
Data simulation
In these experiments, the "true" networks for the organisms and their geneexpression data are both generated, starting from three pieces of input information: the phylogenetic tree, the network at the root, and the evolutionary model. While simulated data allows us to get absolute evaluation of our refinement algorithms, specific precautions need to be taken against systematic bias during data simulation and result analysis. We use a wide variety of phylogenetic trees from the literature (of modest sizes: between 20 and 60 taxa) and several choices of root networks, the latter variations on part of the yeast network from the KEGG database [26], as also used by Kim et al. [3]; we also explore a wide range of evolutionary rates, especially different rates of gene duplication and loss. The root network is of modest size, between 14 and 17 genes, a relatively easy case for inference algorithms and thus also a more challenging case for a boosting algorithm.
We first generate the leaf networks that are used as the "true" regulatory networks for the chosen organisms. Since we need quantitative relationships in the networks in order to generate geneexpression data from each network, in the data generation process, we use adjacency matrices with signed weights. Weight values are assigned to the root network, yielding a weighted adjacency matrix A_{ p }. To get the adjacency matrix for its child A_{ c }, according to the extended network evolution model, we follow two steps: evolve the gene contents and evolve the regulatory connections. First, genes are duplicated or lost by p_{ d }and p_{ l }. If a duplication happens, a row and column for this new copy will be added to A_{ p }, the values initialized either according to the neutral initialization model or the inheritance initialization model. (We conducted experiments under both models.) We denote the current adjacency matrix as . Secondly, edges in are mutated according to p_{01} and p_{10} to get A_{ c }. We repeat this process as we traverse down the tree to obtain weighted adjacency matrices at the leaves, which is standard practice in the study of phylogenetic reconstruction [27, 28].
To test our refinement algorithms on different kinds of data, besides DBNSim, we also use Yu's GeneSim[29] to generate continuous geneexpression data from the weighted leaf networks. Denoting the geneexpression levels of the genes at time t by the vector x(t), the values at time t + 1 are calculated according to x(t + 1) = x(t) + (x(t)  z)C + ε, where C is the weighted adjacency matrix of the network, the vector z represents constitutive expression values for each gene, and ε models noise in the data. The values of x(0) and x_{ i }(t) for those genes without parents are chosen uniformly at random from the range [0,100], while the values of z are all set to 50. The term (x(t)  z)C represents the effect of the regulators on the genes; this term needs to be amplified for the use of DBI, because of the required discretization. We use a factor k_{ e }with the regulation term (set to 7 in our experiments), yielding the new equation x(t + 1) = x(t) + k_{ e }(x(t) z)C + ε.
Groups of experiments
With two data generation methods, DBNSim and GeneSim, and two base inference algorithms, DBI and DEI, we can conduct experiments with different combinations of data generation methods and inference algorithms to verify that our boosting algorithms work under all circumstances. First, we use DBNSim to generate data for DBI. 13 × n time points are generated for a network with n genes, since larger networks generally need more samples to gain inference accuracy comparable to smaller ones. Second, we apply DEI to datasets generated by GeneSim to infer the networks. Since the DEI tool TRNinfer does not accept large datasets (with many time points), here we use smaller datasets than the previous group of experiments with at most 75 time points. For each setup, experiments with different rates of gene duplication and loss are conducted.
For each combination of rates of gene duplication and loss, data generation methods, and base network inference methods, we get the networks inferred by DBI or DEI for the family of organisms. We then run refinement algorithms on each set of networks with different gene duplication and loss histories: the duplicationonly history, the lossonly history, the history reconstructed by FastML given the true orthology assignment, and that reconstructed by Notung [17] without orthology information as input. Besides, since simulation experiments allow us to record the true gene duplication and loss history during data generation, we can also test the accuracy of the refinement algorithms with the true history, without mixing their performance with that of gene tree reconstruction or reconciliation. Each experiment is run 10 times to obtain average performance.
Measurements
We want to examine the predicted networks at different levels of sensitivity and specificity. For DBI, on each dataset, we apply different penalty coefficients to predict regulatory networks, from 0 to 0.5, with an interval of 0.05, which results in 11 discrete coefficients. For each penalty coefficient, we apply RefineFast and RefineML on the predicted networks. For DEI, we also choose 11 thresholds for each predicted weighted connection matrix to get networks on various sparseness levels. For each threshold, we apply RefineFast on the predicted networks. We measure specificity and sensitivity to evaluate the performance of the algorithms and plot the values, as measured on the results for various penalty coefficients (for DBI) and thresholds (for DEI) to yield ROC curves. In such plots, the larger the area under the curve, the better the results.
Results and analysis
Results on biological data with the basic evolutionary model
The proportion of edges shared by different numbers of species
Number of species  1  2  3  4  5  6  7  8  9  10  11  12 

Proportion of edges  0.19  0.18  0.03  0.07  0.03  0.09  0.03  0.09  0.02  0.07  0.07  0.13 
Cumulative fraction  0.19  0.37  0.40  0.47  0.50  0.59  0.62  0.71  0.73  0.80  0.87  1.00 
Results with extended evolutionary model
We used different evolutionary rates to generate the networks for the simulation experiments. In [11] we tested mainly edge gain or loss rates; here we focus on testing different gene duplication and loss rates. We also conducted experiments on various combinations of geneexpression data generation methods and network inference methods. The inferred networks were then refined by refinement algorithms with different models of gene duplications and losses. We do not directly compare the extended model with the basic, as the two do not lend themselves to a fair comparison  for instance, the basic model requires equal gene contents across all leaves, something that can only be achieved by restricting the data to a common intersection, thereby catastrophically reducing sensitivity.
Since the results of using neutral initialization and inheritance initialization in data generation are very similar, we only show results with the neutral initialization model. We first refine networks with the true gene duplication and loss history to test the pure performance of the refinement algorithms, then we present and discuss the results of refinement algorithms with several other gene evolution histories, which are more suitable for the application on real biological data. All results we show below are averages over 10 runs.
Refine with true history of gene duplications and losses
Refine with duplicationonly and lossonly histories
We have seen from Figs. 2 and 3 that our two refinement algorithms improve the networks inferred by both DBI and DEI. Since the accuracy of DBI is much better than that of DEI, which causes more difficulty for refinement algorithms, and since RefineML does clearly better than RefineFast, hereafter we only show results with DBI inference and RefineFast refinement, which are on the same datasets as used in Fig. 2.
Refine with inferred histories of gene duplications and losses
In both Fig. 5(A) and Fig. 5(B) the FastML reconstructed history with correct orthology does as well as the true history. In fact, the history is very accurately reconstructed, which explains why the two curves agree so much. However, with the history reconstructed by FastML under random orthology assignments, the refinement algorithm only improves slightly over the base algorithm. With Notung inference RefineFast still dominates DBI in Fig. 5(B), but not in Fig. 5(A) which has higher evolutionary rates.
On using histories of gene duplications and losses, and orthology assignments
 1.
Good orthology assignments are important.
 2.
When we have good orthology assignments, the refinement algorithms need not rely on the true history of gene duplications and losses. We can use the lossonly history or the history reconstructed by FastML, both of which are easy to build and lead to performance similar to that of the true history.
Conclusions and future work
We presented a model, associated algorithms, and experimental results to test the hypothesis that a more refined model of transcriptional regulatory network evolution would support additional refinements in accuracy. Specifically, we presented a new version of our evolutionary approach to refine the accuracy of transcriptional regulatory networks for phylogenetically related organisms, based on an extended network evolution model, which takes into account gene duplication and loss. As these events are thought to play a crucial role in evolving new functions and interactions [8, 9], integrating them into the model both extends the range of applicability of our refinement algorithms and enhances their accuracy. Furthermore, to give a comprehensive analysis of the factors which affect the performance of the refinement algorithms, we conducted experiments with different histories of gene duplications and losses, and different orthology assignments. Results of experiments under various settings show the effectiveness of our refinement algorithms with the new model throughout a broad range of gene duplications and losses.
We also collected regulatory networks from the TFBS data of 12 Drosophila species and applied our approach (using the basic model), with very encouraging results. These results confirm that phylogenetic relationships carry over to the regulatory networks of a family of organisms and can be used to improve the network inference and to help with further analysis of regulatory systems and their dynamics and evolution.
Our positive results with the extended network evolution model show that refined models can be used in inference to good effect. Our current model can itself be refined by using the widely studied evolution of TFBS [20, 22].
Declarations
Acknowledgements
A preliminary version of this paper appeared in the proceedings of the 9^{ th }Workshop on Algorithms in Bioinformatics WABI'09 [30].
Authors’ Affiliations
References
 Akutsu T, Miyano S, Kuhara S: Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. Proc. 4th Pacific Symp. on Biocomputing (PSB'99), World Scientific. 1999, 4: 1728.Google Scholar
 Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian Networks to Analyze Expression Data. J Comput Bio. 2000, 7 (34): 601620. 10.1089/106652700750050961.View ArticleGoogle Scholar
 Kim SY, Imoto S, Miyano S: Inferring gene networks from time series microarray data using dynamic Bayesian networks. Briefings in Bioinformatics. 2003, 4 (3): 228235. 10.1093/bib/4.3.228PubMedView ArticleGoogle Scholar
 Chen T, He HL, Church GM: Modeling gene expression with differential equations. Proc. 4th Pacific Symp. on Biocomputing (PSB'99), World Scientific. 1999, 2940.Google Scholar
 Wang R, Wang Y, Zhang X, Chen L: Inferring Transcriptional Regulatory Networks from Highthroughput Data. Bioinformatics. 2007, 23 (22): 30563064. 10.1093/bioinformatics/btm465PubMedView ArticleGoogle Scholar
 Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA: Structure and evolution of transcriptional regulatory networks. Curr Opinion in Struct Bio. 2004, 14 (3): 283291. 10.1016/j.sbi.2004.05.004.View ArticleGoogle Scholar
 Babu MM, Teichmann SA, Aravind L: Evolutionary Dynamics of Prokaryotic Transcriptional Regulatory Networks. J Mol Bio. 2006, 358 (2): 614633. 10.1016/j.jmb.2006.02.019.View ArticleGoogle Scholar
 Teichmann SA, Babu MM: Gene regulatory network growth by duplication. Nature Genetics. 2004, 36 (5): 492496. 10.1038/ng1340PubMedView ArticleGoogle Scholar
 Roth C, Rastogi S, Arvestad L, Dittmar K, Light S, Ekman D, Liberles DA: Evolution after gene duplication: models, mechanisms, sequences, systems, and organisms. Journal of Experimental Zoology Part B: Molecular and Developmental Evolution. 2007, 308B: 5873. 10.1002/jez.b.21124.View ArticleGoogle Scholar
 Bourque G, Sankoff D: Improving gene network inference by comparing expression timeseries across species, developmental stages or tissues. J Bioinform Comput Biol. 2004, 2 (4): 765783. 10.1142/S0219720004000892PubMedView ArticleGoogle Scholar
 Zhang X, Moret BME: Boosting the Performance of Inference Algorithms for Transcriptional Regulatory Networks Using a Phylogenetic Approach. Proc. 8th Int'l Workshop Algs. in Bioinformatics (WABI'08). 2008, 5251: 245258. Lecture Notes in Computer Science, SpringerGoogle Scholar
 Pupko T, Pe'er I, Shamir R, Graur D: Fast Algorithm for Joint Reconstruction of Ancestral Amino Acid Sequences. Mol Bio Evol. 2000, 17 (6): 890896.View ArticleGoogle Scholar
 Friedman N, Murphy KP, Russell S: Learning the structure of dynamic probabilistic networks. Proc. 14th Conf. on Uncertainty in Art. Intell. UAI'98. 1998, 139147.Google Scholar
 Liang S, Fuhrman S, Somogyi R: REVEAL, a general reverse engineering algorithm for inference of genetic network architectures. Proc. 3rd Pacific Symp. on Biocomputing (PSB'98), World Scientific. 1998, 3: 1829.Google Scholar
 Murphy KP: The Bayes Net Toolbox for MATLAB. Computing Sci and Statistics. 2001, 33: 331351.Google Scholar
 Arvestad L, Berglund AC, Lagergren J, Sennblad B: Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution. Proc. 8th Ann. Int'l Conf. Comput. Mol. Bio. (RECOMB'04). 2004, 326335. full_text. New York, NY, USA: ACMGoogle Scholar
 Durand D, Halldórsson BV, Vernot B: A Hybrid MicroMacroevolutionary Approach to Gene Tree Reconstruction. J Comput Bio. 2006, 13 (2): 320335. 10.1089/cmb.2006.13.320.View ArticleGoogle Scholar
 Page RDM, Charleston MA: From Gene to Organismal Phylogeny: Reconciled Trees and the Gene Tree/Species Tree Problem. Molecular Phylogenetics and Evolution. 1997, 7 (2): 231240. 10.1006/mpev.1996.0390PubMedView ArticleGoogle Scholar
 Crombach A, Hogeweg P: Evolution of Evolvability in Gene Regulatory Networks. PLoS Comput Biol. 2008, 4 (7): e1000112. 10.1371/journal.pcbi.1000112PubMedPubMed CentralView ArticleGoogle Scholar
 Stark A, Kheradpour P, Roy S, Kellis M: Reliable prediction of regulator targets using 12 Drosophila genomes. Genome Res. 2007, 17: 19191931. 10.1101/gr.6593807PubMedPubMed CentralView ArticleGoogle Scholar
 Tanay A, Regev A, Shamir R: Conservation and evolvability in regulatory networks: The evolution of ribosomal regulation in yeast. Proc Nat'l Acad Sci USA. 2005, 102 (20): 72037208. 10.1073/pnas.0502521102.View ArticleGoogle Scholar
 Kim J, He X, Sinha S: Evolution of Regulatory Sequences in 12 Drosophila Species. PLoS Genet. 2009, 5: e1000330. 10.1371/journal.pgen.1000330PubMedPubMed CentralView ArticleGoogle Scholar
 Bhan A, Galas DJ, Dewey TG: A duplication growth model of gene expression networks. Bioinformatics. 2002, 18 (11): 14861493. 10.1093/bioinformatics/18.11.1486PubMedView ArticleGoogle Scholar
 Barabási AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet. 2004, 5: 101113. 10.1038/nrg1272PubMedView ArticleGoogle Scholar
 Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M: Transcriptional regulatory code of a eukaryotic genome. Nature. 2004, 431: 99104. 10.1038/nature02800PubMedPubMed CentralView ArticleGoogle Scholar
 Kanehisa M, Goto S, Hattori M, AokiKinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006, 34: D354D357. 10.1093/nar/gkj102PubMedPubMed CentralView ArticleGoogle Scholar
 Hillis DM: Approaches for assessing phylogenetic accuracy. Syst Bio. 1995, 44: 316.View ArticleGoogle Scholar
 Moret BME, Warnow T: Reconstructing optimal phylogenetic trees: A challenge in experimental algorithmics. Experimental Algorithmics. Edited by: Fleischer R, Moret B, Schmidt E. 2002, 2547: 163180. full_text. Lecture Notes in Computer Science. Springer VerlagView ArticleGoogle Scholar
 Yu J, Smith VA, Wang PP, Hartemink AJ, Jarvis ED: Advances to Bayesian network inference for generating causal networks from observational biological data. Bioinformatics. 2004, 20 (18): 35943603. 10.1093/bioinformatics/bth448PubMedView ArticleGoogle Scholar
 Zhang X, Moret BME: Improving Inference of Transcriptional Regulatory Networks Based on Network Evolutionary Models. Proc. 9th Int'l Workshop Algs. in Bioinformatics (WABI'09). 2009, 5724: 412425. Lecture Notes in Computer Science, SpringerGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.