 Research
 Open Access
 Published:
Using graph models to find transcription factor modules: the hitting set problem and an exact algorithm
Algorithms for Molecular Biology volume 8, Article number: 2 (2013)
Abstract
Background
Systematically perturbing a cellular system and monitoring the effects of the perturbations on gene expression provide a powerful approach to study signal transduction in gene expression systems. A critical step of revealing a signal transduction pathway regulating gene expression is to identify transcription factors transmitting signals in the system. In this paper, we address the task of identifying modules of cooperative transcription factors based on results derived from systemsbiology experiments at two levels: First, a graph algorithm is developed to identify a minimum set of cooperative TFs that covers the differentially expressed genes under each systematic perturbation. Second, using a cliquefinding approach, modules of TFs that tend to consistently cooperate together under various perturbations are further identified. Our results indicate that this approach is capable of identifying many known TF modules based on the individual experiment; thus we provide a novel graphbased method of identifying contextspecific and highly reused TFmodules.
Background
In order to survive, a cell responds to a variety of environmental and internal perturbations, e.g., environmental stresses and gene mutations respectively. A common response to cellular perturbations is to activate gene expression programs that induce or repress expression of genes to cope with changed homeostatus. Signals which originate as a result of the perturbation are often propagated to transcription factors (TFs), which serve as bottlenecks of signal transduction pathways that regulate transcription programs. Often multiple TFs are involved in regulating one set of genes in a cooperative manner, hence they are referred to as a TF module, and their binding sites in the genome are often referred to as cisregulatory modules.
Figure 1 shows the concept structure of information flow, where information is transmitted through signalling proteins to TF modules that regulate gene transcriptions. The binding relations between TFs and genes can be represented by a bipartite graph.
There is an extensive body of literature on identifying cisregulatory modules by combining a variety of data [1–7]. The most commonly used approach is to study the combinatory patterns of transcription factor binding sites (TFBSs) in a set of coexpressed genes, often derived from clustering analysis of multiple gene expression data. Since the space of combination of TFs is extraordinarily large, contemporary TFmodulesearching methods usually adopt heuristic or stochastic searching algorithms, which cannot guarantee optimal solutions based on given searching criteria. In addition, searching TF modules based on clustering analysis of gene expression data introduces an implicit assumption that the enriched TFBSs regulate the expression of the genes under all conditions. However, in reality, activation of TFs and subsequent binding to their TFBSs are often dependent on the state of cellular signaling systems in a contextspecific fashion. Therefore, there is a need for identification methods that are capable of identifying an optimal combination of TFs that explain gene expression patterns in a contextspecific setting, which need cannot be effectively addressed within the conventional statistical framework. Therefore, in this study, we address the task of identifying cooperative TFs from a new perspective—using a graphbased approach.
The main ideas underlying our approaches are as follows: First, given a set of differentially expressed genes identified from an individual microarray experiment, we identify the cooperative TFs by searching for a minimum set of TFs that bind (cover) these genes with high reliabilities (reflected as weights associated with TFs) and where each gene is covered by at least t TFs, a weighted tcover hitting set problem. Our results indicate that the latter constraint enables our approach to recover known cooperative TFs. Second, based on sets of TFs identified from each microarray experiment, we further find a set of TFs that tend to cooperatively function together in multiple instances, thus revealing TF modules at the systems level.
The tcover hitting set problem is an NPhard problem. As the previous best algorithm for solving this problem cannot deal with the weighted case and has impractical time complexity for our task, we developed a new exact algorithm, which not only deals with the weighted case of the problem, but also has significantly less time complexity, thus enabling us to find exact solutions for the problems in our study.
Problem Formulation
Data sets

We collected the gene expression data from the seminal study by Hughes et al.,[8], in which transcriptional responses to systematic genetic and pharmacological perturbations were investigated.

We collected a proteinDNA interaction graph from Huang et al., and YegerLotem et al.,[9, 10]. In this bipartite graph, vertices on one side are TF proteins while the vertices on the other side are potential target genes. An edge between a TF and a gene indicates that the TF likely binds to the promoter of the gene as based on the ChIPchip experimental results [9, 10], and the edge weight reflects the reliability of a binding event between a corresponding TF and gene pair.
Finding cooperative TFs for a set of coregulated genes
Given a set of coregulated genes, finding a set of cooperative TFs regulating them is a challenging problem in studying transcriptional regulation [2]. In this study, we cast the task as a graph problem, referred to as the weighted tcover hitting set problem, and designed an efficient algorithm to solve it.
For a set of coregulated genes, we induce a subgraph from the bipartite graph representing proteinDNA interaction. The new graph remains a bipartite graph, with one part being all TFs and the other being the given coregulated genes. If a TF is connected to a gene, we say that the TF covers the gene. The task is to find a subset of TFs with minimum weight, called the tTF cover, such that each gene is covered by at least t TFs in the cover. The weight of the tTF cover is the sum of the weights of TFs in the cover. We defined the weight of a TF to be the reciprocal of the sum of edge weights, which is defined in the input bipartite graph by Huang et al.,[9, 10]. More specifically, if TF t_{1} is connected to genes g_{1}, g_{2}, …, g_{ k }, then $\mathit{\text{weight}}\left({t}_{1}\right)=\frac{1}{\sum _{i=1}^{k}\mathit{\text{Edge}}\text{\_}\mathit{\text{Weight}}({t}_{1},{g}_{i})}$. Recall that, in our case, an edge weight reflects the binding reliability between the TF and its target gene; as such, a TF which interacts with many genes with high reliability will have a small TF weight, and thus is more likely to be included in the solution of the tTF cover problem.
When t = 1, the solution finds a set of TFs such that each gene must be covered by at least one TF, which will return a minimum set of TFs that covers all those coregulated genes. Such a result is not interesting because we aim to find cooperative TFs. By constraining t to be greater than 1, we are able to try to find a set of TFs such that each gene must be covered by more than one TF in the set. By definition, a solution will preferentially search for and include TFs that cocover as many genes in a target set as possible and with as high reliability as possible. From a biological viewpoint, the more genes a set of TFs cocovers, the more likely the TFs act cooperatively to regulate the genes. Thus, finding an optimal solution for the tTF cover problem in this setting is a biologically sensible approach for identifying TF modules.
The problem of finding the tTF cover is NPhard, where the problem is equivalent to two wellknown NPhard problems, the set multicover problem and the tcover hitting set problem [11, 12]. The tTF cover problem can be reduced to the tcover hitting set problem easily. The formal definition of the weighted tcover hitting set is as follows:
weighted tcover hitting set: Given a universal set X of m elements, a weight function w(x) : X → R^{+}, a family $\mathcal{T}=\{{S}_{1},\dots ,{S}_{n}\}$ of subsets of X, where the size of any subset in $\mathcal{T}$ is at least t, and an integer t, find a subset H ⊆ X of minimum weight such that every subset in $\mathcal{T}$ has at least t elements in H, where the weight of H is defined as $\sum _{x\in H}w\left(x\right)$. We denote an instance of the problem as $(X,\mathcal{T},w,t)$ and call H the minimum tcover hitting set.^{a}
As mentioned above, the weighted tcover hitting set problem is NPhard and the previous best algorithm for the unweighted case of the problem has a time complexity of O((t + 1)^{n}n m) [12]. In terms of our application, n is the number of coregulated genes and m is the number of TFs; these two numbers are usually large enough to render the existing algorithms impractical for our problem. Thus, designing an efficient exact algorithm to solve the weighted tcover hitting set problem becomes the main goal and therefore the main contribution of this study.
Identify repeatedly used TF modules
For each perturbation instance, a set of differentially expressed genes (considered to be coregulated genes) can be decided. Using the method in the previous subsection, we can obtain a set of cooperative TFs for each perturbation instance. At this stage, a biologist may want to find modules of TFs that are repeatedly used under various experimental conditions, as they are more likely to be part of signal transduction pathways that are repeatedly involved in responses to different perturbations.
We refer to a set of TFs sharing a common set of target genes as a “hard” TF clique. The formal definition of a “hard” TF clique is as follows: a subset of TFs T forms a “hard” TF clique if there exists a subset of genes G such that T and G make a complete induced subgraph in the proteinDNA interaction graph. From a biological viewpoint, the more common target genes are shared by the TFs, the more likely the TFs truly cooperate to regulate the genes. Thus, we weigh a hard clique using the number of common target genes covered by the TFs to reflect the “goodness” of the clique. In such a setting, one way to find the TF modules that are repeatedly used in many perturbation instances is to find hard TF cliques that occur in multiple instances, where the overall weight of such hard cliques is the sum of weights of the clique from all instances. The higher the weight of a clique, the more likely the TFs in the clique will function as a module. However, as there are usually noise and errors in microarray data, it is difficult to consistently find hard cliques from multiple instances. To address this issue, we introduce a new formulation: search for a TF module whose member TFs tend to cooperatively regulate gene expression in multiple instances but not necessarily in all perturbation instances.
First, we use the cooperative TFs found in each instance to make one TFTFrelation graph for each case of t = 1, 2, 3, 4. In such a graph, each TF is a node; a weighted edge between a pair of TFs is added if the TFs are found to cooperate in one or more instances, of which the weight is the sum of the number of common target genes from all instances. Hence, the higher the weight of an edge, the more cooperation instances the two TFs have. Then we search for all 3cliques and 4cliques from such a TFTF graph and sort them by weights, where the weight of a clique is the sum of all edge weights for the clique. Since we have relaxed the requirement so that the TFs in these cliques do not need to form a hard clique in all individual instances, we refer to this approach as finding “soft” TF cliques. At this stage, we limit the search for cliques to 3cliques and 4cliques because our data [9, 10] indicated that only 15% of genes in yeast are connected to more than 5 TFs.
The clique finding problem is a wellknown NPhard problem. However, since we only need to find cliques of size 3 and size 4, we can search for cliques of these sizes in a graph with at most 214 (the total number of TFs in our case) vertices in a reasonable time. In the Results section, we will compare the results for “soft” 3cliques for the cases of t = 1, 2, 3, 4 and hard 3cliques for the case of t = 3, and “soft” 4cliques for the cases of t = 1, 2, 3, 4 and hard 4cliques for the case of t = 4.
Exact algorithm for tcover problem
Usually, finding an exact optimal solution for an NPhard problem in practical time is difficult; it is not uncommon for such a program to run as long as months or years. Hence, heuristic or greedy algorithms are often designed to approximately solve NPhard problems. However, heuristic or greedy algorithms cannot guarantee the performance. Therefore, in order to design an exact algorithm, we investigated the characteristic of our problem and found that the degrees of many genes were very small (i.e., they were only connected to a small number of TFs in the proteinDNA interaction graph [9, 10]). More specifically, about 70% of genes had degrees less than or equal to 3, about 85% of genes had degrees which are at most 5, and about 96% of genes had degrees less than or equal to 10. This characteristic enabled us to design an efficient exact algorithm, which was presented in Figure 2, for the problem. In addition to its efficiency, our algorithm is the first exact algorithm that can solve a weighted tcover hitting set problem.
Before proving the correctness and time complexity of the algorithm, we give the basic idea of our algorithm, which is based on the dynamical programming technique. When we expand the subsolutions, if two subsolutions H_{1} and H_{2} hit exactly the same group of subsets in $\mathcal{T}$, we prove that keeping any one of these two subsolutions is sufficient. Hence, if $\left\mathcal{T}\right=n$, then we keep at most (t + 1)^{n} different subsolutions (Note: there are n subsets, where each subset can be hit by 0, 1, 2, …, or at least t elements in the subsolution. Hence, totally, there are at most (t + 1)^{n} cases.). This is the main part of the time complexity and the space complexity. In the algorithms, we also sort $\mathcal{T}$ such that sizes of subsets in $\mathcal{T}$ are ordered from the smallest to the largest (when there is a tie, an arbitrary order suffices). If sizes of many subsets in $\mathcal{T}$ are bounded, such as sizes of first k subsets {S_{1},S_{2},…,S_{ k }} are bounded by d, we also sort X such that the first $\left{\cup}_{i=1}^{\phantom{\rule{0.3em}{0ex}}j}d{S}_{i}\right$ elements are ${\cup}_{i=1}^{\phantom{\rule{0.3em}{0ex}}j}{S}_{i}$ for all j = 1, 2, …, k. Hence, the first $\left{\cup}_{i=1}^{k}{S}_{i}\right$ elements are ${\cup}_{i=1}^{k}{S}_{i}$. In the algorithm, we add the elements of sorted X orderly into the subsolutions (i.e., first, try to add the first element of X into subsolutions. Then try to add the second element of X into subsolutions, and so on.) such that, first, we make S_{1} be hit by at least t elements of each subsolution. Then we make S_{1} and S_{2} be hit by at least t elements of each subsolution, and so on. It is easy to know that when we have considered first $\left{\cup}_{i=1}^{k}{S}_{i}\right$ elements in the sorted X, all {S_{1},S_{2},…,S_{ k }} are hit by at least t elements of each subsolution. At that time, the number of subsolutions is bounded by ${2}^{\left{\cup}_{i=1}^{k}{S}_{i}\right}$ (all possible combinations of first $\left{\cup}_{i=1}^{k}{S}_{i}\right$ elements in the sorted X). After that, as we only need to remember the hitting statuses of remaining nk subsets in $\mathcal{T}$, the number of subsolutions is bounded by (t + 1)^{nk}. We will show that if sizes of many subsets in $\mathcal{T}$ are bounded, ${2}^{\left{\cup}_{i=1}^{k}{S}_{i}\right}$ and (t + 1)^{nk} will be much smaller than (t + 1)^{n}.
Let X = {u_{1}, u_{2}, …, u_{ m }} and $\mathcal{T}=\{{S}_{1},{S}_{2},\dots ,{S}_{n}\}$. We define ${\mathcal{T}}_{X[1:i]}=\left\{S\rightS\in \mathcal{T}\text{and}\phantom{\rule{0.3em}{0ex}}S\subset \{{u}_{1},{u}_{2},\dots ,{u}_{i}\}\}$ for 1 ≤ i ≤ m. Let H be a subset of X. We define h i t(H) = [c_{1}, c_{2}, …, c_{ n }], and $\mathit{\text{weight}}\left(H\right)=\sum _{u\in H}w\left(u\right)$, where c_{ i } = h i t(H)[S_{ i }] = m i n(t,S_{ i }∩H) for 1 ≤ i ≤ n, i.e., c_{ i } remembers how many element(s) in S_{ i } is(are) in H (Note: if any S_{ i } already has at least t elements in a partial solution, we can removed S_{ i } from the problem and do not need to further remember its covering status. Hence, there is no need to remember any S_{ i }∩H that is large than t). Following lemmas are needed in the proof of the main theorem.
Lemma 0.1
Let H_{1}, H_{2}, H^{′} be three subsets of X such that H_{1} ∩ H^{′} = ∅ and H_{2} ∩ H^{′} = ∅. If h i t(H_{1}) = h i t(H_{2}), then h i t(H_{1} ∪ H^{′}) = h i t(H_{2} ∪ H^{′}).
Proof
As h i t(H_{1}) = h i t(H_{2}), for any ${S}_{i}\in \mathcal{T}$, h i t(H_{1})[S_{ i }] = h i t(H_{2})[S_{ i }], i.e., m i n(t, S_{ i }∩H_{1}) = m i n(t, S_{ i } ∩ H_{2}). Furthermore, because H_{1} ∩ H^{′} = ∅ and H_{2} ∩ H^{′} = ∅, we will have that, for any ${S}_{i}\in \mathcal{T}$, m i n(t, S_{ i } ∩ (H_{1} ∪ H^{′})) = m i n(t, S_{ i }∩H_{1} + S_{ i } ∩ H^{′}) = m i n(t, S_{ i } ∩ H_{2} + S_{ i }∩H^{′}) = m i n(t, S_{ i }∩(H_{2}∪H^{′})). Therefore, h i t(H_{1}∪H^{′}) = h i t(H_{2}∪H^{′}) and the lemma is proved. □
The Lemma 0.1 guarantees that if any two subsolutions cover in the same way, then keeping the subsolution with the smaller weight is enough.
Lemma 0.2
Let ${H}^{\ell}=\{{u}_{{i}_{1}},{u}_{{i}_{2}},\dots ,{u}_{{i}_{\ell}}\}$, whose elements are in the same order as in the sorted X in Algorithm1 (i.e., if j_{1} < j_{2} with respect to the index of H^{ℓ}, then ${i}_{{j}_{1}}<{i}_{{j}_{2}}$ with respect to the index of X), be the minimum tcover hitting set, and ${H}_{j}^{\ell}=\{{u}_{{i}_{1}},{u}_{{i}_{2}},\dots ,{u}_{{i}_{j}}\}$, 1 ≤ j ≤ ℓ. For any 1 ≤ j ≤ ℓ, if there is a $H\subset \{{u}_{1},{u}_{2},\dots ,{u}_{{i}_{j}}\}$ such that $\mathit{\text{hit}}\left(H\right)=\mathit{\text{hit}}\left({H}_{j}^{\ell}\right)$, then $\mathit{\text{weight}}\left(H\right)\ge \mathit{\text{weight}}\left({H}_{j}^{\ell}\right)$.
Proof
Let ${H}^{\prime}=\{{u}_{{i}_{j+1}},{u}_{{i}_{j+2}},\dots ,{u}_{{i}_{\ell}}\}$. Then H ∩ H^{′} = ∅. If $\mathit{\text{weight}}\left(H\right)<\mathit{\text{weight}}\left({H}_{j}^{\ell}\right)$, then by Lemma 0.1, H ∪ H^{′} will be a tcover hitting set with a smaller weight than the weight of H^{ℓ}, which causes contradiction. Hence, the lemma is correct. □
The Lemma 0.2 shows that Algorithm1 always keeps a subsolution that will lead to the fullsolution with minimum weight. Now, let us present and prove the main theorem.
Theorem 0.3
The weighted tcover hitting set problem can be solved in O((t + 1)^{n}m n t) time and in O((t + 1)^{n}n t) space, where m is the size of the ground set and n is the number of subsets for the given instance. If, furthermore, the problem has at least $\frac{n}{1+d/\underset{2}{log}(t+1)}$ subsets whose sizes are upper bounded by d, then the problem can be solved in $O\left({\left({(t+1)}^{d/(d+\underset{2}{log}(t+1\left)\right)}\right)}^{n}\mathit{\text{mnt}}\right)$ time and in $O\left({\left({(t+1)}^{d/(d+\underset{2}{log}(t+1\left)\right)}\right)}^{n}\mathit{\text{nt}}\right)$ space.
Proof
We first prove the correctness of the algorithm.
Given an instance $(X,\mathcal{T},w,t)$ of the weighted tcover hitting set problem, let X = {u_{1}, u_{2}, …, u_{ m }}, where X is sorted as shown in Algorithm1 such that the order of elements in X is as S_{1}, S_{2}  X_{1}, …, S_{ n }X_{n1}, where ${X}_{j}\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}{\cup}_{i=1}^{\phantom{\rule{0.3em}{0ex}}j}{S}_{i}$. Let ${H}^{\ell}\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\{{u}_{{i}_{1}},{u}_{{i}_{2}},\dots ,{u}_{{i}_{\ell}}\}$, whose elements are in the same order as in the sorted X (i.e., if j_{1} < j_{2} with respect to the index of H^{ℓ}, then ${i}_{{j}_{1}}\phantom{\rule{0.3em}{0ex}}<\phantom{\rule{0.3em}{0ex}}{i}_{{j}_{2}}$ with respect to the index of X), be the minimum tcover hitting set. Let ${H}_{j}^{\ell}\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\{{u}_{{i}_{1}},{u}_{{i}_{2}},\dots ,{u}_{{i}_{j}}\}$ for all 0 ≤ j ≤ ℓ, where ${H}_{0}^{\ell}\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\varnothing $.
To prove correctness, we claim that when the for loop in step 2 of Algorithm1 is at loop i = i_{ j } for all 1 ≤ j ≤ ℓ (Note: ${u}_{{i}_{j}}$ is the j th element in H^{ℓ} and i_{ j } th element in X), there exists a P = (h i t(H), H) in ${\mathcal{Q}}_{\mathit{\text{old}}}$ (loop in step 3) such that $\mathit{\text{hit}}\left(H\right)=\mathit{\text{hit}}\left({H}_{j1}^{\ell}\right)$ and $\mathit{\text{weight}}\left(H\right)=\mathit{\text{weight}}\left({H}_{j1}^{\ell}\right)$. We prove this claim by mathematical induction on j.

Induction basis. In the case of j = 1, as for any i < i_{1}, ${\mathcal{T}}_{X[1:i]}=\varnothing $ (else, H^{ℓ} cannot be the solution), no subsolution will be removed in step 5.1 for all loops of i < i_{1} in step 2. Hence, when i = i_{1}, P = (h i t(∅), ∅) is in ${\mathcal{Q}}_{\mathit{\text{old}}}$. The claim is correct.

Induction step. Suppose that when j < q ≤ ℓ, the claim is true. Hence, when i = i_{q1} in the loop of step 2, there exists a P = (h i t(H), H) in ${\mathcal{Q}}_{\mathit{\text{old}}}$ such that $\mathit{\text{hit}}\left(H\right)=\mathit{\text{hit}}\left({H}_{q2}^{\ell}\right)$ and $\mathit{\text{weight}}\left(H\right)=\mathit{\text{weight}}\left({H}_{q2}^{\ell}\right)$. Then by Lemma 0.1, $\mathit{\text{hit}}(H\cup \{{u}_{{i}_{q1}}\left\}\right)=\mathit{\text{hit}}({H}_{q2}^{\ell}\cup \{{u}_{{i}_{q1}}\left\}\right)=\mathit{\text{hit}}\left({H}_{q1}^{\ell}\right)$, and $\mathit{\text{weight}}(H\cup \{{u}_{{i}_{q1}}\left\}\right)=\mathit{\text{weight}}\left({H}_{q1}^{\ell}\right)$. Therefore, ${P}^{\ell}=\left(\mathit{\text{hit}}\right({H}_{q1}^{\ell}),{H}_{q1}^{\ell})$ will be saved into ${\mathcal{Q}}_{\mathit{\text{new}}}$ unless there is another P^{′} = (h i t(H^{′}), H^{′}) in ${\mathcal{Q}}_{\mathit{\text{new}}}$ such that $\mathit{\text{hit}}\left({H}^{\prime}\right)=\mathit{\text{hit}}\left({H}_{q1}^{\ell}\right)$, and $\mathit{\text{weight}}\left({H}^{\prime}\right)=\mathit{\text{weight}}\left({H}_{q1}^{\ell}\right)$. By Lemma 0.2, if any P^{′} = (h t i(H^{′}), H^{′}) such that $\mathit{\text{hit}}\left({H}^{\prime}\right)=\mathit{\text{hit}}\left({H}_{q1}^{\ell}\right)$, and $\mathit{\text{weight}}\left({H}^{\prime}\right)=\mathit{\text{weight}}\left({H}_{q1}^{\ell}\right)$ is already saved into ${\mathcal{Q}}_{\mathit{\text{new}}}$, it will not be replaced. Furthermore, as any $S\in {\mathcal{T}}_{X[1:i]}$ for i < i_{ q }, h i t(H^{′})[S] = t (otherwise, it would cause a contradiction that H^{ℓ} is a solution as no element in $\{{u}_{{i}_{q}},\dots ,{u}_{{i}_{\ell}}\}$ will cover S). Hence, P^{′} = (h i t(H^{′}), H^{′}) will not be removed when loop i < i_{ q } in loop 2. Thus, P^{′} = (h i t(H^{′}), H^{′}) will be in ${\mathcal{Q}}_{\mathit{\text{old}}}$ when i=i_{ q } in the loop of step 2, i.e., the claim is still true when j = q.
Therefore, when j = ℓ, we will save a (h i t(H), H) into ${\mathcal{Q}}_{\mathit{\text{new}}}$ such that $\mathit{\text{hit}}\left(H\right)=\mathit{\text{hit}}\left({H}_{\ell}^{\ell}\right)=\mathit{\text{hit}}\left({H}^{\ell}\right)$, and w e i g h t(H) = w e i g h t(H^{ℓ}), i.e., we will find the minimum tcover hitting set. The correctness of Algorithm1 is proved.
Next, we consider the time complexity and space complexity of the algorithm. Step 2 loops X = m times. Step 3 loops $\left{\mathcal{Q}}_{\mathit{\text{old}}}\right$ times. As ${\mathcal{Q}}_{\mathit{\text{old}}}$ only remember different combinations of [c_{1}, c_{3}, …, c_{ n }] and each c_{ i } is between 0 and t, it is obvious that $\left{\mathcal{Q}}_{\mathit{\text{old}}}\right\le {(t+1)}^{n}$. Steps 4.1 to 4.4 take O(n t) time. Steps 4.6 to 4.11 can be finished in O(log2(t + 1)^{n}) = O(n log2(t + 1)) time if we use AVL tree to implement ${\mathcal{Q}}_{\mathit{\text{new}}}$ and ${\mathcal{Q}}_{\mathit{\text{old}}}$. Hence the total time complexity is O((t + 1)^{n}m n t).
In the case that $\mathcal{T}$ has at least $\frac{n}{1+d/\underset{2}{log}(t+1)}$ subsets whose sizes are bounded from above by d, then when $i=\frac{\mathit{\text{dn}}}{1+d/\underset{2}{log}(t+1)}$, both ${\mathcal{Q}}_{\mathit{\text{old}}}$ and ${\mathcal{Q}}_{\mathit{\text{new}}}$ have at most ${2}^{\frac{\mathit{\text{dn}}}{1+d/\underset{2}{log}(t+1)}}={\left({(t+1)}^{\frac{1}{1+\underset{2}{log}(t+1)/d}}\right)}^{n}$ elements. Furthermore, when $i=\frac{\mathit{\text{dn}}}{1+d/\underset{2}{log}(t+1)}$, for any P = (h i t(H), H) in ${\mathcal{Q}}_{\mathit{\text{old}}}$ or in ${\mathcal{Q}}_{\mathit{\text{new}}}$, if let h i t(H) = [c_{1}, c_{2}, …, c_{ n }], then c_{ j } = t for $1\le j\le \frac{n}{1+d/\underset{2}{log}(t+1)}$. Hence, when $i>\frac{\mathit{\text{dn}}}{1+d/\underset{2}{log}(t+1)}$, all elements in ${\mathcal{Q}}_{\mathit{\text{old}}}$ or in ${\mathcal{Q}}_{\mathit{\text{new}}}$ have at most ${(t+1)}^{n\frac{n}{1+d/\underset{2}{log}(t+1)}}$ combinations of h i t(H), i.e., the size of ${\mathcal{Q}}_{\mathit{\text{old}}}$ or ${\mathcal{Q}}_{\mathit{\text{new}}}$ is always bounded from above by ${(t+1)}^{n\frac{n}{1+d/\underset{2}{log}(t+1)}}={\left({(t+1)}^{\frac{1}{1+\underset{2}{log}(t+1)/d}}\right)}^{n}$. Therefore, the total time complexity is $O\left({\left({(t+1)}^{\frac{1}{1+\underset{2}{log}(t+1)/d}}\right)}^{n}\mathit{\text{mnt}}\right)$.
It is obvious that the space complexity is $O\left(\right{\mathcal{Q}}_{\mathit{\text{old}}}\xb7max$ (lengthes of elements in ${\mathcal{Q}}_{\mathit{\text{old}}}\left)\right)=O\left(\right{\mathcal{Q}}_{\mathit{\text{new}}}\left\right)\xb7max$ (lengthes of elements in ${\mathcal{Q}}_{\mathit{\text{new}}}\left)\right)$. The lengthes of elements in both ${\mathcal{Q}}_{\mathit{\text{old}}}$ and ${\mathcal{Q}}_{\mathit{\text{new}}}$ are bounded from above by O(n t). Therefore, in the general case, the space complexity is O((t + 1)^{n}n t) and in the case sizes of many subsets in $\mathcal{T}$ are bounded from above by d, the space complexity is $O\left({\left({(t+1)}^{d/(d+\underset{2}{log}(t+1\left)\right)}\right)}^{n}\mathit{\text{nt}}\right)$. □
The Algorithn1 only reports one solution with the minimum weight, even problems in application have multiple solutions with the minimum weight. The setting of weights of TFs increases the probability that any solution with the minimum weight includes most correct TFs that regulate differently expressed genes. However, in some cases of the application, we may also want to study other top weight solutions (as the data error, the actual solution may not have the minimum weight). By modifying the algorithm such that for each distinct cover way, save k top weight subsolutions, then the new algorithm can output k top weight solutions. It is also easy to prove that the time complexity and space complexity of the new algorithm will only increase by a ratio k.
Before we finish this section, we briefly summarize the time complexity of our algorithm and compare it with the previously reported one [12]:

If there are at least $\frac{n}{1+d/\underset{2}{log}(t+1)}$ subsets whose sizes are upper bounded by d, then the time complexity of our algorithm is $O\left({\left({(t+1)}^{d/(d+\underset{2}{log}(t+1\left)\right)}\right)}^{n}\mathit{\text{mnt}}\right)$, while the time complexity of the previous best algorithm is always Ω((t + 1)^{n}m n) [12] (and only works for the unweighted case). As d/(d + log2(t + 1)) < 1, ${\left({(t+1)}^{d/(d+\underset{2}{log}(t+1\left)\right)}\right)}^{n}$ is much less than (t+1)^{n}. For example, if we let d = 5 (note: 85% of genes in our case have degrees less than or equal to 5) and t = 2, 3, 4, our algorithm is bounded by O(2.303^{n}m n), O(2.692^{n}m n), or O(3.002^{n}m n) respectively, while the the previous best algorithm is bounded by Ω(3^{n}m n), Ω(4^{n}m n), or Ω(5^{n}m n) respectively. Suppose n = 30, then our algorithm is at least 1393 times faster if t = 2, or 48131 times faster if t = 3, or 1108459 times faster if t = 4 than the previous best algorithm.

The time complexity shown above is only the worst case upper bound; in most cases, the actual time complexity is usually much better. In fact, we can further improve the running time by removing a gene from the graph whenever we find a gene’s degree is less than t. Thus the value of n can be greatly reduced.
Results
When applied to the proteinDNA interaction graph induced from ChIPchip experiments [9, 10] and the results from microarray experiments from Hughes et al.,[8], our algorithms identify TF modules at two levels: First, given genes differentially expressed in each perturbation experiment, we find a set of cooperative TFs at the perturbationinstance level in a contextspecific manner. Second, we further combine contextspecific TFs to find TF modules that are repeatedly utilized at the system level. In the following subsections, we show that our approach produces biologically sensible results at both levels.
TFs for sets of coregulated genes
For each yeast genetic/pharmacological perturbation experiment, we identified genes that were differently expressed. Then, we applied our weighted tcover hitting set algorithm to each bipartite component induced by connecting the differentially expressed genes and TFs to identify a set of cooperative TFs from the component. In addition, we applied our algorithm by setting t to 2, 3, and 4 respectively to investigate the impact of setting this parameter. Empirically, we found that setting t = 2 was sufficient to force our algorithm to find a set of cooperative TFs. We recommend setting t=2 as a beginning parameter and exploring other settings based on the degree of the genes in the organism of interest. After inspecting the resulting TF modules, we found many of them to be already wellknown. Due to page limitations, we will discuss just one of the wellknown modules identified by our algorithm below.
The pheromone response pathway of S. cerevisiae, which consists of more than 20 proteins [13], is a wellstudied signal transduction pathway in yeast. When this pathway is activated by pheromone, a wellstudied transcription program is initiated which is known to be cooperatively regulated by TFs: S t e 12p, D i g 1p/D i g 2p[13] or S t e 12p, D i g 1p/D i g 2p, and M c m 1p[9]. In the experiments by Hughes et al., 12 geneencoding proteins in the pathway were perturbed, and many canonical pheromone response genes were differentially expressed. Hence, one may expect that S t e 12p, D i g 1p or S t e 12p, D i g 1p, and M c m 1p are involved in mediating these responses. Indeed, our results show that, when we set t = 2, S t e 12p, D i g 1p were returned as members of the set of cooperative TFs identified in all those 12 perturbations and S t e 12p, D i g 1p, M c m 1p were found in the set of cooperative TFs in 9 out of those 12 perturbation experiments. In comparison, when we set t = 1, S t e 12p, D i g 1p were found in 5 instances, and S t e 12p, D i g 1p, M c m 1p were found in only 1 instance. These results indicate that our algorithm is capable of identifying cooperative TFs from individual experiments in a contextspecific manner.
We further studied the coverage of 7 robust pheromone response genes (i.e., their expression levels change significantly in almost all 12 perturbations of genes on the pheromone response pathway), namely F A R 1, F U S 1, G P A 1, S S T 2, S T E 2, S T E 6, T E C 1, to investigate how they were covered by TFs in the results obtained from the tcover hitting set algorithm. Figure 3 shows the coverage of these genes by the TFs in the results returned by tTF cover algorithm. As expected, when t = 1, the algorithm failed to find cooperative TFs but returned S t e 13p as the regulating TF. On the other hand, when we set t = 2, the algorithm returned all three TFs, S t e 12p, D i g 1p, and M c m 1p, as the members of solution TFs covering these genes, which form a dense graph.
Above examples show that if we know a set of genes that are coregulated, our new program can find correct TFs that regulate them when we set t to 2 or 3. There exist more examples to indicate that the algorithm is working as expected.
Finding TF modules involved in multiple instances
Using the tcover hitting set results (for t = 1, 2, 3, 4) from 300 microarray data, we constructed a TFTF relation graph and searched for 3cliques and 4cliques in the graph. The cliques are ranked according to their weights; the larger the weight, the higher the rank is. Our assumption is that, because highranking cliques consist of the TFs deemed to function cooperatively in multiple instances by our algorithm, they may truly function as partners in real biological settings. We evaluated this assumption looking at two factors: 1) whether members of the cliques are known to function as a module based on literature, and 2) whether the members of the cliques have shown physiological or genetic interactions in previous experiments.
1. High scores for known cooperative TFs: We first inspected the top 20 highranking 3 cliques identified in the TFTF graph derived from the results of the tcover hitting set with t=2. We found that the majority of them are well known cooperative TF modules. For example, TFs participating in cell cycle checkpoint [9], S w i 4p, S w i 6p, and M b p 1p, ranked 2nd and, interestingly, the 3clique including D i g 1p, M c m 1p, and S t e 12p (the pheromone responding TF module) ranked 4th. Similarly, many of the high ranking 4cliques constitute TF modules that are also supported by existing knowledge. However, it is interesting to note that the 3clique that ranked the highest consists of D i g 1p, S t e 12p and T e c 1p, for which we could not find any published results reporting that they work together; thus it would be an interesting case to investigate if our approach has found previously unknown interactions.
We further investigated if the input data, the TF modules returned using the tcover hitting sets with different settings of t, has an impact on the results of TF cliques. We found that the ranking and the members of cliques were similar when t was set to 2, 3, and 4, respectively. However, the results for when t = 1 and for hard cliques were totally different. For example, when the TFTF graph is built using the results with t = 1, the first 3clique containing S w i 4p, S w i 6p, and M b p 1p ranks 254th. Furthermore, if we rank this commonly used TF module by finding common “hard” cliques, the highest ranking 3clique containing D i g 1p, M c m 1p, and S t e 12p ranks 273rd. Thus, the results indicate that, by setting t > 1 and finding “soft” cliques, our methods enhance the capability of finding TF modules that are repeatedly utilized in multiple conditions, and are thus likely to play key roles in cellular signal transduction systems. Due to page limitation, we made the results available by listing the top 100 soft 3cliques and 4cliques on the supplement.
2. Interactions between proteins inside the cliques: From a biological point of view, if a set of TFs work together to regulate the transcription processes, there should be physical or genetic interactions among those TFs. If the results of the tcover hitting set algorithm truly capture the cooperations, one would expect that TFs in the repeatedly utilized cliques have a high probability of interacting. The following section evaluates the TF cliques through analyzing their proteinprotein interactions, where we define that there exists an interaction between a pair of TFs if they have a physical interaction, a genetic interaction determined by synthetic lethality [14], or both.
We constructed multiple TFTF relation graphs using the results from the tcover hitting set algorithm with t = 1, 2, 3, 4 to assess the impact of setting t. From each graph, we identified the top 100 4cliques and assessed the percent (aka, the probability) that the cliques have at least a given number of interactions. As a control, we also randomly sampled 100 groups of 4 TFs for comparison. The results are plotted as a cumulated distribution in Figure 4. From the figure, we can see that the majority of randomly picked TF groups have no interaction, with less than 20% of groups containing 1 or more interaction(s). Similarly, the curve for the cliques that are derived from the results with t = 1 —a setting that is not optimized for finding cooperations among TFs—is located close to that curve of the random groups. On the other hand, with t = 2, 3, 4, over 50% of 4cliques contain at least 3 interactions. These results indicate that when one sets t > 1, the tcover hitting set algorithm indeed strives to find the TFs that cooperatively regulate transcriptions; the results also indicate that setting t = 2 is sufficient to find cooperative TFs.
To illustrate the difference in performance between the “soft” and “hard” cliques, we also identified the top 100 “hard” 4cliques that were obtained directly from the tcover hitting set for individual perturbation instances by setting t = 4 and studied interactions among the TFs in these cliques. The results show that the performance of the “hard” cliques is superior to those of the random TF groups and the “soft” cliques derived from t = 1 results, indicating that the tcover hitting set algorithm is capable of revealing cooperation among TFs. However, “hard” cliques consistently underperform in comparison with the “soft” cliques that are derived from integrating results in multiple instances (i.e., except the case of t = 1), which indicates that the information integration approach further enhances the quality of the TF modules.
Conclusion
In this paper, we have developed graphbased approaches to address the problem of finding cooperative TF modules at two levels. First, given a set of coregulated genes, we find a set of TFs that cooperatively regulate the genes in a contextspecific manner. Second, given a collection of contextspecific TF modules, we find the TFs that tend to function cooperatively in multiple instances at systems level, where the behind idea here is: if two TFs are working together in multiple time, then it is more possible that they are in the same TF module. For the first part, we cast the task as a weighted tcover hitting set problem and developed an exact algorithm to solve the problem. The main contribution of this paper is that, by taking advantage of the knowledge of the limited gene degrees, we have developed a very efficient exact algorithm capable of solving the problem at hand in a practical time. For the second problem, we cast the task as a cliquefinding problem, and our approach produced results that are biologically sensible and generate new biological hypotheses. Our graphbased approaches are significantly different from statisticsbased approaches, hence providing a new perspective to study transcriptional regulation [2].
Endnote
^{a}In applications, if there is any subset whose size is less than t, we add dummy element/elements to make its size to t.
References
 1.
Aerts S, Thijs G, Coessens B, Staes M, Moreau Y, Moor B: Toucan: Deciphering the cisregulatory logic of coregulated genes. Nucleic Acids Res. 2003, 31: 17531764.
 2.
Chang L, Nagarajan R, Magee J, Milbrandt J, Stormo G: A systematic model to predict transcriptional regulatory mechanisms based on overrepresentation of transcription factor binding profiles. Genome Res. 2006, 16: 405413.
 3.
Cole S, Yan W, Galic Z, Arevalo J, Zack J: Expressionbased monitoring of transcription factor activity: The TELiS database. Bioinformatics. 2005, 21: 803810.
 4.
Sui S, Mortimer J, Arenillas D, Brumm J, Walsh C, Kennedy B, : oPOSSUM: Identification of overrepresented transcription factor binding sites in coexpressed genes. Nucleic Acids Res. 2005, 33: 31543164.
 5.
Hu J, Hu H, Li X: MOPAT: a graphbased method to predict recurrent cisregulatory modules from known motifs. Nucleic Acids Res. 2008, 36: 4488497.
 6.
Loo P, Marynen P: Computational methods for the detection of cisregulatory modules. Briefings Bioinf. 2009, 10: 509524. 10.1093/bib/bbp025.
 7.
Zhou Q, Wong W: CisModule: De novo discovery of cisregulatory modules by hierarchical mixture modeling. Proc of Nat Acad Sci. 2004, 101: 1211412119. 10.1073/pnas.0402858101.
 8.
Hughes T, Marton M, Jones A, Roberts C, Stoughton R, Armour C, Bennett H, Coffey E, Dai H, He Y, Kidd M, King A, Meyer M, Slade D, Lum P, Stepaniants S, Shoemarker D, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend S: Functional Discovery via a Compendium of Expression Profiles. Cell. 2000, 102: 109126.
 9.
Huang S, Fraenkel E: Integrating Proteomic, Transcriptional, and Interactome Data Reveals Hidden Components of Signaling and Regulatory Networks. Sci Signaling. 2009, 2: Ra4010.1126/scisignal.2000350.
 10.
YegerLotem E, Riva L, Su L, Gitler A, Cashikar A, King O, Auluck P, Geddie M, Valastyan J, Karger D, Lindquist S, Fraenkel E: Bridging highthroughput genetic and transcriptional data revaels cellular responses to alphasynuclein toxicity. Nat Genet. 2009, 41: 316323.
 11.
Bjölund A, Husfeldt T, Koivisto M: Set partitioning via InclusionExclusion. SIAM J Comput. 2006, 39: 546563.
 12.
Hua Q, Yu D, Lau F, Wang Y: Dynamic programming based algorithms for set multicover and multiset multicover problems. Theor Comput Sci. 2010, 411: 24672474. 10.1016/j.tcs.2010.02.016.
 13.
Gustin M, Albertyn J, Alexander M, Davenport K: MAP Kinase Pathway in the Yeast Saccharomyces cerevisiae. Microbilogy and Mol Biol Rev. 1998, 64: 12641300.
 14.
Stark C, Breitkreutz B, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: A General Repository for Interaction Datasets. Nucleic Acids Res. 2006, 34Tyers: 535539.
Acknowledgements
We thank the authors of YegerLotem et al., who provided with us this proteinDNA interaction graph data. Special thanks for Dr. Shouhuai Xu from the Department of Computer Science at the University of Texas at San Antonio, who helped us derive this algorithm. Special thanks for Nova M. Smith who help us in proofreading.
This work was supported in part by National Institudes of Health under the Grant 5R01LM010114.
A preliminary version of this article appeared in the Proceeding of the ACM Conference on Bioinformatics, Computational Biology & Biomedicine 2011, Chicago.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
Both SL and XL analyzed biological data, established and modified the mathematical model, and drafted the manuscript. SL made the major contribution in designing the algorithm. Both authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Lu, S., Lu, X. Using graph models to find transcription factor modules: the hitting set problem and an exact algorithm. Algorithms Mol Biol 8, 2 (2013) doi:10.1186/1748718882
Received
Accepted
Published
DOI
Keywords
 Time Complexity
 Bipartite Graph
 Minimum Weight
 Exact Algorithm
 Multiple Instance