Usually, finding an exact optimal solution for an NP-hard problem in practical time is difficult; it is not uncommon for such a program to run as long as months or years. Hence, heuristic or greedy algorithms are often designed to approximately solve NP-hard problems. However, heuristic or greedy algorithms cannot guarantee the performance. Therefore, in order to design an exact algorithm, we investigated the characteristic of our problem and found that the degrees of many genes were very small (i.e., they were only connected to a small number of TFs in the protein-DNA interaction graph [9, 10]). More specifically, about 70% of genes had degrees less than or equal to 3, about 85% of genes had degrees which are at most 5, and about 96% of genes had degrees less than or equal to 10. This characteristic enabled us to design an efficient exact algorithm, which was presented in Figure 2, for the problem. In addition to its efficiency, our algorithm is the first exact algorithm that can solve a weighted t-cover hitting set problem.
Before proving the correctness and time complexity of the algorithm, we give the basic idea of our algorithm, which is based on the dynamical programming technique. When we expand the sub-solutions, if two sub-solutions H1 and H2 hit exactly the same group of subsets in , we prove that keeping any one of these two sub-solutions is sufficient. Hence, if , then we keep at most (t + 1)n different sub-solutions (Note: there are n subsets, where each subset can be hit by 0, 1, 2, …, or at least t elements in the sub-solution. Hence, totally, there are at most (t + 1)n cases.). This is the main part of the time complexity and the space complexity. In the algorithms, we also sort such that sizes of subsets in are ordered from the smallest to the largest (when there is a tie, an arbitrary order suffices). If sizes of many subsets in are bounded, such as sizes of first k subsets {S1,S2,…,S
k
} are bounded by d, we also sort X such that the first elements are for all j = 1, 2, …, k. Hence, the first elements are . In the algorithm, we add the elements of sorted X orderly into the sub-solutions (i.e., first, try to add the first element of X into sub-solutions. Then try to add the second element of X into sub-solutions, and so on.) such that, first, we make S1 be hit by at least t elements of each sub-solution. Then we make S1 and S2 be hit by at least t elements of each sub-solution, and so on. It is easy to know that when we have considered first elements in the sorted X, all {S1,S2,…,S
k
} are hit by at least t elements of each sub-solution. At that time, the number of sub-solutions is bounded by (all possible combinations of first elements in the sorted X). After that, as we only need to remember the hitting statuses of remaining n-k subsets in , the number of sub-solutions is bounded by (t + 1)n-k. We will show that if sizes of many subsets in are bounded, and (t + 1)n-k will be much smaller than (t + 1)n.
Let X = {u1, u2, …, u
m
} and . We define for 1 ≤ i ≤ m. Let H be a subset of X. We define h i t(H) = [c1, c2, …, c
n
], and , where c
i
= h i t(H)[S
i
] = m i n(t,|S
i
∩H|) for 1 ≤ i ≤ n, i.e., c
i
remembers how many element(s) in S
i
is(are) in H (Note: if any S
i
already has at least t elements in a partial solution, we can removed S
i
from the problem and do not need to further remember its covering status. Hence, there is no need to remember any |S
i
∩H| that is large than t). Following lemmas are needed in the proof of the main theorem.
Lemma 0.1
Let H1, H2, H′ be three subsets of X such that H1 ∩ H′ = ∅ and H2 ∩ H′ = ∅. If h i t(H1) = h i t(H2), then h i t(H1 ∪ H′) = h i t(H2 ∪ H′).
Proof
As h i t(H1) = h i t(H2), for any , h i t(H1)[S
i
] = h i t(H2)[S
i
], i.e., m i n(t, |S
i
∩H1|) = m i n(t, |S
i
∩ H2|). Furthermore, because H1 ∩ H′ = ∅ and H2 ∩ H′ = ∅, we will have that, for any , m i n(t, |S
i
∩ (H1 ∪ H′)|) = m i n(t, |S
i
∩H1| + |S
i
∩ H′|) = m i n(t, |S
i
∩ H2| + |S
i
∩H′|) = m i n(t, |S
i
∩(H2∪H′)|). Therefore, h i t(H1∪H′) = h i t(H2∪H′) and the lemma is proved. □
The Lemma 0.1 guarantees that if any two sub-solutions cover in the same way, then keeping the sub-solution with the smaller weight is enough.
Lemma 0.2
Let , whose elements are in the same order as in the sorted X in Algorithm-1 (i.e., if j1 < j2 with respect to the index of Hℓ, then with respect to the index of X), be the minimum t-cover hitting set, and , 1 ≤ j ≤ ℓ. For any 1 ≤ j ≤ ℓ, if there is a such that , then .
Proof
Let . Then H ∩ H′ = ∅. If , then by Lemma 0.1, H ∪ H′ will be a t-cover hitting set with a smaller weight than the weight of Hℓ, which causes contradiction. Hence, the lemma is correct. □
The Lemma 0.2 shows that Algorithm-1 always keeps a sub-solution that will lead to the full-solution with minimum weight. Now, let us present and prove the main theorem.
Theorem 0.3
The weighted t-cover hitting set problem can be solved in O((t + 1)nm n t) time and in O((t + 1)nn t) space, where m is the size of the ground set and n is the number of subsets for the given instance. If, furthermore, the problem has at least subsets whose sizes are upper bounded by d, then the problem can be solved in time and in space.
Proof
We first prove the correctness of the algorithm.
Given an instance of the weighted t-cover hitting set problem, let X = {u1, u2, …, u
m
}, where X is sorted as shown in Algorithm-1 such that the order of elements in X is as S1, S2 - X1, …, S
n
-Xn-1, where . Let , whose elements are in the same order as in the sorted X (i.e., if j1 < j2 with respect to the index of Hℓ, then with respect to the index of X), be the minimum t-cover hitting set. Let for all 0 ≤ j ≤ ℓ, where .
To prove correctness, we claim that when the for loop in step 2 of Algorithm-1 is at loop i = i
j
for all 1 ≤ j ≤ ℓ (Note: is the j th element in Hℓ and i
j
th element in X), there exists a P = (h i t(H), H) in (loop in step 3) such that and . We prove this claim by mathematical induction on j.
-
Induction basis. In the case of j = 1, as for any i < i1, (else, Hℓ cannot be the solution), no sub-solution will be removed in step 5.1 for all loops of i < i1 in step 2. Hence, when i = i1, P = (h i t(∅), ∅) is in . The claim is correct.
-
Induction step. Suppose that when j < q ≤ ℓ, the claim is true. Hence, when i = iq-1 in the loop of step 2, there exists a P = (h i t(H), H) in such that and . Then by Lemma 0.1, , and . Therefore, will be saved into unless there is another P′ = (h i t(H′), H′) in such that , and . By Lemma 0.2, if any P′ = (h t i(H′), H′) such that , and is already saved into , it will not be replaced. Furthermore, as any for i < i
q
, h i t(H′)[S] = t (otherwise, it would cause a contradiction that Hℓ is a solution as no element in will cover S). Hence, P′ = (h i t(H′), H′) will not be removed when loop i < i
q
in loop 2. Thus, P′ = (h i t(H′), H′) will be in when i=i
q
in the loop of step 2, i.e., the claim is still true when j = q.
Therefore, when j = ℓ, we will save a (h i t(H), H) into such that , and w e i g h t(H) = w e i g h t(Hℓ), i.e., we will find the minimum t-cover hitting set. The correctness of Algorithm-1 is proved.
Next, we consider the time complexity and space complexity of the algorithm. Step 2 loops |X| = m times. Step 3 loops times. As only remember different combinations of [c1, c3, …, c
n
] and each c
i
is between 0 and t, it is obvious that . Steps 4.1 to 4.4 take O(n t) time. Steps 4.6 to 4.11 can be finished in O(log2(t + 1)n) = O(n log2(t + 1)) time if we use AVL tree to implement and . Hence the total time complexity is O((t + 1)nm n t).
In the case that has at least subsets whose sizes are bounded from above by d, then when , both and have at most elements. Furthermore, when , for any P = (h i t(H), H) in or in , if let h i t(H) = [c1, c2, …, c
n
], then c
j
= t for . Hence, when , all elements in or in have at most combinations of h i t(H), i.e., the size of or is always bounded from above by . Therefore, the total time complexity is .
It is obvious that the space complexity is (lengthes of elements in (lengthes of elements in . The lengthes of elements in both and are bounded from above by O(n t). Therefore, in the general case, the space complexity is O((t + 1)nn t) and in the case sizes of many subsets in are bounded from above by d, the space complexity is . □
The Algorithn-1 only reports one solution with the minimum weight, even problems in application have multiple solutions with the minimum weight. The setting of weights of TFs increases the probability that any solution with the minimum weight includes most correct TFs that regulate differently expressed genes. However, in some cases of the application, we may also want to study other top weight solutions (as the data error, the actual solution may not have the minimum weight). By modifying the algorithm such that for each distinct cover way, save k top weight sub-solutions, then the new algorithm can output k top weight solutions. It is also easy to prove that the time complexity and space complexity of the new algorithm will only increase by a ratio k.
Before we finish this section, we briefly summarize the time complexity of our algorithm and compare it with the previously reported one [12]:
-
If there are at least subsets whose sizes are upper bounded by d, then the time complexity of our algorithm is , while the time complexity of the previous best algorithm is always Ω((t + 1)nm n) [12] (and only works for the unweighted case). As d/(d + log2(t + 1)) < 1, is much less than (t+1)n. For example, if we let d = 5 (note: 85% of genes in our case have degrees less than or equal to 5) and t = 2, 3, 4, our algorithm is bounded by O(2.303nm n), O(2.692nm n), or O(3.002nm n) respectively, while the the previous best algorithm is bounded by Ω(3nm n), Ω(4nm n), or Ω(5nm n) respectively. Suppose n = 30, then our algorithm is at least 1393 times faster if t = 2, or 48131 times faster if t = 3, or 1108459 times faster if t = 4 than the previous best algorithm.
-
The time complexity shown above is only the worst case upper bound; in most cases, the actual time complexity is usually much better. In fact, we can further improve the running time by removing a gene from the graph whenever we find a gene’s degree is less than t. Thus the value of n can be greatly reduced.