Definitions
Definition 1
Time Series Microarray Gene Expression Dataset). We can model a time series microarray gene expression dataset (D) as a G × C × T matrix and each element of D (d
i
j
k
) corresponds to the expression value of gene i over jth sample/experimental condition across time point k, where i ∈ (g1,g2,...,g
G
), j ∈ (c1,c2,...,c
C
) and k ∈ (t1,t2,...,t
T
).
Definition 2
Tricluster). A tricluster is defined as a submatrix M(I,J,K) = [m
i
j
k
], where i ∈ I, j ∈ J and k ∈ K. The submatrix M represents a subset of genes (I) that are coexpressed over a subset of conditions (J) across a subset of time points (K).
Definition 3
(Perfect Shifting Tricluster). A Tricluster M(I,J,K) = m
i
j
k
, where i ∈ I, j ∈ J and k ∈ K, is called a perfect shifting tricluster if each element of the submatrix M is represented as: m
i
j
k
= Γ + α
i
+ β
j
+ η
k
, where Γ is a constant value for the tricluster, α
i
, β
j
and η
k
are shifting factors of ith gene, jth samples/experimental condition and kth time point, respectively. As the noise is present in microarray datasets, the deviation from actual value and expected value of each element in the dataset also exists. For this deviation, every tricluster is not a perfect one.
Cheng and Church proposed an algorithm for retrieving large and maximal biclusters that have mean squared residue score (MSR) below a threshold δ in 2D microarray gene expression dataset. They also showed that MSR of a perfect δ-bicluster and perfect shifting bicluster is zero (S = δ = 0) [2, 6]. Now extending this idea, here we present a novel definition of Mean Squared Residue score for 3D microarray gene expression datasets. The MSR (S) of a perfect shifting tricluster becomes also zero, where each element m
i
j
k
= Γ + α
i
+ β
j
+ η
k
. For delineating new MSR score (S), at first we need to define the residue score:
Let the mean of i th gene (m
i
j
k
): , the mean of j th sample/experimental condition (m
i
j
k
): , the mean of k th time point (m
i
j
k
): , and the mean of tricluster (m
i
j
k
): . Now the mean of the tricluster can be considered as the value of constant i.e. Γ=m
I
J
K
. We can define the shifting factor for the ith gene (α
i
) as the difference between m
i
j
k
and m
i
j
k
i.e. α
i
=m
i
J
K
-m
I
J
K
. Similarly, we can define shifting factor for the jth condition (β
j
) as β
j
=m
I
j
K
-m
I
J
K
and shifting factor for the kth time point (η
k
) can be defined as η
k
=m
I
J
k
-m
I
J
K
. Hence we can define each element of a perfect shifting tricluster as m
i
j
k
=Γ+α
i
+β
j
+η
k
=m
I
J
K
+(m
i
J
K
-m
I
J
K
)+(m
I
j
K
-m
I
J
K
)+(m
I
J
k
-m
I
J
K
)=(m
i
J
K
+m
I
j
K
+m
I
J
k
-2m
I
J
K
). But usually noise is evident in microarray gene expression dataset. Therefore to evaluate the difference between the actual value of an element (m
i
j
k
) and its expected value, obtained from above equation, the term “residue” can be used [6]. Thus the residue of a tricluster (r
i
j
k
) can be defined as follows: r
i
j
k
=m
i
j
k
-(m
i
J
K
+m
I
j
K
+m
I
J
k
-2m
I
J
K
)=(m
i
j
k
-m
i
J
K
-m
I
j
K
-m
I
J
k
+2m
I
J
K
).
Definition
4 (Mean Squared Residue). We define the term Mean Squared Residue MSR(I,J,K) or S of a tricluster M(I,J,K) to estimate the quality of a tricluster i.e. the level of coherence among the elements of a tricluster as follows:
(1)
Lower residue score represents larger coherence and better quality of a tricluster.
Proposed method
δ-TRIMAX aims to find largest and maximal triclusters in a 3D microarray gene expression dataset. It is an extension of Cheng and Church biclustering algorithm [2] that deals with 2-D microarray datasets. In contrast, our algorithm is capable to mine 3D gene expression dataset. There is always a submatrix in an expression dataset that has a perfect MSR(I,J,K) or S score i.e. S = 0 and this submatrix is each element of the dataset. But as mentioned above, our algorithm finds maximal triclusters having S score under a threshold δ, hence we have used a greedy heuristic approach to find triclusters. Our algorithm therefore starts with the entire dataset containing all genes, all samples/experimental conditions and all time points.
Algorithm 1 (δ-TRIMAX)
Input. D, a matrix that represents 3D microarray gene expression dataset, λ> 1, an input parameter for multiple node deletion algorithm, δ≥ 0, maximum allowable MSR score.
Output. All possible δ-triclusters.
Initialization. Missing elements in D ← random numbers, D’ ← D
Repeat
a. D’1 ← Results of Algorithm 2 on D’ using delta and λ. If the no. of genes (conditions/samples and/or no. of time points) is 50 (This value can be choosen experimentally. Large value increases the execution time of the algorithm as it then executes more number of iterations.), then do not apply Algorithm 2 on genes (conditions/samples and/or time points).
b. D’2 ← Results of Algorithm 3 on D’1 usingδ.
c. D’3 ← Results of Algorithm 4 on D’2.
d. Return D’3 and replace the elements that exist in D’ and D’3 with random numbers.
Until(No gene is found for δ-tricluster)
Initially, our algorithm removes genes or conditions or time points from the dataset to accomplish largest diminishing of score S; this step is described in the following section in which a node corresponds to a gene or experimental condition or time point in the 3D microarray gene expression dataset.
Algorithm 2 (Multiple node deletion)
Input. D, a matrix of real numbers that represents 3D microarray gene expression dataset; δ≥ 0, maximum allowable MSR threshold, λ> 1, threshold for multiple node deletion. The value of λ has been set experimentally to optimize the speed and performance (to avoid falling into local optimum) of the algorithm.
Output. M
i
j
k
, a δ-tricluster, consisting of a subset(I) of genes, a subset(J) of samples/experimental conditions and a subset of time points, having MSR score (S) less than or equal to δ.
Initialization. I ←{set of all genes }, J ←{set of all experimental conditions/ samples } and K ←{set of all time points } and to M(I,J,K) ← D
Repeat
Calculate m
i
j
k
, ∀ i ∈ I; m
i
j
k
, ∀ j ∈ J; m
i
j
k
, ∀ k ∈ K; m
i
j
k
and S.
If S ≤δ return M(I,J,K)
Else
Delete genes i ∈ I that satisfy the following inequality
Recalculate m
i
j
k
, ∀ i ∈ I; m
i
j
k
, ∀ j ∈ J; m
i
j
k
, ∀ k ∈ K; m
i
j
k
and S
Delete samples/experimental conditions j ∈ J that satisfy the following inequality
Recalculate m
i
j
k
, ∀ i ∈ I; m
i
j
k
, ∀ j ∈ J; m
i
j
k
, ∀ k ∈ K; m
i
j
k
and S
Delete time points k ∈ K that satisfy the following inequality
End if
Until(There is no change in I, J and/or K)
The complexity of this algorithm is O(max(m,n,p)) where m, n and p are the number of genes, samples and time points in the 3D microarray dataset.
In the second step, we delete one node at each iteration from the resultant submatrix, produced by Algorithm 2, until the score S of the resultant submatrix is less than or equal to δ. This step results in a δ-tricluster.
Algorithm 3 (Single node deletion)
Input. D, a matrix of real numbers that represents 3D microarray gene expression dataset; δ≥ 0, maximum allowable MSR threshold.
Output. M
i
j
k
, a δ-tricluster, consisting of a subset(I) of genes, a subset(J) of samples/experimental conditions and a subset of time points, having MSR score (S) less than or equal to δ.
Initialization. I ←{set of all genes in D }, J ←{set of experimental conditions/samples in D } and K ←{set of time points in D } and to M(I,J,K) ← D
Calculate m
i
j
k
, ∀ i ∈ I; m
i
j
k
, ∀ j ∈ J; m
i
j
k
, ∀ k ∈ K; m
i
j
k
and S.
While S >δ
Detect gene i ∈ I that has the highest score
Detect sample/experimental condition j ∈ J that has the highest score
Detect time point k ∈ K that has the highest score
Delete gene or sample/experimental condition or time point that has highest μ score and modify I or J or K. Recalculate m
i
j
k
, ∀ i ∈ I; m
i
j
k
, ∀ j ∈ J; m
i
j
k
, ∀ k ∈ K; m
i
j
k
and S.
End while
Return M(I,J,K)
The complexity of first and second steps is O(mnp) as those will iterate (m+n+p) times. The complexity of selection of best genes, samples and time points is O(log m + log n + log p). So it is suggested to use algorithm II before algorithm 3.
As the goal of our algorithm is to find maximal triclusters, having MSR score (S) below the threshold δ, the resultant tricluster M(I,J,K) may not be the largest one. That means some genes and/or experimental conditions/samples and/or time points may be added to the resultant tricluster T produced by node deletion algorithm, so that the MSR score of new tricluster T’ produced after node addition does not exceed the MSR score of T. Now the third step of our algorithm is described below.
Algorithm 4 (Node addition)
Input. D, a matrix of real numbers that represents δ-tricluster, having a subset of genes (I), a subset of experimental conditions/samples (J) and a subset of time points (K).
Output. M, a δ-tricluster, consisting of a subset of genes (I’), a subset of samples/experimental conditions (J’) and a subset of time points (K’), such that I ⊂ I ′, J ⊂ J ′, K ⊂ K’ and MSR(I ′,J ′,K’) ≤ MSR of D.
Initialization. M(I,J,K) ← D
Repeat
Calculate m
i
j
k
, ∀ i; m
i
j
k
, ∀ j; m
i
j
k
, ∀ k; m
i
j
k
and S. Add genes i ∉ I that satisfy the following inequality
Recalculate m
i
j
k
, ∀ j; m
i
j
k
, ∀; m
i
j
k
and S
Add samples/experimental conditions j ∉ J that satisfy the following inequality
Recalculate m
i
j
k
, ∀ i; m
i
j
k
, ∀ k; m
i
j
k
and S
Add time points k ∉ K that satisfy the following inequality
Until(There is no change in I, J and/or K)
I’ ← I, J’ ← J, K’ ← K Return I’, J’, K’
The complexity of this algorithm is O(mnp) as each step iterates (m+n+p) times.
Tricluster eigengene
To find tricluster eigengene we applied singular value decomposition method (SVD) on the expression data of each tricluster [7]. For instance, represents the expression matrix of ith tricluster, where g, c and t represent the number of genes, samples and time points of ith tricluster. Now we apply SVD on the data matrix (normalized to mean=0 and variance=1). Now, the SVD of ith tricluster can be represented as,
where U and V are the orthogonal matrices. Ui is a g ∗ (c ∗ t) matrix with orthonormal columns, Vi is a (c ∗ t) × (c ∗ t) orthogonal matrix and Di is (c ∗ t) × (c ∗ t) diagonal matrix of singular values.
Assuming that singular values in matrix Di are arranged in non-decreasing order, we can represent eigengene of ith tricluster by the first column of matrix Vi, i.e.