 Research
 Open Access
 Published:
A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series
Algorithms for Molecular Biology volume 4, Article number: 8 (2009)
Abstract
Background
The ability to monitor the change in expression patterns over time, and to observe the emergence of coherent temporal responses using gene expression time series, obtained from microarray experiments, is critical to advance our understanding of complex biological processes. In this context, biclustering algorithms have been recognized as an important tool for the discovery of local expression patterns, which are crucial to unravel potential regulatory mechanisms. Although most formulations of the biclustering problem are NPhard, when working with time series expression data the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the design of efficient biclustering algorithms able to identify all maximal contiguous column coherent biclusters.
Methods
In this work, we propose eCCCBiclustering, a biclustering algorithm that finds and reports all maximal contiguous column coherent biclusters with approximate expression patterns in time polynomial in the size of the time series gene expression matrix. This polynomial time complexity is achieved by manipulating a discretized version of the original matrix using efficient string processing techniques. We also propose extensions to deal with missing values, discover anticorrelated and scaled expression patterns, and different ways to compute the errors allowed in the expression patterns. We propose a scoring criterion combining the statistical significance of expression patterns with a similarity measure between overlapping biclusters.
Results
We present results in real data showing the effectiveness of eCCCBiclustering and its relevance in the discovery of regulatory modules describing the transcriptomic expression patterns occurring in Saccharomyces cerevisiae in response to heat stress. In particular, the results show the advantage of considering approximate patterns when compared to state of the art methods that require exact matching of gene expression time series.
Discussion
The identification of coregulated genes, involved in specific biological processes, remains one of the main avenues open to researchers studying gene regulatory networks. The ability of the proposed methodology to efficiently identify sets of genes with similar expression patterns is shown to be instrumental in the discovery of relevant biological phenomena, leading to more convincing evidence of specific regulatory mechanisms.
Availability
A prototype implementation of the algorithm coded in Java together with the dataset and examples used in the paper is available in http://kdbio.inescid.pt/software/ecccbiclustering.
Background
Time series gene expression data, obtained from microarray experiments performed in successive instants of time, can be used to study a wide range of biological problems [1], and to unravel the mechanistic drivers characterizing cellular responses [2]. Being able to monitor the change in expression patterns over time, and to observe the emergence of coherent temporal responses of many interacting components, should provide the basis for understanding evolving but complex biological processes, such as disease progression, growth, development, and drug responses [2]. In this context, several machine learning methods have been used in the analysis of gene expression data [3]. Recently, biclustering [4–6], a nonsupervised approach that performs simultaneous clustering on the gene and condition dimensions of the gene expression matrix, has been shown to be remarkably effective in a variety of applications. The advantages of biclustering in the discovery of local expression patterns, described by a coherent behavior of a subset of genes in a subset of the conditions under study, have been extensively studied and documented [4–8]. Recently, Androulakis et al. [2] have emphasized the fact that biclustering methods hold a tremendous promise as more systemic perturbations are becoming available and the need to develop consistent representations across multiple conditions is required. Madeira et al. [9] have also described the use of biclustering as critical to identify the dynamics of biological systems as well as the different groups of genes involved in each biological process. However, most formulations of the biclustering problem are NPhard [10], and almost all the approaches presented to date are heuristic, and for this reason, not guaranteed to find optimal solutions [6]. In a few cases, exhaustive search methods have been used [7, 11], but limits are imposed on the size of the biclusters that can be found [7] or on the size of the dataset to be analyzed [11], in order to obtain reasonable runtimes. Furthermore, the inherent difficulty of this problem when dealing with the original realvalued expression matrix and the great interest in finding coherent behaviors regardless of the exact numeric values in the matrix, has led many authors to a formulation based on a discretized version of the expression matrix [7–9, 12–23]. Unfortunately, the discretized versions of the biclustering problem remain, in general, NPhard. Nevertheless, in the case of time series expression data the interesting biclusters can be restricted to those with contiguous columns leading to a tractable problem. The key observation is the fact that biological processes are active in a contiguous period of time, leading to increased (or decreased) activity of sets of genes that can be identified as biclusters with contiguous columns. This fact led several authors to point out the relevance of biclusters with contiguous columns and their importance in the identification of regulatory mechanisms [9, 20, 22, 24].
In this work, we propose eCCCBiclustering, a biclustering algorithm specifically developed for time series expression data analysis, that finds and reports all maximal contiguous column coherent biclusters with approximate expression patterns in time polynomial in the size of the expression matrix. The polynomial time complexity is obtained by manipulating a discretized version of the original expression matrix and by using efficient string processing techniques based on suffix trees. These approximate patterns allow a given number of errors, per gene, relatively to an expression profile representing the expression pattern in the bicluster. We also propose several extensions to the core eCCCBiclustering algorithm. These extensions improve the ability of the algorithm to discover other relevant expression patterns by being able to deal with missing values directly in the algorithm and by taking into consideration the possible existence of anticorrelated and scaled expression patterns. Different ways to compute the errors allowed in the approximate patterns (restricted errors, alphabet range weighted errors and pattern length adaptive errors) can also be used. Finally, we propose a statistical test that can be used to score the biclusters discovered (by extending the concept of statistical significance of an expression pattern [9] to cope with approximate expression patterns) and a method to filter highly overlapping, and, therefore, redundant, biclusters. We report results in real data showing the effectiveness of the approach and its relevance in the process of identifying regulatory modules describing the transcriptomic expression patterns occurring in Saccharomyces cerevisiae in response to heat stress. We also show the superiority of eCCCBiclustering when compared with state of the art biclustering algorithms, specially developed for time series gene expression data analysis such as CCCBiclustering [9, 22].
Related Work: Biclustering algorithms for time series gene expression data
Although many algorithms have been proposed to address the general problem of biclustering [5, 6], and despite the known importance of discovering local temporal patterns of expression, to our knowledge, only a few recent proposals have addressed this problem in the specific case of time series expression data [9, 20, 22, 24]. These approaches fall into one of the following two classes of algorithms:

1.
Exhaustive enumeration: CCCBiclustering [9, 22] and qclustering [20].

2.
Greedy iterative search: CCTSB algorithm [24].
These three biclustering approaches work with a single time series expression matrix and aim at finding biclusters defined as subsets of genes and subsets of contiguous time points with coherent expression patterns. CCCBiclustering and qclustering work with a discretized version of the expression matrix while the CCTSBalgorithm works with the original realvalued expression matrix. In additional file 1: related_work we describe in detail these algorithms and identify their strengths and weaknesses. Based on their characteristics, we decided to compare the performance of eCCCBiclustering with that of CCCBiclustering, but not with that of the qclustering and CCTSB algorithms. The decision to exclude the last two algorithms from the comparisons is mainly based on existing analysis of these algorithms [9], and is basically related with complexity issues, in the case of qclustering, and on poor results on real data obtained by the heuristic approach used by the CCTSB algorithm.
Biclusters in discretized gene expression data
Let A' be an R row by C column gene expression matrix defined by its set of rows (genes), R, and its set of columns (conditions), C. In this context, represents the expression level of gene i under condition j. In this work, we address the case where the gene expression levels in matrix A' can be discretized to a set of symbols of interest, Σ, that represent distinctive activation levels. After the discretization process, matrix A' is transformed into matrix A, where A_{ ij }∈ Σ represents the discretized value of the expression level of gene i under condition j (see Figure 1 for an illustrative example).
Given matrix A we define the concept of bicluster and the goal of biclustering as follows:
Definition 1 (Bicluster) A bicluster is a submatrix A_{ IJ }defined by I ⊆ R, a subset of rows, and J ⊆ C, a subset of columns. A bicluster with only one row or one column is called trivial.
The goal of biclustering algorithms is to identify a set of biclusters B_{ k }= (I_{ k }, J_{ k }) such that each bicluster satisfies specific characteristics of homogeneity. These characteristics vary in different applications [6]. In this work we will deal with biclusters that exhibit coherent evolutions:
Definition 2 (CCBicluster) A column coherent bicluster A_{ IJ }is a bicluster such that A_{ ij }= A_{ lj }for all rows i, l ∈ I and columns j ∈ J.
Finding all maximal biclusters satisfying this coherence property is known to be an NPhard problem [10].
CCBiclusters in discretized gene expression time series
Since we are interested in the analysis of time series expression data, we can restrict the attention to potentially overlapping biclusters with arbitrary rows and contiguous columns [9, 20, 22, 24]. This fact leads to an important complexity reduction and transforms this particular version of the biclustering problem into a tractable problem. Previous work in this area [9, 22] has defined the concept of CCBiclusters in time series expression data and the important notion of maximality:
Definition 3 (CCCBicluster) A contiguous column coherent bicluster A_{ IJ }is a subset of rows I = {i_{1}, ..., i_{ k }} and a subset of contiguous columns J = {r, r + 1, ..., s  1, s} such that A_{ ij }= A_{ lj }, for all rows i, l ∈ I and columns j ∈ J. Each CCCBicluster defines a string S that is common to every row in I for the columns in J.
Definition 4 (rowmaximal CCCBicluster) A CCCBicluster A_{ IJ }is rowmaximal if we cannot add more rows to I and maintain the coherence property referred in Definition 3.
Definition 5 (leftmaximal and rightmaximal CCCBicluster) A CCCBicluster A_{ IJ }is leftmaximal/rightmaximal if we cannot extend its expression pattern S to the left/right by adding a symbol (contiguous column) at its beginning/end without changing its set of rows I.
Definition 6 (maximal CCCBicluster) A CCCBicluster A_{ IJ }is maximal if no other CCCBicluster exists that properly contains A_{ IJ }, that is, if for all other CCCBiclusters A_{ LM }, I ⊆ L ∧ J ⊆ M ⇒ I = L ∧ J = M.
Lemma 1 Every maximal CCCBicluster is right, left and rowmaximal.
Figure 2 shows the maximal CCCBiclusters with at least two rows (genes) present in the discretized matrix in Figure 1. CCCBiclusters with only one row, even when maximal, are trivial and uninteresting from a biological point of view and are thus discarded.
Maximal CCCBiclusters and generalized suffix trees
Consider the discretized matrix A obtained from matrix A' using the alphabet Σ. Consider also the matrix obtained by preprocessing A using a simple alphabet transformation, that appends the column number to each symbol in the matrix (see Figure 3), and considers a new alphabet Σ' = Σ × {1, ..., C}, where each element Σ' is obtained by concatenating one symbol in Σ and one number in the range {1, ..., C}. We present below the two Lemmas and the Theorem describing the relation between maximal CCCBiclusters with at least two rows and nodes in the generalized suffix tree built from the set of strings obtained after alphabet transformation [9, 22]. Figure 4 illustrates this relation using the generalized suffix tree obtained from the rows in the discretized matrix after alphabet transformation in Figure 3 together with the maximal CCCBiclusters with at least two rows (B1 to B6) already showed in Figure 2.
Lemma 2 Every rightmaximal, rowmaximal CCCBicluster with at least two rows corresponds to one internal node in T and every internal node in T corresponds to one rightmaximal, rowmaximal CCCBicluster with at least two rows.
Lemma 3 An internal node in T corresponds to a leftmaximal CCCBicluster iff it is a MaxNode.
Definition 7 (MaxNode) An internal node v in T is called a MaxNode iff it satisfies one of the following conditions:

a)
It does not have incoming suffix links.

b)
It has incoming suffix links only from nodes u_{ i } such that, for every node u_{ i }, the number of leaves in the subtree rooted at u_{ i } is inferior to the number of leaves in the subtree rooted at v.
Theorem 1 Every maximal CCCBicluster with at least two rows corresponds to a MaxNode in the generalized suffix tree T, and each MaxNode defines a maximal CCCBicluster with at least two rows.
Note that this theorem is the base of CCCBiclustering [9, 22], which finds and reports all maximal CCCBiclusters using three main steps:

1.
All internal nodes in the generalized suffix tree are marked as "Valid", meaning each of them identifies a rowmaximal, rightmaximal CCCBicluster with at least two nodes according to Lemma 2.

2.
All internal nodes identifying non leftmaximal CCCBiclusters are marked as "Invalid" using Theorem 1, discarding all rowmaximal, rightmaximal CCCBiclusters which are not leftmaximal.

3.
All maximal CCCBiclusters, identified by each node marked as "Valid", are reported.
Methods
In this section we propose eCCCBiclustering, an algorithm designed to find and report all maximal CCCBiclusters with approximate expression patterns (eCCCBiclusters) using a discretized matrix A and efficient string processing techniques. We first define the concepts of eCCCBicluster and maximal eCCCBicluster. We then formulate two problems: (1) finding all maximal eCCCBiclusters and (2) finding all maximal eCCCBiclusters satisfying row and column quorum constraints. We discuss the relation between maximal eCCCBiclusters and generalized suffix trees highlighting the differences between this relation and that of maximal CCCBiclusters and generalized suffix tree, discussed in the previous section. We then discuss and explore the relation between the two problems above and the Common Motifs Problem [25, 26]. We describe eCCCBiclustering, a polynomial time algorithm designed to solve both problems and sketch the analysis of its computational complexity. We present extensions to handle missing values, discover anticorrelated and scaled expression patterns, and consider alternative ways to compute approximate expression patterns. Finally, we propose a scoring criterion for eCCCBiclusters combining the statistical significance of their expression patterns with a similarity measure between overlapping biclusters.
CCCBiclusters with approximate expression patterns
The CCCBiclusters defined in the previous section are perfect, in the sense that they do not allow errors in the expression pattern S that defines the CCCBicluster. This means that all genes in I share exactly the same expression pattern in the time points in J. Being able to find all maximal CCCBiclusters using efficient algorithms is useful to identify potentially interesting expression patterns and can be used to discover regulatory modules [9]. However, some genes might not be included in a CCCBicluster of interest due to errors. These errors may be measurement errors, inherent to microarray experiments, or discretization errors, introduced by poor choice of discretization thresholds or inadequate number of discretization symbols. In this context, we are interested in CCCBiclusters with approximate expression patterns, that is, biclusters where a certain number of errors is allowed in the expression pattern S that defines the CCCBicluster. We introduce here the definitions of eCCCBicluster and maximal eCCCBicluster preceded by the notion of eneighborhood:
Definition 8 ( e Neighborhood) The eNeighborhood of a string S of length S, defined over the alphabet Σ with Σ symbols, N(e, S), is the set of strings S_{ i }, such that: S = S_{ i } and Hamming(S, S_{ i }) ≤ e, where e is an integer such that e ≥ 0. This means that the Hamming distance between S and S_{ i }is no more than e, that is, we need at most e symbol substitutions to obtain S_{ i }from S.
Lemma 4 The eNeighborhood of a string S, N(e, S), contains elements.
Definition 9 ( e CCCBicluster) A contiguous column coherent bicluster with e errors per gene, eCCCBicluster, is a CCCBicluster A_{ IJ }where all the strings S_{ i }that define the expression pattern of each of the genes in I are in the eNeighborhood of an expression pattern S that defines the eCCCBicluster: S_{ i }∈ N (e, S), ∀i ∈ I. The definition of 0CCCBicluster is equivalent to that of a CCCBicluster.
Definition 10 (maximal e CCCBicluster) An eCCCBicluster A_{ IJ }is maximal if it is rowmaximal, leftmaximal and rightmaximal. This means that no more rows or contiguous columns can be added to I or J, respectively, maintaining the coherence property in Definition 9.
Given these definitions we can now formulate the problem we solve in this work:
Problem 1 Given a discretized expression matrix A and the integer e ≥ 0 identify and report all maximal eCCCBiclusters .
Similarly to what happened with CCCBiclusters, eCCCBiclusters with only one row should be overlooked. A similar problem is that of finding and reporting only the maximal eCCCBiclusters satisfying predefined row and column quorum constraints:
Problem 2 Given a discretized expression matrix A and three integers e ≥ 0, q_{ r }≥ 2 and q_{ c }≥ 1, where q_{ r }is the row quorum (minimum number of rows in I_{ k }) and q_{ c }is the column quorum (minimum number of columns in J_{ k }), identify and report all maximal eCCCBiclusters such that, I_{ k }and J_{ k }have at least q_{ r }rows and q_{ c }columns, respectively.
Figure 5 shows all maximal eCCCBiclusters with at least rows (genes), which are present in the discretized matrix in Figure 1, when one error per gene is allowed (e = 1). Figure 6 shows all maximal eCCCBiclusters identified using row and column constraints. In this case, the maximal 1CCCBiclusters having at least three rows and three columns (q_{ r }= q_{ c }= 3) are shown. Also clear in these figures is the fact that, when errors are allowed (e > 0), different expression patterns S can define the same eCCCBicluster. Furthermore, when e > 0, an eCCCBicluster can be defined by an expression pattern S, which does not occur in the discretized matrix in the set of contiguous columns in the eCCCBicluster.
Maximal eCCCBiclusters and generalized suffix trees
In the previous section we showed that each internal node in the generalized suffix tree, constructed for the set of strings corresponding to the rows in the discretized matrix after alphabet transformation, identifies exactly one CCCBicluster with at least two rows (maximal or not) (see Lemma 2). We also showed that each internal node corresponding to a MaxNode (see Definition 7) in the generalized suffix tree identifies exactly one maximal CCCBicluster and that each maximal CCCBicluster is identified by exactly one MaxNode (see Lemma 3 and Theorem 1). This also implies that a maximal CCCBicluster is identified by one expression pattern, which is common to all genes in the CCCBicluster within the contiguous columns in the bicluster. Moreover, all expression patterns identifying maximal CCCBiclusters always occur in the discretized matrix and thus correspond to a node in the generalized suffix tree (see Figure 4).
When errors are allowed, one eCCCBicluster (e > 0) can be identified (and usually is) by several nodes in the generalized suffix tree, constructed for the set of strings corresponding to the rows in the discretized matrix after alphabet transformation, and one node in the generalized suffix tree may be related with multiple eCCCBiclusters (maximal or not) (see Figure 7). Moreover, a maximal eCCCBicluster can be defined by several expression patterns (see Figure 5 and Figure 6). Upon all this, a maximal eCCCBicluster can be defined by an expression pattern not occurring in the expression matrix and thus not appearing in the generalized suffix tree (see Figure 6 and Figure 7).
Furthermore we cannot obtain all maximal eCCCBiclusters using the set of maximal CCCBiclusters by: 1) extending them with genes by looking for their approximate patterns in the generalized suffix tree, or 2) extending them with e contiguous columns (see Figure 5 and Figure 8). It is also clear from Figure 8 that extending maximal CCCBiclusters can in fact lead to the discovery of non maximal eCCCBiclusters. For the reasons stated above we cannot use the same searching strategy used to find maximal CCCBiclusters when looking for maximal eCCCBiclusters (e > 0). We therefore need to explore the relation between finding eCCCBiclusters and the Common Motifs Problem, as explained below.
Finding eCCCBiclusters and the common motifs problem
There is an interesting relation between the problem of finding all maximal eCCCBiclusters, discussed in this work, and the well known problem of finding common motifs (patterns) in a set of sequences (strings). For the first problem, and to our knowledge, no efficient algorithm has been proposed to date. For the latter problem (Common Motifs Problem), several efficient algorithms based on string processing techniques have been proposed to date [25, 26]. The Common Motifs Problem is as follows [26]:
Common Motifs Problem Given a set of N sequences S_{ i }(1 ≤ i ≤ N) and two integers e ≥ 0 and 2 ≤ q ≤ N, where e is the number of errors allowed and q is the required quorum, find all models m that appear in at least q distinct sequences of S_{ i }.
During the design of eCCCBiclustering, we used the ideas proposed in SPELLER [26], an algorithm to find common motifs in a set of N sequences using a generalized suffix tree T. The motifs searched by SPELLER correspond to words, over an alphabet Σ, which must occur with at most e mismatches in 2 ≤ q ≤ N distinct sequences. Since these words representing the motifs may not be present exactly in the sequences (see SPELLER for details), a motif is seen as an "external" object and called model. In order to be considered a valid model, a given model m of length m has to verify the quorum constraint: m must belong to the eneighborhood of a word w in at least q distinct sequences.
In order to solve the Common Motifs Problem, SPELLER builds a generalized suffix tree T for the set of sequences S_{ i }and then, after some further preprocessing, uses this tree to "spell" the valid models. Valid models verify two properties [26]:

1.
All the prefixes of a valid model are also valid models.

2.
When e = 0, spelling a model leads to one node v in T such that L(v) ≥ q, where L(v) denotes the number of leaves in the subtree rooted at v.
When e > 0, spelling a model leads to a set of nodes v_{1}, ..., v_{ k }in T for which , where L(v_{ j }) denotes the number of leaves in the subtree rooted at v_{ j }.
In these settings, and since the occurrences of a model are in fact nodes of the generalized suffix tree T, these occurrences are called nodeoccurrences [26]. The goal of SPELLER is thus to identify all valid models by extending them in the generalized suffix tree and to report them together with their set of nodeoccurrences. We present here an adaptation of the definition of nodeoccurrence used in SPELLER. In SPELLER, a nodeoccurrence is defined by a pair (v, v_{ err }) and not by a triple (v, v_{ err }, p), as in this work. For clarity, SPELLER was originally exemplified [26] in an uncompacted version of the generalized suffix tree, that is, a trie (although it was proposed to work with a generalized suffix tree). However, and as pointed out by the authors, when using a generalized suffix tree, as in our case, we need to know at any given step in the algorithm whether we are at a node or in an edge between nodes v and v'. We use p to provide this information, and redefine nodeoccurrence as follows:
Definition 11 (nodeoccurrence) A nodeoccurrence of a model m is a triple (v, v_{ err }, p), where v is a node in the generalized suffix tree T and v_{ err }is the number of mismatches between m and the stringlabel of v computed using Hamming(m, stringlabel(v)). The integer p ≥ 0 identifies a position/point in T such that:
1. If p = 0: we are exactly at node v.
2. If p > 0: we are in E(v), the edge between father_{ v }and v, in a point p between two symbols in label(E(v)) such that 1 ≤ p < label(E(v)).
Consider a model m, a symbol α in the alphabet Σ, a node v in T, its father father_{ v }, the edge between father_{ v }and v, E(v), the edgelabel of E(v), label(E(v)) and its edgelength, label(E(v)). The modified version of SPELLER described below is based on the following Lemmas (adapted from SPELLER):
Lemma 5 (v, v_{ err }, 0) is a nodeoccurrence of a model m' = mα, if, and only if:
1. Match:
(father_{ v }, v_{ err }, 0) is a nodeoccurrence of m and label(E(v)) = α.
The edgelabel of E(v) has only one symbol and this symbol is α.
or
(v, v_{ err }, label(E(v)) 1) is a nodeoccurrence of m and label(E(v)) [label(E(v))] = α.
The last symbol in label(E(v)) is α.
2. Substitution:(father_{ v }, v_{ err }1, 0) is a nodeoccurrence of m and label(E(v)) = β ≠ α.
The edgelabel of E(v) has only one symbol and this symbol is not α.
or
(v, v_{ err } 1, label(E(v))  1) is a nodeoccurrence of m and label(E(v)) [label(E(v))] = β ≠ α.
The last symbol in label(E(v)) is not α.
Lemma 6 (v, v_{ err }, 1) is a nodeoccurrence of a model m' = mα, if, and only if:
1. Match:(father_{ v }, v_{ err }, 0) is a nodeoccurrence of m and label(E(v))[1] = α.
2. Substitution: (father_{ v }, v_{ err } 1, 0) is a nodeoccurrence of m and label(E(v))[1] = β ≠ α.
Lemma 7 (v, v_{ err }, p), 2 ≤ p < label(E(v) is a nodeoccurrence of a model m' = mα, if, and only if:
1. Match:(v, v_{ err }, p  1) is a nodeoccurrence of m and label(E(v) [p] = α.
2. Substitution:(v, v_{ err } 1, p  1) is a nodeoccurrence of m and label(E(v)) [p] = β ≠ α.
Consider now the discretized matrix A obtained from matrix A' using the alphabet Σ. We preprocess A using the same alphabet transformation used in CCCBiclustering. Remember that we append the column number to each symbol in the matrix and consider a new alphabet Σ' = Σ × {1, ..., C} (see Figure 3). We will now show that SPELLER can be adapted to extract all rightmaximal eCCCBiclusters from this transformed matrix A by building a generalized suffix tree for the set of R strings S_{ i }obtained from each row in A and use it to "spell" the valid models using the symbols in the new alphabet Σ'.
Given the set of R strings S_{ i }, the number of allowed errors e ≥ 0 and the quorum constraint 2 ≤ q ≤ R, the goal is now to find the set of all rightmaximal valid models m, identifying expression patterns that are present in at least q distinct rows starting and ending at the same columns. Note that the valid models identified by the original SPELLER algorithm are already rowmaximal. However they may be non rightmaximal, non leftmaximal, and start at different positions in the sequences. Under these settings, the set of nodeoccurrences of each valid model m and the model itself in our modified version of SPELLER identifies one rowmaximal, rightmaximal eCCCBicluster with q rows and a maximum of C contiguous columns. Furthermore, it is possible to find all rightmaximal eCCCBiclusters by fixing the quorum constraint, used to specify the number of rows/genes necessary to identify a model as valid, to the value q = 2. In this context, and in order to be able to solve not only Problem 1 but also Problem 2, we adapted SPELLER to consider not only a row constraint, 2 ≤ q_{ r }≤ R, but also an additional column constraint, 1 ≤ q_{ c }≤ C.
Figure 7 shows the generalized suffix tree used by our modified version of SPELLER when it is applied to the discretized matrix after alphabet transformation in Figure 3. We can also see in this figure the five maximal 1CCCBiclusters B1, B2, B3, B4 and B5, already shown in Figure 6, identified by five valid models, when e = 1 and the values q_{ r }and q_{ c }, specifying the row and column constraints, respectively, are set to 3. The maximal 1CCCBiclusters B1 to B5 are defined, respectively, by the following valid models: m = [D1 U2 D3 U4 N5] (three nodeoccurrences labeled with B1); m = [D2 D3 U4] (three nodeoccurrences labeled with B2), m = [D3 U4 N5] (four nodeoccurrences labeled with B3), m = [N2 D3 U4] (four nodeoccurrences labeled with B4) and m = [U2 D3 U4 D5] (four nodeoccurrences labeled with B5). It is also possible to observe in this figure that, when e > 0, a model can be valid without being right/leftmaximal and that several valid models may identify the same eCCCBicluster. For example, m = [D1 U2 D3] is valid but it is not rightmaximal, m = [D3 U4 D5] is also valid but it is not leftmaximal, and finally the models m = [D1 U2 D3 U4 N5] and m = [N1 U2 D3 U4 D5] are both valid but identify the same 1CCCBicluster B1. Figure 4 shows the generalized suffix tree used when e = 0, q_{ r }= 2 and q_{ c }= 1. Since no errors are allowed the generalized suffix tree is the same as the one used by CCCBiclustering and the maximal 0CCCBiclusters identified correspond in fact to the maximal CCCBiclusters in Figure 2.
In the next section we describe the details of the modified version of SPELLER that we used to identify all rightmaximal eCCCBiclusters. However, and for clarity, we summarize here the main differences between the original version of SPELLER and the modified version (procedure computeRightMaximalBiclusters in the next section), which we use as the first step of the eCCCBiclustering algorithm. While reading the differences listed below have in mind that in order to be maximal, an eCCCBicluster must be rowmaximal, rightmaximal and leftmaximal. Moreover, all the approximate patterns identifying genes in an eCCCBicluster must start and end at the same columns.

1.
In SPELLER a nodeoccurrence is defined by a pair (v, v_{ err }) since (for clarity) the algorithm was exemplified using a trie and not a generalized suffix tree, as explained above. As such we redefined the original concept of nodeoccurrence to use the triple (v, v_{ err }, p) (see Definition 11), adapted the three original Lemmas in SPELLER to use the new definition of nodeoccurrence (see Lemma 5, Lemma 6 and Lemma 7), and rewrote SPELLER to use a generalized suffix tree.

2.
In SPELLER a model can be valid without being right/leftmaximal. As such all models satisfying the quorum constraint are stored for further reporting. This means that the valid models reported by SPELLER are only rowmaximal. We only store valid models that cannot be extended to the right without loosing genes, that is valid models which are both rowmaximal are rightmaximal. This implied modifying the original procedure storeModel in SPELLER in order to include the procedure checkRightMaximality (see procedure spellModels in the next section, for details).

3.
In SPELLER the nodeoccurrences of a valid model can start in any position in the sequences. In our modified version of this algorithm all nodeoccurrences of a valid model must start in the same position (same column in the discretized matrix) in order to guarantee that they belong to an eCCCBicluster. As such we modified the construction of the generalized suffix tree used in SPELLER in order to be constructed using the set of strings corresponding to the set of rows in the discretized matrix after alphabet transformation. We also modified all the procedures used in SPELLER for model extension. Note that it is not possible to modify SPELLER in order to check if a valid model that is rightmaximal is also leftmaximal. This is so since we can only guarantee that a model is/is not leftmaximal once we have computed all valid models corresponding to rightmaximal eCCCBiclusters. This justifies why we need to discard valid models which are not leftmaximal in the next step of the algorithm and did not integrate this step in our modified version of SPELLER.
In this context, we also show in the next section that the proposed eCCCBiclustering algorithm will need three steps to identify all maximal eCCCBiclusters without repetitions: a first step to identify all rightmaximal eCCCBiclusters (for this we use the modified version of SPELLER), a second step to discard all rightmaximal eCCCBiclusters which are not leftmaximal, and finally a third step to discard repetitions, that is maximal valid models identifying the same maximal eCCCBicluster.
Note that the original SPELLER algorithm does not eliminate repetitions (different valid models with the same set of nodeoccurrences). Furthermore, we also cannot integrate the elimination of valid models corresponding to the same rightmaximal eCCCBiclusters in our modified version of SPELLER since we need the set of all valid models corresponding to rightmaximal eCCCBiclusters in order to discard valid models which are not leftmaximal in the second step of eCCCBiclustering.
eCCCBiclustering: Finding and reporting all maximal eCCCBiclusters in polynomial time
This section presents eCCCBiclustering, a polynomial time biclustering algorithm for finding and reporting all maximal CCCBiclusters with approximate patterns (eCCCBiclusters), and describes its main steps. Algorithm 1 is designed to solve Problem 2: identify and report all maximal eCCCBiclusters such that I_{ k }and J_{ k }have at least q_{ r }rows and q_{ c }columns, respectively. The proposed algorithm is easily adapted to solve problem 1 (identify and report all maximal eCCCBiclusters without quorum constraints) by fixing the values of q_{ r }and q_{ c }to the values two and one, respectively. The proposed algorithm is based on the following steps (described in detail below):
[Step 1] Computes all valid models corresponding to rightmaximal eCCCBiclusters. Uses the discretized matrix A after alphabet transformation, the quorum constraints q_{ r }and q_{ c }, a generalized suffix tree and a modified version of SPELLER.
[Step 2] Deletes all valid models not corresponding to leftmaximal eCCCBiclusters. Uses all valid models computed in Step 1 and a trie.
[Step 3] Deletes all valid models representing the same eCCCBiclusters. Uses all valid models corresponding to maximal eCCCBiclusters (both left and right) computed in Step 2 and a hash table. Note that this step is only needed when e > 0.
[Step 4] Reports all maximal eCCCBiclusters.
Algorithm 1: eCCCBiclustering
Input : A, Σ, e, q_{ r }, q_{ c }
Output: Maximal eCCCBiclusters.
1 {S_{1}, ..., S_{R}} ← alphabetTransformation(A, Σ)
2 modelsOcc ← {}
3 computeRightMaximalBiclusters(Σ, e, q_{ r }, q_{ c }, {S_{1}, ..., S_{R}}, modelsOcc)
4 deleteNonLeftMaximalBiclusters(modelsOcc)
5 if e > 0 then
6 deleteRepeatedBiclusters(modelsOcc)
7 reportMaximalBiclusters(modelsOcc)
Detailed discussions can be found in additional file 2: algorithmic_complexity_details.
Computing valid models corresponding to rightmaximal eCCCBiclusters
In step 1 of eCCCBiclustering we compute all valid models m together with their nodeoccurrences Occ_{ m }corresponding to rightmaximal eCCCBiclusters. The details are shown in the procedure computeRightMaximalBiclusters below, which corresponds to a modified version of SPELLER.
Procedure computeRightMaximalBiclusters
Input: Σ, e, q_{ r }, q_{ c }, {S_{1}, ..., S_{R}}, modelsOcc
/* The value of modelsOcc is updated. */
1 T_{ right }← constructGeneralizedSuffixTree({S_{1}, ..., S_{R}})
2 addNumberOfLeaves(T_{ right }) /* Adds L(v) to each node v in T_{ right }. */
3 if e ≠ 0 then
4 addColorArray(T_{ right })
/* Adds colors_{ v }to every node v in T_{ right }: colors_{ v }[i] = 1, if there is a leaf in the subtree rooted at v that is a suffix of S _{ i }; colors_{ v }[i] = 0, otherwise. */
5 m ← "" /* model m is a string [m [1] ... m [length_{ m }1]] */
6 length_{ m }← 0
7 father_{ m }← "" /* father_{ m }is a string [m[1] ... m [length_{ m }1]] */
8 ← 0
9 Occ_{ m }← {} /* List of nodeoccurrences (v, v_{ err }, p) */
10 addNodeOccurrence(Occ_{ m }, (root(T_{ right }), 0, 0))
11 Ext_{ m }← {} /* Ext_{ m }is the set of possible symbols α to extend the model m. */
12 if e = 0 then
13 forall edges E(v_{ i }) leaving from node root(T_{ right }) to a node v_{ i }do
14 if label(E(v_{ i }))[1]is not a string terminator then
15 addSymbol(Ext_{ m }, label(E(v_{ i }))[1])
16 else
17 forall symbols in Σ' do
/* Σ' must be in lexicographic order. */
18 addSymbol(Ext_{ m }, Σ' [i])
19 length_{ m }← 0
20 spellModels(Σ, e, q_{ r }, q_{ c }, modelsOcc, T_{ right }, m, length_{ m }, Occ_{ m }, Ext_{ m }, father_{ m }, )
In this procedure we use the transformed matrix A as input and store the results in the list modelsOcc, which stores triples with the following information (m, genesOcc_{ m }, numberOfGenesOcc_{ m }), where m is the model, genesOcc_{ m }is a bit vector containing the distinct genes in the nodeoccurrences of m, Occ_{ m }, and numberOfGenesOcc_{ m }is the number of bits set to 1 in genesOcc_{ m }and, therefore, the number of genes where the model occurs. This information is computed using the procedure spellModels described below, which corresponds to a modified version of the procedure with the same name used in SPELLER).
Procedure spellModels
/* Called recursively. Stores rightmaximal e CCCBiclusters in modelsOcc. */
Input : Σ, e, q_{ r }, q_{ c }, modelsOcc, T_{ right }, m, length_{ m }, Occ_{ m }, Ext_{ m }, father_{ m },
/* The value of modelsOcc is updated. *//* C is the length of the longest model */
1 keepModel(q_{ r }, q_{ c }, modelsOcc, T_{ right }, m, length_{ m }, Occ_{ m }, father_{ m },
2 if length_{ m }≤ C then
3 forall symbols α in Ext _{ m } do
4 if α is not a string terminator then
5 maxGenes ← 0/* Sum of L(v) for all nodeoccurrences (v, v_{ err }, p) in Occ_{ mα } */
6 minGenes← ∞/* Minimum L(v) in all nodeoccurrences (v, v_{ err }, p) in Occ_{ mα } */
7 Colors_{ mα }← {}
8 if e > 0 then
9 Colors_{ mα }[i] ← 0, 1 ≤ i ≤ R
/* colors_{ mα }[i] = 1, if there is a nodeoccurrence of m in S_{ i }; */
/* colors_{ mα }[i] = 0, otherwise */
10 Ext_{ mα }← {}
11 Occ_{ mα }← {}
12 forall nodeoccurrences (v, v_{ err }, p) in Occ_{ m }do
/* If p = 0 we are at node v. Otherwise, we are at edge E(v) between nodes father(v) and v at point p> 0. */
13 if p = 0 then
14 extendFromNodeWithoutErrors(Σ, e, T_{ right }, (v, v_{ err }, p), m, α, Occ_{ mα }, Colors_{ mα }, Ext_{ mα }, maxGenes, minGenes)
15 if (v_{ err }<e) then
16 extendFromNodeWithErrors(Σ, e, T_{ right }, (v, v_{ err }, p), m, α, Occ_{ mα }, Colors_{ mα }, Ext_{ mα }, maxGenes, minGenes)
17 else
18 extendFromEdgeWithoutErrors(T_{ right }, Σ, e, (v, v_{ err }, p), m, α, m, Occ_{ mα }, Colors_{ mα }, Ext_{ mα }, maxGenes, minGenes)
19 if x_{ err }<e then
20 extendFromEdgeWithErrors(Σ, e, T_{ right }, (v, v_{ err }, p), m, α, Occ_{ mα }, Colors_{ mα }, Ext_{ mα }, maxGenes, minGenes)
21 if modelHasQuorum(maxGenes, minGenes, Colors_{ mα }, q_{ r }) then
22 spellModels(Σ, e, q_{ r }, q_{ c }, modelsOcc, T_{ right }, mα, length_{ m }+ 1, Occ_{ mα }, Ext_{ mα }, father_{ mα }, numberOfGenesOcc_{ m })
The recursive procedure spellModels (modified to extract valid models corresponding to rightmaximal eCCCBiclusters) is now able to:

1.
Use a generalized suffix tree T_{ right }and define nodeoccurrences as triples (v, v_{ err }, p), where p is used throughout the algorithm to find out whether we are at node v (p = 0) or in an edge E(v) between nodes v and father_{ v }(p > 0).

2.
Check if a valid model m corresponds to a rightmaximal eCCCBicluster. This is performed using the procedure checkRightMaximality inside the procedure keepModel. This procedure deletes from the list of stored models, modelsOcc, a valid model m when the result of its extension with a symbol α, mα, is also a valid model and the set of nodeoccurrences of mα, Occ_{ mα }, has as many genes as the set of nodeoccurrences of its father m, Occ_{ m }. When this is the case, m no longer corresponds to a rightmaximal eCCCBicluster since its expression pattern can be extended to the right with the symbol α without losing genes.

3.
Restrict the extensions of a given model m, Ext_{ m }, to the level of the model in the generalized suffix tree (column of the last symbol in m). When we are extending a model m with a symbol α (eventually extracting a valid model mα), the column number of the last symbol in m, m [length_{ m }], is C(m [length_{ m }]), where C(m [length_{ m }]) ∈ {1, ..., C}, and errors are still allowed, α can only be one of the symbols in the set , where corresponds to the subset of elements in Σ' whose column is equal to C(m [length_{ m }])) + 1. For example, if Σ = {D, N, U} and the model m = [D1] is being extended, the possible symbols α with which m can be extended to mα must be in = {D 2. N 2, U 2}. In the same way, if m = [D2 U3], the possible symbols α with which m can be extended to mα are in = {D 4, N 4, U 4}.
The algorithmic details of the procedures and functions called in the recursive procedure spellModels are described in additional file 2: algorithmic_complexity_details.
Deleting valid models not corresponding to leftmaximal eCCCBiclusters
In step 2 of eCCCBiclustering (details in procedure deleteNonLeftMaximalBiclusters below), we remove from the valid models stored in modelsOcc (identifying rightmaximal eCCCBiclusters) those not corresponding to leftmaximal eCCCBiclusters. These models are removed from modelsOcc by first building a trie with the reverse patterns of all (rightmaximal) models m and storing the number of genes in numberOfGenesOcc_{ m }in its corresponding node in the trie. After this, it is sufficient to mark as "non leftmaximal" any node in the trie that has at least one child with as many genes as itself. This is easily achieved by performing a depthfirst search (dfs) of the trie and computing, for each node, the maximum value amongst the values of numberOfGenesOcc_{ m }stored in its children. The models whose corresponding node in the trie is marked as "non leftmaximal" are then removed from modelsOcc.
Procedure deleteNonLeftMaximalBiclusters
Input: modelsOcc
/* The value of modelsOcc is updated. */
1 T_{ left }← createTrie ()
/* Array which will store references to nodes in T_{ left }*/
2 R_{ nodes }← {}
3 foreach model and occurrences (m, genesOcc_{ m }, numberOfGenesOcc_{ m }) in modelsOcc do
4 m_{ r }← ReverseModel(m)
5 nodeRepresentingModel ← addReverseModelToTrie(T_{ left }, m_{ r })
/* Each node in T_{ left }stores two integers: 1) the number of genes in the model it represents, genes_{ v }(0 if it does not represent the end of a model); and 2) the maximum number of genes in the subtree rooted at v, (computed later). Both these values are initialized with 0. */
6 addNumberOfGenes(nodeRepresentingModel,numberOfGenesOcc_{ m })
7 addReferenceToNode(R_{ nodes }, nodeRepresentingModel)
8 forall nodes v in T _{ left } do
/* Performed using a depthfirst search (dfs) */
9 if genes_{ v }> 0 then
/* Node v represents a model and is potentially leftmaximal. */
10 Mark v as "leftmaximal"
11 else
12 Mark v as "non leftmaximal"
13 Compute the maximum number of genes in the subtree rooted at v
14 foreach node v in T _{ left } do
/* Performed using a depthfirst search (dfs) */
15 if genes_{ v }> 0 and genes_{ v }= then
16 Mark v as "non leftmaximal"
17 p_{ modelsOcc }← 0
18 foreach model and occurrences (m, genesOcc_{ m }, numberOfGenesOcc_{ m }) in modelsOcc do
19 if R_{ nodes }[p_{ modelsOcc }] is marked as "nonleft maximal" then
20 deleteModelAndOccurrences(modelsOcc, m)
21 p_{ modelsOcc }← p_{ modelsOcc }+ 1
Deleting valid models representing the same eCCCBiclusters
When errors are allowed, different valid models may identify the same eCCCBicluster. Step 3 of eCCCBiclustering, described in detail in procedure deleteRepeatedBiclusters below, uses a hash table to remove from modelsOcc all the valid models that, although maximal (left and right), identify repeated eCCCBiclusters. This is needed because all valid models m with the same first and last columns and the same set of genes represent the same maximal eCCCBicluster.
Procedure deleteRepeatedBiclusters
Input: modelsOcc
/* The value of modelsOcc is updated. */
1 H ← createHashTable()
2 foreach model and occurrences (m, genesOcc_{ m }, numberOfGenesOcc_{ m }) in modelsOcc do
3 firstColumn_{ m }= C(m[1])
4 lastColumn_{ m }= C(m [length_{ m }])
5 key ← createKey(firstColumn, lastColumn, genesOcc_{ m })
6 value ← (firstColumn, lastColumn, genesOcc_{ m })
7 if containsKey(H, key) then
8 value_{ key }← getValue(H, key)
9 if value = value_{ key }then
/* H already has a value representing the same eCCCBicluster */
10 deleteModelAndOccurrences(modelsOcc, m)
11 else
12 insertKeyValue(key, value)
13 else
14 insertKeyValue(key, value)
Reporting all maximal eCCCBiclusters
After the three main steps of eCCCBiclustering the list modelsOcc stores all valid models corresponding to maximal eCCCBiclusters satisfying the quorum constraints q_{ r }and q_{ c }. In this context, the reporting procedure reportMaximalBiclusters, described below, lists these eCCCBiclusters using the information stored in the model m (needed to identify the expression pattern and the columns in each eCCCBicluster) and the bit vector genesOcc (needed to identify the genes in the eCCCBicluster).
Procedure reportMaximalBiclusters
Input: modelsOcc
1 foreach model and occurrences (m, genesOcc_{ m }, numberOfGenesOcc_{ m }) in modelsOcc do
2 firstColumn_{ m }= C(m[1])
3 lastColumn_{ m }= C(m [length_{ m }])
4 print(m, firstColumn_{ m }, lastColumn_{ m }, genesOcc_{ m })
eCCCBiclustering: Complexity analysis
In this section we sketch an analysis of the complexity of eCCCBiclustering. For a detailed complexity analysis see additional file 2: algorithmic_complexity_details.
Given a discretized matrix A with R rows and C columns, the alphabet transformation performed using the procedure alphabetTransformation takes O(RC) time.
The complexity of computing all valid models corresponding to rightmaximal eCCCBiclusters using procedure computeRightMaximalBiclusters takes O(R^{2}C^{1 + e}Σ^{e}) operations. The construction of T_{ right }and the computation of L(v) for all its nodes takes O(RC) time each, using Ukkonen's algorithm with appropriate data structures, and a dfs, respectively. The increase in the alphabet size from Σ to CΣ due to the alphabet transformation does not affect the O(RC) construction and manipulation of the generalized suffix tree [9]. When e > 0, adding the color array to all nodes in T_{ right }takes O(R^{2}C) time. Initializing Ext_{ m }takes O(CΣ) and spellModels is O(R^{2}C^{1 + e}Σ^{e}). The complexity of this step of the algorithm is bounded by the complexity of spellModels and is thus O(R^{2}C^{1+e}Σ^{e}). The complexity of deleting from modelsOcc all valid models that are not leftmaximal using procedure deleteNonLeftMaximalBiclusters is O(RC^{2+e}Σ^{e}). Since the number of models in modelsOcc is O(RC^{1+e}Σ^{e}) and the size of the models is O(C), the trie T_{ left }can be constructed and manipulated in O(RC^{2 + e}Σ^{e}).
The complexity of deleting from modelsOcc all models representing the same eCCCBiclusters with procedure deleteRepeatedBiclusters takes O(R^{2}C^{1 + e}Σ^{e}). Since computing the hash key for each of the O(RC^{1 + e}Σ^{e}) models in modelsOcc takes O(R) time, the overall complexity of this step is O(R^{2}C^{1 + e}Σ^{e}).
Since the number of genes in genesOcc_{ m }is O(R) and computing the first and last column of the valid model m takes constant time, reporting all maximal eCCCBiclusters using procedure reportMaximalBiclusters is O(R^{2}C^{1+e}Σ^{e}).
Therefore, the asymptotic complexity of the proposed eCCCBiclustering algorithm is O(max (R^{2}C^{1+e}Σ^{e}, RC^{2 + e}Σ^{e})). However, in most cases of interest R >>C and the complexity becomes O(R^{2}C^{1+e}Σ^{e}). Moreover, when e = 0, CCCBiclustering [9, 22] can be used to obtain O(RC).
Extensions to handle missing values, anticorrelated and scaled expression patterns
In this section we present extensions to eCCCBiclustering able to handle missing values and discover anticorrelated (opposite patterns) and scaled (patterns with different expression rates) expression patterns. In the subsections below we consider the illustrative example in Figure 9, corresponding to a modified version of the example in Figure 1. We now assume that some expression values are missing.
Handling missing values
Since eCCCBiclustering cannot deal with missing values directly, genes with missing values have to be removed, or missing values have to be filled, as a preprocessing step. In this section we present extensions that enable direct processing of the expression matrix with missing values. Our goal is to consider all available time points and thus always include the expression pattern of a gene as input to the extended version of the algorithm. Nevertheless genes with more than a predefined percentage of missing values can still be discarded in a preprocessing step.
Dealing with missing values in eCCCBiclustering is straightforward and can be performed in two ways:

1.
Considering missing values as valid errors.

2.
"Jumping over" missing values.
In order to consider missing values as valid errors we modify eCCCBiclustering as follows:

The initialization of Ext_{ m }in procedure computeRightMaximalBiclusters must include the symbol used for missing value, when e > 0, and ignore all edges descending from the root starting with this symbol, when e = 0.

The extension of a model m with a symbol α in spellModels must take into account the following: α can either be, or not be, the symbol used for missing value, depending on whether we are performing an extension without errors or performing an extension with errors, respectively.
For details, see procedures extendFromNodeWithoutErrors and extendFromEdgeWithoutErrors, in case of extensions without errors, or procedures extendFromNodeWithErrors and extendFromEdgeWithErrors, in case of extensions with errors. These procedures are called in spellModels and described in additional file 2: algorithmic_complexity_details.
Consider the illustrative example in Figure 9, where some gene expression values are missing.
Figure 10 shows the generalized suffix tree T_{ right }and the two maximal 1CCCBiclusters (B1 and B2) identified by two valid models when e = 1, q_{ r }= q_{ c }= 3 and missing values are considered as valid errors.
In order to "jump over" missing values we modify eCCCBiclustering as follows:

After alphabet transformation, we construct the generalized suffix tree T_{ right }, used in procedure computeRightMaximalBiclusters, using the set of strings without missing values , where r_{ i }is the number of contiguous sets of symbols without missing values in row i. The set of substrings of each string S_{ i }(gene i), , is inserted in T using the same terminator $i.
Consider, for example, the string corresponding to the expression pattern of gene G2 in the illustrative example in Figure 9. In this case, and in order to "jump over" the missing value in the time points C3 and C5, we insert in T_{ right }two strings corresponding to each of the two contiguous sets of symbols without missing values in the expression pattern of G2: = [D1 U2 $2] and = [U4 $2]. Note that the same terminator $2 is used for all the substrings of row i: and .
Figure 11 shows the generalized suffix tree T_{ right }constructed for the matrix after alphabet transformation in Figure 9 together with the four maximal 1CCCBiclusters (B1, B2, B3 and B4), identified by four valid models, when e = 1, q_{ r }= 3, q_{ c }= 2 and the algorithm "jumps over" missing values.
The asymptotic complexity of both versions of this extended version of eCCCBiclustering remains O(max (R^{2}C^{1+e}Σ^{e}, RC^{2+e}Σ^{e})). When e = 0, a modified version of CCCBiclustering [27] can be used to achieve the linear time complexity O(RC), if repeated CCCBiclusters are not filtered. In order to eliminate repetitions, the asymptotic complexity is now O(R^{2}C).
Handling anticorrelated expression patterns
Given the importance of anticorrelation relationships in the study of transcription regulation using time series expression data we present here the extension of eCCCBiclustering to extract maximal eCCCBiclusters with signchanges, that is maximal eCCCBiclusters allowing genes with opposite expression patterns. We first define formally the concepts of opposite expression pattern, eCCCBicluster with signchanges, and maximal eCCCBicluster with signchanges:
Definition 12 ( e CCCBicluster with SignChanges) An eCCCBicluster with signchanges A_{ IJ }is an eCCCBicluster where all the strings S_{ i }that define the expression pattern of each of the genes in I are either in the eNeighborhood of the expression pattern S that defines the eCCCBicluster, or in the eneighborhood of its opposite expression pattern S^{1}: S_{ i }∈ N (e; S) or S_{ i }∈ N (e, S^{1}), ∀i ∈ I.
Definition 13 (Maximal e CCCBicluster with SignChanges) An eCCCBicluster with signchanges A_{ IJ }is maximal if it is rowmaximal, leftmaximal and rightmaximal. This means that no more rows or contiguous columns can be added to I or J, respectively, maintaining the coherence property in Definition 12.
In order to discover maximal eCCCBiclusters with signchanges we modify eCCCBiclustering as follows:

We construct the generalized suffix tree T_{ right }, used in procedure computeRight MaximalBiclusters, for the set of strings S_{ i }∈ {S_{1}, ..., S_{R}} obtained after alphabet transformation and insert in T_{ right }the set of opposite patterns of these strings . Since we use string terminators {$1, ..., $R} for the expression patterns S_{ i }and {$(R + 1,..., $(2R)} for their opposite patterns it is easy to compute the color arrays in T_{ right }in O(R) time and space. Notet hat we still use a color array with a maximum of R bits and not 2R bits.

When the extension to "jump over" missing values is considered, we construct T_{ right }for the set of strings and their opposite patterns .
Figure 12 shows the generalized suffix tree T_{ right }and the three maximal 1CCCBiclusters (B1, B2 and B3), identified by three valid models, when e = 1, q_{ r }= 3 and q_{ c }= 2. In this example, the extension "jump over" missing values was used to handle missing values.
The asymptotic complexity of this extended version of eCCCBiclustering remains O(max (R^{2}C^{1+e}Σ^{e}, RC^{2+e}Σ^{e})). Note however that, although the asymptotic complexity does not change the constant of proportionality is higher. When e = 0, a modified version of CCCBiclustering [27] can again be used to achieve the linear time complexity O(RC), if repeated CCCBiclusters are not filtered. However, removing repeated CCCBiclusters takes O(R^{2}C).
Handling scaled expression patterns
Since different genes can have different expression rates, we propose eCCCBiclustering with scaled expression patterns. These extensions allow the shifting of gene expression patterns up to K symbols up and down, in order to potentially find maximal eCCCBiclusters that would not be found due to different gene expression rates. The value of K is an integer between 1 and (Σ  1), where Σ is the set of symbols used to discretize the original expression matrix, in lexicographic order.
In the general case, and in order to shift the expression pattern of the genes K symbols up and down we consider a pair of K symbol alphabets: Σ^{↑} and Σ^{↓}. These alphabets make it possible to shift all the symbols in Σ the desired K symbols up and down. Assuming the three alphabets Σ, Σ^{↑} and Σ^{↓} are in lexicographic order and thus their symbols respect the ordering Σ^{↓} [1] < ... < Σ^{↓} [K] < Σ [1] < ... < Σ [Σ] <Σ^{↑}[1] < ... < Σ^{↑} [K], the alphabet = Σ^{↓} ∪ Σ ∪ Σ^{↑} is also in lexicographic order.
For illustration purposes, consider Σ = {D, N, U}, K = (Σ  1) = 2, and the illustrative example in Figure 9. In this case, and in order to shift the expression pattern of the genes K = 2 symbols up and down, we need to consider, for example, the K = 2 symbol alphabets Σ^{↑} = {V, W} and Σ^{↓} = {B, C}. The three symbols in Σ are then shifted K = 2 symbols up and down using the following three pairs of alphabets: = {N, U} and = {B, C}; = {U, V} and = {C, D}; and = {V, W} and = {D, N}. Thus, = {B, C, D, N, U, V, W}, in this specific case.
We define eCCCBicluster with scaled patterns and the notion of maximality as follows:
Definition 14 ( e CCCBicluster with Scaled Patterns) An eCCCBicluster with scaled patterns
A_{ IJ }is an eCCCBicluster where all the strings S_{ i }that define the expression pattern of each of the genes in I are either in the eNeighborhood of the expression pattern S, that defines the eCCCBicluster, or in the eneighborhood of the patterns resulting from shifting its expression pattern S K symbols up, , or K symbols down, , where K is an integer and K ∈ [1, ..., Σ  1]. This means S_{ i }∈ <N (e, S) ∨ S_{ i }∈ N (e, S^{↑}) ∨ S_{ i }∈ N (e, S^{↓}), ∀i ∈ I.
Definition 15 (Maximal e CCCBicluster with Scaled Patterns) An eCCCBicluster with scaled patterns A_{ IJ }is maximal if it is rowmaximal, leftmaximal and rightmaximal. This means that no more rows can be added to the set of rows I and no contiguous columns can be added to the set of columns J while maintaining the coherence property in Definition 14.
In order to discover eCCCBiclusters with scaled patterns we modify eCCCBiclustering as follows:

We construct the generalized suffix tree T_{ right }, used in procedure computeRight MaximalBiclusters, for the set of strings S_{ i }= {S_{1}, ..., S_{R}} and insert in T_{ right }the patterns resulting from shifting the expression pattern S_{ i }K symbols up and down.
Since we use string terminators $1, ..., $R for the expression patterns S_{ i }and $(R + 1), ..., $(R + 2 × K × R) for shifted patterns it is easy to compute the colors arrays in T_{ right }in O(R) time and space.

When the extension to "jump over" missing values is considered, we construct T_{ right }for the set of strings together with their corresponding set of shifted patterns K symbols up and down.
The asymptotic complexity of eCCCBiclustering with scaled patterns is O(KR^{2}C^{1+e}Σ^{e}). When e = 0, a modified version of CCCBiclustering [27] can be used to obtain O(KRC), or O(KR^{2}C) if repetitions are discarded.
Alternative ways to compute approximate expression patterns
In this section we describe alternative ways to compute the errors allowed in the approximate patterns, which can reveal to be more suitable depending on the specific problem under study. The proposed eCCCBiclustering algorithm can be modified in order to cope with the three different kinds of errors described below: restricted errors, alphabet range weighted errors, and pattern length adaptive errors.
Restricted errors
The eCCCBiclustering algorithm allows general errors, that is, substitutions of the symbols A_{ ij }in the eCCCBicluster A_{ IJ }by any symbol in the alphabet but A_{ ij }. Considering approximate expression patterns having this kind of errors is specially relevant to minimize the negative effect of measurement errors, generally occurring during the microarray experiments, in the ability of the algorithm to identify relevant expression patterns. However, if we are specially interested in minimizing the also problematic effects of potential discretization errors, introduced due to poor choice of discretization thresholds or number of symbols, we can consider restricted errors, that is, substitutions of the symbols A_{ ij }by the lexicographically closer symbols (neighbors) in .
In general, when restricted errors are considered, the allowed substitutions for any symbol A_{ ij }are in the set , where is the position of A_{ ij }in and z is a value in that specifies the number of neighbors both to the left and to the right of that are considered valid errors. Note that this set with the allowed symbols to substitute the symbol in A_{ ij }has a maximum of (2z) elements. Furthermore, the exact number of elements depends both on the number of considered neighbors, z, and on the position of A_{ ij }in the alphabet , p. If then the errors are not restricted. For example, when general errors are allowed, Σ = {D, N, U}, and m = [U2 D3 U4 D5], D5 can be substituted by N5 and U5 in = {D 5, N 5, U 5} leading to the 1CCCBicluster B5 = ({G1, G2, G4},{C2–C5}) in Figure 7. However, if only restricted errors with z = 1 are allowed, D5 can only be substituted by {N 5} leading to 1CCCBicluster B = ({G1, G2},{C2–C5}).
Alphabet range weighted errors
When the alphabet Σ used to discretize the data has many symbols, we can either restrict the errors allowed in the approximate patterns to a neighborhood around the symbol, or to consider alphabet range weighted errors. In the last case, we weight the errors according to the percentage of the total alphabet range they correspond to. For example, if Σ has 10 symbols, an error consisting of a substitution between symbols Σ[1] and Σ[3] should get a weight of 2/9 ~ 0.22 and not a weight of 1 (as happens to all errors in the definition of eCCCBicluster). This means that in general an error from symbol Σ[i] to symbol Σ [j], considering that Σ is in lexicographic order and i <j, is weighted as , where . Since Σ  1 is the maximum amplitude error, , when i = 1 and j = Σ. Furthermore, , each time i = j and no error occurred. In these settings, a nodeoccurrence can be extended with errors if the weighted sum of the errors already found is less than e.
Pattern length adaptive errors
The definition of an eCCCBicluster A_{ IJ }states that the expression pattern S_{ i }of each gene in I must be in the eNeighborhood of an expression pattern S that defines the eCCCBicluster. This implies that the maximum number of errors e is fixed, and, as such, it does not take into account the length of the expression pattern of each individual eCCCBicluster B_{ k }. Since allowing e errors in an expression pattern of a few columns is not the same as allowing e errors in longer expression patterns, we propose the use of pattern length adaptive errors (the longer the pattern the more errors can be considered during the model extension process). In this context, we believe that a valuable extension to eCCCBiclustering is to allow not a fixed number of errors e per gene in , but an adaptive number of errors, per gene and per eCCCBicluster B_{ k }, which is computed using a percentage of the number of columns in . As such the goal is to modify the algorithm in order to find and report all maximal δCCCBiclusters (defined below) instead of all maximal eCCCBiclusters using Lemma 8 and Lemma 9 (described below).
Definition 16 ( δ CCCBicluster) A contiguous column coherent bicluster A_{ IJ }with pattern length adaptive errors, δCCCBicluster, is a CCCBicluster where all the strings S_{ i }that define the expression pattern of each of the genes in I are in the δ Neighborhood of an expression pattern S that defines the δCCCBicluster: S_{ i }∈ N (δ, S), ∀i ∈ I. The value of δ is computed using a percentage of the number of columns in J: δ = αJ, where α ∈ [0, ..., 1].
Lemma 8 A δCCCBicluster A_{ IJ }is an eCCCBicluster with e = δ = αJ errors per gene in I.
Lemma 9 The δNeighborhood of a string S, N(δ, S), where δ = αJ, α ∈ [0, ..., 1] and J ∈ [1, ..., C], contains elements, where ϵ = αC.
Scoring eCCCBiclusters using statistical significance and similarity measures
Since applying biclustering to real gene expression matrices can produce hundreds or even thousands of biclusters, an objective evaluation of the quality of the biclusters discovered is crucial. In fact, the inspection of biclustering results can be prohibitive without an efficient scoring approach which enables sorting and filtering the results according to a statistical scoring criterion. The statistical significance of the results can then be combined with measures of biological significance in order to produce a set of interesting and potentially useful biclusters, both from the statistical and biological point of view. For eCCCBiclusters, we propose the use of a scoring criterion, which combines two criteria:

1.
Statistical significance of expression patterns.

2.
Similarity measure between overlapping eCCCBiclusters.
In this work, we extend the concept of statistical significance of perfect expression patterns, proposed for CCCBiclusters [9], in order to compute the statistical significance of approximate expression patterns. Based on this scoring criterion, the pvalue of each eCCCBicluster is computed and those not passing a Bonferroni corrected statistical significance test at a predefined level are discarded. Biclusters are then sorted by increasing order of their pvalue and, when several of them overlap more than a predefined threshold, only the most significant are kept.
Note that although we use the stringent Bonferroni correction for multiple testing at the 1% level to guarantee that only the highly significant eCCCBiclusters are considered for further analysis, other less conservative statistical corrections can be used, thus considering more eCCCBiclusters as highly significant. Other significance levels can also be used. Moreover, even though we also use a stringent threshold in the overlapping filter (only eCCCBiclusters overlapping less than 25% are further analyzed), other overlapping thresholds can be used, thus allowing the analysis of a larger number of eCCCBiclusters, although potentially increasing the number of redundant biclusters.
In what follows we describe how to compute the statistical significance of an eCCCBicluster using its expression pattern and describe the similarity score used to compare eCCCBiclusters.
Statistical significance
We proposed to measure the statistical significance of an eCCCBicluster B of size I × J, where I is the set of genes and J is the set of contiguous timepoints, and expression pattern p_{ B }, against the null hypothesis, H_{0}, that assumes that the expression values of genes evolve independently.
Under the null hypothesis, it is possible to compute, using reasonable simplifying assumptions, the probability of an eCCCBicluster of the considered size and expression pattern occurring by chance in an expression matrix with R genes and C time points. The value of this probability is obtained by computing the tail of the binomial distribution, which gives the probability of an event with probability p occurring k or more times in n independent trials: .
The statistical significance of an eCCCBicluster B is thus the value of pvalue(B), which is computed by obtaining the probability of a random occurrence under H_{0} of the expression patterns in the eNeighborhood of the expression pattern p_{ B }defining the eCCCBicluster, N (e, p_{ B }), k = I  1 times in n = R  1 independent trials, where I is the number of genes in B and R is the total number of genes in the gene expression matrix. This is performed using the simplifying assumption that the probability of occurrence of a specific expression pattern in the eNeighborhood of the pattern p_{ B }, N (e, p_{ B }), is adequately modeled by a first order Markov Chain, with state transition probabilities obtained from the values in the corresponding columns in the matrix. In the general case,
where N(e, p_{ B }) and N (e, p_{ B }) [i] are, respectively, the number of patterns and the i th pattern in the eNeighborhood of the pattern p_{ B }. As an example, consider the computation of P (N (e, p_{ B })) when B is the eCCCBicluster B1 = ({G 1, G 2, G 4}, {C 1  C 4}) in Figure 7 with p_{ B }= [D1 U2 D3 U4]. Since, in this case, eCCCBiclustering was applied using e = 1, we have to compute P(N(1, [D1 U2 D3 U4])), which has, in this case, the following set of elements: {[D1 U2 D3 U4], [N1 U2 D3 U4], [U1 U2 D3 U4], [D1 D2 D3 U4], [D1 N2 D3 U4], [D1 U2 N3 U4], [D1 U2 U3 U4], [D1 U2 D3 D4], [D1 U2 D3 N4]}.
In this context, the value of P(N(1, [D1 U2 D3 U4])) is computed as follows: P(N(1, [D1 U2 D3 U4])) = P([D1 U2 D3 U4]) + P([N1 U2 D3 U4]) + P([U1 U2 D3 U4]) + P([D1 D2 D3 U4]) + P([D1 N2 D3 U4]) + P([D1 U2 N3 U4]) + P([D1 U2 U3 U4]) + P([D1 U2 D3 D4]) + P([D1 U2 D3 N4]).
By using a first order Markov Chain, P([D1 U2 D3 U4]), for example, is computed as follows:
where , , and . The values D 1, D 1U 2, U 2, U 2D 3, D 3, D 3U 4 correspond, respectively, to the number of occurrences of symbol D1, the number of transitions from D1 to U2, the number of occurrences of symbol U2, the number of transitions from D1 to U2, the number of occurrences of symbol D3 and the number of transitions from D3 to U4. The remainder conditional probabilities needed to compute P(N(1, p_{ B })) are computed in a similar way.
When missing values are considered as valid errors, N(e, p_{ B }) is computed using the alphabet Σ '∪ mv', where mv is the symbol used for missing value and each element mv' is obtained by concatenating m and one number in the range {1, ..., C}, that is, mv' = {mv} × {1, ..., C}.
When only restricted errors are allowed, N(e, p_{ B }) is not computed using all the symbols in Σ'. The allowed substitutions for each symbol in p_{ B }are the z neighbors, both to the left and to the right of Σ'[p] that are considered as valid errors, where p is the position of the symbol p_{ B }[k] in Σ'.
In the case of eCCCBiclusters with signchanges we compute the statistical significance of B, using the pvalue(B), by obtaining the probability of a random occurrence under H_{0} of the expression patterns in the eNeighborhoods of the patterns in the p_{ B }and , K = I1 times in n = R 1 independent trials. We compute P(N(e, p_{ B }) ∪ N(e, )) as follows:
where and N(e, p_{ B }), N(e, ), N(e, p_{ B })[i] and N (e, ) [i] are, respectively, the number of patterns and the i th pattern in the eNeighborhood of the pattern p_{ B }and , respectively.
In the case of eCCCBiclusters with scaled patterns we compute the statistical significance of B, using the pvalue(B), computed by obtaining the probability of a random occurrence under H_{0} of the patterns in the eNeighborhood of the pattern p_{ B }, and its scaled patterns, (p_{ B })^{↑} and (p_{ B })^{↓}, k = I  1 times in n = R  1 independent trials, where I is the number of genes in B and R is the total number of genes in the matrix.
We compute P(N(e, p_{ B }) ∪ N(e, (p_{ B })^{↑})) ∪ N(e, (p_{ B })^{↓})) as follows:
where shift ∈ {1, ..., K}, and K is the value used in eCCCBiclustering with scaled patterns to shift the expression patterns K symbols up and down.
Similarity measure
In order to compute the similarity measure between two eCCCBiclusters, B_{1} = (I_{1}, J_{1}) and B_{2} = (I_{2}, J_{2}), we use the Jaccard Index. In this work, this score is used to measure the overlap between two eCCCBiclusters both in terms of genes and conditions and is defined as follows:
where B_{11} = {(i, j): (i, j) ∈ B_{1} ∧ (i, j) ∈ B_{2}}, B_{10} = {(i, j): (i, j) ∈ B_{1} ∧ (i, j) ∉ B_{2}}, and B_{01} = {(i, j): (i, j) ∉ B_{1} ∧ (i, j) ∈ B_{2}}, for the genes i ∈ I_{1} ∪ I_{2} and the conditions j ∈ J_{1} ∪ J_{2}.
Similarly, the gene similarity and condition similarity can be computed, respectively, as follows:
Note that, in practice, and since B_{1} = I_{1} × J_{1} and B_{2} = I_{2} × J_{2}, the similarity score as defined above can be computed easily using the fact that B_{1} ∩ B_{2} = I_{1} ∩ I_{2} × J_{1} ∩ J_{2} and B_{1} ∪ B_{2} = B_{1} + B_{2}  B_{1} ∩ B_{2}.
Results and discussion
In this section we present and discuss the results obtained when applying the proposed eCCCBiclustering algorithm to real time series gene expression data. We also compare the performance of the proposed algorithm to that of CCCBiclustering [9]. We first describe the dataset used to test the ability of the algorithm to find biologically relevant expression patterns in real data and to perform the comparison with CCCBiclustering. This dataset describes the transcriptional responses of Saccharomyces cerevisiae to heat stress. We then show an application of eCCCBiclustering to the discovery of transcriptional regulatory modules. Finally, we present the comparison with CCCBiclustering. All the results presented are based on the analysis of Gene Ontology annotations obtained using the GOToolbox database [28], together with information about transcriptional regulations available in the YEASTRACT database [29].
Dataset
We used a dataset from Gasch et al. [30], concerning the Saccharomyces cerevisiae response to heat shock. This dataset comprises seven different time points along the first hour of exposure to 37°C (0, 0, 0, 5, 15, 30 and 60 minutes) and corresponds to the experiment identified as "heat shock 2" in the original group of datasets described by the authors. Since the first three time points are replicates of the steady state, we computed an average of three replicates of time zero and used a dataset with five time points. From the original 6152 ORFS we removed those with missing values and the ones that no longer existed in SGD (Saccharomyces Genome Database). For the remaining 6142 genes we obtained the correspondence between ORFS and gene names using the YEASTRACT database [29]. Since both eCCCBiclustering and CCCBiclustering work with a discretized matrix we have then discretized this dataset using the discretization technique proposed by Ji and Tan [20, 31]. The discretized matrix A is obtained in two steps. In the first step, A' is transformed into an A" = R × (C  1) matrix of variations (see Equation 1). Once matrix A" is generated, the final discretized matrix A, also with R rows and C 1 columns, is obtained in a second step by binning the values of the transformed matrix considering a threshold t > 0.
The expression matrix A' was standardized to zero mean and unit standard deviation, gene by gene, before the discretization process, and the discretization threshold t was set to the value of the standard deviation (t = 1). We refer to this preprocessed and discretized dataset as DiscretizedHeatShock.
Application of eCCCBiclustering to the identification of transcriptional regulatory modules
To assess the biological relevance of eCCCBiclusters in real data we applied eCCCBiclustering to the DiscretizedHeatShock dataset. We allowed only one error (e = 1) and considered only errors in the 1neighborhood of the symbols in the alphabet Σ = {D, N, U}. Note that this corresponds to applying one of the eCCCBiclustering extensions we propose in this work (eCCCBiclustering with restricted errors). By restricting the errors to the 1neighborhood of the symbols in the alphabet Σ = {D, N, U}, our goal is to avoid the impact of a poor choice of the discretization thresholds in the ability of the algorithm to find all genes with coherent expression patterns. As such, the errors D <  > N and N <  > U are allowed but the error D <  > U is not permitted.
With this settings, 1CCCBiclustering found 170 maximal nontrivial 1CCCBiclusters. For these 170 1CCCBiclusters we computed the pvalue using the method described in the previous section. Only 47 1CCCBiclusters were considered as statistically significant, at the 1% level, after applying the Bonferroni correction for multiple testing. All the 1CCCBiclusters not passing this statistical test were discarded. The remainder 47 were then sorted in ascending order of the statistical pvalue previously computed. See additional file 3: 1_ccc_biclusters for a summary of these 47 eCCCBiclusters.
In order to avoid the analysis of highly overlapping 1CCCBiclusters, we computed the similarities between the sorted 1CCCBiclusters using the Jaccard similarity score and filtered the 1CCCBiclusters with similarity above 25%. The filtering process removed 35 of the 47 1CCCBiclusters originally selected. Figure 13 shows the expression patterns of the 12 1CCCBiclusters that remain.
These 12 highly significant and nonredundant 1CCCBiclusters were then analyzed using the Gene Ontology annotations using the GOToolbox database [28], together with information about transcriptional regulations available in the YEASTRACT database [29].
Figure 14 shows a summary of these top 12 1CCCBiclusters (expression patterns, number of genes and contiguous time points) together with information about functional enrichment relatively to terms in the Gene Ontology. To perform the analysis for functional enrichment, we considered only the "Biological Process" ontology and terms above level 2. We used the pvalues obtained using the hypergeometric distribution to assess the overrepresentation of a specific GO term. In order to consider an eCCCBicluster to be highly significant, we require its genes to show highly significant enrichment in one or more of the "Biological Process" ontology terms by having a Bonferroni corrected pvalue below 0.01. An eCCCBicluster is considered as significant if at least one of the GO terms analyzed is significantly enriched by having a (Bonferroni corrected) pvalue in the interval [0.01, 0.05[. Note that, although we only consider as functionally enriched the terms with Bonferroni corrected pvalues below 0.01 (for high statistical significance), or below 0.05 (for statistical significance), the pvalues presented in the text are without correction, as it is common practice in the literature.
It is worth noting that all the 1CCCBiclusters analyzed have in general a large number of GO terms enriched (after Bonferroni correction), and all of them have at least one term whose pvalue is highly significant (see Figure 14, for details). This means all the 1CCCBiclusters identified are biologically relevant as reported by functional enrichment analysis performed using the Gene Ontology.
Figure 15 and Figure 16 show a detailed analysis of the Gene Ontology annotations together with information about transcriptional regulations available in the YEASTRACT database, for the 1CCCBiclusters with transcriptional upregulation patterns and 1CCCBiclusters with transcriptional downregulation patterns, respectively. When the 1CCCBicluster has more than 10 terms enriched or its genes are coregulated by more than 10 transcription factors (TFs), only the 10 terms with lower pvalue or the 10 transcription factors regulating the higher percentage of the genes in the 1CCCBicluster are listed. The GO terms marked with * only passed the statistical test at the 5% level.
Comparison with CCCBiclustering: perfect versus approximate expression patterns
To assess the biological relevance of eCCCBiclusters in real data, and test our thesis regarding the potential superiority of this approach relatively to finding CCCBiclusters with perfect expression patterns, we compared the results of eCCCBiclustering to those of CCCBiclustering in the DiscretizedHeatShock dataset, as recently published by Madeira et al. [9].
In order to perform this comparison we reproduced the results in [9] using a prototype implementation of CCCBiclustering coded in Java and made available by the authors in http://www.inescid.pt/kdbio/software/cccbiclustering. We have also reproduced the biological analysis of CCCBiclustering results since the data in the two databases (GoToolbox and YEASTRACT) used by the authors for this purpose was updated since the results in [9] were published.
Our intuition, when performing this comparison, is that allowing a small number of errors, per gene, in the perfect expression patterns identifying the CCCBiclusters (0CCCBiclusters) discovered by CCCBiclustering should improve the biological significance of the biclusters by considering genes with approximate expression patterns and thus minimizing the effect of possible discretization errors.
Note that, in the specific case of allowing 1 error in the pattern of a CCCBicluster one of the following three situations can happen: (1) the 1CCCBicluster is equal to the CCCBicluster; (2) one or more genes, excluded from the CCCBicluster due to a single error are added to the 1CCCBicluster; (3) the pattern of the 0CCCBicluster is extended (by adding one contiguous column at its beginning/end) leading to a 1CCCBicluster with at least as many genes as the CCCBicluster and one additional contiguous column.
In this context, we believe the improvement in the biological significance of the results obtained by eCCCBiclustering should be twofold:

1.
The functional enrichment of the eCCCBiclusters should improve not only regarding the pvalues of the GO terms enriched but also in terms of the number of GO terms enriched.

2.
The number of genes regulated by relevant transcription factors in 1CCCBiclusters (TFs) should be higher than the number of genes regulated by the same TFs in the corresponding CCCBiclusters.
The validation of the two points above will, in our opinion, demonstrate that eCCCBiclustering is not only able to recover genes with approximate expression patterns, that are potentially lost when only perfect expression patterns are considered, but also that the recovered genes are, in fact, biologically relevant to the problem under study.
CCCBiclustering discovered 167 maximal nontrivial CCCBiclusters, which were then sorted in ascending order according to a statistical pvalue similar to that we proposed here for eCCCBiclusters. From these only 25 CCCBiclusters were considered as highly significant at the 1% level after applying the Bonferroni correction for multiple testing. In order to avoid the analysis of highly overlapping CCCBiclusters, we have also computed the similarities between the sorted CCCBiclusters using the Jaccard similarity score and filtered the CCCBiclusters with similarity greater than 25%. The filtering process removed 9 of the 25 CCCBiclusters originally selected. See additional file 4: ccc_biclusters for a summary of these 25 CCCBiclusters. See also additional file 5: 1_ccc_biclusters_vs_ccc_biclusters for a detailed comparison between the 47 highly significant 1CCCBiclusters discovered by 1CCCBiclustering restricted to errors in the 1neighborhood of the symbols in the alphabet Σ = {D, N, U} and the 16 highly significant CCCBiclusters found by CCCBiclustering and analyzed by Madeira et al. [9]. It is clear from this table that most of the 47 1CCCBiclusters discovered by the 1CCCBiclustering algorithm are highly overlapping with one or more of the top 16 CCCBiclusters identified by the CCCBiclustering algorithm. Figure 17 shows a summary of the remaining 16 CCCBiclusters analyzed according to the Gene Ontology (GO) annotations obtained using the GoToolBox [28], together with information about transcriptional regulations available in the YEASTRACT database [29], as performed above for 1CCCBiclustering results. See additional file 6: ccc_biclusters_biological_validation for a detailed analysis of the GO terms enriched and transcriptional regulations of these top 16 CCCBiclusters.
Note that, unlike what happened with the top 12 1CCCBiclusters discovered by 1CCCBiclustering (Figure 14), the top 16 CCCBiclusters discovered by CCCBiclustering (Figure 17) have in general a small number of GO terms enriched (after Bonferroni correction), and several of them are not functionally enriched (after Bonferroni correction). This means some of the CCCBiclusters identified by the CCCBiclustering algorithm may not be biologically relevant according with the GO analysis.
Figure 18 shows the relationship between the top 12 1CCCBiclusters discovered by 1CCCBiclustering in Figure 14 (CCCBiclusters with approximate patterns allowing one error per gene relatively to the expression pattern identifying the 1CCCBicluster) and the top 16 CCCBiclusters discovered by CCCBiclustering in Figure 17 (CCCBiclusters with perfect expression patterns). It is clear from this figure that, apart from two 1CCCBiclusters (IDs 68 and 122), all other 1CCCBiclusters correspond to the extension of one or several of the 16 CCCBiclusters by adding genes with approximate expression patterns. The CCCBicluster with ID 124 was extended not only with genes with approximate patterns but also with a contiguous column at the left of its expression pattern.
It is worth noting that all the resulting 1CCCBiclusters have a larger number of GO terms functionally enriched. Moreover, even when the CCCBiclusters are not functionally enriched, the 1CCCBiclusters obtained by considering approximate expression patterns instead of perfect patterns are always functionally enriched.
In order to show that the number of genes regulated by relevant TFs has increased in the 1CCCBiclusters when compared with the same number in the corresponding CCCBiclusters, we used a set of relevant CCCBiclusters chosen by Madeira et al. among the top 16 CCCBiclusters in Figure 17. From these top 16 CCCBiclusters the authors selected 6 CCCBiclusters, which were then analyzed in more detail using the Gene Ontology annotations together with information about transcriptional regulation available in the YEASTRACT database. These selected CCCBiclusters describe either transcriptional upregulation (CCCBiclusters with IDs 39, 27 and 14) or downregulation patterns (CCCBiclusters with IDs 147, 151 and 124). For these 6 CCCBiclusters the authors identified relevant transcription factors (TFs) according to their expression pattern and relevant GO terms. For example, the heatshock factor Hsf1p, together with the transcription factors Msn2p and Msn4p, known regulators of the general stress response in yeast, and the transcription factor Rpn4p, known stimulator of the proteosome genes, involved in the degradation of denatured or unnecessary proteins in stressed yeast cell [9], were identified by the authors as relevant TFs in CCCBiclusters 39, 27 and 14. Note that apart from CCCBicluster 14, whose corresponding 1CCCBiclusters were removed during the application of the overlapping filter (see additional file 5: 1_ccc_biclusters_vs_ccc biclusters), all these selected CCCBiclusters have at least one corresponding 1CCCBicluster in the top 12 (as it is also shown in Figure 18).
In this context, we decided to compare the number of genes regulated by each relevant TF identified in each of the selected CCCBiclusters and the number of genes regulated by the same TFs in the corresponding 1CCCBiclusters. Remember that, if relevant genes (not included in these CCCBiclusters due to a single error) were recovered and included in the corresponding 1CCCBiclusters, the number of genes regulated by relevant TFs should increase in the 1CCCBicluster.
Figure 19 shows, for each of the 5 selected CCCBiclusters considered (CCCBiclusters with IDs 39, 27, 147, 151 and 124), the set of relevant transcription factors together with the number of regulated genes and compares these numbers with those obtained for the same TFs in the corresponding 1CCCBicluster(s). It is clear from this figure that the number of genes regulated by relevant TFs always increases in the corresponding 1CCCBiclusters. These results support our idea that eCCCBiclustering is able to recover genes with relevant expression patterns, that were missed due to small errors, and are in fact, biologically relevant to the problem under study. For example, in CCCBicluster 39, the relevant transcription factors Sok2p, Arr1p, Hsf1p, Rpn4p and Msn2p, regulated, respectively, 70, 37, 36, 32 and 32 of the 258 genes in the CCCBicluster. In the corresponding 1CCCBicluster (ID 79) with 849 genes, these key transcription factors regulate, respectively, 189, 115, 142, 123 and 147 genes.
These results demonstrate that allowing the discovery of CCCBiclusters with approximate patterns (eCCCBiclusters), rather than restricting the analysis to CCCBiclusters with perfect expression patterns, can in fact improve the biological significance of the obtained results. These results also show, the superiority of the proposed eCCCBiclustering, algorithm, when compared with the CCCBiclustering approach, in the identification of biologically relevant temporal patterns of expression.
Conclusion and future work
In this work we proposed eCCCBiclustering, a new biclustering algorithm specifically developed for time series gene expression data analysis, that finds and reports all maximal contiguous column coherent biclusters with approximate expression patterns in time polynomial in the size of the expression matrix. These approximate patterns allow a given number of errors, per gene, relatively to an expression profile representing the expression pattern in the eCCCBicluster. We described the algorithmic details of eCCCBiclustering, analyzed its computational complexity, and proposed extensions to improve the ability of the algorithm to discover other relevant expression patterns by being able to deal with missing values and allowing anticorrelated and scaled expression patterns. We also discussed different ways to compute the errors allowed in the approximate expression patterns. Finally, we described a scoring criterion based on a statistical test, used to sort eCCCBiclusters by increasing value of the probability that they have appeared by a random coincidence of events. Coupled with a similarity measure, used to filter highly overlapping eCCCBiclusters, this scoring criterion effectively identifies not only statistically but also biologically relevant eCCCBiclusters, which can then be useful to identify regulatory modules.
The results show the effectiveness of the approach and its relevance in the discovery of regulatory modules describing the transcriptomic expression patterns occurring in Saccharomyces cerevisiae in response to heat stress. Moreover, the comparison performed with a state of the art biclustering algorithm specifically developed for time series gene expression data analysis demonstrated the superiority of eCCCBiclustering in discovering statistically and biologically relevant temporal patterns of expression.
As short term future work, we plan to extend the algorithm to detected timelagged regulations between genes and temporal patterns of expression in multiple time series gene expression matrices. The proposed algorithm can be easily extended to discover eCCCBiclusters with timelags, enabling the discovery of important timelagged regulations between genes, such as activation and inhibition, as well as temporal programs of expression, in which genes are activated one by one in a predefined order. Moreover, extending the algorithm to identify local temporal patterns of expression using multiple datasets should enable the discovery of conserved expression patterns and potentially help in the identification of common regulatory modules within and acrossspecies. Our medium and long term research will be related with the use of the information about coherent expression patterns and coregulation in the identification of regulatory modules, potentially helpful in the challenging area of inferring regulatory networks. This will require the development of efficient inference methods able to integrate heterogeneous data such as gene expression data, sequence data, and textual information scattered in scientific literature.
References
BarJoseph Z: Analyzing time series gene expression data. Bioinformatics. 2004, 20 (16): 24932503.
Androulakis IP, Yang E, Almon RR: Analysis of TimeSeries Gene Expression Data: Methods, Challenges and Opportunities. Annual Review of Biomedical Engineering. 2007, 9: 205228.
McLachlan GJ, Do K, Ambroise C: Analysing microarray gene expression data. 2004, Wiley Series in Probability and Statistics
Cheng Y, Church GM: Biclustering of Expression Data. In Proc of the 8th International Conference on Intelligent Systems for Molecular Biology. 2000, 93103.
Mechelen IV, Bock HH, Boeck PD: Twomode clustering methods: a structured overview. Stat Methods Med Res. 2004, 13 (5): 363394.
Madeira SC, Oliveira AL: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform. 2004, 1 (1): 2445.
Tanay A, Sharan R, Shamir R: Discovering statistically significant biclusters in gene expression data. Bioinformatics. 2002, 18 (Suppl 1): S136S144.
BenDor A, Chor B, Karp R, Yakhini Z: Discovering Local Structure in Gene Expression Data: The OrderPreserving Submatrix Problem. J Comput Biol. 2002, 10 (34): 373384.
Madeira SC, Teixeira MC, SáCorreia I, Oliveira AL: Identification of Regulatory Modules in Time Series Gene Expression Data using a Linear Time Biclustering Algorithm. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 21 Mar. 2008, 10.1109/TCBB.2008.34. IEEE Computer Society Digital Library. IEEE Computer Society
Peeters R: The maximum edge biclique problem is NPcomplete. Discrete Applied Mathematics. 2003, 131 (3): 651654.
Yang E, Foteinou PT, King K, Yarmush ML, Androulakis I: A novel nonoverlapping biclustering algorithm for network generation using living cell array data. Bioinformatics. 2007, 23 (17): 23062313.
Murali TM, Kasif S: Extracting conserved gene expression motifs from gene expression data. Pac Symp Biocomput. 2003, 7788.
Koyuturk M, Szpankowski W, Grama A: Biclustering GeneFeature Matrices for Statistically Significant Dense Patterns. In Proc of the 8th International Conference on Research in Computational Molecular Biology. 2004, 480484.
Liu J, Wang W, Yang J: Biclustering in gene expression data by tendency. In Proc of the 3rd International IEEE Computer Society Computational Systems Bioinformatics Conference. 2004, 182193.
Liu J, Wang W, Yang J: A framework for ontologydriven subspace clustering. In Proc of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004, 623628.
Liu J, Wang W, Yang J: Gene ontology friendly biclustering of expression profiles. In Proc of the 3rd IEEE Computational Systems Bioinformatics Conference. 2004, 436447.
Liu J, Wang W, Yang J: Mining Sequential Patterns from Large Data Sets. Kluwer. 2005, 18:
Lonardi S, Szpankowski W, Yang Q: Finding Biclusters by Random Projections. In Proc of the 15th Annual Symposium on Combinatorial Pattern Matching. 2004, 102116.
Sheng Q, Moreau Y, Moor BD: Biclustering microarray data by Gibbs sampling. Bioinformatics. 2003, 19 Suppl 2: ii196ii205.
Ji L, Tan K: Identifying timelagged gene clusters using gene expression data. Bioinformatics. 2005, 21 (4): 509516.
Wu C, Fu Y, Murali TM, Kasif S: Gene expression module discovery using Gibbs sampling. Genome Informatics. 2004, 15: 239248.
Madeira SC, Oliveira AL: A Linear Time Biclustering Algorithm for Time Series Gene Expression Data. Proc of the 5th Workshop on Algorithms in Bioinformatics. 2005, 3952. Springer Verlag, LNCS/LNBI 3692
Prelic A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics. 2006, 22 (10): 12821283.
Zhang Y, Zha H, Chu CH: A TimeSeries Biclustering Algorithm for Revealing CoRegulated Genes. In Proc of the 5th IEEE International Conference on Information Technology: Coding and Computing. 2005, 3237.
Gusfield D: Algorithms on strings, trees, and sequences. 1997, Computer Science and Computational Biology Series, Cambridge University Press
Sagot MF: Spelling approximate repeated or common motifs using a suffix tree. Proc of Latin'98. 1998, 111127. Springer Verlag, LNCS 1380
Madeira SC: Efficient Biclustering Algorithms for Time Series Gene Expression Data Analysis. PhD thesis. 2008, Instituto Superior Técnico, Technical University of Lisbon
Martin D, Brun C, Remy E, Mouren P, Thieffry D, Jacq B: GOToolBox: functional investigation of gene datasets based on Gene Ontology. Genome Biology. 2004, 5 (12): R101http://burgundy.cmmt.ubc.ca/GOToolBox/
Teixeira MC, Monteiro P, Jain P, Tenreiro S, Fernandes AR, Mira NP, Alenquer M, Freitas AT, Oliveira AL, SaCorreia I: The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae. Nucleic Acids Research. 2006, 34: D446D451. http://www.yeastract.com/
Gasch AP, Spellman PT, Kao CM, CarmelHarel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes. Molecular Biology of the Cell. 2000, 11: 42414257.
Ji L, Tan K: Mining gene expression data for positive and negative coregulated gene clusters. Bioinformatics. 2004, 20 (16): 27112718.
Madeira SC, Oliveira AL: An Efficient Biclustering Algorithm for finding Genes with Similar Patterns in TimeSeries Expression Data. Proc of the 5th Asia Pacific Bioinformatics Conference, Series in Advances in Bioinformatics and Computational Biology. 2007, 5: 6780. Imperial College Press
Acknowledgements
Parts of this work have appeared previously in [32]. However, this manuscript describes algorithmic and complexity details not included in the conference version of the paper. We also present a detailed comparison with other algorithms with similar goals highlighting their strengths and weaknesses. The extensions to allow missing values, anticorrelation, scaling, alphabet range weighted errors and pattern length adaptive errors are original. The proposed approach to score eCCCBiclusters using a statistical significance criterion and a similarity measure, only superficially mentioned in the conference paper, was improved and used in the experimental results. All the experimental results are new. The algorithm was applied to a new dataset to identify transcriptional regulatory modules. Moreover, the superiority of CCCBiclusters with approximate expression patterns (eCCCbiclusters) relatively to CCCBiclusters (perfect expression patterns) was demonstrated using two biological criteria: stronger evidence of functional enrichment (regarding the pvalues of the GO terms enriched and the number of GO terms enriched) and increased number of genes regulated by relevant transcription factors.
This work was partially supported by projects ARN – Algorithms for the Identification of Genetic Regulatory Networks, PTDC/EIA/67722/2006, and Dyablo – Models for the Dynamic Behavior of Biological Networks, PTDC/EIA/71587/2006, funded by FCT, Fundafição para a Ciência e Tecnologia.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
SCM and ALO designed the eCCCBiclustering algorithm together with the proposed extensions and defined the scoring criterion for eCCCBiclusters based on the statistical significance of their expression patterns and similarity with other overlapping biclusters. SCM coded the prototype implementation of the algorithm in Java and wrote the first draft of the manuscript. SCM and ALO worked together towards the final version of the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
13015_2008_65_MOESM1_ESM.pdf
Additional file 1: e CCCBiclustering: Related work on biclustering algorithms for time series gene expression data. Supplementary material describing related work on biclustering algorithms for time series gene expression data analysis. We describe in detail three state of the art biclustering approaches specifically designed to identify biclusters in gene expression time series and identify their strengths and weaknesses. We also explain and justify why we decided to compare the performance of eCCCBiclustering with that of CCCBiclustering, but not with that of the qclustering and CCTSB algorithms. (PDF 141 KB)
13015_2008_65_MOESM2_ESM.pdf
Additional file 2: e CCCBiclustering: Algorithmic and complexity details. Supplementary material describing algorithmic and complexity details of eCCCBiclustering. (PDF 207 KB)
13015_2008_65_MOESM3_ESM.pdf
Additional file 3: Highly significant 1CCCBiclusters. Table showing a summary of the 47 1CCCBiclusters passing the Bonferroni correction for multiple testing at the 1% level when 1CCCBiclustering restricted to errors in the 1neighborhood of the symbols in the alphabet Σ = {D, N, U} was applied to the DiscretizedHeatShock dataset. (PDF 32 KB)
13015_2008_65_MOESM4_ESM.pdf
Additional file 4: Highly significant CCCBiclusters. Table showing a summary of the 25 CCCBiclusters passing the Bonferroni correction for multiple testing at the 1% level when CCCBiclustering was applied to the DiscretizedHeatShock dataset. (PDF 33 KB)
13015_2008_65_MOESM5_ESM.pdf
Additional file 5: Highly significant 1CCCBiclusters versus highly significant CCCBiclusters. Table showing a comparison between the 47 highly significant 1CCCBiclusters discovered by 1CCCBiclustering restricted to errors in the 1neighborhood of the symbols in the alphabet Σ = {D, N, U} and the 16 highly significant CCCBiclusters found by CCCBiclustering (after the applying the overlapping filter) and analyzed by Madeira et al. [9]. Both sets of biclusters were identified when the algorithm was applied to the DiscretizedHeatShock dataset. (PDF 42 KB)
13015_2008_65_MOESM6_ESM.pdf
Additional file 6: GO terms enriched and transcriptional regulations of the top 16 CCCBiclusters. Table showing a detailed analysis of the GO terms enriched and transcriptional regulations of the top 16 CCCBiclusters discovered with CCCBiclustering. When the set of genes in the CCCBicluster have more than 10 transcription factors or more than 10 GO terms enriched, only the top 10 of each are shown. We only show the GO terms passing the Bonferroni correction for multiple testing at either the 1% level (highly significant) or the 5% level (significant). The pvalues marked with * only passed the test at the 5% level. The pvalues presented in the table are without correction as it is common practice in the literature. (PDF 45 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Madeira, S.C., Oliveira, A.L. A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series. Algorithms Mol Biol 4, 8 (2009). https://doi.org/10.1186/1748718848
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1748718848
Keywords
 Gene Ontology
 Valid Model
 Discretized Matrix
 Approximate Pattern
 Relevant Transcription Factor