We present our approach in four stages. First, we define some useful notation. Second, we introduce the PQ-tree, a data structure that is fundamental to our approach. Third, we present our layout algorithm. Finally, we discuss its implementation and the web interface to query the computed layout.

### 3.1 Definitions

We denote the input matrix by *D* and use *R* and *C* to denote the set of rows and columns of *D*, respectively. A *layout* \mathcal{L}(\mathcal{R}, \mathcal{C}) of the matrix *D* is a two-dimensional matrix specified as follows:

1.

is the ordered list of rows of \mathcal{L} with the property that each element of \mathcal{R} is an element of *R*; a row in *R* can appear multiple times in \mathcal{R}.

2.

is the ordered list of columns of \mathcal{L} with the property that each element of \mathcal{C} is an element of *C*; a column in *C* can appear multiple times in \mathcal{L}.

3.

_{
ij
}, the element in the *i* th row of \mathcal{R} and the *j* th column of \mathcal{C} is equal to *D*_{
i'j'
}, where *i'* is the row of *D* corresponding to the *i* th row of \mathcal{L} and *j'* is the column of *D* corresponding to the *j* th column of \mathcal{R}.

The *size* of \mathcal{L}, is |\mathcal{R}||\mathcal{C}|. It is appropriate to consider \mathcal{L} to be a layout of *D* since \mathcal{L} specifies an order for the rows and columns of *D*. We do not require that every row/column of *D* appear in \mathcal{L}. In the example in Figure 1(b), the layout does not contain the column titled "< 35F" that is in the original matrix. The layout does not contain any repeated rows or columns either.

Given subsets *R'* ⊆ *R* and *C'* ⊆ *C*, we define a *bicluster B*(*R'*, *C'*) to be the sub-matrix of *D* spanned by the rows in *R'* and the columns in *C'*. This simple definition is sufficient for this paper. An algorithm that computes biclusters in gene expression data will use a more complex definition relevant to the patterns to be detected. A bicluster *B*(*R'*, *C'*) is *contiguous* in a layout \mathcal{L}(\mathcal{R}, \mathcal{C}) if and only if the elements of *R'* (respectively, *C'*) appear consecutively at least once in \mathcal{R} (respectively, \mathcal{L}). We say that the layout \mathcal{L}(\mathcal{R}, \mathcal{C}) is *valid* with respect to a set of biclusters *S* if every bicluster *B* ∈ *S* is contiguous in \mathcal{L}(\mathcal{R}, \mathcal{C}). For example, the layout in Figure 1(b) is valid with respect to the bicluster ({7/04/2004, 7/03/2004, 7/02/2004}, {> 60F, Daylight > l0 h, Cloudy, Rainy}) since the bicluster spans rows four to six and columns two to five in the layout. We now formally define the *bicluster layout* problem: Given a matrix *D* and a set *S* of biclusters in *D*, find a layout \mathcal{L} of *D* such that \mathcal{L} is valid with respect to *S* and \mathcal{L} has the smallest size among all valid layouts of *D*.

### 3.2 The PQ tree

Booth and Leuker [22] developed a data structure called the PQ tree, which they used to compute a column ordering that proves that that a binary matrix *M* has the COP. To define the PQ tree, it is convenient to reformulate the COP problem as follows: Let *U* be the set of columns of *M*. Let *r* be the number of rows in *M*. For each *i*, 1 ≤ *i* ≤ *r*, define the set *S*_{
i
}to be the set of columns in *U* that have a one in row *i*. We seek a permutation of the elements of *U* that satisfies *r* restrictions, where *restriction i*, 1 ≤ *i* ≤ *r* requires that the elements of *S*_{
i
}be consecutive in the permutation.

A PQ tree can represent all legal permutations of *U* that satisfy the restrictions {*S*_{
i
}, 1 ≤ *i* ≤ *r*}. Each leaf of the PQ tree corresponds to a column in *U*. The PQ tree contains two types of internal nodes: P-nodes and Q-nodes. The children of a P-node can be permuted in any way while still satisfying the restrictions. A valid permutation of the children of a Q-node is either the order in which they appear in the PQ tree or the reversal of this order. A PQ tree supports the REDUCE operation. This operation inserts a restriction *S* into a PQ tree *T*, modifying *T* such that *T* satisfies *S* in addition to all the previous restrictions inserted into *T*. The REDUCE operation fails if there are no legal permutations of *U* that can satisfy *S* and the previously inserted restrictions. The operation takes time linear in |*S*|. Figure 3 displays a PQ tree on four elements {*a*, *b*, *c*, *d*} after two REDUCE operations: REDUCE(*T*,{*a*, *c*}) and REDUCE(*T*,{*b*, *c*}). Inserting the restriction {*c*, *d*} into the tree next will result in a failed REDUCE operation.

To solve the COP problem, start with an empty PQ tree *T*. For each *i*, 1 ≤ *i* ≤ *r*, invoke the operation reduce(*T*, *S*_{
i
}). To obtain an ordering that satisfies the restrictions, perform a breadth-first traversal of *T* starting at the root. At each internal node of *T*, visit the children of the node in any order that is valid for the type of the node. At a leaf node of *T*, append the column corresponding to the leaf to the required ordering.

### 3.3 The bicluster layout algorithm

We are now ready to describe our algorithm for the bicluster layout problem. To minimize the size of \mathcal{L}, we can minimize the length of \mathcal{R} and the length of \mathcal{C} independently. Therefore, we construct the layout \mathcal{L} by determining \mathcal{R} and \mathcal{C} independently. In the rest of this section, we describe the algorithm to construct \mathcal{C}, the ordered list of the columns in the layout \mathcal{L}. We can compute \mathcal{R}, the ordered list of rows in the layout, analogously.

We describe the algorithm in two stages. We first transform the problem of constructing \mathcal{C} to a generalization of the COP problem. We then present an algorithm to solve this transformed problem. This transformation allows us to describe our algorithm in terms of operations on PQ trees. The PQ tree cannot solve this generalization directly since the matrix we construct may not have the COP.

We start by constructing a new binary matrix *M* that represents the columns of the biclusters in *S*. Each column on *M* corresponds to a column of the input matrix *D*. *M* contains one row for each bicluster in *S*; thus, *M* has *n* rows. The entry *M*_{
ij
}is 1 if the *i* th bicluster in *S* contains the column *j* in *D*; otherwise, *M*_{
ij
}is 0. We can now reformulate the problem of constructing \mathcal{C} as follows: find the shortest linear ordering \mathcal{C} of the columns of *M* such that \mathcal{C} can contain repeated columns of *M* and for every row of *M*, the columns containing the ones in that row appear consecutively at least once in \mathcal{C}.

Before describing the algorithm, we define some more notation. The leaves of each PQ tree constructed by the algorithm correspond to a subset of the columns of *M*. We use *C*_{
T
}to denote the set of columns in a PQ tree *T*. Given two PQ trees *T* and *T'*, let *σ*(*T*, *T'*) denote the set similarity \frac{\left|{C}_{T}\cap {C}_{{T}^{\prime}}\right|}{\left|{C}_{T}\cup {C}_{{T}^{\prime}}\right|} between the columns in *T* and *T'*. Our algorithm executes the following steps:

1. For each row *i* of *M*, 1 ≤ *i* ≤ *n*, construct a PQ tree *T*_{
i
}and insert the restriction corresponding to row *i* of M into *T*_{
i
}. Let \mathcal{T} be the set of these *n* PQ trees.

2. For every pair 1 ≤ *i* ≤ *j* ≤ *n*, compute the set similarity *σ*(*T*_{
i
}, *T*_{
j
}).

3. Compute Σ, the list of values in {*σ*(*T*_{
i
}, *T*_{
J
}), 1 ≤ *i* ≤ *j* ≤ *n*} sorted in descending order.

4. Repeat the following steps until Σ is empty:

(a) Remove the largest element from Σ. Let *T* and *T'* be the PQ trees in \mathcal{T} with this similarity value.

(b) Set *T"* = *T*.

(c) For each restriction *r* inserted into *T'*, invoke the operation REDUCE(*T"*, *r*). If any reduce operation fails, go to Step 4a.

(d) Delete *T* and *T'* from \mathcal{T}.

(e) For each tree *U* ∈ \mathcal{T}, insert *σ*(*U*, *T"*) into Σ.

(f) Insert *T"* into \mathcal{T}.

5. For each PQ tree *T* in \mathcal{T}, traverse *T* to compute a valid permutation of the columns in *C*_{
T
}.

6. Output the column layout formed by concatenating (in any order) the permutations computed in Step 5.

The algorithm starts by storing each row of *M* in a separate PQ tree in the set \mathcal{T} (Step 1). Next, the algorithm performs a series of REDUCE operations to hierarchically cluster the rows of *M*. Inductively, the restrictions inserted into each PQ tree in \mathcal{T} correspond to a set of rows of *M* with the property that the submatrix of *M* spanned by these rows has the COP. To decide which two sets of rows to merge next, in Step 4a, the algorithm picks the two PQ trees *T* and *T'* in \mathcal{T} that are the most similar and attempts to merge them. To effect the merger, the algorithm adds the restrictions added to one of these PQ trees to the other PQ tree (Step 4c). If this step succeeds, the algorithm deletes *T* and *T'* from \mathcal{T}, inserts the similarities between the new PQ tree *T"* and each of the remaining PQ trees in \mathcal{T} into Σ, and inserts *T"* into \mathcal{T} (Steps 4d–4f). In Step 4c, the failure of a REDUCE operation means that the restrictions in *T* are not compatible with the restrictions imposed by *T'*. Hence, the submatrix of *M* induced by the union of rows in *T* and in *T'* does not have the COP. An example of such a situation is when *T* corresponds to the tree in Figure 3(c) and *T'* contains the restriction {*c*, *d*}. In this case, the algorithm aborts the merger of *T* and *T'* and moves on to the next most similar pair of PQ trees. Due to such conflicts, \mathcal{T} may contain more than one PQ tree when the algorithm completes. Finally, generating the required layout is a simple matter of traversing each PQ tree in \mathcal{T} (Step 5) as described in Section 3.2 and concatenating the resulting permutations into a single order (Step 6). A column of *M* appears as many times in this order as there are PQ trees in \mathcal{T} that include this column.

We now analyze the running time of the algorithm. Let *m* be the number of ones in the matrix *M*. As stated earlier, the number of biclusters in the input is *n*. In Step 1, computing the PQ trees takes *O*(*m*) time. Computing the similarity between a pair of PQ trees takes *O*(*c*) time, where *c* is the number of columns of *M*. Thus, in Steps 2 and 3, computing and sorting the *O*(*n*^{2}) similarity values takes *O*(*cn*^{2} + *n*^{2} log *n*) time. We execute Step 4 *O*(*n*^{2}) times. The running time of each iteration is proportional to the size of the new PQ tree constructed. A naive upper bound on this size is *m*, the total number of columns in all the biclusters. Hence, the total running time of Step 4 is *O*(*mn*^{2}). Finally, traversing all the PQ trees in \mathcal{T} and concatenating the permutations takes *O*(*m*) time. Keeping in mind that *c* ≤ *m*, the total running time of the algorithm is *O*(*mn*^{2} + *n*^{2} log *n*). The space used by the algorithm is *O*(*m* + *n*^{2}), with *O*(*m*) space taken to store all the biclusters and the PQ trees and *O*(*n*^{2}) required for Σ, the sorted list of similarities.

### 3.4 Implementation and web interface

We implemented the layout algorithm in C++ and tested it on a 2.8 GHz Pentium computer running the Fedora Core 3 operating system. Our software contains two executable programs. The first executable, layout, implements the layout algorithm. It takes a text file describing the biclusters as input and outputs the layout in a simple textual format that specifies the order of the rows and columns in the layout and the corners of each bicluster in the layout. The second executable, drawlayout, uses the computed layout and the original data set as input and produces an image corresponding to the layout.

If the input data contains a large number of biclusters, the layout may contain too many rows and/or columns for the user to navigate with ease. To alleviate this problem, we have also developed a simple web-based interface that allows the user to upload a file containing computed biclusters and a file containing the original data, and query the layout with the names of rows and columns. The interface invokes layout and drawlayout on the biclusters that contain the query rows/columns and highlights the matching biclusters, rows, and columns in the resulting layout. The interface allows the user to specify whether the data is real-valued or binary, whether the layout should contain only the matching biclusters, and whether the query should be a conjunction or disjunction of the search terms.