BicPAM: Pattern-based biclustering for biomedical data analysis
- Rui Henriques^{1}Email author and
- Sara C Madeira^{1}
https://doi.org/10.1186/s13015-014-0027-z
© Henriques and Madeira; licensee BioMed Central. 2014
Received: 18 January 2014
Accepted: 12 November 2014
Published: 16 December 2014
Abstract
Background
Biclustering, the discovery of sets of objects with a coherent pattern across a subset of conditions, is a critical task to study a wide-set of biomedical problems, where molecular units or patients are meaningfully related with a set of properties. The challenging combinatorial nature of this task led to the development of approaches with restrictions on the allowed type, number and quality of biclusters. Contrasting, recent biclustering approaches relying on pattern mining methods can exhaustively discover flexible structures of robust biclusters. However, these approaches are only prepared to discover constant biclusters and their underlying contributions remain dispersed.
Methods
The proposed BicPAM biclustering approach integrates existing principles made available by state-of-the-art pattern-based approaches with two new contributions. First, BicPAM is the first efficient attempt to exhaustively mine non-constant types of biclusters, including additive and multiplicative coherencies in the presence or absence of symmetries. Second, BicPAM provides strategies to effectively compose different biclustering structures and to handle arbitrary levels of noise inherent to data and with discretization procedures.
Results
Results show BicPAM’s superiority against its peers and its ability to retrieve unique types of biclusters of interest, to efficiently deliver exhaustive solutions and to successfully recover planted biclusters in datasets with varying levels of missing values and noise. Its application over gene expression data leads to unique solutions with heightened biological relevance.
Conclusions
BicPAM approaches integrate existing disperse efforts towards pattern-based biclustering and provides the first critical strategies to efficiently discover exhaustive solutions of biclusters with shifting, scaling and symmetric assumptions with varying quality and underlying structures. Additionally, BicPAM dynamically adapts its behavior to mine data with different levels of missing values and noise.
Keywords
Introduction
Biclustering, a local approach for clustering, seeks to find sub-matrices (biclusters), subsets of rows with a highly correlated expression pattern across a subset of columns. Biclustering has been extensively applied in gene expression data analysis [1], since small groups of genes can participate in multiple cellular processes or pathways of interest that may be only active in a subset of the conditions under analysis. Biclustering has been also applied to group mutations and copy number variations [2], to analyze biological networks [3], and to study translational [4], chemical [5] or nutritional data [6].
Biclustering involves hard combinatorial optimization. In particular, its complexity increases when rows and columns are allowed to participate in more than one bicluster (non-exclusive structure) and in no bicluster at all (non-exhaustive structure). Hence most existing algorithms are either based on greedy or stochastic approaches [1],[2],[7],[8], potentially producing sub-optimal solutions, or on finding a constrained number, structure or type of biclusters [1],[2],[9].
The state-of-the-art attempts to tackle biclustering using pattern mining techniques allow for exhaustive and flexible searches and show solid levels of efficiency [10],[11]. The fact that pattern mining research is driven by scalability requirements [12], opens a critical direction to perform biclustering. Interestingly, the existing pattern-based approaches for biclustering – such as BiModule [13], DeBi [10], RAP [14] and GenMiner [15] – provide complementary principles of interest for this field. However, these principles are not yet integrated. Additionally, existing approaches only discover biclusters with constant profiles [10],[13],[14], and are not able to handle missing values or medium to high levels of noise. This work aims to target these limitations by proposing a pattern-based biclustering approach, BicPAM, that is able to combine existing potentialities from state-of-the-art pattern-based approaches with two critical novel contributions:
flexible exhaustive solutions: arbitrary number of (potentially overlapping) biclusters with additive, multiplicative and symmetric assumptions using multiple ranges of values;
biclustering behavior dynamically adapted to deal with varying levels of noise and missing values.
To our knowledge, this is the first biclustering approach that is able to support and combine each of these two contributions. The importance of these contributions is shown experimentally over synthetic and biological data. Additionally, experimental results on both synthetic and real datasets demonstrate the efficiency and effectiveness of the pattern-based biclustering algorithms proposed in BicPAM.
The paper is organized as follows. Background covers essential concepts from biclustering and pattern mining, and surveys the contributions from existing pattern-based biclustering approaches. BicPAM: pattern-based biclustering describes the proposed algorithms. In Results , we assess BicPAM’s performance on synthetic and real data. Finally, the contributions and implications of this work are synthesized.
Background
This section introduces fundamental concepts of biclustering and pattern mining, and surveys the related work on pattern-based biclustering.
Definition 1.
Given a matrix, A=(X,Y), with a set of rows X={x_{1},..,x_{ n }}, columns Y={y_{1},..,y_{ m }}, and elements ${a}_{\mathit{\text{ij}}}\in \mathbb{R}$ relating row i and column j:
A bicluster B=(I,J) is a r×s submatrix of A, where I=(i_{1},..,i_{ r })⊂X is a subset of rows and J=(j_{1},..,j_{ s })⊂Y is a subset of columns;
The biclustering task is to identify a set of biclusters $\mathcal{\mathcal{B}}=\{{B}_{1},\mathrm{..},{B}_{p}\}$ such that each bicluster B_{ k }=(I_{ k },J_{ k }) satisfies specific criteria of homogeneity, where I_{ k }⊂X, J_{ k }⊂Y, and $k\in \mathbb{N}$.
Approaches to solve the biclustering task either explicitly or implicitly rely on a merit function to define the homogeneity criteria. An illustrative function is the variance of bicluster’s values. Merit functions either guarantee intra-bicluster homogeneity, the overall homogeneity of the output set of biclusters (inter-bicluster homogeneity), or both. When combined within specific search procedures, merit functions are to define the type, quality and structure of biclustering solutions [1].
Merit functions can be defined to locally maximize greedy iterative searches [7],[8],[16]–[19], to combine row- and column-based clusters [20]–[22], to exploit matrices recursively [23], and to stochastically model the target solution [6],[24]. In exhaustive searches, which commonly rely on constrained formulations, merit functions are the heuristics that guide the space exploration [9],[25].
Figure 1 presents different types and structures of biclusters. Biclusters can follow constant or more flexible models, with coherency on rows or columns [1]. Biclusters under an additive-multiplicative model, also referred as shifting-scaling biclusters, can be discovered using merit functions based on δ-offsets of noise [17],[25], on vector-angle cosines [21], or on generative models of linear dependencies [2]. Biclusters with symmetries can be discovered by differential biclustering methods [9],[26] and by few others [14]. Additionally, plaid [6] and order-preserving [19] types of biclusters have also been tackled [27],[28]. Multiple biclustering structures have been proposed [1], with some approaches constraining them to exhaustive, exclusive, non-overlapping structures, and others allowing more flexible structures with arbitrarily positioned overlapping biclusters [7].
Pattern mining
Patterns are itemsets, rules or substructures that appear in a dataset with frequency no less than a specified threshold. Finding patterns is critical to derive relations from data.
Definition 2.
Let be a finite set of items, and P be an itemset $P\subseteq \mathcal{\mathcal{L}}$. A transaction t is a pair (t_{ id },P) with $\mathit{\text{id}}\in \mathbb{N}$. An itemset database D over is a finite set of transactions {t_{1},..,t_{ n }}.
Definition 3.
A transaction (t_{ id },P) contains P^{′}, denoted P^{′}⊆(t_{ id },P), if P^{′}⊆P. The coverage Φ_{ P } of an itemset P is the set of all transactions in D in which the itemset P occurs: Φ_{ P }={t∈D∣P⊆t}. The support of an itemset P in D, denoted s u p_{ P }, can either be absolute, being its coverage size |Φ_{ P }|, or a relative threshold given by |Φ_{ P }|/|D|.
Definition 4.
Given an itemset database D and a minimum support threshold θ, the frequent itemset mining (FIM) problem consists of computing the set $\{P\mid P\subseteq \mathcal{\mathcal{L}},{\mathit{\text{sup}}}_{P}\ge \theta \}$.
A frequent itemset is an itemset with s u p_{ P }≥θ. An accepted pattern is a frequent itemset that satisfies any other placed constraints over D.
To illustrate these concepts, consider the following itemset database, D _{ ex }= {(t _{1},{B,E,G}), (t _{2}, {A,B,C,E,H,J}), (t _{3},{A,B,D,H,J}), (t _{4},{D,H,J}), (t _{5},{A,H,J}), (t _{6},{A,G})}. We have $\left|\mathcal{\mathcal{L}}\right|=\left|\right\{A,\mathrm{..},J\left\}\right|=10,{\Phi}_{\{B,J\}}=\{{t}_{2},{t}_{3}\}$ and s u p_{{B,J}}=|{t_{2},t_{3}}|/6=0.(3). For θ=4, FIM tasks returns {{A},{H},{J},{H,J}}.
Since FIM proposal [29], multiple extensions have been proposed, ranging from scalable data mining methodologies to multiple condensed and approximated pattern representations.
Definition 5.
Given an itemset matrix, a support threshold θ, and the coverage function $\Phi :{2}^{\mathcal{\mathcal{L}}}\to {2}^{D}$ that maps an itemset P to its set of supporting transactions:
A frequent itemset P is an itemset that satisfies |Φ(P)|≥θ;
A closed frequent itemset is a frequent itemset with no superset with same support $\left({\forall}_{{P}^{\prime}\supset P}\left|{P}^{\prime}\right|<\left|P\right|\right)$;
A maximal frequent itemset is a frequent itemset with all supersents being infrequent, ${\forall}_{{P}^{\prime}\supset P}\left|\Phi \right({P}^{\prime}\left)\right|<\theta $.
A frequent itemset is maximal if all its supersets are infrequent, while it is closed if it is not a subset of an itemset with the same support. Considering the previously introduced itemset database D _{ ex }, a given threshold θ= 3 and |P|≥2, there is one maximal frequent itemset ({A,H,J}) and there are two closed frequent itemsets ({A,H,J} and {H,J}).
Definition 6.
Consider two itemsets $P\in {2}^{\mathcal{\mathcal{L}}}$ and ${P}^{\prime}\in {2}^{\mathcal{\mathcal{L}}}$, where P^{′}⊆P, and a predicate M. M is monotonic when M(P)⇒M(P^{′}) and M is anti-monotonic when ¬M(P^{′})⇒¬M(P).
These properties are the basis of FIM, either for candidate generation or pattern growth methods, with horizontal or vertical data formats.
Pattern-based biclustering
The homogeneity criteria (Definition 1) in pattern-based approaches for biclustering is obtained through support and confidence-correlation metrics. Pattern-based approaches allow for an efficient and exhaustive space search that produces an arbitrary high number of biclusters within a flexible structure.
Definition 7.
Given a matrix A and a minimum support threshold θ, a set of biclusters ∪_{ k }B_{ k }, where B_{ k }=(I_{ k },J_{ k }), can be derived from the set of frequent itemsets ∪_{ k }P_{ k } by either mapping $({I}_{k},{J}_{k})=({\Phi}_{{P}_{k}},{P}_{k})$ to compose biclusters with coherency on rows, or by mapping $({I}_{k},{J}_{k})=({P}_{k},{\Phi}_{{P}_{k}})$ to compose biclusters with coherency on columns.
Although the state-of-the-art pattern-based biclustering methods follow this general behavior, they have varying structural specificities that affect both the efficiency and the quality of the output. Two classes of PM-based biclustering approaches can be considered: approaches that directly apply pattern miners over discrete matrices, and approaches that target numeric matrices by customizing the support metric. To our knowledge, BiModule [13], DeBi [10], Bellay’s et al. [30] and GenMiner [15] are the state-of-the-art methods for this first class of PM-based biclustering. BiModule [11],[13] allows for a parameterized multi-value itemization of the input matrix to discover constant biclusters derived from (closed) frequent patterns using the LCM miner [31]. DeBi [10] derives biclusters from (maximal) frequent patterns mined over binarized matrices using the MAFIA miner [32], and places key post-processing principles to adjust biclusters in order to guarantee their statistical significance. Bellay’s et al. [30] use the Apriori miner with additional principles to evaluate the functional coherency of the discovered biclusters against the background noise. GenMiner [15] includes external knowledge within the input matrix to derive biclusters from association rules that relate annotations (external grouping of rows or columns) with computed clusters of rows and columns from (closed) frequent patterns using CLOSE [33]. Although other biclustering approaches seize contributions from these previous works [34],[35], we do not refer to them as PM-based appproaches if the core mining task does not rely on FIM.
The itemization step is optional for the second class of methods [36]. To our knowledge, RAP [14], RCB discovery [36] and ET-bicluster [37] are the state-of-the-art methods in this context. RAP [14] plugs an adapted range-based metric to mine constant biclusters on rows (or columns), while RCB discovery targets biclusters with constant values overall [36]. ET-bicluster extends the previous approaches to discover noisy biclusters, although an exhaustive enumeration of biclusters is not guaranteed [37]. Alternative support metrics with dedicated Apriori-based searches have been additionally referred in literature [38]–[40].
BicPAM: pattern-based biclustering
The homogeneity criteria can be intentionally defined to search for specific types and structures of biclusters and to affect their quality. The bicluster type depends on the allowed coherency patterns and on their orientation (row, column or overall), the solution structure depends on the number, size and positioning of biclusters, and, finally, the quality defines the allowed noise associated with a single bicluster or with a set of biclusters.
BicPAM is introduced in three following sections. First, we describe the core steps of BicPAM (BicPAM outline ). Second, we go further and incorporate new methods to deal with missing values and arbitrary high input levels of noise (Affecting the quality of pattern-based biclusters ). Finally, we propose further algorithmic solutions for the discovery of biclusters allowing symmetries and following additive and multiplicative assumptions (Allowing more flexible types of biclusters ).
BicPAM outline
This section describes the structural behavior of BicPAM by surveying principles for the mining, mapping and closing steps. These principles are either derived from the existing pattern-based approaches for biclustering or from advances in the pattern mining field.
Mining step
Understandably, non-constrained settings, where the number of biclusters and their properties is not known apriori, require efficient searches. Pattern mining approaches have been tuned during the last decades to be computationally efficient. Therefore, their adequate use for biclustering is critical and depends essentially on three points described below: 1) the adopted pattern-based approach to biclustering, 2) the target pattern representation, and 3) the search strategy.
1) Pattern-based Approach
Definition 8.
Let be a set of ordinal items, a bicluster is a sub-matrix (I,J)⊆A with its elements ${a}_{\mathit{\text{ij}}}\in \mathcal{\mathcal{L}}$ defining a pattern profile. A constant bicluster can follow: i) an overall constant assumption where a_{ ij }=c and $c\in \mathcal{\mathcal{L}}$, ii) a column-based constant assumption where a_{ ij }=c_{ j } and ${c}_{j}\in \mathcal{\mathcal{L}}$, or iii) a row-based constant assumption where a_{ ij }=c_{ i } and ${c}_{i}\in \mathcal{\mathcal{L}}$.
.
Pattern-based biclustering under a constant assumption is the ordinary case. DeBi [10], BiModule [13] or GenMiner [15] only target this type of biclusters. These approaches either rely on Frequent Itemset Mining (FIM) or on association rules, which contrasts with traditional approaches [9],[18]. The support threshold defines the minimum number of rows in a bicluster. In the context of gene expression, a low support is critical since high expression coherency is only observed for small groups of genes and conditions. Additionally, a post-pruning to the frequent itemsets can be performed in order to filter frequent itemsets below a minimum number of columns and above a maximum number of rows and columns.
From the point of view of an itemized database, the FIM-based biclusters are perfect biclusters, that is, they do not allow value-variations in any of its elements. Contrasting, from the point of view of the input real-value matrix, these biclusters can handle noise since two elements with the same item may be numerically distant. The number of items can be used to control the noise-tolerance. However, regardless of the number of items, a common drawback appears when two elements have similar real-values but different items assigned. We refer to this drawback as the items-boundary problem.
BiModule [11] and DeBi [10] are representative FIM-based biclustering approaches. Since their running time is comparable to greedy algorithms, they offer a simplistic way to deal with noise and overlapping structures [13]. However, the items-boundary problem can lead to the partitioning of large biclusters into smaller ones (with many being filtered as they no longer satisfy the support criterion).
In order to mine frequent itemsets with different properties, the notion of support of an itemset can be redefined. RAP [14] uses a customized anti-monotonic range support merit function. A FIM-based algorithm is used to discover range patterns from real-value matrices without the need for discretization. However, efficiency is strongly penalized.
An additional option to pattern-based biclustering is to derive biclusters from association rules. An association rule, an implication between two itemsets, can affect the properties of the corresponding bicluster as it constrains the levels of confidence among rows. Optionally, correlation metrics can be adopted to augment the confidence-support metrics with new interestingness criteria. GenMiner [15] uses association rules to compose biclusters. However, the adoption of association rules is only preferred over FIM-based approaches when knowledge regarding the dependencies across rows (or columns) is available.
BicPAM uses frequent itemsets as the default pattern-based option to biclustering. Range-based approaches are only selected for small-to-medium datasets. Finally, in the presence of domain knowledge (such as functional groups of genes or dependencies on conditions), BicPAM relies on association rules to compose biclustering solutions.
2) Pattern Representations
The target pattern representation depends essentially on: 1) the selected type and structure of biclusters, and 2) the post-processing needs. Efficiency is not a strong criterion, since only subtle gains are observed for methods that target constrained representations, such as closed and maximal representations.
The use of all frequent itemsets leads to biclustering solutions with a high number of (potentially redundant) biclusters (if contained by another bicluster), which can degrade the performance of the mining and closing steps. Contrasting, the use of maximal itemsets leads to biclusters with the columns’ size maximized. Maximal itemsets for biclustering are adopted in DeBi [10]. Such flattened biclusters are particularly interesting when there is an extension step to be performed to include new rows for the discovered biclusters. However, since both vertical and smaller biclusters are avoided, maximal-based biclusters lead to incomplete solutions as they are just a subset of all valid biclusters.
Finally, by using closed itemsets, we allow for overlapping biclusters only if a reduction on the number of columns from a specific bicluster results in a higher number of rows. Note that to obtain maximal biclusters – biclusters that cannot be extended without the need of removing rows and columns – closed patterns need to be used instead of maximal patterns. FIM-based BiModule [13] and rule-based GenMiner [15] use closed itemsets as the means to compose biclusters.
3) Search Strategies
The definition of the search setting depends essentially on: 1) the fit of the search with the target biclustering task, and 2) the chosen implementation.
The choice on whether to use Apriori-based [41], pattern-growth [42] or combined approaches [43], mainly depends on the dataset density and fixed support thresholds. Dense matrices under low support thresholds benefit from pattern-growth or combined methods. The choice on whether to use a mining method that has a vertical versus an horizontal data format [43] depends essentially on the type of biclusters we are targeting. If we want to find constant values across rows or on both dimensions, we usually benefit from using searches over horizontal data formats [35]. This is particularly true for most GE matrices where the total number of genes largely exceeds the total number of conditions. If we want to find constant values across columns (when n>m), a vertical data format should be the choice, as the performance searches based horizontal formats exponentially degrades with an increasing number of items.
Several algorithms were developed for each of these search strategies. However, their properties should be carefully assessed, as their nature is most of the times optimized towards specific sets of datasets. In the DeBi [10], BiModule [11] and GenMiner [15] biclustering tasks, Mafia [32], LCM [31] and CLOSE [33] are, respectively, the algorithmic choices.
BicPAM makes available a variant of FP-Growth that traces the set of transactions per frequent pattern [44] (default option), Charm [45], AprioriTID [41] and Eclat [43]. Finally, Carpenter [46] and Cobbler [47] are additional algorithmic choices in BicPAM to compose biclusters with a large number of columns and for large-scale datasets.
Mapping step
Normalization techniques are often required to enhance differences across rows and/or columns. Alternative methods have been reported [34],[48]. BicPAM allows the normalization criteria to be applied in the context of a row, a column or the overall matrix. Additionally, it makes available a zero-mean value to allow for symmetries and to provide a simple setting for the approximation of probabilistic distributions. In the presence of missing and outlier elements, a masking bitmap can be used in order to exclude them from the computation of the mean and dispersion metrics.
Discretization is an additional key step for pattern-based methods relying on itemset databases. Although discretization may imply loss of information, it alleviates the noise dilemma [26] and it is the cost to pay for exhaustive searches. BicPAM makes available multiple discretization options with key implications on the target solution. Two axes are considered: 1) the number of items (also referred as symbols) and 2) the target method to map the normalized real-value matrix into a itemset database. Increasing the number of items is commonly used to improve quality, but it reduces the average size of biclusters and the number of biclusters produced. A sensitivity analysis on the impact of choosing different number of items was performed in Bidens [34] and BiModule [13].
Closing step
Similarly to mining and mapping options, post-processing criteria can be used to minimize the two challenges of the noise dilemma. One challenge results from a too restrictive noise tolerance, commonly associated with considering a high number of items, which leads to many small sized biclusters. The other challenge is related with heightened levels of noise allowance, commonly appearing in binarized partitions and under a relaxed levels of support or confidence. To handle these challenges and to treat the problem of the explosion of valid biclusters (commonly connected with overlapping biclusters), BicPAM enables the use of criteria structured according to three stages: 1) extension, 2) merging and 3) filtering.
1) Extension Options
Three optional and non-exclusive strategies can be used to extend the discovered biclusters such that the resulting solution still satisfies some pre-defined homogeneity criteria. First strategy consists on the use of statistical tests to include rows or columns over each bicluster as proposed in DeBi [10]. Second strategy relies on traditional approaches and on their merit functions for further extensions as long as the solution satisfies either the intra- or inter-bicluster homogeneity criteria. Finally, we propose a third strategy that uses patterns discovered under more relaxed criteria (such as lower support-confidence thresholds) to guide the extension step. When considering lower supports, new columns and rows can be added to the original frequent patterns. Similarly, more relaxed association rules, with less restrictive ways to group the antecedent-consequent, can be used to guide the extension step. The use of simple thresholds, of statistical tests or of merit functions to verify if the bicluster is valid can either be computed using the discretized matrix (item matchings) or, more interestingly, the distances from the original real-value matrix.
2) Merging Options
Merging operations serve two goals: noise allowance and overall biclustering structure manipulation. The first goal is driven by the observation that when two biclusters share a significant area it is probable that their merging composes a larger bicluster still respecting some homogeneity criteria. Commonly, such decomposition is related with the items-boundary problem or with a missing value. The simplest criterion to allow the merging is either to rely on the overlapping area (as a percentage of the smaller bicluster), to compute the overall noisy percentage after the merging, or to use advanced homogeneity criteria (potentially relying on the real-values provided by the input matrix). State-of-the-art procedures to efficiently merge pattern-based biclusters include [49],[50].
3) Filtering Options
Filtering is possible at two levels: 1) at the bicluster level, and 2) at the row-column level. The first type of filtering is required to remove duplicates and biclusters that are contained in larger biclusters. The existence of biclusters included in larger biclusters is a necessary result of the extension-merging options and it is a common problem when adopting mining approaches that do not rely on condensed pattern representations. Both DeBi [10] and BiModule [13] provide alternative heuristics to efficiently perform this type of filtering.
The second type of filtering can be used to exclude rows or columns from a particular bicluster in order to intensify its homogeneity. This is usually the case when a low number of items is considered, leading to highly noise-tolerant biclusters. For this purpose, similarly to extension options, we can rely on three strategies: 1) use statistical tests on each row and column of a particular bicluster in order to identify removals, 2) rely on existing greedy-iterative approaches and maximize their merit functions (which can imply a reduction on the size of biclusters), and 3) discover patterns under more restrictive conditions (as higher support and confidence thresholds).
Affecting the quality of pattern-based biclusters
BicPAM options with impact on the solution quality include:
Mining step options, including the approach, the support-confidence thresholds, and the pattern representations;
Mapping step options, including the number of items and the normalization-discretization techniques;
Closing step options, including the selected extension, merging and filtering approaches, and their criteria thresholds (percentage of noise, overlapping degree, statistical significance levels).
Below, we describe new strategies that BicPAM makes available to handle varying levels of missing values and input noise, and to compose multiple structures of biclusters.
Handling missing values
The input matrices can have an arbitrary high number of missing values, as it is common in GE matrices. A non-treated missing value may result in the loss of a critical row and of a column across one or more biclusters. Three different strategies can be applied to treat missing values: 1) removal, 2) replacement, and 3) handling as a special value. The simplest method is to remove the containing row or column (usually the dimension with smaller size).
Handling varying levels of noise
Producing alternative biclustering structures
Since the number of biclusters is neither fixed nor depends on the satisfaction of local coverage criterion, pattern-based approaches provide a heightened flexibility for the composition of different biclustering structures. A pattern-based solution is non-exhaustive, non-exclusive and allows overlaps. The task of composing different structures has been poorly addressed in literature, and rather seen as the byproduct of biclustering methods [1]. Below, we introduce a set of principles to compose multiple structures made available in BicPAM.
For an exhaustive structure (either overall, across rows or across columns), biclusters can be incrementally merged following, for instance, an hierarchical criteria based on the proximity and the area of biclusters, until all the matrix is covered. If the goal is an exclusive structure (either overall, across rows or across columns), a simple strategy is to merge biclusters in order to reduce overlapping across one or both dimensions and, additionally, to filter biclusters that share rows or columns following an relevance criterion (as size or noise level) until exclusivity is guaranteed. Closing options can be specifically used to produce other alternative structures with sharp usability (no need to change the core tasks of pattern-based approaches).
Allowing more flexible types of biclusters
Below, we extend BicPAM to consider more flexible expression patterns: additive, multiplicative and symmetric coherency.
Coherency under additive-multiplicative assumption
Definition 9.
A bicluster (I,J) follows an additive model if a_{ ij }=c+α_{ i }+β_{ j }+η_{ ij }, where c is the typical value within the bicluster, α_{ i } is the adjustment for row i∈I, β_{ j } is the adjustment for column j∈J and η_{ ij } is the noise associated with the element. A bicluster (I,J) follows a multiplicative model if ${a}_{\mathit{\text{ij}}}={c}^{\prime}\times {\alpha}_{i}^{\prime}\times {\beta}_{j}^{\prime}+{\eta}_{\mathit{\text{ij}}}$, which is equivalent to the additive model when c=l o g c^{′}, ${\alpha}_{i}=\mathit{\text{log}}{\alpha}_{i}^{\prime}$ and ${\beta}_{j}=\mathit{\text{log}}{\beta}_{j}^{\prime}$.
We propose two pattern-based strategies for the discovery of biclusters with non-constant models of coherency. The first strategy is to use local normalization procedures to correct row- or column-based differences and then to map the task into the search for constant biclusters.
An additive alignment over a target column y_{ j } can be computed by adding for each element on the row x_{ i } the difference between the maximum of the column and the discretized value m a x(y_{ j })−a_{ ij }. A multiplicative alignment over a target column y_{ j } can be computed by adding, for each element on the row x_{ i }, the least common multiple between the maximum of the column and the discretized value l c m(m a x(y_{ j }),a_{ ij }). The resulting number of items under an additive assumption is in the worst case the double of the number of items initially considered. The number of final items under a multiplicative model is the size of the lcm combinations across the initial items. As illustrated in Figure 8, a distance-based δ-error can be considered to gather close items in the multiplicative model due to the lower probability of finding coherent biclusters as a consequence of the resulting large number of items.
Coherency under symmetrical assumption
A critical, but less studied, type of biclusters is biclusters with coherent values under symmetrical assumption, also referred as biclusters with sign-changes in literature [1]. Two rows or columns from a bicluster allowing symmetries may have similar absolute values differing in sign. Such biclusters can simultaneously capture activation and repression mechanisms within a biological process.
Definition 10.
A bicluster (I,J) following a symmetric model has either: i) symmetries on rows ${\xe2}_{\mathit{\text{ij}}}={c}_{i}\times {a}_{\mathit{\text{ij}}}$, where c_{ i }∈{−1,1} is the symmetry factor for each row of the bicluster and ${a}_{\mathit{\text{ij}}}\in \mathbb{R}$ is a bicluster element defined according to a constant, additive or multiplicative model, or ii) on columns ${\xe2}_{\mathit{\text{ij}}}={c}_{j}\times {a}_{\mathit{\text{ij}}}$, where c_{ j }∈{−1,1} is the column symmetry factor and ${a}_{\mathit{\text{ij}}}\in \mathbb{R}$ is an element of a bicluster with coherent values.
For the purpose of finding biclusters with symmetries, the normalization should satisfy the zero-mean criterion. Additionally, if the number of considered items is odd, there is one item being its own symmetric that must be specially handled.
The combination of this strategy with the search for biclusters under an additive or multiplicative model can be expensive (m×m times iterations). Therefore, BicPAM makes available an additional option to combine the use of the sign and of the additive or multiplicative adjustments together for every column (or row). This model (combined sign and coherent model) is not equivalent to the previous model (sign plus coherent model), since it assumes that additions or multiplications are not absolute but depend on the activity slope sign. Here, the value adjustment of a particular element is also affected by the sign, which can lead to an additional number of items. This strategy is illustrated in the bottom example of Figure 9.
BicPAM algorithm and complexity analysis
The algorithmic basis of BicPAM is described in Algorithm 1. Although BicPAM follows a plug-and-play style, default and data-driven parameterizations are made available. In particular, lines 40-44 and 37 describe BicPAM behavior in the absence of user-driven parameterizations. This is performed by either relying on estimation procedures or on convergence criteria based on thresholds such as the relative area covered by biclusters or the minimum number of biclusters.
The computational complexity of BicPAM is bounded by the pattern mining task and computation of similarities among biclusters. For this analysis, we cover major computational bottlenecks related with each one of the three major steps of BicPAM. Within the mapping step: outlier detection, normalization, discretization, and noise correction procedures (such as the assignment of multiple items) are linear on the size of the matrix, Θ(n m). The optional distribution fitting tests and parameter estimations to dynamically select an adequate discretization procedure are also Θ(n m). These tests and estimations rely on the calculation of approximated statistical ratios [54]. Handling missings by removing the respective element or by replacing them by a special dedicated item is also Θ(n m). However, when an imputation method is selected, the complexity is upper bounded by Θ(h n m), where h is the number of missing values. In BicPAM implementation, the nearest neighbor rows and columns are computed for the estimation of each missing value.
The cost of the mining step depends on two factors: the complexity of the pattern miner and the need for iterations for the discovery of non-constant profiles. The cost of the pattern mining task depends essentially on: the number and size of transactions (γ n m, where γ≥1 captures the increase in size related with noise and missings handlers), the frequency distribution of items ($\{\mathcal{\mathcal{L}}\times Y\}\to \mathbb{N}$), the minimum support θ, the pattern representation and the selected mining procedure. A detailed analysis of this complexity has been attempted in literature [55] and it is out of the scope of this paper. The reader should also keep in mind that there has been proposals to guarantee the scalability of pattern miners recurring to partitioning and approximation methods [12]. Let $\Theta \left(\wp \right(\gamma ,n,m,\left|\mathcal{\mathcal{L}}\right|,\theta \left)\right)$, or simply Θ(℘), to be the complexity of a pattern mining task. When there is the need for the iterative application of the core mining procedure, the overall search is bounded by Θ(d×℘), where $d=\mathit{\text{min}}\left(\left(\genfrac{}{}{0ex}{}{n}{2}\right),m\right)$ when allowing symmetries, $d=\mathit{\text{min}}\left(\left(\genfrac{}{}{0ex}{}{n}{\left|\mathcal{\mathcal{L}}\right|}\right),m\right)$ when allowing shifts, and $d=\mathit{\text{min}}\phantom{\rule{0.3em}{0ex}}\left(\left(\genfrac{}{}{0ex}{}{n}{\mathrm{\u266flcm}}\right),m\right)$ when allowing scaling factors.
The cost of the closing step depends essentially on two factors: the complexity of computing similarities among biclusters (required for merging and filtering biclusters) and the complexity of extending and reducing biclusters. To compute similarities a tree structure is created where each node represents a gene and each leaf corresponds to a bicluster. Only biclusters sharing a branch over a threshold based on the input overlapping degree are candidates for merging and filtering. Filtering a bicluster results in the removal of its leaf and dedicated nodes. Merging two biclusters results on the combination of the target branches. These tasks have an average complexity of $\Theta \phantom{\rule{0.3em}{0ex}}\left(\left(\genfrac{}{}{0ex}{}{k}{k/2}\right)\stackrel{\u0304}{r}\stackrel{\u0304}{s}\right)$, where k is the number of biclusters and $\stackrel{\u0304}{r}\stackrel{\u0304}{s}$ their average size. Extending biclusters relies on quick tests based on the coherency of each new column or row and therefore the complexity of this task is respectively $\Theta \left({k}^{\prime}\stackrel{\u0304}{r}m\right)$ or $\Theta \left({k}^{\prime}n\stackrel{\u0304}{s}\right)$, where k^{′} is the number of biclusters after merging and filtering. Removing rows or columns from biclusters is $\Theta \left({k}^{\prime}\stackrel{\u0304}{r}\stackrel{\u0304}{s}\right)$.
In this context, the complexity of BicPAM is bounded by $\Theta \phantom{\rule{0.3em}{0ex}}\left(\mathit{\text{hnm}}+\mathrm{d\wp}+\left(\genfrac{}{}{0ex}{}{k}{k/2}\right)\stackrel{\u0304}{r}\stackrel{\u0304}{s}+{k}^{\prime}(\stackrel{\u0304}{r}m+n\stackrel{\u0304}{s})\right)$, which for settings resulting in a large number of biclusters after the mining step (k≫k^{′}) is approximately $\Theta \phantom{\rule{0.3em}{0ex}}\left(\mathrm{d\wp}+\left(\genfrac{}{}{0ex}{}{k}{k/2}\right)\stackrel{\u0304}{r}\stackrel{\u0304}{s}\right)$.
Results
In this section, we present an extensive experimental evaluation showing that BicPAM is effective and computationally efficient. BicPAM was implemented in Java (JVM version 1.6.0-24). The following experiments were run in an Intel Core i3 1.80 GHz with 6 GB of RAM.
The results were collected and analyzed in four steps. Section “Comparison of biclustering approaches in synthetic data” compares the performance of BicPAM against state-of-the-art biclustering approaches. In Section “Performance analysis in synthetic data”, BicPAM’s behavior is extensively assessed in synthetic datasets with varying size, noise, sparsity and background distributions. The biological relevance of BicPAM’s results is analyzed in Section “Results in real data”. Finally, Section “Comparison of pattern-based biclustering approaches” goes further on comparing BicPAM and its pattern-based peers. Below, we describe the evaluation metrics and data settings used.
In the absence of hidden biclusters, merit functions can be used as long as they are not biased towards the merit criteria used within the approaches under comparison. Examples include the commonly used mean squared residue (MSR) [62] and its extension [16], or the Pearson’s correlation coefficent [59] sensitive to shifting-scaling properties. Finally, domain-specific evaluations can be used by computing statistical enrichment p-values in biological contexts [10],[63].
Data settings. Gene expression data and two sets of synthetic datasets were used to evaluate BicPAM performance. The first set corresponds to the datasets generated by Hochreiter et al. [2]. These datasets simulate specific characteristics of gene expression data, such as heavy tail properties, using three settings: multiplicative models and additive models under signals according to N(±2,0.5^{2}) and N(±4,0.5^{2}) distributions [64]. Each setting has 100 datasets with 1000 genes, 100 conditions and 10 planted biclusters.
Properties of the generated set of synthetic datasets
Matrix size (♯rows ×♯cols) | 100 × 30 | 500 × 60 | 1000 × 100 | 2000 × 200 | 4000 × 400 |
---|---|---|---|---|---|
Nr. of hidden biclusters | 3 | 5 | 10 | 15 | 20 |
Nr. columns in biclusters | [5,7] | [6,8] | [6,10] | [6,14] | [6,20] |
Nr. rows in biclusters | [10,20] | [15,30] | [20,40] | [40,70] | [60,100] |
Area of biclusters | 9.0% | 2.6% | 2.4% | 2.1% | 1.3% |
The generated matrices were parameterized according to pre-specified number of items ($\left|\mathcal{\mathcal{L}}\right|\in \{5,10,20\}$) and to an inputed bicluster type assumption (constant, additive, multiplicative and/or symmetric). The number of rows and columns for each bicluster followed a Uniform distribution over the ranges presented in Table 1. We allow for overlapping biclusters, which can difficult the recovery of the original planted biclusters. Finally, a noise factor was randomly added over the background values. This noise factor is up to ±15% of the range of values (e.g. a_{ ij }←a_{ ij }U(−1.5,1.5) when 10 items are available).
For each of these settings we instantiated 40 matrices: 20 matrices with background values following a Uniform distribution, $\mathrm{U}(1,|\mathcal{\mathcal{L}}\left|\right)$, and 20 matrices with background values generated according to a Gaussian distribution, $\mathrm{N}\phantom{\rule{0.3em}{0ex}}\left(\frac{\left|\mathcal{\mathcal{L}}\right|}{2},\frac{\left|\mathcal{\mathcal{L}}\right|}{6}\right)$. The performance of BicPAM is an average across these 40 matrices.
Comparison of biclustering approaches in synthetic data
We selected five state-of-the-art approaches able to discover biclusters under additive-multiplicative assumptions: FABIA with sparse prior option [2], Bexpa [66], ISA [67], Plaid [6] and OPSM [19]. Additionally, we considered CC [62], Samba [9], xMotifs [18], and three pattern-based biclustering approaches: BiModule [13], DeBi [10] and RAP [14]. Although the last six biclustering approaches use more simplistic homogeneity criteria, their inclusion is critical to study the biological significance of BicPAM’s solutions and to test BicPAM’s performance improvements when considering biclusters with constant models.
We used the following software to run these methods: R packages fabia [68] and biclust [69], BicAT [70], (Evo-)Bexpa [66] and Expander [71]. The specified number of biclusters for FABIA (with and without sparse equation), Bexpa, CC and ISA (number of starting points) was the number of hidden biclusters plus 10%: $\left|\mathcal{\mathscr{H}}\right|\times 1.1$. Note that this required specification can be used to guide the search space exploration against other biclustering approaches and optimistically bias FABIA consensus (FC) levels. The default number of iterations for OPSM was varied from 10 to 200 iterations. The remaining methods were executed with default parameterizations. For this comparison, BicPAM was parameterized with closed patterns discovered using discretization methods with three distinct sets of items (|Σ|∈{3,5,7}), under a simple merging option (70% overlap) and filtering of biclusters based on an overlapping area over 30% against a larger bicluster. Additionally, two items were assigned to values near item-boundaries, leading to an increase in the size of transactions of 8-11%. The support threshold was incrementally decreased 10% and stopped when the discovered biclusters covered a minimum area of the input matrix (>5% ×|X|×|Y|).
Performance analysis in synthetic data
In this section we study the efficiency limits of BicPAM. Then we assess the ability of BicPAM to discover different types of biclusters for data with varying regularities. Finally, we go further on understanding the impact of using different strategies related with the mining, mapping and closing steps.
Efficiency limits
Recovery of (non-)constant biclusters
FC and MS levels of BicPAM in different settings (mean and variance from 20 datasets)
100 × 30 | 500 × 60 | 1000 × 100 | 2000 × 200 | ||||||
---|---|---|---|---|---|---|---|---|---|
Metric | Coherency | Normal | Uniform | Normal | Uniform | Normal | Uniform | Normal | Uniform |
FC | Constant | 0.862 ±0.017 | 0.930 ±0.014 | 0.884 ±0.018 | 0.956 ±0.007 | 0.909 ±0.017 | 0.949 ±0.006 | 0.907 ±0.014 | 0.948 ±0.011 |
Additive | 0.782 ±0.021 | 0.831 ±0.008 | 0.834 ±0.014 | 0.888 ±0.007 | 0.845 ±0.018 | 0.897 ±0.007 | 0.827 ±0.015 | 0.887 ±0.006 | |
Multiplicative | 0.762 ±0.028 | 0.794 ±0.013 | 0.790 ±0.019 | 0.825 ±0.014 | 0.785 ±0.020 | 0.840 ±0.011 | 0.767 ±0.020 | 0.819 ±0.015 | |
MS$\left(\mathcal{\mathcal{B}},\mathcal{\mathscr{H}}\right)$ | Constant | 0.923 ±0.018 | 0.974 ±0.007 | 0.931 ±0.012 | 0.968 ±0.005 | 0.935 ±0.010 | 0.984 ±0.005 | 0.944 ±0.011 | 0.987 ±0.008 |
Additive | 0.895 ±0.017 | 0.945 ±0.006 | 0.925 ±0.012 | 0.963 ±0.003 | 0.913 ±0.008 | 0.981 ±0.007 | 0.917 ±0.011 | 0.974 ±0.006 | |
Multiplicative | 0.902 ±0.019 | 0.958 ±0.014 | 0.906 ±0.015 | 0.953 ±0.009 | 0.910 ±0.015 | 0.941 ±0.008 | 0.886 ±0.019 | 0.948 ±0.010 | |
$\mathit{\text{MS}}\left(\mathcal{\mathscr{H}},\mathcal{\mathcal{B}}\right)$ | Constant | 0.956 ±0.013 | 0.984 ±0.006 | 0.960 ±0.007 | 0.981 ±0.004 | 0.961 ±0.004 | 0.996 ±0.002 | 0.957 ±0.009 | 0.993 ±0.002 |
Additive | 0.955 ±0.012 | 0.997 ±0.001 | 0.959 ±0.006 | 0.997 ±0.002 | 0.955 ±0.004 | 0.995 ±0.002 | 0.957 ±0.007 | 0.995 ±0.003 | |
Multiplicative | 0.937 ±0.015 | 0.966 ±0.008 | 0.924 ±0.012 | 0.968 ±0.008 | 0.923 ±0.010 | 0.963 ±0.009 | 0.927 ±0.013 | 0.974 ±0.007 |
Mining options
Three main observations can be retrieved from this analysis. First, the use of maximal patterns for biclustering should be avoided as it gives preference to biclusters with a large number of columns and discards biclusters with a subset of these columns (even when they have a larger number of rows). Understandably, this penalizes the $\mathit{\text{MS}}(\mathcal{\mathscr{H}},\mathcal{\mathcal{B}})$ levels. $\mathit{\text{MS}}(\mathcal{\mathcal{B}},\mathcal{\mathscr{H}})$ scores are not so affected as each maximal bicluster is covered by a planted bicluster. Second, the use of simple patterns for biclustering can degrade the $\mathit{\text{MS}}(\mathcal{\mathcal{B}},\mathcal{\mathscr{H}})$ in comparison with closed patterns. This score penalizes the discovery of biclusters contained in larger planted biclusters, even when the discovered biclusters have a heightened homogeneity. Third, the search for closed and maximal patterns is slightly more efficient than the search for simple patterns as a result of additional pruning procedures. These observations support the use of closed patterns. Furthermore, they correspond to maximal biclusters, which are in general the aim of effective biclustering algorithms [1],[13],[73].
Mapping options
When analyzing the results in Figure 22, three observations can be retrieved. First, $\mathit{\text{MS}}(\mathcal{\mathcal{B}},\mathcal{\mathscr{H}})$ under the baseline strategy (remove the missings) significantly decreases from 97% to near 70% when the percentage of missings reaches 10%. Although this solution is easily implemented in BicPAM (removing an element from respective transactions), the majority of existing biclustering algorithms only allow for removals on the columns or the rows where a missing occurs (impracticable even in the presence of a few missings as illustrated). Second, the ability to retrieve the planted biclusters increases when considering the nearest 2-3 values against the strategies that consider the closest value only or all the possible values (relaxed strategy). This is justified by two factors: 1) when estimating more than one value for a missing, there is an increased chance to recover the original value and, therefore, of not damaging a planted bicluster; 2) when considering all the possible values for a missing, there is an increased amount of noise that is added and can lead to the emergence of false biclusters. Third, although inserting multiple values to replace a missing is an attractive option in terms of accuracy, its efficiency is penalized as the itemized matrix becomes denser (consistent with the number of discovered biclusters). Still, when considering only the closest 2-3 values, scalability is maintained for levels of noise up to 10%.
Closing options
We planted additional levels of noise to evaluate the closing options. This was performed by changing the values of specific elements by a randomly distant value (distance >25% of the domain range). The percentage of noisy elements was varied from 0 to 10%. We used the 1000×100 setting, Charm and a total of 10 items.
Finally, the use of filtering strategies can also lead to an enhanced ability to recover the planted biclusters. Although the filtering of biclusters with weak homogeneity impacts accuracy, this analysis targets the removal of rows and columns (on each bicluster) that do not satisfy a specific homogeneity threshold. Figure 24(b) illustrates the impact of removing potentially false rows and columns assuming a level of planted noise of 2%. The impact is only significant when considering a low-to-medium number of items, since for these cases filtering is able to correct the errors related with the large ranges of values per item that lead to false biclusters. Similarly to the merging option, an increase in the matching score is observed when compared to the baseline case (an homogeneity degree of 0%) up to 75%, given by 1 −MSR[62]. From this upper threshold the match scores decrease since the homogeneity criteria becomes too restrictive.
Results in real data
To assess the performance of BicPAM in real data, we compared the biological significance of BicPAM’s solutions against state-of-the-art biclustering solutions using three distinct gene expression datasets [74],[75]: 1) dlblc dataset (660 genes, 180 conditions) to study responses to chemotherapy [76], 2) hughes dataset (6300 genes, 300 conditions) to characterize nucleosome occupancy [77], and 3) gasch dataset (6152 genes, 176 conditions) to measure Yeast responses to environmental stimuli [78]. For the gasch dataset, we considered the multiple time points per condition and averaged the replicates of the steady state. The missing values were not removed since BicPAM can cope with them. For the state-of-the-art biclustering approaches, we maintained the parameterizations used in the previous section. In particular, pattern-based approaches were parameterized with multiple levels of expression ($\left|\mathcal{\mathcal{L}}\right|\in \left\{4\mathrm{..}7\right\}$). BicPAM output include constant, additive, multiplicative and symmetric biclusters, discovered under different closing options. The selected closing options were: merging (70% overlap); relaxed merging (55% overlap) with filtering of rows; and tight merging (90% overlap) with extensions on rows that appear in another bicluster sharing a minimum 50% of the conditions. In what follows, we analyze the results obtained focusing the three following points: 1) functional enrichment, 2) transcriptional regulation, and 3) coherence.
Functional enrichment
The biological relevance of the biclusters from the different biclustering solutions was obtained using the Gene Ontology (GO) annotations computed by GoToolBox [79]. To discover the enriched GO terms, we computed the p-values obtained using the hypergeometric distribution to access the over-representation of a specific term. In order to consider a bicluster to be significant, we require its genes to show enrichment in one or more of the “biological process” ontology terms by having a (Bonferroni corrected) p-value below 0.05.
Comparing the biological relevance and novelty of different biclustering solutions
Dataset | Approach | ♯Bics | Avg.♯Genes ×♯Conds | ♯Bics sig.enriched | Coverage and exclusivity of enriched GO terms |
---|---|---|---|---|---|
dlblc | BicPAM | 56 | 83 ×7 | 43 (77%) | Highest number of exclusively enriched terms (partial list in Table 4). |
(human | BiModule | 322 | 62 ×4 | 79 (25%) | Absence of closing options leads to redundant and less significant terms. |
genome) | DeBi | 31 | 73 ×6 | 21 (68%) | Loss of relevant terms due to the inability to discover all maximal biclusters. |
CC | 10 | 41 ×33 | 5 (50%) | Exclusive bicluster related with circulatory & cardiovascular system development. | |
ISA | 72 | 23 ×8 | 8 (11%) | Exclusive bicluster for extracellular structure organization and heparin binding. | |
Plaid | 3 | 12 ×49 | 1 (33%) | Majority of genes modeled in a single background bicluster with general terms. | |
Fabia | 10 | 79 ×35 | 6 (60%) | Small bicluster with superior enrichment of antigen binding functions. | |
Bexpa | 10 | 16 ×87 | 2 (20%) | Small sets of genes supported by large number of conditions. | |
Samba | 100 | 17 ×6 | 18 (18%) | Dedicated terms for antigen processing, peptide cross-linking and disassembly. | |
OPSM | 12 | 128 ×5 | 5 (42%) | High variance of ♯ genes and ♯ conditions; some of the biclusters with low ♯ genes (coherency across high ♯ conditions) have exclusive significantly enriched terms. | |
hughes | BicPAM | 47 | 360 ×7 | 38 (81%) | Exclusive enriched terms due to flexible coherency and post-processing criteria. |
(yeast | BiModule | 219 | 285 ×4 | 43 (20%) | Terms with lower sig. than terms from noise-tolerant BicPAM solutions. |
genome) | DeBi | 28 | 317 ×7 | 21 (75%) | Terms observed across very small sets of conditions (≤5) are not enriched. |
CC | 10 | 228 ×58 | 6 (60%) | GO terms covered by BicPAM constant biclusters. | |
ISA | 8 | 120 ×4 | 5 (63%) | Small biclusters with exclusive significance GO terms: spindle pole and karyogamy. | |
Plaid | 8 | 78 ×39 | 3 (38%) | One bicluster with higher significance for fungal-type cell wall assembly. | |
Fabia | 10 | 210 ×49 | 5 (50%) | Higher significance observed for actin cortical patch and oxidoreductase GO-terms. | |
Bexpa | 72 | 42 ×49 | 1 (10%) | Low number of enriched terms (probably due to the low ♯ genes per bicluster). | |
Samba | 120 | 18 ×9 | 11 (9%) | Enriched terms covered by pattern-based biclustering solutions. | |
OPSM | 6 | 531 ×4 | 3 (50%) | Exclusive bicluster for the negative regulation of metabolic processes. | |
gasch | BicPAM | 149 | 411 ×8 | 123 (83%) | Large diversity of highly significant GO-terms (partial list in Table 4). |
(yeast | BiModule | 653 | 287 ×4 | 159 (24%) | Large but incomplete set of GO-terms as it excludes non-constant biclusters. |
genome) | DeBi | 82 | 310 ×6 | 61 (74%) | Significance of terms slightly differ than BicPAM due to the handling of noise. |
CC | 10 | 203 ×79 | 7 (70%) | Enriched terms appear in BicPAM solutions with higher significance. | |
ISA | 23 | 292 ×22 | 18 (78%) | Enriched terms covered by pattern-based biclustering solutions. | |
Plaid | 6 | 48 ×12 | 3 (50%) | Biclusters (apart from background layer) with lower enrichments than peers. | |
Fabia | 10 | 310 ×41 | 8 (80%) | Bicluster with higher sig. for specific proteasome complexes. | |
Bexpa | 10 | 63 ×29 | 3 (33%) | The few biclusters with deviation in size (higher ♯ genes) are significant. | |
OPSM | 16 | 212 ×8 | 11 (69%) | One bicluster with higher significance for pre-ribosome functions. |
Summary on the biological relevance of BicPAM’s biclusters
Dataset | Closing option | ♯Bics | Avg. Area | ♯Filtered bics | ♯Highly sig. bics | ♯Sig. bics |
---|---|---|---|---|---|---|
merging | 4803 | 81 ×7 | 28 | 22 | 5 | |
dlblc | relaxed merging + reductions | 980 | 83 ×9 | 24 | 19 | 3 |
tight merging + extensions | 7652 | 79 ×6 | 27 | 25 | 2 | |
merge | 6311 | 432 ×6 | 36 | 19 | 12 | |
hughes | relaxed merging + reductions | 1259 | 492 ×7 | 22 | 12 | 8 |
tight merging + extensions | 9210 | 398 ×5 | 39 | 22 | 11 | |
merge | 27031 | 392 ×8 | 89 | 66 | 12 | |
gasch | relaxed merging + reductions | 2177 | 486 ×11 | 67 | 49 | 11 |
tight merging + extensions | 52123 | 367 ×7 | 92 | 79 | 9 |
Terms highly enriched in BicPAM’s biclusters
Dataset | ID | Terms | Bicluster with bestp-value | ♯Genes |
---|---|---|---|---|
dlblc | Dl1 | translational elongation; cytosolic part; translational initiation | 4.49E-5 | 81 |
Dl2 | Golgi apparatus; MHC protein complex | 5.40E-5 | 83 | |
Dl3 | defense response; receptor activity; single organism signaling; vacuole; cell communication | 4.91-5 | 162 | |
Dl4 | immune response; response to interferon-gamma | 1.06E-4 | 58 | |
Dl5 | immune system process | 1.27E-4 | 52 | |
Dl6 | response to interferon-gamma; cellular response to chemical stimulus; response to cytokine stimulus | 0.001 | 60 | |
Dl7 | membrane-enclosed lumen; cell division; cell cycle process | 2.92E-12 | 81 | |
Dl8 | small molecule binding; catalytic activity; cell cycle process | 6.14E-8 | 108 | |
hughes | H1 | mitochondrion organization; organellar ribosome; mitochondrial matrix; mitochondrial translation | 2.70E-39 | 416 |
H2 | cell periphery; cell wall constituent; oxidoreductase activity; cell wall organization; sexual sporulation | 1.73E-4 | 370 | |
H3 | ribonucleoprotein complex biogenesis; nucleus | 3.61E-30 | 426 | |
H4 | cellular amino acid metabolic/biosynthetic process; carboxylic acid metabolic/biosynthetic process | 1.3E-25 | 581 | |
H5 | organonitrogen compound metabolic process; sulfur compound metabolic process | 1.62E-4 | 504 | |
H6 | macromolecular complex; intracell. non-membrane-bounded organelle; membrane-enclosed lumen | 4.80E-14 | 512 | |
gasch | G1 | nitrogen compound metabolic proc.; carboxylic/organic amino acid processes; structural cytoskeleton | 1.84E-16 | 434 |
G2 | cellular carbohydrate metabolic process; cytoplasm | 2.01E-7 | 265 | |
G3 | generation of precursor metabolites and energy; tricarboxylic acid cycle | 1.16E-14 | 954 | |
G4 | endomembrane system; retrotransposon nucleocapsid; pore; viral procapsid maturation | 4.34E-6 | 102 | |
G5 | nucleolus; ncRNA metabolic process | 1.03E-61 | 611 | |
G6 | intracell. non-membrane-bounded organelle; structural molecule activity | 5.33E-76 | 293 | |
G7 | cytosolic part; ribosomal subunit | 1.61E-88 | 460 | |
G8 | membrane-enclosed lumen; nuclear lumen; intracell. organelle lumen | 1.17E-47 | 263 | |
G9 | mitochondrion organization; mitochondrial part; cytoplasmic part; protein complex biogenesis | 2.06E-26 | 592 | |
G10 | cellular response to oxidative stress; generation of precursor metabolites and energy | 2.37E-4 | 296 | |
G11 | binding; nuclear part; preribosome | 2.87E-11 | 508 | |
G12 | cellular process involved in reproduction | 0.001 | 435 | |
G13 | macromolecular complex; cell part; structural molecule activity | 6.05E-29 | 1442 | |
G14 | vacuolar transport; chromosome | 5.09E-7 | 606 | |
G15 | regulation of cellular (macromolecule) biosynthetic process; protein modification process | 2.28E-13 | 1019 | |
G16 | organic substance catabolic process; carbohydrate metabolic process; cytoplasm | 1.02E-15 | 648 | |
G17 | ribonucleoprotein complex biogenesis (general) | 1.08E-94 | 784 |
Illustrative set of biclusters with different properties and heightened biological relevance ( p -values after Bonferroni correction)
Dataset | ID | Pattern | Items | Closing options | ||
---|---|---|---|---|---|---|
B1 | FAABFFF | A-F | Merging with tight overlapping | |||
dlblc | B2 | AAABCA | A-C | Extensions allowed (with tight merging) | ||
B3 | AAA/../EEE | A-E | Reducing with high homogeneity | |||
B4 | EEECEE | A-E | Merging allowed | |||
hughes | B5 | CCDCBCBCC | A-E | Merging with relaxed overlapping | ||
B6 | AAAAA/../G..G | A-G | Merging with tight overlapping | |||
gasch | B7 | AAAGGGA | A-G | Merging with tight overlapping | ||
B8 | AAABACCCAA | A-E | Merging allowed | |||
ID | Type | ♯ Genes | ♯ Conds | ♯ p −values <0.01 | ♯ p −values [0.01,0.05] | Best p-value |
B1 | constant | 83 | 7 | 41 | 21 | 1.97E-10 |
B2 | constant | 153 | 8 | 9 | 1 | 2.27E-12 |
B3 | multiplicative | 119 | 5 | 5 | 18 | 4.12E-8 |
B4 | constant | 581 | 6 | 12 | 7 | 1.31E-25 |
B5 | constant | 654 | 10 | 16 | 4 | 1.31E-17 |
B6 | additive | 476 | 6 | 12 | 10 | 1.92E-6 |
B7 | multiplicative | 483 | 7 | 57 | 10 | 1.24E-81 |
B8 | additive | 521 | 10 | 17 | 5 | 4.57E-12 |
Table 4 shows the number of biologically significant biclusters found by BicPAM when using closing strategies. In this analysis, a bicluster is considered to be highly significant if it has at least one enriched term with a corrected p-value below 0.01. To complement this analysis, Table 5 lists some of the most significant biological processes associated with these enriched terms for each data setting [80].
Table 6 shows an illustrative set of the found pattern-based biclusters with statistical relevance. Such biclusters could hardly be discovered by peer biclustering methods, since many of them include conditions with multiple degrees of expression (B1, B2 and B5) and non-constant profiles (B8). All of these biclusters have heightened biological significance as observed by the number of highly enriched terms after Bonferroni correction. Interestingly, we also observe that different closing options lead to biclusters with different shapes, even when the number of items is the same (B4 and B5).
Enriched GO terms of three illustrative BicPAM biclusters
ID | Dataset | Top 4 GO Terms (p-value) |
---|---|---|
B1 | dlblc | Immune response (2.32E-10); immune system process, defense response (<1E-6); |
cytokine-mediated signaling pathway (1.33E-7); Golgi apparatus (1.19E-7). | ||
B4 | hughes | Carboxylic acid biosynthetic process (1.3E-25) and metabolic process (6.12E-16); |
organonitrogen compound biosynthetic process (2.23E-18) and metabolic process (2.71E-13). | ||
B7 | gasch | Ribonucleoprotein biogenesis and assembly (1.24E-81); cytosolic part (1.22E-57); |
intracell. non-membrane-bounded organelle (1.31E-65); ncRNA metabolic process (1.82E-52). |
Transcriptional regulation
To complement the results on functional enrichment, we analyzed the highly enriched transcription factors (TFs) using the TFCONES database [83] (human genome) and Yeastract database [84] (yeast genome) using a corrected hyper-geometric statistical test.
Consider the illustrative biclusters provided in Table 7. Some of the enriched transcription factors regulating the genes in bicluster B 1 (associated with immune system responses in the human genome) include: HCLS1 gene that plays a key role in regulating clonal expansion and deletion in lymphoid cells [85], IRF1 protein that acts as a tumor suppressor and plays a role not only in antagonism of tumor cell growth but also in stimulating an immune response against tumor cells [85], and TRIM22 antiviral protein involved in cell innate immunity [83]. Other highly enriched TFs that regulate proliferation and transformation (tumor supressors) are ANP32A and RUNX3 [85]. The TFs regulating the genes in bicluster B 4 have p-values below 1E-15 after correction, each regulating from 50% to 95% of the genes in bicluster. They are associated with regulatory functions consistent with the enriched terms. Some of these TFs include histidine biosyntehsis (Bas1p), amino acid biosynthesis (Gcn4p), cyclic AMP receptor protein regulation (Sok2p) and other TFs related with the regulation of carboxylic acid and organonitrogen compounds [86]. Consider now bicluster B 7 from gasch. Some of the enriched TFs include Sfp1p, Mga2p, Ace2p, Tup1p, Spt10p and Swi5p (p-values below 1E-15), each regulating 55%-97% of B7’s genes. These factors are known to be involved in stress responses as they regulate cooling and oxygen levels (Mga2p), repair cellular damage (Sfp1p and Spt10p), remodel chromatin (Tup1p) and regulate cell wall protection (Swi5p and Ace2p) [86]–[88]. Finally, consider bicluster B 8, whose genes coherently regulate heat, nitrogen depletion and diauxic shifts. Sfp1p, Bas1p, Ste12p and Tec1p were the most significant TFs in this bicluster (p-values <1E-7). Sfp1p controls expression of ribosome biogenesis genes in response to stress and DNA-damage response [86]. Bas1p regulates gene expression for biosynthesis pathways such as pathways related with histidine metabolism, which responds to environmental stimuli (e.g. nitrogen) affecting pH calibration [86]. Finally, Ste12p and Tec1p act together to regulate genes related with invasive growth, whose production is expected under such stress conditions [86].
Analysis of TFs of the putative regulatory modules given by the BicPAM’s biclusters provided in Table 5 for the human genome ( dlblc dataset) and the yeast genome ( gasch dataset)
Dataset | Bic.ID(Table5) | Highly enriched TFs |
---|---|---|
dlblc | Dl1 | BCL11A, LZTS1, GTF2I, HCLS1, HDAC1, MBD4, MEF2B, NCOA3, STAT6 |
Dl2 | ANP32A, HCLS1, IRF1, MNDA, NCOA1, RUNX3, STAT1, TRIM22, TRIP10 | |
Dl3 | BCL3, TRIM22, ANP32A, ARID5B, CEBPB, CREG1, IRF1, PFDN5, STAT1 | |
Dl4 | ANP32A, IRF1, NCOA1, STAT1, TRIM22 | |
Dl5 | CREG1, IRF1, TRIM22, ANP32A, STAT1 | |
Dl6 | ANP32A, IRF1, NCOA1, STAT1, TRIM22 | |
Dl7 | BCL6, BCL6B, HIf1A, ILF2, POU2AF1, SERTAD1, TCF3 | |
Dl8 | DR1, DRAP1, HIf1A, ILF2, NCOA3, SERTAD1, TMF1, ZNFN1A1 | |
gasch | G1 | Gcn4p, Sfp1p, Ace2p, Tec1p, Ste12p, Ash1p |
G2 | Sfp1p, Msn2p, Bas1p, Tec1p, Sok2p, Abf1p, Ash1p, Cst6p | |
G3 | Sfp1p, Tec1p, Ste12p, Msn2p, Bas1p, Sok2p, Msn4p, Gcn4p | |
G4 | Snf6p, Tec1p, Ste12p, Rap1p, Sin4p, Abf1p, Snf2p, Ash1p | |
G5 | Sfp1p, Ace2p, Cst6p, Tup1p, Msn2p, Spt10p, Spt20p | |
G6 | Hsf1p, Spt23p, Mga2p, Sfp1p, Spt10p, Msn2p, Gcr1p, Gcn4p | |
G7 | Sfp1p, Swi5p, Tup1p, Spt10p, Spt20p, Gcr1p, Sin3p, Mga2p | |
G8 | Sfp1p, Swi5p, Cst6p, Tup1p, Spt20p, Ash1p, Spt10p | |
G9 | Yap1p, Ace2p, Sfp1p, Msn2p, Ash1p, Msn4p, Abf1p | |
G10 | Sfp1p, Msn2p, Msn4p, Cst6p, Abf1p, Sok2p, Bas1p | |
G11 | Snf6p, Tup1p, Snf2p, Cst6p, Sin4p, Rap1p, Swi3p, Hap2p | |
G12 | Yap1p, Tec1p, Msn2p, Msn4p, Ste12p, Sok2p | |
G13 | Snf6p, Tup1p, Abf1p, Snf2p, Cst6p, Sin4p | |
G14 | Sfp1p, Tec1p, Ste12p, Bas1p, Sok2p, Yrm1p | |
G15 | Ace2p, Sfp1p, Tec1p, Ste12p, Ash1p, Bas1p, Gcn4p, Sok2p | |
G16 | Cin5p, Gcn4p, Msn4p, Sfp1p, Msn2p, Tec1p, Ste12p, Sok2p | |
G17 | Sfp1p, Ace2p, Cst6p, Snf6p, Rap1p, Tup1p, Spt10p, Swi5p |
Consider the enriched TFs provided in Table 8 for a sample set with 8 distinct biclusters found by BicPAM in the dlblc dataset. Different groups of TFs were identified, each associated with a specific chemotherapy outcome. Some of the TFs acting as putative tumor suppressors include: ANP32A, LZTS1 (protein-coding silenced in rapidly metastasizing and metastatic tumor cells), RUNX3 (protein that binds to the core site of leukemia virus, also frequently silenced in cancer), HCLS1 (antigen receptor signaling deletion in lymphoid cells), IRF1 (protein that stimulates immune responses and regulates tumor cell differentiation), HIf1A (gene responsible for tumor angiogenesis and pathophysiology of ischemic disease), HDAC1 (complex interacting with retinoblastoma tumor-suppressor proteins), TCF3 (protein regulating lymphopoiesis as its deletion is associated with lymphoblastic and acute leukima malignancies) [83],[85]. Other TFs dedicated to regulate cell proliferation include the STAT families, CREG1, MEF2B, ARID5B, and BCL3 [85]. Understandably, we also observed the B-cell lymphoma protein (BCL6 and its paralog coding gene BCL6B) and other leukemia-related disease genes involved in lymphoma pathogenes, such as BCL11A [83]. Complementarily, immune responses are associated with TRIM22 antiviral proteins, CEBPB, NFATC2 complex, and GTF2I for activating immunoglobulin heavy-chain transcription upon B-lymphocyte activation [85].
Finally, consider the enriched TFs provided in Table 8 for a sample set with 17 distinct biclusters found by BicPAM in the gasch dataset. Since a large number of enriched TFs was identified, Table 8 only provides an illustrative set containing TFs regulating over 50% of the genes associated with each bicluster. Although the enriched TFs regulate very distinct processes (see Table 5), most TFs are activated in stress conditions, namely: Yap1p, Cin5p and Hap2p during oxidative stress [86]; Gcn4p, Msn2p and Msn4p during amino acid starvation [86]; Hsf1p during variable heat shock elements including hyperthermia [86]; Sfp1p during DNA damage [84]; and Spt23p and Mga2p during cooling [87]. The stress conditions are associated with invasive growth (regulated by Tec1p, Ste12p, Ash1p and Sok2p), and with the need for chromatin remodeling (regulated by Snf6p, Snf2p, Spt20p, Tup1p and Swi3p) and DNA repair (regulated for instance by Abf1p and Spt10p) [84],[86].
Coherence
Comparison of pattern-based biclustering approaches
In the previous sections, we provided substantial empirical evidence for the improvements of BicPAM performance in comparison with peer pattern-based methods such as BiModule, DeBi and RAP. First, Figures 10 and 11 show the unique ability of BicPAM to discover non-constant biclusters (>50 percentage points in MS and FC against BiModule, DeBi and RAP). Second, Figure 12 shows improvements in the discovery of constant biclusters related with BicPAM’s ability to deal with the items-boundary problem and to adequately postprocess biclustering solutions. Additionally, BicPAM’s ability to combine solutions discovered under multiple levels of expression and to discover all the maximal biclusters (closed pattern representations) surpasses specific drawbacks found in some of the existing methods. Third, the incorporation of scalability principles and of minimalist FP-trees (Figure 20) guarantee its competitive computational complexity even when procedures to handle noise and adapt the biclustering structures are used. Fourth, Figures 22 to 24 show significant performance improvements of BicPAM due to its exclusive ability to deal with medium-to-high levels of missing values and noise. Finally, the biological relevance of BicPAM’s solutions against the solutions provided by the peer methods is assessed in Table 4 and further supported in subsequent analyzes. In particular, we show that BicPAM’s solutions cover the (enriched) biological processes associated with peer pattern-based solutions (Table 6). Moreover, they enable the discovery of unique and biologically meaningful biclusters (Tables 5 and 6) such as the four illustrative biclusters in Figure 25.
Conclusion
A new approach for flexible and robust pattern-based biclustering (BicPAM) is proposed with the goal of performing exhaustive searches to discover biclustering solutions with multiple coherencies under relaxed conditions (arbitrary number and structure of biclusters) with heightened efficiency. BicPAM is the result of integrating existing dispersed contributions on pattern-based biclustering with new critical methods to deal with more flexible expression profiles and to handle varying levels of missing values and noise.
BicPAM goes beyond the constant assumption made by existing pattern-based approaches, and extends the biclustering task to new types of biclusters, including additive and multiplicative assumptions that can accommodate symmetries. It is the first attempt to model these coherencies under a pattern-based approach. This is critical since pattern-based searches are exhaustive, support flexible structures of biclusters, and consider multiple levels of expression (instead of differential expression).
Additionally, BicPAM is able to surpass the common drawbacks related with discretization procedures, since it is able to assign multiple items over a single element to tackle the items-boundary problem. In this way, the transactional database derived from the input matrix can have more items than the number of elements in the original matrix.
BicPAM relies on dynamic parameterizations for a tuned performance across different settings, including pattern representations, strategies to handle missing values, and postprocessing options for the post-handling of noise and composition of flexible structures. Although the default options are dynamically derived based on the properties of the target dataset, they can also be defined by the user without the need to adapt the core mining task.
Results on both synthetic and real datasets show BicPAM’s ability to find optimal solutions over matrices with more than 10.000 rows and up to 400 columns. The assessment of BicPAM’s performance against peer pattern-based approaches and other state-of-the-art biclustering algorithms supports its heightened flexibility and robustness to noise. Additionally, we observed that the majority of the biclusters discovered by BicPAM in gene expression datasets are functionally relevant and could not be discovered by other biclustering approaches. The analysis of their transcriptional regulation showed significant and meaningful associations.
Software availability
The datasets and BicPAM executables are available in http://web.ist.utl.pt/rmch/software/bicpam/.
Endnote
^{a} Clustering metrics measure the ability to correctly group rows (or columns), that is, of attaining high intra-cluster similarity and low inter-cluster similarity. Entropy and F-measure metrics are the common choice [56],[57]. F-measure can be further decomposed in terms of recall (coverage of found rows by a hidden cluster) and precision (absence of rows present in other hidden clusters).
Declarations
Acknowledgments
This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT), under projects Pest-OE/EEI/ LA0021/2013 and DataStorm (EXCL/EEI-ESS/0257/2012), and the doctoral grant SFRH/BD/75924/2011 to RH.
Authors’ Affiliations
References
- Madeira SC, Oliveira AL: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinformatics. 2004, 1: 24-45. 10.1109/TCBB.2004.2.View ArticleGoogle Scholar
- Hochreiter S, Bodenhofer U, Heusel M, Mayr A, Mitterecker A, Kasim A, Khamiakova T, Van Sanden S, Lin D, Talloen W, Bijnens L, Göhlmann HWH, Shkedy Z, Clevert DA: FABIA: factor analysis for bicluster acquisition. Bioinformatics. 2010, 26 (12): 1520-1527. 10.1093/bioinformatics/btq227.View ArticlePubMedPubMed CentralGoogle Scholar
- Bebek G, Yang J: PathFinder: mining signal transduction pathway segments from protein-protein interaction networks. BMC bioinformatics. 2007, 8: 335-10.1186/1471-2105-8-335.View ArticlePubMedPubMed CentralGoogle Scholar
- Ding C, Zhang Y, Li T, Holbrook SR: Biclustering protein complex interactions with a biclique finding algorithm. ICDM . 2006, IEEE Computer Society, Washington, DC, USA, 178-187.Google Scholar
- Liu J, Wang W: OP-Cluster: clustering by tendency in high dimensional space. ICDM . 2003, IEEE Computer Society, Washington, DC, USA, 187-Google Scholar
- Lazzeroni L, Owen A: Plaid models for gene expression data. Statistica Sinica. 2002, 12: 61-86.Google Scholar
- Odibat O, Reddy C: A generalized framework for mining arbitrarily positioned overlapping co-clusters. SDM . 2011, SIAM, Arizona, USA, 343-354.Google Scholar
- Zhang L, Chen C, Bu J, Chen Z, Cai D, Han J: Locally discriminative coclustering. Knowl Data Eng IEEE Trans. 2012, 24 (6): 1025-1035. 10.1109/TKDE.2011.71.View ArticleGoogle Scholar
- Tanay A, Sharan R, Shamir R: Discovering statistically significant biclusters in gene expression data. Bioinformatics. 2002, 18: 136-144. 10.1093/bioinformatics/18.suppl_1.S136.View ArticleGoogle Scholar
- Serin A, Vingron M: DeBi: Discovering differentially expressed biclusters using a frequent itemset approach. Algorithms Mol Biol. 2011, 6: 1-12. 10.1186/1748-7188-6-18.View ArticleGoogle Scholar
- Okada Y, Okubo K, Horton P, Fujibuchi W: Exhaustive search method of gene expression modules and its application to human tissue data. IAENG IJ Comp Sci. 2007, 34: 119-126.Google Scholar
- Han J, Cheng H, Xin D, Yan X: Frequent pattern mining: current status and future directions. Data Min. Knowl. Discov. 2007, 15: 55-86. 10.1007/s10618-006-0059-1.View ArticleGoogle Scholar
- Okada Y, Fujibuchi W, Horton P: A biclustering method for gene expression module discovery using closed itemset enumeration algorithm. IPSJ Transactions on Bioinformatics. 2007, 48 (SIG5): 39-48.Google Scholar
- Pandey G, Atluri G, Steinbach M, Myers CL, Kumar V: An association analysis approach to biclustering. KDD . 2009, ACM, New York, NY, USA, 677-686.View ArticleGoogle Scholar
- Martinez R, Pasquier C, Pasquier N: GenMiner: mining informative association rules from genomic data. BIBM . 2007, IEEE CS, Silicon Valley, USA, 15-22.Google Scholar
- Yang J, Wang W, Wang H, Yu P: Delta-clusters: capturing subspace correlation in a large data set. In ICDE. San Jose, USA; 2002:517 –528.Google Scholar
- Califano A, Stolovitzky G, Tu Y: Analysis of gene expression microarrays for phenotype classification. In Proc. Int. Conf. Intell. Syst. Mol. Biol. San Jose, USA; 2000:75–85.Google Scholar
- Murali TM, Kasif S: Extracting conserved gene expression motifs from gene expression data. In Pacific Symposium on Biocomputing. Lihue, Hawaii, USA; 2003:77–88.Google Scholar
- Ben-Dor A, Chor B, Karp R, Yakhini Z: Discovering local structure in gene expression data: the order-preserving submatrix problem. RECOMB . 2002, ACM, New York, NY, USA, 49-57.Google Scholar
- Getz G, Levine E, Domany E: Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences. 2000, 97 (22): 12079-12084. 10.1073/pnas.210134797.View ArticleGoogle Scholar
- Tang C, Zhang L, Ramanathan M, Zhang A: Interrelated two-way clustering: an unsupervised approach for gene expression data analysis. BIBE . 2001, IEEE Computer Society, Washington, DC, USA, 41-Google Scholar
- Busygin S, Jacobsen G, KrÃd’mer E, Ag C: Double conjugated clustering applied to leukemia microarray data. ICDM IW on clustering high dimensional data . 2002, IEEE, Brussels, Belgium,Google Scholar
- Hartigan JA: Direct clustering of a data matrix. Journal of the American Statistical Association. 1972, 67 (337): 123-129. 10.1080/01621459.1972.10481214.View ArticleGoogle Scholar
- Sheng Q, Moreau Y, Moor BD: Biclustering microarray data by Gibbs sampling. In ECCB. Paris, France; 2003:196–205.Google Scholar
- Wang H, Wang W, Yang J, Yu PS: Clustering by pattern similarity in large data sets. SIGMOD . 2002, ACM, New York, NY, USA, 394-405.Google Scholar
- Carmona-Saez P, Chagoyen M, Rodriguez A, Trelles O, Carazo J, Pascual-Montano A: Integrated analysis of gene expression by association rules discovery. BMC Bioinformatics. 2006, 7: 1-16. 10.1186/1471-2105-7-1.View ArticleGoogle Scholar
- Henriques R, Madeira SC: BiP: effective discovery of overlapping biclusters using flexible plaid models. BIOKDD, ACM SIGKDD . 2014, ACM, New York, NY, USA,Google Scholar
- Henriques R, Madeira S: BicSPAM: flexible biclustering using sequential patterns. BMC Bioinformatics. 2014, 15: 130-10.1186/1471-2105-15-130.View ArticlePubMedPubMed CentralGoogle Scholar
- Agrawal R, Imieliński T, Swami A: Mining association rules between sets of items in large databases. SIGMOD Rec. 1993, 22 (2): 207-216. 10.1145/170036.170072.View ArticleGoogle Scholar
- Bellay J, Atluri G, Sing TL, Toufighi K, Costanzo M, Ribeiro PSM, Pandey G, Baller J, VanderSluis B, Michaut M, Han S, Kim P, Brown G, Andrews B, Boone C, Kumar V, Myers C: Putting genetic interactions in context through a global modular decomposition. Genome Res. 2011, 21 (8): 1375-1387. 10.1101/gr.117176.110.View ArticlePubMedPubMed CentralGoogle Scholar
- Uno T, Kiyomi M, Arimura H: LCM ver.3: collaboration of array, bitmap and prefix tree for frequent itemset mining. OSDM . 2005, ACM, Chicago, Illinois, 77-86.View ArticleGoogle Scholar
- Burdick D, Calimlim M, Gehrke J: MAFIA: a maximal frequent itemset algorithm for transactional databases. ICDE . 2001, IEEE CS, Heidelberg, Germany, 443-452.Google Scholar
- Pasquier N, Bastide Y, Taouil R, Lakhal L: Efficient mining of association rules using closed itemset lattices. Inf Syst. 1999, 24: 25-46. 10.1016/S0306-4379(99)00003-4.View ArticleGoogle Scholar
- Mahfouz M, Ismail M: BIDENS: Iterative density based biclustering algorithm with application to gene expression analysis. World Acad. of Science, Eng. and Tech., Volume 37 . 2009, WASET.org, Riverside, USA, 342-348.Google Scholar
- Alves R, Rodríguez-Baena DS, Aguilar-Ruiz JS: Gene association analysis: a survey of frequent pattern mining from gene expression data. Brief Bioinformatics. 2010, 11 (2): 210-224. 10.1093/bib/bbp042.View ArticlePubMedGoogle Scholar
- Atluri G, Bellay J, Pandey G, Myers C, Kumar V: Discovering coherent value bicliques in genetic interaction data. In BIOKDD: ACM; 2000.Google Scholar
- Gupta R, Rao N, Kumar V: Discovery of error-tolerant biclusters from noisy gene expression data. BMC Bioinformatics. 2011, 12 (12): 1-17. 10.1186/1471-2105-12-S12-S1.View ArticleGoogle Scholar
- Huang Y, Xiong H, Wu W, Sung SY: Mining quantitative maximal hyperclique patterns: a summary of results. Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining, PAKDD’06 . 2006, Heidelberg: Springer-Verlag, Berlin, 552-556.View ArticleGoogle Scholar
- Steinbach M, Tan PN, Xiong H, Kumar V: Generalizing the notion of support. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’04 . 2004, ACM, New York, NY, USA, 689-694.Google Scholar
- Han EH, Karypis G, Kumar V: Min-apriori: An algorithm for finding association rules in data with continuous attributes. Department of Computer Science. University of Minnesota, Minneapolis 1997.Google Scholar
- Agrawal R, Srikant R: Fast algorithms for mining association rules in large databases. VLDB . 1994, Morgan Kaufmann, San Francisco, USA, 487-499.Google Scholar
- Han J, Pei J, Yin Y: Mining frequent patterns without candidate generation. SIGMOD Rec. 2000, 29 (2): 1-12. 10.1145/335191.335372.View ArticleGoogle Scholar
- Zaki MJ, Gouda K: Fast vertical mining using diffsets. KDD . 2003, ACM, New York, NY, USA, 326-335.Google Scholar
- Henriques R, Madeira SC, Antunes C: F2G: efficient discovery of full-patterns. ECML/PKDD nfMCP . 2013, Springer Verlag, Prague,Google Scholar
- Zaki MJ, Hsiao CJ: Efficient algorithms for mining closed itemsets and their lattice structure. IEEE TKDE. 2005, 17 (4): 462-478.Google Scholar
- Pan F, Cong G, Tung AKH, Yang J, Zaki MJ: Carpenter: finding closed patterns in long biological datasets. KDD . 2003, ACM, Washington, DC, USA, 637-642.Google Scholar
- Pan F, Tung A, Cong G, Xu X: COBBLER: combining column and row enumeration for closed pattern discovery. SSDM . 2004, IEEE, Santorini Island, Greece, 21-30.Google Scholar
- de Souto M, de Araujo D, Costa I, Soares R, Ludermir T, Schliep A: Comparative study on normalization procedures for cluster analysis of gene expression datasets. IJCNN . 2008, IEEE, Hong Kong, China, 2792-2798.Google Scholar
- Xin D, Cheng H, Yan X, Han J: Extracting redundancy-aware top-k patterns. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06 . 2006, ACM, New York, NY, USA, 444-453.View ArticleGoogle Scholar
- Yan X, Cheng H, Han J, Xin D: Summarizing itemset patterns: a profile-based approach. Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD ’05 . 2005, ACM, New York, NY, USA, 314-323.View ArticleGoogle Scholar
- Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB: Missing value estimation methods for DNA microarrays. Bioinformatics. 2001, 17 (6): 520-525. 10.1093/bioinformatics/17.6.520.View ArticlePubMedGoogle Scholar
- Donders A, van der Heijden G, Stijnen T, Moons K: Review: a gentle introduction to imputation of missing values. Clinical epidemiology. 2006, 59 (10): 1087-91. 10.1016/j.jclinepi.2006.01.014.View ArticleGoogle Scholar
- Hellem T, Dysvik B, Jonassen I: LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res. 2004, 32 (3): 34+-10.1093/nar/gnh026.View ArticleGoogle Scholar
- http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf (accessed 11 Nov 2014).
- Ramesh G, Maniatty WA, Zaki MJ: Feasible itemset distributions in data mining: theory and application. Symposium on Princ. of data. sys., . 2003, ACM Press, San Diego, USA, 284-295.Google Scholar
- Assent I, Krieger R, Muller E, Seidl T: DUSC: dimensionality unbiased subspace clustering. In ICDM; 2007.Google Scholar
- Sequeira K, Zaki M: SCHISM: a new approach to interesting subspace mining. Int J Bus Intell Data Min. 2005, 1 (2): 137-160. 10.1504/IJBIDM.2005.008360.View ArticleGoogle Scholar
- Prelić A, Bleuler S, Zimmermann P, Wille A, Bühlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinf. 2006, 22 (9): 1122-1129. 10.1093/bioinformatics/btl060.View ArticleGoogle Scholar
- Bozdağ D, Kumar AS, Catalyurek UV: Comparative analysis of biclustering algorithms. BCB . 2010, ACM, New York, NY, USA, 265-274.View ArticleGoogle Scholar
- Patrikainen A, Meila M: Comparing subspace clusterings. IEEE TKDE. 2006, 18 (7): 902-916.Google Scholar
- Munkres J: Algorithms for the assignment and transportation problems. Soc Ind Appl Math. 1957, 5: 32-38. 10.1137/0105003.View ArticleGoogle Scholar
- Cheng Y, Church GM: Biclustering of expression data. In Intelligent Systems for Molecular Biology: AAAI Press; 2000:93–103.Google Scholar
- Berriz GF, King OD, Bryant B, Sander C, Roth FP: Characterizing gene sets with FuncAssociate. Bioinformatics. 2003, 19: 2502-2504. 10.1093/bioinformatics/btg363.View ArticlePubMedGoogle Scholar
- http://www.bioinf.jku.at/software/fabia/benchmark.html.
- http://web.ist.utl.pt/rmch/software/bicpam/.
- Pontes B, Giráldez R, Aguilar-Ruiz JS: Configurable pattern-based evolutionary biclustering of gene expression data. Algorithms Mol Biol. 2013, 8: 4-10.1186/1748-7188-8-4.View ArticlePubMedPubMed CentralGoogle Scholar
- Ihmels J, Bergmann S, Barkai N: Defining transcription modules using large-scale gene expression data. Bioinformatics. 2004, 20 (13): 1993-2003. 10.1093/bioinformatics/bth166.View ArticlePubMedGoogle Scholar
- http://www.bioinf.jku.at/software/fabia/fabia.html.
- http://cran.r-project.org/web/packages/biclust.
- Barkow S, Bleuler S, Prelić A, Zimmermann P, Zitzler E: BicAT: a biclustering analysis toolbox. Bioinformatics. 2006, 22 (10): 1282-1283. 10.1093/bioinformatics/btl099.View ArticlePubMedGoogle Scholar
- http://acgt.cs.tau.ac.il/expander.
- http://www.philippe-fournier-viger.com/spmf/.
- Madeira S, Teixeira MNPC, Sá-Correia I, Oliveira A: Identification of regulatory modules in time series gene expression data using a linear time biclustering algorithm. IEEE/ACM Trans Comput Biol Bioinformatics. 2010, 1: 153-165. 10.1109/TCBB.2008.34.View ArticleGoogle Scholar
- http://www.bioinf.jku.at/software/fabia/gene_expression.html.
- http://chemogenomics.stanford.edu/supplements/03nuc/datasets.html.
- Rosenwald A, dlblc team: The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med. 2002, 346 (25): 1937-1947. 10.1056/NEJMoa012914.View ArticlePubMedGoogle Scholar
- Lee W, Tillo D, Bray N, Morse RH, Davis RW, Hughes TR, Nislow C: A high-resolution atlas of nucleosome occupancy in yeast. Nat Genet. 2007, 39 (10): 1235-1244. 10.1038/ng2117.View ArticlePubMedGoogle Scholar
- Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. 2000, 11 (12): 4241-4257. 10.1091/mbc.11.12.4241.View ArticlePubMedPubMed CentralGoogle Scholar
- Martin D, Brun C, Remy E, Mouren P, Thieffry D, Jacq B: GOToolBox: functional analysis of gene datasets based on Gene Ontology. Genome Biol. 2004, 12: 101-10.1186/gb-2004-5-12-r101.View ArticleGoogle Scholar
- http://web.ist.utl.pt/rmch/software/bicpam/.
- Wlodkowic D, Skommer J, McGuinness D, Hillier C, Darzynkiewicz Z: ER–Golgi network–A future target for anti-cancer therapy. Leuk Res. 2009, 33 (11): 1440-1447. 10.1016/j.leukres.2009.05.025.View ArticlePubMedPubMed CentralGoogle Scholar
- Bracken AP, Bond U: Reassembly and protection of small nuclear ribonucleoprotein particles by heat shock proteins in yeast cells. Rna. 1999, 5 (12): 1586-1596. 10.1017/S1355838299991203.View ArticlePubMedPubMed CentralGoogle Scholar
- Lee AP, Yang Y, Brenner S, Venkatesh B: TFCONES: a database of vertebrate transcription factor-encoding genes and their associated conserved noncoding elements. BMC Genomics. 2007, 8: 441-10.1186/1471-2164-8-441.View ArticlePubMedPubMed CentralGoogle Scholar
- Teixeira M, Monteiro P, Guerreiro J, Gonçalves J, Mira N, dos Santos S, Cabrito T, Palma M, Costa C, Francisco A, Madeira S, Oliveira A, Freitas A, Sá-Correia I: The YEASTRACT database: an upgraded information system for the analysis of gene and genomic transcription regulation in Saccharomyces cerevisiae . Nucleic Acids Res2014. (Database issue).Google Scholar
- Safran M, Dalah I, Alexander J, Rosen N, Stein TI, Shmoish M, Nativ N, Bahir I, Doniger T, Krug H, et al: GeneCards Version 3: the human gene integrator. Database. 2010, 2010: baq020-10.1093/database/baq020.View ArticlePubMedPubMed CentralGoogle Scholar
- Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, Christie KR, Costanzo MC, Dwight SS, Engel SR, et al: Saccharomyces genome Database: the genomics resource of budding yeast. Nucleic Acids Res2011:gkr1029.Google Scholar
- Nakagawa Y, Sakumoto N, Kaneko Y, Harashima S: Mga2p is a putative sensor for low temperature and oxygen to induce ole1 transcription in saccharomyces cerevisiae. Biochem Biophys Res Commun. 2002, 291 (3): 707-713. 10.1006/bbrc.2002.6507.View ArticlePubMedGoogle Scholar
- Doolin MT, Johnson AL, Johnston LH, Butler G: Overlapping and distinct roles of the duplicated yeast transcription factors Ace2p and Swi5p. Mol Microbiol. 2001, 40 (2): 422-432. 10.1046/j.1365-2958.2001.02388.x.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.