GrpClassifierEC: a novel classification approach based on the ensemble clustering space

Background Advances in molecular biology have resulted in big and complicated data sets, therefore a clustering approach that able to capture the actual structure and the hidden patterns of the data is required. Moreover, the geometric space may not reflects the actual similarity between the different objects. As a result, in this research we use clustering-based space that convert the geometric space of the molecular to a categorical space based on clustering results. Then we use this space for developing a new classification algorithm. Results In this study, we propose a new classification method named GrpClassifierEC that replaces the given data space with categorical space based on ensemble clustering (EC). The EC space is defined by tracking the membership of the points over multiple runs of clustering algorithms. Different points that were included in the same clusters will be represented as a single point. Our algorithm classifies all these points as a single class. The similarity between two objects is defined as the number of times that these objects were not belong to the same cluster. In order to evaluate our suggested method, we compare its results to the k nearest neighbors, Decision tree and Random forest classification algorithms on several benchmark datasets. The results confirm that the suggested new algorithm GrpClassifierEC outperforms the other algorithms. Conclusions Our algorithm can be integrated with many other algorithms. In this research, we use only the k-means clustering algorithm with different k values. In future research, we propose several directions: (1) checking the effect of the clustering algorithm to build an ensemble clustering space. (2) Finding poor clustering results based on the training data, (3) reducing the volume of the data by combining similar points based on the EC. Availability and implementation The KNIME workflow, implementing GrpClassifierEC, is available at https://malikyousef.com


Background
Clustering has a broad range of applications in life sciences and is used in many fields, from clinical information analysis to phylogeny and genomics and proteomics, over many years. The aim of clustering biological data is to cover the natural structure of the data and find important patterns within the data. Advances in molecular biology have resulted in big and complicated data sets, making clustering vital for information understanding and visualization. In addition, clustering can be a strong method to define the relationship between different samples (points) [1].
A clustering ensemble attempts to combine many clustering models to produce a better consistency and reliability result than that of individual clustering algorithms

Open Access
Algorithms for Molecular Biology *Correspondence: malik.yousef@gmail.com 2 The Department of Community Information Systems, Zefat Academic College, 13206 Zefat, Israel Full list of author information is available at the end of the article Page 2 of 10 Abdallah and Yousef Algorithms Mol Biol (2020) 15:3 [2]. Cluster ensembles have been shown to be better than any standard clustering algorithm at improving accuracy and robustness across different data collections [3]. However, for clarification purposes in this study we are introducing the ensemble cluster (EC) algorithm that is different from the known clustering ensemble (CE). The main assumption in this research is that points belonging to the same cluster are more similar to other points from other clusters even though their Euclidean distance is closer. This is because the clustering algorithms take into account both the geometric space as well as other statistical parameters.
In this research, the EC transformation algorithm is to run clustering algorithm (or multiple algorithms) several times with different parameter values where each run produce a categorical dimension (feature) of the new categorical data. For example running k-means with different value of k, k = 1,…,50, will generate a new categorical data with 50 dimensions.
Our current research presents a novel classification model that based on the Ensemble Cluster (EC) space. EC space is generated by EC transformation algorithm (See Algorithm 1 and Fig. 2) applied on a given data to generate a categorical data using clustering algorithms (one or more).
For example for a given point from the original data X = (x 1 , . . . , x n ) with n features applying EC transformation using k-means over k = 1,…,50 will generate a new point X = (c 1 , . . . , c 50 ) in the new categorical space with 50 categorical values. The value of each c i indicates the cluster label that was assigned to the point in the i ∈ {1, . . . , 50} iteration. Additionally, we can define an boolean identity function id() over the EC space between two categorical points X = (c 1 , . . . , c 50 ) and In other words, two points in the EC space are identical if they were assigned to the same clusters over all the iteration (k = 1,…,50). All the points that fall in the same cluster in the different clustering runs constitute a single group and are represented by a single point. Our algorithm classifies only the representors, and all the group members will have the same class label.
In general, one could use any clustering algorithm or a combination of algorithms. However, in our experiments, we use the k-means clustering algorithm with different We have chosen the k-means as first step and as a future work; we would examine different algorithms and different combination to examine the impact on the performance of the algorithm. K-means is chosen for couple of reasons; firstly, it well known clustering algorithms, also we can specify the number of clusters, which is essential part to our algorithm and the differentiation between the different k values, is big. Interestingly, in our experiments, we observe that not only the number of the data points (size) decreased, but also the number of the generated features (categorical) is decreased. This reduction is different from traditional feature reduction that eliminates some of the unneeded features. Combination clustering is a more challenging task than the combination of supervised classifications. Topchy et al. [4] and Strehl et al. [5] addressed this issue by formulating consensus functions that avoid an explicit solution to the correspondence problem. Recent studies have demonstrated that consensus clustering can be found using graph-based, statistical or information-theoretic methods without explicitly solving the label correspondence problem as mentioned in [6]. Other empirical consensus functions were also considered in [7][8][9].
A clustering-based learning method was proposed in [10]. In this study, several clustering algorithms are run to generate several (unsupervised) models. The learner then utilizes the labeled data to guess labels for entire clusters (assuming that all points in the same cluster have the same label). In this way, the algorithm forms a number of hypotheses. The one that minimizes the PAC-Bayesian boundary is chosen and used as the classifier. The authors assume that at least one of the clustering runs will produce a good classifier and that their algorithm will find it.
Clustering ensemble algorithms were also applied for semi-supervised classification [11,12] based on the hypothesis that for noisy data they more accurately reflect the actual similarity between different objects. They propose a Co-association Matrix (CM) based on the outputs of different clustering algorithms and use this as a similarity matrix in the regularization framework. Berikon et al. [13] use the same idea in the semi-supervised regression method. They combine graph Laplacian regularization and cluster ensemble methodologies. To accelerate the calculation, they apply the low-rank decomposition of the CM.
Our method is different from those already published studies. We assume that the groups, which were built by the identical points in the categorical space, are relatively pure (i.e., all the points belonging to the same group have the same class).
Abdallah et al. [14,15] developed a distance function based on ensemble clustering and use it within the framework of the k-nearest neighbor classifier and then improve selecting sampling for unsupervised data to be labeled by an expert. Additionally Abdallah and Yousef [16] integrated EC within Decision Trees, K Nearest Neighbors, and the Random Forest classifiers. The results obtained by applying EC on 10 datasets confirmed the hypothesis that embedding the EC space would improve the performance and reduce the feature space dramatically. However, in this research we do not integrated the EC with an existing algorithms, instead we suggest a novel classification method based on the categorical space that was received as a result of (EC). A recent study by Yousef et al. [17] used EC classification comparing it to two-class SVM and one-class classifiers applied on sequence plant microRNA data. The results show that K-Nearest Neighbors-EC (KNN-ECC) outperforms all other methods. The results emphasize that the EC procedure contributes to building a stronger model for classification.
In this study we introduce a novel algorithm called GrpClassifierEC that based on EC transformation space. Several experiments were conducted in order to evaluate the performance of GrpClassifierEC. We tested it over 10 biological datasets and compare its results to the k-nearest neighbors, decision trees and random forest classification algorithms. The results show that the new algorithm GrpClassifierEC using the ensemble clustering was superior and outperforms the other baseline algorithms on most of the datasets.

The ensemble clustering transformation to categorical space
This section describes the ensemble clustering (EC) transformation that transforms the original data from its original feature to categorical space as illustrated in Fig. 2. The basic algorithm assumes that points belonging to the same cluster are more similar than points that fall in different clusters. In real-world, this assumption may not always hold, as illustrated in the example presented in Fig. 1. In this example, the data includes two classes (circles and diamonds). If we cluster the data into two clusters, then the left cluster will include two types of classes and the right one will still have all the points from the same class.
As a conclusion, we decided to run the clustering algorithm several times. Points belonging to the same cluster in the multiple runs are consider as identical points and will define a group that will be classified to the same class.
Let, D be a set of labeled points used as training data, and A a set of unlabeled data. First, the GrpClassifi-erEC algorithm will create a new dataset E , where E is a dataset combining D and A (i.e.,E = D ∪ A ), then the GrpClassifierEC runs the k-means clustering algorithm several times with different values of k (we refer it to nmc = number of clusters) and creates the clustering matrix cMat . cMat is a matrix where the i th row consists of the clustering results of the i th point in E . See Table 1 for an example of cMat with 20 points and 10 dimension of categorical features. The first column is the results of running k-means with k = 2 while the last column is the results of running k-means with k = 11. The values are the index of the cluster that was assigned by k-means. We record the results from k = 2.
Applying the EC transformation on x i ∈ E will create a new point x * i ∈ cMat with categorical values. The dimension of the x i * is k − 1. Therefore applying the EC transformation on the whole data will generate a new categorical data (EC data) that consists of l points with nmc-1 categorical features.  Table 1

EC space for 20 points and number of cluster (nmc) of 11
First column is the point name, second column is the results of assigning k-means of each point into two clusters (c0 and c1), the third column is the result of assigning k-means for each point into 3 clusters etc. The new dimension nmc-1, usually, is much less that the original data dimension (nmc-1 ≪ N in Fig. 2). More interestingly, the new EC data point can also be reduced as the new EC data contains identical points. We will explain it in more details in the section "Reduction of the Data". Identical points that share the same clusters over the all iteration of k-means are represented as a same point in cMat as a result those points are consider to be one point, as a result all the identical points will define a group. For example, in Table 1, point 11, point 12 and point 20 have the same categorical values. This means, the vector space that represents those 3 points is = g(c0, c2,c2,c2,c4,c5,c6,c5,c5,c4). As a result, we consider those 3 points as a single point g that we refer to it as a unique point. In other words, each group is represented by one unique point.
Note that, the set E contains labeled and unlabeled points, and as a result, the groups may contain labeled and unlabeled points. Generally, there are three possible cases for the identical points in the same group: 1. The labeled points are having the same class label; the unlabeled points will be classified with this label. 2. The labeled points have different class labels: here the group points will be classified as the majority class. 3. All the points are not labeled: in this case, the group will be an unclassified group and the algorithm classifies it based on labeled nearest group.
To this end, we define a purity measurement for a given group in order to evaluate the purity of the grouping process. The purity measurement is based mainly on the probabilities of the labeled objects as follows: where g i denotes group i that was represented by vector g i in the matrix G , #classes denotes the number of the classes in g i , and p j denotes the probability of class j in group i . As can be seen, purity(g i ) equals 1 when the group is pure and 1 #classes for the lowest purity, that will decrease as the number of the classes increases.
The k-means algorithm is known to have a time complexity of O(n 2 ) where n is the where n is the input data size. Then the complexity of the EC transformation is O(k.n 2 ) where k is the number of times we run k-means. In fact, this part is the heaviest computation part of the GrpClassifierEC algorithm.

GrpClassifierEC-ensemble clustering based classifier
The GrpClassifierEC pseudo code is presented in Algorithm 2. The input to the classifier is the cMat matrix that generated by the EC transformation that described in Algorithm 1. The first step of the GrpClassifierEC is creating the groups extracted from cMat. groups = { group i } where i = 1,…, s. s is number of groups. The number of groups is influenced by nmc, the number of iteration that we run k-means. For instance, if we run k-means with nmc = 1 then all the points will be assigned to one cluster which means that we have just one group that contains all the data points. As we seen from Table 2 for the data Cercopithecidae vs Malvacea we have 449 groups with nmc = 30 while with the same data with nmc = 50 we have 593 groups (Table 3 #EC_Samples is equal to the number of groups). The number of groups is increasing as nmc is increasing and might reach the number of points in the data, which means that each group will host one point in categorical values.
Groups could have different sizes (size is the number of categorical points belongs to it). As seen from Table 2, Table 2 The data Cercopithecidae vs Malvacea with k = 30 The total number of points (points) is 894 which is the sum of column #Points. The size of the unique points is the sum of columns "Unique Points" which is 449. #Points is multiplication of Size and Unique Points. Ratio Unique Points is the #Unique Points/Total #Points while Ratio All is #Points/Total #Points  group can have just one point; actually, we see that 305 different groups (unique points) with size 1 while 68 groups (unique points) with size 2. We see also that we have one group with size 31 which is the maximum size in this specific data.
Following the step of creating the groups, we suggest our novel approach for classification, by randomly selecting one point from each group. The label of the selected point will be the label of all points belongs to the group. The process of selecting random point and assigning its label to its group repeated r times. The GrpClassifierEC classifier produce a list named prd_set that for contains the predictions results. Then in order to calculate the performances we run a scorer function. The scorer function compare the assigned label and original label for each point in order to get the confusion matrix. Accuracy statistics such as True-Positives, False-Positives, True-Negatives, False-Negatives, Recall, Precision, Sensitivity, Specificity, F-measure, as well as the overall accuracy and Cohen's kappa, are calculated. Table 2 shows the output of the EC procedure with k = 30 applied on the data Cercopithecidae vs Malvacea that contains 894 examples (points). The table also shows that the EC data has 449 unique points or groups, a 50% reduction in the size of the original data (449/894 = 0.5).

Reduction of the data
For each group (unique point), we measure its size, equal to the number of times this unique point appears in the EC data. For example, in Table 2, we have 305 unique points with size 1. All these points appear once in the new data space. In addition, we have 68 unique points.
If each one appears twice in the data, then each one is size 2. There are 22 points with size 3-each of these 22 unique points appears 3 times in the data. Note that the labels are not included in the EC data. This means that the group of points at the EC space can have different labels associated with the original points and still share the same group. Figure 3, shows the distribution of the group size for nmc = 30 and nmc = 50, and clearly indicates that as nmc increases, the number of groups with size 1 also increases. The expectation is that the number of groups of size of 1 should be the same as the number of the original number of points as we increase the value of nmc. In other words, each point will be hosted in one cluster. This actually raises a scientific question: what is the optimal value of nmc that will yield in improving the performance of the classifier, or more specifically, capture the nature of the data in terms of clusters. Answering this question is requiring additional future research.

Experiments on numerical datasets
To evaluate the performance of the new classifier Grp-ClassifierEC we compared its results to the k-nearest neighbors, decision trees and random forest classification algorithms. We tested it over 10 biological datasets and we compared the performance for each algorithm. The  Table 4 The table shows a

list of clades used in the study
The first column represents the name of the clade, the second column the number of pre-cursors available on miRBase, and the third column the number of precursors after preprocessing the data results show that the new algorithm using the ensemble clustering was superior and outperforms the other baseline algorithms on most the datasets.

Datasets
The data consists of microRNA precursor sequences, and each sequence is made up of 4 nucleotide letters {A,U,C,G,}. The length of each precursor sequence is about 70 nucleotides. The source of this data is miRbase [18]. Part of the data we have used has was from other different studies [19][20][21], including our study [16]. One simple way of representing sequences that consist of 4 nucleotide letters is by employing the k-mers frequency. The k-mer counts in a given sequence were normalized by the length of the sequence.
Our features include k-mer frequencies, other distance features that were recently suggested by Yousef et al. [19] and secondary features suggested suggest by [22]. Many additional features describing pre-miRNAs have also been proposed [23] and are included in the feature set that numbers1038 features.
The main data consists of information from 15 clades ( Table 4). The Homo sapiens sequences were taken out of the data of its clade Hominidae. The homology sequences were removed from the dataset and only one representative was kept. Each clade can serve as a positive examples or a as a negative examples. Considering all the different combination of pair of clades (positive/negative) it is possible to generate 256 datasets. We selected 10 datasets at random presented in Table 5.

Implementation
We have implemented the GrpClassifierEC in Knime [24]. We have decided to use the free and open-source platform Knime due to its simplicity and very useful graphical presentations. Additionally, Knime is also a highly integrative tool. The Knime workflow consists from two parts, the first part is performing the EC transformation as describe on Algorithm 1. Actually, this part is time consuming where for example it took 13 min to generate the EC matrix for the input file that consists from 1038 features ad 1068 points. The run was performed on a laptop with Intell ® Core ™ i7 7600U CPU @2.80 GHz 2.90 GHz with 16GM RAM.

Model performance evaluation
We tested a different number of EC clusters using the k-means clustering algorithm with nmc values from 10 to 50. For each level, we performed 100 iterations with equal sample size, and then calculated the mean of each performance measurements described below. For each established model we calculated a number of performance measures for the evaluation of the classifier such as sensitivity, specificity, and accuracy according to the following formulas (TP: True Positive, FP: False Positive, TN: True Negative, and FN False Negative classifications):

Results and discussion
We also conducted a study comparing the new classifier GrpClassifierEC with the other known classifiers such as k-nearest neighbors, decision trees and random forest classifiers. The results are presented in Table 3. The results clearly show that the performance of the suggested classifier GrpClassifierEC was superior. Figure 4 shows the performance of different classifiers at different levels of training percentage of the data. The results of EC refer to our own GrpClassifierEC classifier. We see that the performance is not significantly influenced by the size of the training part for the other classifiers while it does increase significantly for the GrpClassifierEC classifier, at the 39% level. In addition, performance can be improved significantly if the training part is increased, as a function of the value of k in the EC transformation.
In terms of data reduction, Tables 3 and 6 demonstrate that about 56% of the points data are reduced in the EC space with a k value of 49 and 39% in the EC space with a k value of 30. The results demonstrate the advantage of  our approach in reducing the size of the data, for dealing with big data. Tables 3 and 6 show the results of a comparison of the EC classifier with other classifiers applied on the whole feature space (named Regular Classifiers), and the performance of Random forest applied on the EC categorical data(EC-RF). Table 3 presents results with a k value of 49, while Table 6 presents results with k 3. Interestingly, EC Classifier outperforms all the other approaches while using just 56% in average of the data (see ratio column), while the regular classifiers use 80% of the data for training. The EC classifier outperforms the standard approaches by 9% for the DT, 6% for the KNN, 8% for the random forest applied on the EC sample, and by 3% for the regular random forest.
The data in Table 6 show that one can reduce the size of the data to 39% ration with k = 30 and while still providing a reasonable result. The EC classifier outperforms DTT and EC-RF and KNN by 5%, 3% and 1% respectively, while RF outperforms it by 2%. More interestingly, that ratio of the reduction is an indication about the data redundancy and the similarity of the original data points.

Conclusion
In this paper, we proposed a novel classifier based on ensemble clustering GrpClassifierEC. Moreover, we demonstrated the advantage of the EC approach in reducing the feature space and also in reducing the data size. Generally speaking, we shown that we are able to reduce the number of features dramatically to 5% or 3% (50/1038 = 0.048, 30/1038 = 0.028) and reduce the size of the data to 56% and 39%, and still achieve a similar performance level, or even outperform regular classifiers applied on the original data. However, to achieve these results the computation times that the EC transformation algorithm requires, increase.
The main assumption was that points within the same cluster share common traits more than points within different clusters. Thus, it may be more beneficial to represent objects based on the clustering space rather than the geometric space.
The approach suggested here is very useful for reducing the sample size and feature size when dealing with big data, while considering the EC data. For future research we will need to suggest an algorithm that would pick the optimal value of the clusters that and yield improved 0.85 0.9 0.95 1 1% 5% 9% 16% 24% 39% 50% 58% 67% 74% 81% 90%

EC
AccDT AccKNN AccRF Fig. 4 The accuracy of the classifiers over different level of sample training size Table 6 GrpClassifierEC: EC classifier results with a k value of 30 compared to Random forest applied on the EC samples and results for regular classifiers applied on the original data K is number of clusters. The section "Accuracy Difference" is EC Classifier-ACC of the other classifier. A positive value indicates that the EC classifier is better than the other corresponding classifiers. EC-RF is a random forest applied on the EC data, RF is a random forest applied on the original data. DTT is a decisionTrees while KNN is K-Nearest Neighbors applied on the original data