Reliability-based fuzzy clustering ensemble
Introduction
Clustering is the process of partitioning a set of data-objects (samples) into some (K) subsets of data-objects based on a similarity (distance) measure, where the data-objects in each subset are more similar to one another, and more separate than other subsets of data-objects. Each subset in the mentioned definition is usually referred to as a cluster. All clusters together are named a clustering. Based on the relationship of each data-object to the clusters, the clustering algorithms can also be divided into crisp and fuzzy clustering algorithms. In crisp clustering a data object definitely belongs to one cluster. In fuzzy clustering, data-objects are assigned to every cluster with a membership degree. Crisp clustering is a special case of fuzzy clustering, in which the membership degree of a data-object belonging to a cluster equals to one and its membership degree belonging to the other clusters is zero.
In general, in data clustering context, various clustering algorithms have emerged, each uses a different similarity criterion. Therefore, they have different objective functions. All these methods are heavily dependent on dataset; in other words, there is no clustering algorithm that can learn every dataset [1]. Hence, data clustering with the help of an ensemble of clusters has been proposed as a technique for resolving the aforementioned problems in recent years by researchers [2], [3], [4], [5]. This technique is named clustering ensemble. The main aim of clustering ensemble is to search for a likely better and more stable result with the aggregation of the information extracted from multiple clusterings (also called base-clusterings or members) [6], [7]. The better and more robust result that is extracted from base-clusterings is named consensus clustering (which in this research is also referred to as the final clustering) [6], [8].
In summary, as observed in Fig. 1, a clustering ensemble consists of the following two phases [6]: (1) Base-clustering generation phase: Produce base-clusterings through single clustering algorithms (in this study single clustering is used versus ensemble clustering). (2) Base-clustering consolidation: In this phase the base-clusterings generated in phase 1 must be combined in order to generate the final clustering, which is the objective of this phase. This consolidation is done through a consensus function. It is worth mentioning that this paper focuses on this phase, and proposes a co-association based fuzzy clustering ensemble method and a graph based fuzzy clustering ensemble method.
Despite the greater generality of fuzzy clustering compared to crisp clustering, researches in fuzzy cluster ensemble are still in the initial stages and there exist relatively few approaches for this field. Some of the existing fuzzy cluster ensemble methods convert fuzzy clusters into hard clusters at first, and then compute the final clustering through the hard consensus functions, which causes the loss of uncertainty information. Therefore, proposing an efficient fuzzy consensus clustering from multiple fuzzy base-clusterings remains a challenging issue.
In the ensemble of voters (learners) assigning weight to each learner based on its quality can be an effective mechanism to improve the result of the ensemble. The process of weighting to learners in an ensemble of learners can be optimally set if the accuracy of each learner is known [4]. But if the learners are of type clustering algorithms, the accuracy for learner is meaningless [9]. So, the quality of clustering that is obtained by clustering algorithm can be used as an approximation for its (clustering algorithm) accuracy. In summary, the quality of the base-clusterings highly affects the consensus clustering (final clustering) obtained by the ensemble. In other words, low-quality base-clusterings may have a negative influence on the consensus results. Some researchers investigated the quality-evaluation and weighting of the base-clusterings to improve the consensus clustering quality [10], [11], [12]. For a method that used weighting mechanism in fuzzy clustering ensemble, we can refer to a paper by Berikov [13]. However, this approach assumed that all of the clusters in the same base-clustering have the same reliability; They typically treat each base clustering as an individual and assign a weight to them regardless of the diversity of the clusters inside [10], [11], [12], [13]. In this research, reliability is defined as the quantity of certain knowledge of the ensemble about the cluster and is computed by the accretion amount of that cluster by the ensemble. Briefly, in the aforementioned papers weighting is considered at the clustering level not in the cluster level. But due to the inherent complexity of real-world datasets, the different clusters in the same clustering may have different reliability. Hence, it is necessary to consider the local diversity of ensembles (quality of the clusters in the ensemble) and deal with the different reliability of clusters. Although Zhong et al. investigated the reliability of crisp clusters by considering the Euclidean distances between data objects in clusters [14], this method is not operational for fuzzy clustering, in addition, it requires access to the original data features, and its efficacy relies heavily on the data distribution of the dataset, while in the general formulation of cluster ensemble, there exists no access to the original data features. Therefore, without the need to access the data features or rely on specific assumptions made on data distribution, the key question here is how to measure the reliability of fuzzy clusters and weight them accordingly to enhance the accuracy and robustness of the consensus clustering. In other words, the problem that must be solved here is how to compute the reliability of each fuzzy cluster as a fuzzy cluster quality measure and incorporate it into a weighting structure for boosting the consensus clustering.
In light of this, we propose a new fuzzy clustering ensemble framework based on ensemble-driven cluster reliability and local weighting strategy framework; we assign a weight to each cluster based on its reliability value. The contributions of this article are as follows:
- •
A method is proposed to estimate the unreliability of fuzzy clusters in relation to a clustering by considering the membership degree of all data-objects to the clusters by applying an entropic criterion, which requires no access to the original data features.
- •
A reliability-driven cluster indicator is proposed to measure the reliability of the fuzzy clusters in the ensemble and consider it as the weight of the fuzzy clusters in the ensemble.
- •
A method is proposed to compute the reliability-based fuzzy co-association matrix in the fuzzy clustering ensemble.
- •
Applying three single clustering algorithms on the obtained co-association matrix and showing their effects on the quality of the proposed approach on a variety of datasets.
- •
A reliability-based graph consensus function is proposed whose time complexity is linear in the number of data-objects.
- •
Extensive experiments performed on a variety of datasets indicate that this proposed fuzzy clustering ensemble approach performs better than the state-of-the-art approaches in terms of clustering quality.
The rest of this work is structured as follows: Sec. 2 presents a review of related work. The formal background knowledge about ensemble clustering is presented in Sec. 3. The proposed fuzzy clustering ensemble approach is explained in Sec. 4. We show the experimental results in Sec. 5 and the conclusion and future work are presented in Sec. 6.
Section snippets
Related work
Considerable research efforts have been performed in the field of crisp-clustering ensemble. Here we briefly review existing work most related to fuzzy clustering ensemble.
sCSPA, sHBGF and sMCLA proposed by Punera and Ghosh [15] can be assumed as the starting points in the fuzzy clustering ensemble. sCSPA which is the soft (fuzzy) version of CSPA (cluster-based similarity partitioning algorithm) [2], constructs a graph of all data-objects where edges are weighted by pair-wise similarities. For
Preliminaries
In this section, the general formulation of the dataset, some notations of fuzzy clustering ensemble used in this paper are introduced.
Definition 1 A dataset X contains N data-object in the form of , where each data-object contains M features.
Definition 2 A fuzzy clustering (partition) of data set X is a two-dimensional matrix with size , where N is the number of data-objects and K is the number of clusters, presented as so that: where
Proposed approach
In this paper, a new fuzzy clustering ensemble approach based on ensemble-driven cluster unreliability estimation and local weighting strategy is proposed. The main idea of our proposed approach is utilizing a weighting scheme at the fuzzy cluster level in which high-quality fuzzy clusters in the ensemble have more influence on the final clustering production. The fuzzy cluster quality is considered as fuzzy cluster reliability, and is defined by applying the concept of entropy over the
Experiments
The goal of the experimental section of this study is to answer the following questions:
- •
Can the proposed approach compete with the state-of-the-art ensemble clustering algorithms?
- •
How does changing the input parameters of the proposed approach influence the performance of the final clustering?
Conclusion and future work
This paper proposes a novel fuzzy clustering ensemble approach based on fuzzy-cluster-level weighting. The quantity of certain knowledge of the ensemble about the fuzzy cluster is considered as the cluster reliability. We firstly estimate the unreliability of fuzzy clusters applying similarity between fuzzy clusters in the entire ensemble based on an entropic criterion, then obtain a reliability driven cluster indicator (RDCI) as the quantity of certain knowledge of the ensemble about the fuzzy
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (59)
- et al.
Combining multiple clusterings via crowd agreement estimation and multi-granularity link analysis
Neurocomputing
(2015) - et al.
Hybrid clustering solution selection strategy
Pattern Recognit.
(2014) - et al.
A clustering ensemble: two-level-refined co-association matrix with path-based transformation
Pattern Recognit.
(2015) - et al.
Fuzzy ensemble clustering based on random projections for DNA microarray data analysis
Artif. Intell. Med.
(2009) - et al.
Modified differential evolution based fuzzy clustering for pixel classification in remote sensing imagery
Pattern Recognit.
(2009) - et al.
Automatic image pixel clustering with an improved differential evolution
Appl. Soft Comput.
(2009) - et al.
FCM: the fuzzy c-means clustering algorithm
Comput. Geosci.
(1984) - et al.
Positional and confidence voting-based consensus functions for fuzzy cluster ensembles
Fuzzy Sets Syst.
(2012) - et al.
A heterogeneous cluster ensemble model for improving the stability of fuzzy cluster analysis
Proc. Comput. Sci.
(2016) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
J. Comput. Appl. Math.
(1987)
On correlation between two fuzzy sets
Fuzzy Sets Syst.
Similarity based approximate reasoning: fuzzy control
J. Appl. Log.
Hybrid methods for fuzzy clustering based on fuzzy c-means and improved particle swarm optimization
Expert Syst. Appl.
An impossibility theorem for clustering
Cluster ensembles—a knowledge reuse framework for combining multiple partitions
J. Mach. Learn. Res.
Solving cluster ensemble problems by bipartite graph partitioning
Combining Pattern Classifiers: Methods and Algorithms
Cluster ensemble methods: from single clusterings to combined solutions
A survey of clustering ensemble algorithms
Int. J. Pattern Recognit. Artif. Intell.
Spectral ensemble clustering via weighted k-means: theoretical and practical evidence
IEEE Trans. Knowl. Data Eng.
Diversity-based weighting schemes for clustering ensembles
Weighted consensus clustering
A probabilistic model of fuzzy clustering ensemble
Pattern Recognit. Image Anal.
Consensus-based ensembles of soft clusterings
Appl. Artif. Intell.
A divisive information-theoretic feature clustering algorithm for text classification
J. Mach. Learn. Res.
On information and sufficiency
Ann. Math. Stat.
A combination scheme for fuzzy clustering
Int. J. Pattern Recognit. Artif. Intell.
Ensemble fuzzy clustering using cumulative aggregation on random projections
IEEE Trans. Fuzzy Syst.
Cited by (42)
Semi-supervised fuzzy clustering algorithm based on prior membership degree matrix with expert preference
2024, Expert Systems with ApplicationsMulti-fuzzy clustering validity index ensemble: A Dempster-Shafer theory-based parallel and series fusion
2023, Egyptian Informatics JournalGeometric consistent fuzzy cluster ensemble with membership reconstruction for image segmentation
2023, Digital Signal Processing: A Review JournalAn ensemble hierarchical clustering algorithm based on merits at cluster and partition levels
2023, Pattern RecognitionCitation Excerpt :The authors used multi-nominal logistic regression to discover the pattern of clustering results. Bagherinia et al. proposed a reliability-based weighted fuzzy clustering ensemble algorithm [16]. Here, the weight of each cluster is calculated based on its unreliability estimate with an entropic metric.
A survey of fuzzy clustering validity evaluation methods
2022, Information SciencesCitation Excerpt :Different fuzzy clustering algorithms adapt to different data sets [125–127]. Therefore, the integration of fuzzy clustering algorithm [128–131] can be introduced, so the combination of multiple clustering algorithms and validity function can enhance the adaptability of validity function, but it does not essentially change the structure of the validity evaluation. Influence of Datasets Structure