Elsevier

Fuzzy Sets and Systems

Volume 413, 15 June 2021, Pages 1-28
Fuzzy Sets and Systems

Reliability-based fuzzy clustering ensemble

https://doi.org/10.1016/j.fss.2020.03.008Get rights and content

Abstract

In the clustering ensemble the quality of base-clusterings influences the consensus clustering. Although some researches have been devoted to weighting the base-clustering, fuzzy cluster level weighting has been ignored, more specifically, they did not pay attention to the role of cluster reliability in the fuzzy clustering ensemble. In this paper, we propose a new fuzzy clustering ensemble framework without access to the features of data-objects based on fuzzy cluster-level weighting. The reliability of each fuzzy cluster is computed based on estimation of its unreliability, and is considered as its weight in the ensemble. The unreliability of fuzzy clusters is estimated by applying the similarity between fuzzy clusters in the ensemble based on an entropic criterion. In our framework, the final clustering is produced by two types of consensus functions: (1) a reliability-based weighted fuzzy co-association matrix is constructed from the base-clusterings and then, a single traditional clustering such as hierarchical agglomerative clustering or K-means is applied over the matrix to produce the final clustering. (2) a new graph based fuzzy consensuses function. The graph based consensus function has linear time complexity in the number of data-objects. Experimental results on various standard datasets demonstrated the effectiveness of the proposed approach compared to the state-of-the-art methods in terms of evaluation criteria and clustering robustness.

Introduction

Clustering is the process of partitioning a set of data-objects (samples) into some (K) subsets of data-objects based on a similarity (distance) measure, where the data-objects in each subset are more similar to one another, and more separate than other subsets of data-objects. Each subset in the mentioned definition is usually referred to as a cluster. All clusters together are named a clustering. Based on the relationship of each data-object to the clusters, the clustering algorithms can also be divided into crisp and fuzzy clustering algorithms. In crisp clustering a data object definitely belongs to one cluster. In fuzzy clustering, data-objects are assigned to every cluster with a membership degree. Crisp clustering is a special case of fuzzy clustering, in which the membership degree of a data-object belonging to a cluster equals to one and its membership degree belonging to the other clusters is zero.

In general, in data clustering context, various clustering algorithms have emerged, each uses a different similarity criterion. Therefore, they have different objective functions. All these methods are heavily dependent on dataset; in other words, there is no clustering algorithm that can learn every dataset [1]. Hence, data clustering with the help of an ensemble of clusters has been proposed as a technique for resolving the aforementioned problems in recent years by researchers [2], [3], [4], [5]. This technique is named clustering ensemble. The main aim of clustering ensemble is to search for a likely better and more stable result with the aggregation of the information extracted from multiple clusterings (also called base-clusterings or members) [6], [7]. The better and more robust result that is extracted from base-clusterings is named consensus clustering (which in this research is also referred to as the final clustering) [6], [8].

In summary, as observed in Fig. 1, a clustering ensemble consists of the following two phases [6]: (1) Base-clustering generation phase: Produce base-clusterings through single clustering algorithms (in this study single clustering is used versus ensemble clustering). (2) Base-clustering consolidation: In this phase the base-clusterings generated in phase 1 must be combined in order to generate the final clustering, which is the objective of this phase. This consolidation is done through a consensus function. It is worth mentioning that this paper focuses on this phase, and proposes a co-association based fuzzy clustering ensemble method and a graph based fuzzy clustering ensemble method.

Despite the greater generality of fuzzy clustering compared to crisp clustering, researches in fuzzy cluster ensemble are still in the initial stages and there exist relatively few approaches for this field. Some of the existing fuzzy cluster ensemble methods convert fuzzy clusters into hard clusters at first, and then compute the final clustering through the hard consensus functions, which causes the loss of uncertainty information. Therefore, proposing an efficient fuzzy consensus clustering from multiple fuzzy base-clusterings remains a challenging issue.

In the ensemble of voters (learners) assigning weight to each learner based on its quality can be an effective mechanism to improve the result of the ensemble. The process of weighting to learners in an ensemble of learners can be optimally set if the accuracy of each learner is known [4]. But if the learners are of type clustering algorithms, the accuracy for learner is meaningless [9]. So, the quality of clustering that is obtained by clustering algorithm can be used as an approximation for its (clustering algorithm) accuracy. In summary, the quality of the base-clusterings highly affects the consensus clustering (final clustering) obtained by the ensemble. In other words, low-quality base-clusterings may have a negative influence on the consensus results. Some researchers investigated the quality-evaluation and weighting of the base-clusterings to improve the consensus clustering quality [10], [11], [12]. For a method that used weighting mechanism in fuzzy clustering ensemble, we can refer to a paper by Berikov [13]. However, this approach assumed that all of the clusters in the same base-clustering have the same reliability; They typically treat each base clustering as an individual and assign a weight to them regardless of the diversity of the clusters inside [10], [11], [12], [13]. In this research, reliability is defined as the quantity of certain knowledge of the ensemble about the cluster and is computed by the accretion amount of that cluster by the ensemble. Briefly, in the aforementioned papers weighting is considered at the clustering level not in the cluster level. But due to the inherent complexity of real-world datasets, the different clusters in the same clustering may have different reliability. Hence, it is necessary to consider the local diversity of ensembles (quality of the clusters in the ensemble) and deal with the different reliability of clusters. Although Zhong et al. investigated the reliability of crisp clusters by considering the Euclidean distances between data objects in clusters [14], this method is not operational for fuzzy clustering, in addition, it requires access to the original data features, and its efficacy relies heavily on the data distribution of the dataset, while in the general formulation of cluster ensemble, there exists no access to the original data features. Therefore, without the need to access the data features or rely on specific assumptions made on data distribution, the key question here is how to measure the reliability of fuzzy clusters and weight them accordingly to enhance the accuracy and robustness of the consensus clustering. In other words, the problem that must be solved here is how to compute the reliability of each fuzzy cluster as a fuzzy cluster quality measure and incorporate it into a weighting structure for boosting the consensus clustering.

In light of this, we propose a new fuzzy clustering ensemble framework based on ensemble-driven cluster reliability and local weighting strategy framework; we assign a weight to each cluster based on its reliability value. The contributions of this article are as follows:

  • A method is proposed to estimate the unreliability of fuzzy clusters in relation to a clustering by considering the membership degree of all data-objects to the clusters by applying an entropic criterion, which requires no access to the original data features.

  • A reliability-driven cluster indicator is proposed to measure the reliability of the fuzzy clusters in the ensemble and consider it as the weight of the fuzzy clusters in the ensemble.

  • A method is proposed to compute the reliability-based fuzzy co-association matrix in the fuzzy clustering ensemble.

  • Applying three single clustering algorithms on the obtained co-association matrix and showing their effects on the quality of the proposed approach on a variety of datasets.

  • A reliability-based graph consensus function is proposed whose time complexity is linear in the number of data-objects.

  • Extensive experiments performed on a variety of datasets indicate that this proposed fuzzy clustering ensemble approach performs better than the state-of-the-art approaches in terms of clustering quality.

The rest of this work is structured as follows: Sec. 2 presents a review of related work. The formal background knowledge about ensemble clustering is presented in Sec. 3. The proposed fuzzy clustering ensemble approach is explained in Sec. 4. We show the experimental results in Sec. 5 and the conclusion and future work are presented in Sec. 6.

Section snippets

Related work

Considerable research efforts have been performed in the field of crisp-clustering ensemble. Here we briefly review existing work most related to fuzzy clustering ensemble.

sCSPA, sHBGF and sMCLA proposed by Punera and Ghosh [15] can be assumed as the starting points in the fuzzy clustering ensemble. sCSPA which is the soft (fuzzy) version of CSPA (cluster-based similarity partitioning algorithm) [2], constructs a graph of all data-objects where edges are weighted by pair-wise similarities. For

Preliminaries

In this section, the general formulation of the dataset, some notations of fuzzy clustering ensemble used in this paper are introduced.

Definition 1

A dataset X contains N data-object in the form of X={x1,x2,,xN}, where each data-object contains M features.

Definition 2

A fuzzy clustering (partition) of data set X is a two-dimensional matrix with size NK, where N is the number of data-objects and K is the number of clusters, presented as F(X) so that:ti,t{1,,N} and i{1,,K}:Fi(xt)[0,1] and i=1KFi(xt)=1 where Fi(xt

Proposed approach

In this paper, a new fuzzy clustering ensemble approach based on ensemble-driven cluster unreliability estimation and local weighting strategy is proposed. The main idea of our proposed approach is utilizing a weighting scheme at the fuzzy cluster level in which high-quality fuzzy clusters in the ensemble have more influence on the final clustering production. The fuzzy cluster quality is considered as fuzzy cluster reliability, and is defined by applying the concept of entropy over the

Experiments

The goal of the experimental section of this study is to answer the following questions:

  • Can the proposed approach compete with the state-of-the-art ensemble clustering algorithms?

  • How does changing the input parameters of the proposed approach influence the performance of the final clustering?

Conclusion and future work

This paper proposes a novel fuzzy clustering ensemble approach based on fuzzy-cluster-level weighting. The quantity of certain knowledge of the ensemble about the fuzzy cluster is considered as the cluster reliability. We firstly estimate the unreliability of fuzzy clusters applying similarity between fuzzy clusters in the entire ensemble based on an entropic criterion, then obtain a reliability driven cluster indicator (RDCI) as the quantity of certain knowledge of the ensemble about the fuzzy

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (59)

  • B.B. Chaudhuri et al.

    On correlation between two fuzzy sets

    Fuzzy Sets Syst.

    (2001)
  • S. Raha et al.

    Similarity based approximate reasoning: fuzzy control

    J. Appl. Log.

    (2008)
  • T.M. Silva Filho et al.

    Hybrid methods for fuzzy clustering based on fuzzy c-means and improved particle swarm optimization

    Expert Syst. Appl.

    (2015)
  • J.M. Kleinberg

    An impossibility theorem for clustering

  • A. Strehl et al.

    Cluster ensembles—a knowledge reuse framework for combining multiple partitions

    J. Mach. Learn. Res.

    (2002)
  • X.Z. Fern et al.

    Solving cluster ensemble problems by bipartite graph partitioning

  • L.I. Kuncheva

    Combining Pattern Classifiers: Methods and Algorithms

    (2004)
  • A. Fred

    Cluster ensemble methods: from single clusterings to combined solutions

  • S. Vega-Pons et al.

    A survey of clustering ensemble algorithms

    Int. J. Pattern Recognit. Artif. Intell.

    (2011)
  • H. Liu et al.

    Spectral ensemble clustering via weighted k-means: theoretical and practical evidence

    IEEE Trans. Knowl. Data Eng.

    (2017)
  • A. Topchy, A.K. Jain, W. Punch, Combining multiple weak clusterings, in: Third IEEE Int. Conf. Data Min., IEEE Comput....
  • F. Gullo et al.

    Diversity-based weighting schemes for clustering ensembles

  • T. Li et al.

    Weighted consensus clustering

  • V.B. Berikov

    A probabilistic model of fuzzy clustering ensemble

    Pattern Recognit. Image Anal.

    (2018)
  • K. Punera et al.

    Consensus-based ensembles of soft clusterings

    Appl. Artif. Intell.

    (2008)
  • I.S. Dhillon

    A divisive information-theoretic feature clustering algorithm for text classification

    J. Mach. Learn. Res.

    (2003)
  • S. Kullback et al.

    On information and sufficiency

    Ann. Math. Stat.

    (1951)
  • E. Dimitriadou et al.

    A combination scheme for fuzzy clustering

    Int. J. Pattern Recognit. Artif. Intell.

    (2002)
  • P. Rathore et al.

    Ensemble fuzzy clustering using cumulative aggregation on random projections

    IEEE Trans. Fuzzy Syst.

    (2018)
  • Cited by (42)

    • An ensemble hierarchical clustering algorithm based on merits at cluster and partition levels

      2023, Pattern Recognition
      Citation Excerpt :

      The authors used multi-nominal logistic regression to discover the pattern of clustering results. Bagherinia et al. proposed a reliability-based weighted fuzzy clustering ensemble algorithm [16]. Here, the weight of each cluster is calculated based on its unreliability estimate with an entropic metric.

    • A survey of fuzzy clustering validity evaluation methods

      2022, Information Sciences
      Citation Excerpt :

      Different fuzzy clustering algorithms adapt to different data sets [125–127]. Therefore, the integration of fuzzy clustering algorithm [128–131] can be introduced, so the combination of multiple clustering algorithms and validity function can enhance the adaptability of validity function, but it does not essentially change the structure of the validity evaluation. Influence of Datasets Structure

    View all citing articles on Scopus
    View full text