Elsevier

Pattern Recognition

Volume 112, April 2021, 107713
Pattern Recognition

A simulated annealing algorithm with a dual perturbation method for clustering

https://doi.org/10.1016/j.patcog.2020.107713Get rights and content

Highlights

  • Existing partitional clustering algorithms still settle upon local optima.

  • We propose a new simulated annealing algorithm with two perturbation methods.

  • We compare our algorithm with existing simulated annealing clustering algorithms.

  • We show our new algorithm produces clusters of higher quality more consistently.

Abstract

Clustering is a powerful tool in exploratory data analysis that partitions a set of objects into clusters with the goal of maximizing the similarity of objects within each cluster. Due to the tendency of clustering algorithms to find suboptimal partitions of data, the approximation method Simulated Annealing (SA) has been used to search for near-optimal partitions. However, existing SA-based partitional clustering algorithms still settle to local optima. We propose a new SA-based clustering algorithm, the Simulated Annealing with Gaussian Mutation and Distortion Equalization algorithm (SAGMDE), which uses two perturbation methods to allow for both large and small perturbations in solutions. Our experiments on a diverse collection of data sets show that SAGMDE performs more consistently and yields better results than existing SA clustering algorithms in terms of cluster quality while maintaining a reasonable runtime. Finally, we use generative art as a visualization tool to compare various partitional clustering algorithms.

Introduction

The ever-growing abundance of data prompts further research into data mining. An important and well-studied technique within the field of data mining is clustering, a type of unsupervised machine learning that involves grouping data based on predefined attributes. Clustering is applicable to a vast array of fields ranging from statistics to engineering to psychology [1]. For example, this tool has applications in medical fields through image segmentation [2] as well as business through market analysis [3].

Clustering is an NP-hard problem [4] that is generally solved by assigning Ndata points in Ddimensions to Kclusters with the goal of optimizing these clusters based on a given criterion. There are two main categories of clustering: hierarchical and partitional [5]. In hierarchical clustering, objects in the data set are grouped into a nested hierarchy of clusters. Partitional clustering involves dividing the objects into groups simultaneously without creating a nested structure.

Researchers have also investigated a third category of clustering, density-based clustering, which builds clusters based on the density of data points in a given region of the D-dimensional space. This class of algorithms has the benefit of reduced sensitivity to noise and cluster shape, but this strategy generally does not perform well for high-dimensional data [6]. Examples of algorithms that fully or partially employ a density-based strategy include a critical distance-based algorithm [7] and a gravity-center algorithm [8].

This paper will focus on partitional clustering algorithms. The most popular approach to partitional clustering is center-based clustering. Center-based, iterative clustering algorithms follow these two general steps:

  • 1.

    Centers are initialized

  • 2.

    Until convergence, every point’s membership to each center and every point’s weight are recomputed and used to recalculate the centers

The K-means algorithm (KM) is the best-known center-based clustering algorithm. It finds Kcenters in Ddimensions with the goal of maximizing the similarity of the elements within each cluster based upon the provided attributes of each object. To accomplish this goal, the algorithm attempts to minimize the sum of the distances between each point and its closest center, which is the Sum of Squares Error (SSE). Given that X={xi|i=1,,N}represents the Ndata points and M={ml|l=1,,K}represents the current Kcenters, the performance function for K-means is shown below:PerfKM(X,M)=i=1NMIN{ximl2|l=1,,K}

The distance-based approach of KM requires quantitative data, but researchers have produced variants of KM to perform clustering on mixed numerical and categorical data [9]. KM has a hard membership function, meaning that each point belongs to only one cluster (in this case its nearest cluster) as opposed to soft membership, where a proportion of each point belongs to multiple centers [10]. Additionally, every point in KM has the same weight. In KM, only the points close to a given center are considered in the recalculation of that center’s position; as a result, KM is sensitive to initialization and converges upon a local minimum.

There are also studies aimed at finding the optimal value of Kfor center-based clustering algorithms such as Pelleg and Moore’s study [11]. Guan et al. [12] proposed the K-means+ algorithm that automatically finds a near-optimal K.

Various initialization methods have been investigated to improve the clusters found by KM. Two popular algorithms include Forgy’s method [13], in which the initial centers are assigned as Krandom observations in the data set, and MacQueen’s method (the Random Partition method) [14], in which objects are randomly assigned to clusters and the initial centers are calculated as the centroids of these clusters. However, in a comparative study of initialization methods for K-means, Celebi et al. [15] showed there are alternative initialization methods that outperform these popular algorithms. Additionally, Fränti and Sieranoja [16] found that data sets with well-separated clusters still pose a challenge for all tested existing initialization methods.

Beyond experimenting with initialization methods, researchers have studied the use of different performance functions to improve cluster quality. The Fuzzy K-Means (FKM) algorithm, proposed by Dunn [17], as well as the K-Harmonic Means algorithm, proposed by Zhang et al. [18], use new performance functions that account for soft membership such that every point influences the location of each center. The K-Harmonic Means algorithm applies varying weights to points to further optimize the output, but KHM still converges upon a local minimum rather than the global minimum [18].

Metaheuristics serve as a different method to solve this local optima problem [19]. Some authors [20], [21] used the local search heuristic while some others [22], [23] applied the Tabu search method [24], and some others [25], [26] used genetic algorithms. Nature-based metaheuristic methods have also been explored [27]. Additionally, competitive learning algorithms [28], [29] have been proposed, which use an abundance of initial centers (seeds) that “compete,” “cooperate,” or both to eventually yield a given number of cluster centers; these algorithms have the benefit of automatically determine the number of cluster centers. This paper will specifically focus on the use of the simulated annealing (SA) algorithm [30].

This paper will describe the simulated annealing heuristic and its current applications to the clustering problem in Section 2. Section 3 describes our proposed algorithm. Experimental results that compare the proposed algorithm with other clustering algorithms using the simulated annealing heuristic are detailed in Section 4. Section 5 covers a comparison of center-based, partitional clustering algorithms through generative art.

Section snippets

Simulated annealing with K-means

The SA heuristic is modeled after the physical annealing process, in which a solid is slowly cooled until its atoms are repositioned to form a crystal with a low-energy state [31]. Metropolis et al. described the key components of the algorithm in 1953 [30], and the full algorithm was first presented by Kirkpatrick et al. [32], who also proposed the algorithms use as an optimization tool for combinatorial problems that attempt to minimize a cost function. In addition to accepting modified

Our proposed algorithm: SAGMDE

Our Simulated Annealing with Gaussian Mutation and Distortion Equalization algorithm differs from the previously mentioned algorithms in its combination of a small-perturbation method that allows for small shifts in the cluster centers with a large-perturbation method that allows for bigger, strategic shifts in centers. Specifically, the algorithm uses the Gaussian perturbation method (seen in SAGM) that provides these smaller shifts in conjunction with the distortion equalization and utility

Experimental study

The experimental results below compare our algorithm, SAGMDE, with some of the aforementioned algorithms, namely KM, BH, SAKM, and SAGM. As all of these algorithms are center-based clustering algorithms, they should not be applied to highly nonlinear data sets such as concentric rings. We ran the algorithms 20 times on a diverse set of 9 data sets: 6 of these were taken from the Machine Learning Laboratory [37] (details on the real data sets used can be found in Table 1), and we created 3

Generative art comparison

Generative art uses an autonomous system, which is almost always a computer, to create art. By introducing a degree of randomness into the creation of the artwork, the same set of instructions can produce similar but unique pieces of art. In this paper, generative art offers a route to further explore and compare partitional clustering algorithms.

Our generative art allows for comparison between any two center-based, partitional clustering algorithms. To produce the art, both algorithms to be

Conclusion

This paper describes a new SA-based partitional clustering algorithm, SAGMDE, which yields better clusters and performs more consistently than similar existing algorithms. SAGMDE takes advantage of two different perturbation methods—one based on a Gaussian mutation and the other based on the equalization of center distortions—to allow for both small shifts for converging upon the best nearby local minimum and large shifts to find better local minimums elsewhere. The algorithm utilizes

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors would like to thank Dr. Marie-Pierre Jolly and Mingxiao Song for their helpful suggestions in the revision process. The authors would also like to thank the Pioneer Academics Team for their support during the paper writing process.

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Julian Lee is a student at The Pingry School graduating in 2021. His current research interests include machine learning and image processing.

References (40)

  • R.W. Klein et al.

    Experiments in projection and clustering by simulated annealing

    Pattern Recognit.

    (1989)
  • G. Patané et al.

    The enhanced LBG algorithm

    Neural Netw.

    (2001)
  • B. Beddad et al.

    An improvement of spatial fuzzy c-means clustering method for noisy medical image analysis

    2019 6th Int. Conf. on Image and Signal Process. and their Appl. (ISPA)

    (2019)
  • S.H. Shihab et al.

    RFM Based market segmentation approach using advanced k-means and agglomerative clustering: a comparative study

    2019 Int. Conf. on Electr., Comput. and Commun. Eng. (ECCE)

    (2019)
  • D. Aloise et al.

    NP-hardness of euclidean sum-of-squares clustering

    Mach. Learn.

    (2009)
  • S. Merendino et al.

    A simulated annealing clustering algorithm based on center perturbation using gaussian mutation

  • M. Verma et al.

    A comparative study of various clustering algorithms

    Data Min., Int. J. Eng. Res. Appl. (IJERA)

    (2012)
  • F. Kuwil et al.

    A novel data clustering algorithm based on gravity center methodology

    Expert Syst. Appl.

    (2020)
  • M. Ahmed et al.

    The k-means algorithm: a comprehensive survey and performance evaluation

    Electronics

    (2020)
  • D. Pelleg et al.

    X-means: extending k-means with efficient estimation of the number of clusters

  • Cited by (35)

    • Cohesive clustering algorithm based on high-dimensional generalized Fermat points

      2022, Information Sciences
      Citation Excerpt :

      Table 14 shows the comparison of all the best algorithms that are applied to a cluster on the corresponding datasets and indicates the best algorithm that obtains the clustering result with highest NMIs, except for FL. Fig. 7 presents the comparison of the NMIs between FL and the best algorithms listed in Table 14 on the datasets with different dimensions, from which we are able to draw the conclusion that, on the same dataset, FL has more significant advantages over the advanced algorithms proposed in recent years [13,34,35,39–50]. Furthermore, we conducted the statistical analysis in the NMIs obtained by our proposed algorithm and the traditional algorithms on 21 datasets used in this study.

    • An algorithm to compute time-balanced clusters for the delivery logistics problem

      2022, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Next, we present a mathematical model to describe the problem being solved in this work. Metaheuristics based on simulated annealing (SA) have proved to be a good alternative for clustering tasks (José-García and Gómez-Flores, 2016; Lee and Perkins, 2021). Other metaheuristics like Genetic Algorithms, Particle Swarm Optimization, and Differential Evolution have also been used in clustering; however, all work with populations, which implies calculating the cost of visiting all points in the clusters for all solutions in the population per iteration.

    • Research on k-Fermat algorithm for optimal partitioning clustering of high-dimensional data

      2024, Xitong Gongcheng Lilun yu Shijian/System Engineering Theory and Practice
    View all citing articles on Scopus

    Julian Lee is a student at The Pingry School graduating in 2021. His current research interests include machine learning and image processing.

    David Perkins received his Ph.D. in Mathematics from the University of Montana and now teaches in the computer science department at Hamilton College in central New York. His research interests are primarily related to artificial intelligence.

    View full text