A simulated annealing algorithm with a dual perturbation method for clustering

doi:10.1016/j.patcog.2020.107713

Pattern Recognition

Volume 112, April 2021, 107713

https://doi.org/10.1016/j.patcog.2020.107713 Get rights and content

Highlights

•
Existing partitional clustering algorithms still settle upon local optima.
•
We propose a new simulated annealing algorithm with two perturbation methods.
•
We compare our algorithm with existing simulated annealing clustering algorithms.
•
We show our new algorithm produces clusters of higher quality more consistently.

Abstract

Clustering is a powerful tool in exploratory data analysis that partitions a set of objects into clusters with the goal of maximizing the similarity of objects within each cluster. Due to the tendency of clustering algorithms to find suboptimal partitions of data, the approximation method Simulated Annealing (SA) has been used to search for near-optimal partitions. However, existing SA-based partitional clustering algorithms still settle to local optima. We propose a new SA-based clustering algorithm, the Simulated Annealing with Gaussian Mutation and Distortion Equalization algorithm (SAGMDE), which uses two perturbation methods to allow for both large and small perturbations in solutions. Our experiments on a diverse collection of data sets show that SAGMDE performs more consistently and yields better results than existing SA clustering algorithms in terms of cluster quality while maintaining a reasonable runtime. Finally, we use generative art as a visualization tool to compare various partitional clustering algorithms.

Introduction

The ever-growing abundance of data prompts further research into data mining. An important and well-studied technique within the field of data mining is clustering, a type of unsupervised machine learning that involves grouping data based on predefined attributes. Clustering is applicable to a vast array of fields ranging from statistics to engineering to psychology [1]. For example, this tool has applications in medical fields through image segmentation [2] as well as business through market analysis [3].

Clustering is an NP-hard problem [4] that is generally solved by assigning $N$ data points in $D$ dimensions to $K$ clusters with the goal of optimizing these clusters based on a given criterion. There are two main categories of clustering: hierarchical and partitional [5]. In hierarchical clustering, objects in the data set are grouped into a nested hierarchy of clusters. Partitional clustering involves dividing the objects into groups simultaneously without creating a nested structure.

Researchers have also investigated a third category of clustering, density-based clustering, which builds clusters based on the density of data points in a given region of the $D$ -dimensional space. This class of algorithms has the benefit of reduced sensitivity to noise and cluster shape, but this strategy generally does not perform well for high-dimensional data [6]. Examples of algorithms that fully or partially employ a density-based strategy include a critical distance-based algorithm [7] and a gravity-center algorithm [8].

This paper will focus on partitional clustering algorithms. The most popular approach to partitional clustering is center-based clustering. Center-based, iterative clustering algorithms follow these two general steps:

1.
Centers are initialized
2.
Until convergence, every point’s membership to each center and every point’s weight are recomputed and used to recalculate the centers

The K-means algorithm (KM) is the best-known center-based clustering algorithm. It finds $K$ centers in $D$ dimensions with the goal of maximizing the similarity of the elements within each cluster based upon the provided attributes of each object. To accomplish this goal, the algorithm attempts to minimize the sum of the distances between each point and its closest center, which is the Sum of Squares Error (SSE). Given that $X = {x_{i} | i = 1, \dots, N}$ represents the $N$ data points and $M = {m_{l} | l = 1, \dots, K}$ represents the current $K$ centers, the performance function for K-means is shown below: $P e r f_{K M} (X, M) = \sum_{i = 1}^{N} M I N {∥ x_{i} - m_{l} ∥^{2} | l = 1, \dots, K}$

The distance-based approach of KM requires quantitative data, but researchers have produced variants of KM to perform clustering on mixed numerical and categorical data [9]. KM has a hard membership function, meaning that each point belongs to only one cluster (in this case its nearest cluster) as opposed to soft membership, where a proportion of each point belongs to multiple centers [10]. Additionally, every point in KM has the same weight. In KM, only the points close to a given center are considered in the recalculation of that center’s position; as a result, KM is sensitive to initialization and converges upon a local minimum.

There are also studies aimed at finding the optimal value of $K$ for center-based clustering algorithms such as Pelleg and Moore’s study [11]. Guan et al. [12] proposed the K-means+ algorithm that automatically finds a near-optimal $K$ .

Various initialization methods have been investigated to improve the clusters found by KM. Two popular algorithms include Forgy’s method [13], in which the initial centers are assigned as $K$ random observations in the data set, and MacQueen’s method (the Random Partition method) [14], in which objects are randomly assigned to clusters and the initial centers are calculated as the centroids of these clusters. However, in a comparative study of initialization methods for K-means, Celebi et al. [15] showed there are alternative initialization methods that outperform these popular algorithms. Additionally, Fränti and Sieranoja [16] found that data sets with well-separated clusters still pose a challenge for all tested existing initialization methods.

Beyond experimenting with initialization methods, researchers have studied the use of different performance functions to improve cluster quality. The Fuzzy K-Means (FKM) algorithm, proposed by Dunn [17], as well as the K-Harmonic Means algorithm, proposed by Zhang et al. [18], use new performance functions that account for soft membership such that every point influences the location of each center. The K-Harmonic Means algorithm applies varying weights to points to further optimize the output, but KHM still converges upon a local minimum rather than the global minimum [18].

Metaheuristics serve as a different method to solve this local optima problem [19]. Some authors [20], [21] used the local search heuristic while some others [22], [23] applied the Tabu search method [24], and some others [25], [26] used genetic algorithms. Nature-based metaheuristic methods have also been explored [27]. Additionally, competitive learning algorithms [28], [29] have been proposed, which use an abundance of initial centers (seeds) that “compete,” “cooperate,” or both to eventually yield a given number of cluster centers; these algorithms have the benefit of automatically determine the number of cluster centers. This paper will specifically focus on the use of the simulated annealing (SA) algorithm [30].

This paper will describe the simulated annealing heuristic and its current applications to the clustering problem in Section 2. Section 3 describes our proposed algorithm. Experimental results that compare the proposed algorithm with other clustering algorithms using the simulated annealing heuristic are detailed in Section 4. Section 5 covers a comparison of center-based, partitional clustering algorithms through generative art.

Section snippets

Simulated annealing with K-means

The SA heuristic is modeled after the physical annealing process, in which a solid is slowly cooled until its atoms are repositioned to form a crystal with a low-energy state [31]. Metropolis et al. described the key components of the algorithm in 1953 [30], and the full algorithm was first presented by Kirkpatrick et al. [32], who also proposed the algorithms use as an optimization tool for combinatorial problems that attempt to minimize a cost function. In addition to accepting modified

Our proposed algorithm: SAGMDE

Our Simulated Annealing with Gaussian Mutation and Distortion Equalization algorithm differs from the previously mentioned algorithms in its combination of a small-perturbation method that allows for small shifts in the cluster centers with a large-perturbation method that allows for bigger, strategic shifts in centers. Specifically, the algorithm uses the Gaussian perturbation method (seen in SAGM) that provides these smaller shifts in conjunction with the distortion equalization and utility

Experimental study

The experimental results below compare our algorithm, SAGMDE, with some of the aforementioned algorithms, namely KM, BH, SAKM, and SAGM. As all of these algorithms are center-based clustering algorithms, they should not be applied to highly nonlinear data sets such as concentric rings. We ran the algorithms 20 times on a diverse set of 9 data sets: 6 of these were taken from the Machine Learning Laboratory [37] (details on the real data sets used can be found in Table 1), and we created 3

Generative art comparison

Generative art uses an autonomous system, which is almost always a computer, to create art. By introducing a degree of randomness into the creation of the artwork, the same set of instructions can produce similar but unique pieces of art. In this paper, generative art offers a route to further explore and compare partitional clustering algorithms.

Our generative art allows for comparison between any two center-based, partitional clustering algorithms. To produce the art, both algorithms to be

Conclusion

This paper describes a new SA-based partitional clustering algorithm, SAGMDE, which yields better clusters and performs more consistently than similar existing algorithms. SAGMDE takes advantage of two different perturbation methods—one based on a Gaussian mutation and the other based on the equalization of center distortions—to allow for both small shifts for converging upon the best nearby local minimum and large shifts to find better local minimums elsewhere. The algorithm utilizes

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors would like to thank Dr. Marie-Pierre Jolly and Mingxiao Song for their helpful suggestions in the revision process. The authors would also like to thank the Pioneer Academics Team for their support during the paper writing process.

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Julian Lee is a student at The Pingry School graduating in 2021. His current research interests include machine learning and image processing.

References (40)

D.E. Brown et al.
A practical application of simulated annealing to clustering
Pattern Recognit.
(1992)
F. Kuwil et al.
A new data clustering algorithm based on critical distance methodology
Expert Syst. Appl.
(2019)
Z. Güngör et al.
K-harmonic means data clustering with simulated annealing heuristic
Appl. Math. Comput.
(2007)
M.E. Celebi et al.
A comparative study of efficient initialization methods for the k-means clustering algorithm
Expert Syst. Appl.
(2013)
P. Fränti et al.
How much can k-means be improved by using better initialization and repeats?
Pattern Recognit.
(2019)
P. Hansen et al.
J-Means: a new local search heuristic for minimum sum of squares clustering
Pattern Recognit.
(2001)
K.S. Al-Sultan
A tabu search approach to the clustering problem
Pattern Recognit.
(1995)
Y. Lu et al.
A tabu search based clustering algorithm and its parallel implementation on spark
Appl. Soft Comput.
(2018)
M.Z. Islam et al.
Combining k-means and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering
Expert Syst. Appl.
(2018)
S.J. Nanda et al.
A survey on nature inspired metaheuristic algorithms for partitional clustering
Swarm Evol. Comput.
(2014)

R.W. Klein et al.

Experiments in projection and clustering by simulated annealing

Pattern Recognit.

(1989)

G. Patané et al.

The enhanced LBG algorithm

Neural Netw.

(2001)

B. Beddad et al.

An improvement of spatial fuzzy c-means clustering method for noisy medical image analysis

2019 6th Int. Conf. on Image and Signal Process. and their Appl. (ISPA)

(2019)

S.H. Shihab et al.

RFM Based market segmentation approach using advanced k-means and agglomerative clustering: a comparative study

2019 Int. Conf. on Electr., Comput. and Commun. Eng. (ECCE)

(2019)

D. Aloise et al.

NP-hardness of euclidean sum-of-squares clustering

Mach. Learn.

(2009)

S. Merendino et al.

A simulated annealing clustering algorithm based on center perturbation using gaussian mutation

M. Verma et al.

A comparative study of various clustering algorithms

Data Min., Int. J. Eng. Res. Appl. (IJERA)

(2012)

F. Kuwil et al.

A novel data clustering algorithm based on gravity center methodology

Expert Syst. Appl.

(2020)

M. Ahmed et al.

The k-means algorithm: a comprehensive survey and performance evaluation

Electronics

(2020)

D. Pelleg et al.

X-means: extending k-means with efficient estimation of the number of clusters

Cited by (35)

Hybrid neural-like P systems with evolutionary channels for multiple brain metastases segmentation
2023, Pattern Recognition
Neural-like P systems are membrane computing models inspired by natural computing. Spiking neurral (SN) P systems, a kind of neural-like P systems, are viewed as third-generation neural network models. Although real neurons have complex structures, classical SN P systems simplify the structures and corresponding mechanisms to stationary two-dimensional graphs and lack related evolution mechanisms on spikes and channels, which limits the real applications of these models. In this paper, we propose a new hybrid SN P system with evolutionary channels (HN P systems), including three new types of rules for dynamically generating or removing one-one and one-many/many-one channels with related evolutions of spikes on the hybrid neuron structures. Two dynamic regulatory factors are also presented on rules to help guide the optimization of the HN P systems automatically. Based on the new P system, a multiple brain metastases (BMs) segmentation model is developed. The experimental results indicate that the proposed models outperform the state-of-the-art methods on the BMs, which have large variations in sizes, positions and shapes, and low contrast with their surroundings. Performances on the head and neck segmentation dataset also verifies the effectiveness of the HN P system.
Cohesive clustering algorithm based on high-dimensional generalized Fermat points
2022, Information Sciences
Citation Excerpt :
Table 14 shows the comparison of all the best algorithms that are applied to a cluster on the corresponding datasets and indicates the best algorithm that obtains the clustering result with highest NMIs, except for FL. Fig. 7 presents the comparison of the NMIs between FL and the best algorithms listed in Table 14 on the datasets with different dimensions, from which we are able to draw the conclusion that, on the same dataset, FL has more significant advantages over the advanced algorithms proposed in recent years [13,34,35,39–50]. Furthermore, we conducted the statistical analysis in the NMIs obtained by our proposed algorithm and the traditional algorithms on 21 datasets used in this study.
Utilizing high-dimensional generalized Fermat points ( $F_{d}$ - $points$ ) as cluster centers, we propose a new method $F_{d}$ - $points$ Linkage (FL) for calculating intra-cluster and inter-cluster distances. First, we employ the Plant Growth Simulation Algorithm (PGSA) to solve for the $F_{d}$ - $points$ within clusters. The obtained $F_{d}$ - $points$ are then used to represent the corresponding clusters in the process of calculating the inter-cluster distance so as to guide the merging of clusters. To verify the effectiveness of the proposed method, we compared it with previous methods in terms of performance on multiple well-known datasets. $F_{d}$ - $points$ have not been utilized as cluster centers in clustering analysis, although they are theoretically optimal representations of datasets and are able to represent different clusters more accurately. We explore a new direction for refining clustering theory and improving the accuracy of clustering algorithms.
An algorithm to compute time-balanced clusters for the delivery logistics problem
2022, Engineering Applications of Artificial Intelligence
Citation Excerpt :
Next, we present a mathematical model to describe the problem being solved in this work. Metaheuristics based on simulated annealing (SA) have proved to be a good alternative for clustering tasks (José-García and Gómez-Flores, 2016; Lee and Perkins, 2021). Other metaheuristics like Genetic Algorithms, Particle Swarm Optimization, and Differential Evolution have also been used in clustering; however, all work with populations, which implies calculating the cost of visiting all points in the clusters for all solutions in the population per iteration.
An effective supply chain organization is fundamental for any manufacturing, distribution, retail or wholesale business. New technologies have made considerable improvements in the whole process of inventory management; Artificial Intelligence (AI) represents one of the best options for the industry and their search for more intelligent and robust logistics solutions. Based on a real-world scenario, we approach the challenge of defining delivery routes within a city such that the time they require to be traveled is approximately the same. Moreover, while the routes must ensure that drivers’ workload is time balanced and contract regulations can be met, they also must correspond to a customers’ partition (sectorization) according to well-defined, non-overlapping delivery areas. We introduce an approach to solve the problem through the algorithm HSAC (Hierarchical Simulated Annealing Clustering). The proposed algorithm first applies a divisive approach to the data, using simulated annealing at each step to create time-balanced partitions, and then solves the TSP problem to create optimal routes within the defined groups. Based on real data concerning two Mexican cities, our experimental results show that HSAC can solve the sectorization problem efficiently.
Power Line Routing Design by Gis-Driven Fuzzy Travelling Salesman Problem-Binary Integer Programming for Green Energy Integration
2024, SSRN
Research on k-Fermat algorithm for optimal partitioning clustering of high-dimensional data
2024, Xitong Gongcheng Lilun yu Shijian/System Engineering Theory and Practice
Application of machine learning model optimized by improved sparrow search algorithm in water quality index time series prediction
2024, Multimedia Tools and Applications

View all citing articles on Scopus

Julian Lee is a student at The Pingry School graduating in 2021. His current research interests include machine learning and image processing.

David Perkins received his Ph.D. in Mathematics from the University of Montana and now teaches in the computer science department at Hamilton College in central New York. His research interests are primarily related to artificial intelligence.

View full text

A simulated annealing algorithm with a dual perturbation method for clustering

Highlights

Abstract

Introduction

Section snippets

Simulated annealing with K-means

Our proposed algorithm: SAGMDE

Experimental study

Generative art comparison

Conclusion

Declaration of Competing Interest

Acknowledgements

Pattern Recognit.

Expert Syst. Appl.

Appl. Math. Comput.

Expert Syst. Appl.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Appl. Soft Comput.

Expert Syst. Appl.

Swarm Evol. Comput.

Pattern Recognit.

Neural Netw.

An improvement of spatial fuzzy c-means clustering method for noisy medical image analysis

2019 6th Int. Conf. on Image and Signal Process. and their Appl. (ISPA)

RFM Based market segmentation approach using advanced k-means and agglomerative clustering: a comparative study

2019 Int. Conf. on Electr., Comput. and Commun. Eng. (ECCE)

NP-hardness of euclidean sum-of-squares clustering

Mach. Learn.

A simulated annealing clustering algorithm based on center perturbation using gaussian mutation

A comparative study of various clustering algorithms

Data Min., Int. J. Eng. Res. Appl. (IJERA)

A novel data clustering algorithm based on gravity center methodology

Expert Syst. Appl.

The k-means algorithm: a comprehensive survey and performance evaluation

Electronics

X-means: extending k-means with efficient estimation of the number of clusters