Original ArticleBenchmarking the Clustering Performances of Evolutionary Algorithms: A Case Study on Varying Data Size
Graphical abstract
Introduction
Due to the technological developments in communication technologies; the number of connected devices called IoT, and the internet users has reached to billions around the world.) For example Hans Vestburg (Ericssons's former CEO) expects 50 billion devices or 30.73 billion devices to be connected from Statista [1]. The usage of these devices by people on different domains such as marketing, social-media, military, banking, transportation, telecommunications, and health-care create huge data in various formats which is today called Big Data [2]. Especially smart devices like smart phones or internet-connected devices equipped with sensors continuously create data called data streams which is one of the popular data analysis techniques of today.
The storage, processing and analysis capabilities of today's standard computers are not able to deal with Big Data. For that reason new technologies are needed to meet these demands. Many scientists have focused on these problems and have presented studies about different sides such as storage (cloud platforms), frameworks (like Hadoop or Spark), databases that are able to store and process big data effectively to meet the demands for high-performance when reading and writing (NoSQL types like MongoDB, Cassandra and etc.) [3]. As the Big Data concept can include many records from thousands to millions, analysis of such data for many purposes also result in a time-consuming process and as a result it becomes another challenge.
Although the characteristics of Big Data are generally known as 3 Vs; some researchers introduce extra Vs and list as 10 Vs. The first one is Volume of the data which refers to the massive data gathered from different resources like sensors deployed to an airplane, Facebook or Twitter messages. The second one is Velocity which refers to the change speed of the data like millions of different messages about different subjects on Facebook, or speed sensor data changing on each second. The third one is Variety which refers to the types of data that can be in any format such as numeric, character, video or image. The fourth one is Veracity which refers to the trust of the authority to believe in the returned data as the data sometimes can be uncertain because of different reasons caused by users, network, sensors or etc. The fifth one is Visualization which refers to the visualization tools problems to show the billions of records on the graphs due to scalability, response time and functionality. Validity is the sixth one which means if the data are correct and accurate enough to use for the purpose usage. Volatility is the seventh one referring to the time needed for storing the data or how long the data will be stored. Vulnerability is the eighth one referring to security concerns. Variability is the ninth one that refers to the inconsistencies in the data like outliers or anomalies that should be eliminated for better analysis. The tenth as the last one and the most important one is Value referring to the benefit gained by the analysis processes of data for different domains [4].
Extracting information through the datasets even big or not is called data mining. And data mining is very popular in today's world as the importance of the data has been realized by people and is generally used on many domains like marketing, banking, telecommunications and health. Data mining tasks can be classified in two categories - descriptive and predictive. Descriptives characterize the general properties of the data in database; predictives perform inference on the current data in order to make predictions about future data [5]. Each category uses many different methods like classification, regression, time series for predictive; and clustering, association rules for descriptive tasks.
Clustering is one of the mostly used methods in data mining descriptive operations which works like grouping data having similarities with others. It is based on unsupervised learning meaning that a supervisor or expert on a specific field is not required on analyzing the dataset. Clustering includes many algorithms in different categories such as partitioning-based ones (including K-means, Partitioning Around Medoids (PAM), Fuzzy C-Means (FCM), K-modes), hierarchical-based ones (including CURE, Chameleon, BIRCH), density-based ones (including DBSCAN, DENCLUE, OPTICS), grid-based ones (including STING, OptiGrid, CLIQUE) and model-based ones (including COBWEB, CLASSIT, EM). Although there are many algorithms in clustering literature such as the ones listed above and used for various domains' data analysis such as pattern recognition, bio-informatics, machine learning and data mining; it is very difficult to select only one as the most appropriate algorithm to be used on all sizes of dataset analysis because each algorithm has advantages and disadvantages [6].
Evolutionary algorithms use the mechanisms of biological evolution for the real life-optimization problems. Since real-life optimization problems have many constraints and independent variables, the solution space of such problems are very vast. And finding an optimal solution for such complex with big solution spaces are time consuming with classical methods. Because of this, evolutionary and nature-inspired algorithms are used in order to find a nearly optimal solution in a reasonable time. And many scientists have thought clustering problems as optimization problems and presented many papers based on some of these evolutionary algorithms. Particle Swarm Optimization (PSO) Algorithm based clustering [7], [8], Genetic Algorithm (GA) based clustering [9], [10], [11], Grey Wolf Optimization (GWO) Algorithm based clustering [12], [13] and Biogeography-based Optimization (BBO) based clustering [14], [15] are some examples of the modern popular nature inspired optimization algorithms that have been applied on clustering problems. As seen in literature, optimization algorithms have great potential to be easily used for clustering data on various scenarios and on various data size from small to big data.
Besides data mining, machine learning is another discipline in computer science which uses the existing data to train and test the designed models for different aims with various methods. Deep learning is one of the popular methods of machine learning which has gained popularity over the last five years used in many studies with different algorithms such as [16], [17], [18], [19], [20], [21]. Deep learning is also used in data mining in some studies [22], [23], [24].
Several algorithms have been proposed in literature for clustering in data mining. But selecting the right algorithm to be used for a good clustering performance is a challenge. Each algorithm has advantages and disadvantages. Evolutionary algorithms are easily applicable to each problem. If the problem can be modeled as an optimization problem, it can be solved with these algorithms. So they can be alternatives for each kind of specific problem solutions, from non-linear equation solutions for clustering. Some evolutionary algorithms can outperform the others.
Recently, evolutionary and nature-inspired algorithms have been applied to many types of engineering problem solutions like a magic tool. If the model of the problem is designed well with variables, constraints and problem formulations, these algorithms are easy to adapt to all engineering problems. In the latest studies, we see that evolutionary algorithms have been used as clustering for a specific problem data. For example two of the recent studies [25], [26] have used Moth Flame Optimization algorithm for clustering.
With this point of view, each evolutionary algorithm is a candidate for data clustering. In our work, four well-known algorithms, GWO, BBO, PSO and GA have been used for benchmarking the clustering performances of the given evolutionary algorithms against a well-known clustering algorithm K-means on different data sizes.
To see the relationship between the clustering performances of algorithms and the sizes of the datasets, datasets having different data sizes have been used in our experiments. For small scale comparison, we created eight synthetic datasets. For medium scale and large-scale comparison, we used commonly used datasets in many papers which are Multi-hop Outdoor Real Data (MHORD), Multi-hop Indoor Real Data (MHIRD), Single-hop Outdoor Real Data (SHORD) and Single-hop Indoor Real Data (SHIRD) [27] and DARPA KDDCup99 [28].
The contributions of the paper are:
1) Unlike many papers, we used evolutionary algorithms for clustering datasets that have varying dataset sizes, not for just a specific dataset clustering.
2) We proposed to use some of today's popular nature-inspired optimization algorithms, BBO, GWO, PSO, and GA, for clustering purposes as they all have not been used together in the same study.
3) We aimed at benchmarking the clustering performances of the mostly used evolutionary algorithms against various data size from small to Big Data and against each other without any extra change in the algorithms for clustering. So, in this presentation, we have solved the clustering problems as a continues optimization problem.
4) The clustering performances of BBO, GWO, PSO and GA on varying data sizes are also compared to the popular well-known clustering algorithm K-means and the results are evaluated.
The rest of the paper is organized as follows. Related works about clustering studies in literature are given in Section 2. Detailed information about the PSO-based, GA-based, GWO-based and BBO-based clustering are given in Section 3. Then, performance results and comparisons of the specified algorithms on datasets are given in Section 4. Finally, Section 5 concludes the paper.
Section snippets
Related works
An algorithm which was an improved version of DENCLUE, called DENCLUE-IM has been proposed in [29] to find a relation between performance and speed response time on classifying big data. The idea behind the new approach was speed calculation by avoiding the Hill Climbing step in DENCLUE.
The study [9] has reported results that Genetic Algorithm was self-sufficient for handling big data clustering issue in global space on a social media generated big data. Genetic algorithm has been used for
Evolutionary algorithms approach to clustering
Evolutionary and nature-inspired algorithms used in the paper have some common properties. All of them are population based and nature inspired algorithms. They start with initial random solutions and improve these solutions with their strategies inspired from the nature. PSO and GWO share very common properties. Both of them use positions of population in a d-multidimensional space. They have a few parameters to adapt. But GWO has more formula than PSO. The simplest one among the algorithms
Datasets
In order to compare the clustering performances of PSO, GWO, BBO and GA, different datasets were used in this study and the information about these datasets are shown in Table 2.
The datasets from Synt1 to Synt8 are the synthetic datasets created by us and we can share them with anybody who wants to use them for further studies. As it can be seen in Table 2, four of them have only two dimensions in order to present the clustering performance as visually. Since most of the clustering algorithms
Conclusion
Clustering is used in data analysis or data mining for years with many clustering algorithms. In this paper, we use evolutionary optimization algorithms for clustering as they have great potential to be easily used for clustering datasets as listed in recent studies on various scenarios. We used evolutionary algorithms, Biogeography-Based Optimization (BBO), Grey Wolf Optimization (GWO), Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) for clustering datasets which have varying size
Human and animal rights
The authors declare that the work described has not involved experimentation on humans or animals.
Funding
This work did not receive any grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author contributions
All authors attest that they meet the current International Committee of Medical Journal Editors (ICMJE) criteria for Authorship.
CRediT authorship contribution statement
F. Kayaalp: Conceptualization, Data curation, Writing - original draft. P. Erdogmus: Formal analysis, Writing - review & editing, Validation.
Declaration of Competing Interest
The authors declare that they have no known competing financial or personal relationships that could be viewed as influencing the work reported in this paper.
References (58)
- et al.
Big data for Internet of things: a survey
Future Gener Comput Syst
(2018) A grey wolf optimizer based automatic clustering algorithm for satellite image segmentation
Proc Comput Sci
(2017)Grey wolf optimization based clustering algorithm for vehicular ad-hoc networks
Comput Electr Eng
(2018)A survey on deep learning in medical image analysis
Med Image Anal
(2017)DENCLUE-IM: a new approach for big data clustering
Proc Comput Sci
(2016)Applying population-based evolutionary algorithms and a neuro-fuzzy system for modeling landslide susceptibility
Catena
(2019)- et al.
Grey wolf optimizer
Adv Eng Softw
(2014) Comparing clusterings—an information based distance
J Multivar Anal
(2007)Survey on NoSQL database
The 10 Vs of big data
Data mining techniques: a survey paper
Int J Res Eng Technol
A comprehensive survey of clustering algorithms
Ann Data Sci
Performance analysis of parallel particle swarm optimization based clustering of students
A novel initialization method for particle swarm optimization-based FCM in big biomedical data
Social media generated big data clustering using genetic algorithm
Distributed genetic algorithm to big data clustering
Genetic algorithm based approach to determine optimal collection points for big data gathering in distributed sensor networks
Data clustering using enhanced biogeography-based optimization
Biogeography-based optimization for cluster analysis
Face recognition: a convolutional neural-network approach
IEEE Trans Neural Netw
Faster R-CNN: towards real-time object detection with region proposal networks
Detection of skin diseases from dermoscopy image using the combination of convolutional neural network and one-versus-all
J Artif Intell Syst
Clustering with deep learning: taxonomy and new methods
Deep learning applications for hyperspectral imaging: a systematic review
J Inst Electron Comput
Data mining: practical machine learning tools and techniques with Java implementations
SIGMOD Rec
A survey of data mining and deep learning in bioinformatics
J Med Syst
DLTK: state of the art reference implementations for deep learning on medical images
Moth flame clustering algorithm for internet of vehicle (MFCA-IoV)
IEEE Access
Cited by (10)
Density peak clustering by local centers and improved connectivity kernel
2024, Information SciencesArtificial Immune Systems-Based Classification Model for Code-Mixed Social Media Data
2022, IRBMCitation Excerpt :However, it is found that the hyper-parameters tuning of the existing models is still a challenging issue. Therefore, many algorithms have been proposed in literature which can tune these parameters efficiently such as artificial immune systems [61], genetic algorithms [62], etc. Also, the deep transfer learning [63] models can be used to enhance the results further.
K-means clustering for the aggregation of HFLTS possibility distributions: N-two-stage algorithmic paradigm
2021, Knowledge-Based SystemsA data-mining based optimal demand response program for smart home with energy storages and electric vehicles
2021, Journal of Energy StorageData mining method of social media hot topics based on time series clustering
2024, International Journal of Web Based Communities