Elsevier

IRBM

Volume 41, Issue 5, October 2020, Pages 267-275
IRBM

Original Article
Benchmarking the Clustering Performances of Evolutionary Algorithms: A Case Study on Varying Data Size

https://doi.org/10.1016/j.irbm.2020.06.002Get rights and content

Highlights

  • Evolutionary algorithms are used for clustering datasets having varying dataset sizes.

  • Popular nature-inspired optimization algorithms (BBO, GWO, PSO, GA) are used together.

  • The clustering problems have been modeled as continues optimization problems.

  • Clustering performances of algorithms on datasets are compared against K-means.

Abstract

Background and objective

Clustering is a widely used popular method for data analysis within many clustering algorithms for years. Today it is used in many predictions, collaborative filtering and automatic segmentation systems on different domains. Also, to be broadly used in practice, such clustering algorithms need to give both better performance and robustness when compared to the ones currently used. In recent years, evolutionary algorithms are used in many domains since they are robust and easy to implement. And many clustering problems can be easily solved with such algorithms if the problem is modeled as an optimization problem. In this paper, we present an optimization approach for clustering by using four well-known evolutionary algorithms which are Biogeography-Based Optimization (BBO), Grey Wolf Optimization (GWO), Genetic Algorithm (GA) and Particle Swarm Optimization (PSO).

Method

the objective function has been specified to minimize the total distance from cluster centers to the data points. Euclidean distance is used for distance calculation. We have applied this objective function to the given algorithms both to find the most efficient clustering algorithm and to compare the clustering performances of algorithms against different data sizes. In order to benchmark the clustering performances of algorithms in the experiments, we have used a number of datasets with different data sizes such as some small scale, medium and big data. The clustering performances have been compared to K-means as it is a widely used clustering algorithm for years in literature. Rand Index, Adjusted Rand Index, Mirkin's Index and Hubert's Index have been considered as parameters for evaluating the clustering performances.

Result

As a result of the clustering experiments of algorithms over different datasets with varying data sizes according to the specified performance criteria, GA and GWO algorithms show better clustering performances among the others.

Conclusions

The results of the study showed that although the algorithms have shown satisfactory clustering results on small and medium scale datasets, the clustering performances on Big data need to be improved.

Introduction

Due to the technological developments in communication technologies; the number of connected devices called IoT, and the internet users has reached to billions around the world.) For example Hans Vestburg (Ericssons's former CEO) expects 50 billion devices or 30.73 billion devices to be connected from Statista [1]. The usage of these devices by people on different domains such as marketing, social-media, military, banking, transportation, telecommunications, and health-care create huge data in various formats which is today called Big Data [2]. Especially smart devices like smart phones or internet-connected devices equipped with sensors continuously create data called data streams which is one of the popular data analysis techniques of today.

The storage, processing and analysis capabilities of today's standard computers are not able to deal with Big Data. For that reason new technologies are needed to meet these demands. Many scientists have focused on these problems and have presented studies about different sides such as storage (cloud platforms), frameworks (like Hadoop or Spark), databases that are able to store and process big data effectively to meet the demands for high-performance when reading and writing (NoSQL types like MongoDB, Cassandra and etc.) [3]. As the Big Data concept can include many records from thousands to millions, analysis of such data for many purposes also result in a time-consuming process and as a result it becomes another challenge.

Although the characteristics of Big Data are generally known as 3 Vs; some researchers introduce extra Vs and list as 10 Vs. The first one is Volume of the data which refers to the massive data gathered from different resources like sensors deployed to an airplane, Facebook or Twitter messages. The second one is Velocity which refers to the change speed of the data like millions of different messages about different subjects on Facebook, or speed sensor data changing on each second. The third one is Variety which refers to the types of data that can be in any format such as numeric, character, video or image. The fourth one is Veracity which refers to the trust of the authority to believe in the returned data as the data sometimes can be uncertain because of different reasons caused by users, network, sensors or etc. The fifth one is Visualization which refers to the visualization tools problems to show the billions of records on the graphs due to scalability, response time and functionality. Validity is the sixth one which means if the data are correct and accurate enough to use for the purpose usage. Volatility is the seventh one referring to the time needed for storing the data or how long the data will be stored. Vulnerability is the eighth one referring to security concerns. Variability is the ninth one that refers to the inconsistencies in the data like outliers or anomalies that should be eliminated for better analysis. The tenth as the last one and the most important one is Value referring to the benefit gained by the analysis processes of data for different domains [4].

Extracting information through the datasets even big or not is called data mining. And data mining is very popular in today's world as the importance of the data has been realized by people and is generally used on many domains like marketing, banking, telecommunications and health. Data mining tasks can be classified in two categories - descriptive and predictive. Descriptives characterize the general properties of the data in database; predictives perform inference on the current data in order to make predictions about future data [5]. Each category uses many different methods like classification, regression, time series for predictive; and clustering, association rules for descriptive tasks.

Clustering is one of the mostly used methods in data mining descriptive operations which works like grouping data having similarities with others. It is based on unsupervised learning meaning that a supervisor or expert on a specific field is not required on analyzing the dataset. Clustering includes many algorithms in different categories such as partitioning-based ones (including K-means, Partitioning Around Medoids (PAM), Fuzzy C-Means (FCM), K-modes), hierarchical-based ones (including CURE, Chameleon, BIRCH), density-based ones (including DBSCAN, DENCLUE, OPTICS), grid-based ones (including STING, OptiGrid, CLIQUE) and model-based ones (including COBWEB, CLASSIT, EM). Although there are many algorithms in clustering literature such as the ones listed above and used for various domains' data analysis such as pattern recognition, bio-informatics, machine learning and data mining; it is very difficult to select only one as the most appropriate algorithm to be used on all sizes of dataset analysis because each algorithm has advantages and disadvantages [6].

Evolutionary algorithms use the mechanisms of biological evolution for the real life-optimization problems. Since real-life optimization problems have many constraints and independent variables, the solution space of such problems are very vast. And finding an optimal solution for such complex with big solution spaces are time consuming with classical methods. Because of this, evolutionary and nature-inspired algorithms are used in order to find a nearly optimal solution in a reasonable time. And many scientists have thought clustering problems as optimization problems and presented many papers based on some of these evolutionary algorithms. Particle Swarm Optimization (PSO) Algorithm based clustering [7], [8], Genetic Algorithm (GA) based clustering [9], [10], [11], Grey Wolf Optimization (GWO) Algorithm based clustering [12], [13] and Biogeography-based Optimization (BBO) based clustering [14], [15] are some examples of the modern popular nature inspired optimization algorithms that have been applied on clustering problems. As seen in literature, optimization algorithms have great potential to be easily used for clustering data on various scenarios and on various data size from small to big data.

Besides data mining, machine learning is another discipline in computer science which uses the existing data to train and test the designed models for different aims with various methods. Deep learning is one of the popular methods of machine learning which has gained popularity over the last five years used in many studies with different algorithms such as [16], [17], [18], [19], [20], [21]. Deep learning is also used in data mining in some studies [22], [23], [24].

Several algorithms have been proposed in literature for clustering in data mining. But selecting the right algorithm to be used for a good clustering performance is a challenge. Each algorithm has advantages and disadvantages. Evolutionary algorithms are easily applicable to each problem. If the problem can be modeled as an optimization problem, it can be solved with these algorithms. So they can be alternatives for each kind of specific problem solutions, from non-linear equation solutions for clustering. Some evolutionary algorithms can outperform the others.

Recently, evolutionary and nature-inspired algorithms have been applied to many types of engineering problem solutions like a magic tool. If the model of the problem is designed well with variables, constraints and problem formulations, these algorithms are easy to adapt to all engineering problems. In the latest studies, we see that evolutionary algorithms have been used as clustering for a specific problem data. For example two of the recent studies [25], [26] have used Moth Flame Optimization algorithm for clustering.

With this point of view, each evolutionary algorithm is a candidate for data clustering. In our work, four well-known algorithms, GWO, BBO, PSO and GA have been used for benchmarking the clustering performances of the given evolutionary algorithms against a well-known clustering algorithm K-means on different data sizes.

To see the relationship between the clustering performances of algorithms and the sizes of the datasets, datasets having different data sizes have been used in our experiments. For small scale comparison, we created eight synthetic datasets. For medium scale and large-scale comparison, we used commonly used datasets in many papers which are Multi-hop Outdoor Real Data (MHORD), Multi-hop Indoor Real Data (MHIRD), Single-hop Outdoor Real Data (SHORD) and Single-hop Indoor Real Data (SHIRD) [27] and DARPA KDDCup99 [28].

The contributions of the paper are:

1) Unlike many papers, we used evolutionary algorithms for clustering datasets that have varying dataset sizes, not for just a specific dataset clustering.

2) We proposed to use some of today's popular nature-inspired optimization algorithms, BBO, GWO, PSO, and GA, for clustering purposes as they all have not been used together in the same study.

3) We aimed at benchmarking the clustering performances of the mostly used evolutionary algorithms against various data size from small to Big Data and against each other without any extra change in the algorithms for clustering. So, in this presentation, we have solved the clustering problems as a continues optimization problem.

4) The clustering performances of BBO, GWO, PSO and GA on varying data sizes are also compared to the popular well-known clustering algorithm K-means and the results are evaluated.

The rest of the paper is organized as follows. Related works about clustering studies in literature are given in Section 2. Detailed information about the PSO-based, GA-based, GWO-based and BBO-based clustering are given in Section 3. Then, performance results and comparisons of the specified algorithms on datasets are given in Section 4. Finally, Section 5 concludes the paper.

Section snippets

Related works

An algorithm which was an improved version of DENCLUE, called DENCLUE-IM has been proposed in [29] to find a relation between performance and speed response time on classifying big data. The idea behind the new approach was speed calculation by avoiding the Hill Climbing step in DENCLUE.

The study [9] has reported results that Genetic Algorithm was self-sufficient for handling big data clustering issue in global space on a social media generated big data. Genetic algorithm has been used for

Evolutionary algorithms approach to clustering

Evolutionary and nature-inspired algorithms used in the paper have some common properties. All of them are population based and nature inspired algorithms. They start with initial random solutions and improve these solutions with their strategies inspired from the nature. PSO and GWO share very common properties. Both of them use positions of population in a d-multidimensional space. They have a few parameters to adapt. But GWO has more formula than PSO. The simplest one among the algorithms

Datasets

In order to compare the clustering performances of PSO, GWO, BBO and GA, different datasets were used in this study and the information about these datasets are shown in Table 2.

The datasets from Synt1 to Synt8 are the synthetic datasets created by us and we can share them with anybody who wants to use them for further studies. As it can be seen in Table 2, four of them have only two dimensions in order to present the clustering performance as visually. Since most of the clustering algorithms

Conclusion

Clustering is used in data analysis or data mining for years with many clustering algorithms. In this paper, we use evolutionary optimization algorithms for clustering as they have great potential to be easily used for clustering datasets as listed in recent studies on various scenarios. We used evolutionary algorithms, Biogeography-Based Optimization (BBO), Grey Wolf Optimization (GWO), Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) for clustering datasets which have varying size

Human and animal rights

The authors declare that the work described has not involved experimentation on humans or animals.

Funding

This work did not receive any grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author contributions

All authors attest that they meet the current International Committee of Medical Journal Editors (ICMJE) criteria for Authorship.

CRediT authorship contribution statement

F. Kayaalp: Conceptualization, Data curation, Writing - original draft. P. Erdogmus: Formal analysis, Writing - review & editing, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial or personal relationships that could be viewed as influencing the work reported in this paper.

References (58)

  • G. Firican

    The 10 Vs of big data

  • N. Jain et al.

    Data mining techniques: a survey paper

    Int J Res Eng Technol

    (2013)
  • D. Xu et al.

    A comprehensive survey of clustering algorithms

    Ann Data Sci

    (2015)
  • K. Govindarajan

    Performance analysis of parallel particle swarm optimization based clustering of students

  • C.J. Wang

    A novel initialization method for particle swarm optimization-based FCM in big biomedical data

  • P. Sachar et al.

    Social media generated big data clustering using genetic algorithm

  • M.H. Hajeer et al.

    Distributed genetic algorithm to big data clustering

  • A.K. Mishra

    Genetic algorithm based approach to determine optimal collection points for big data gathering in distributed sensor networks

  • R. Pal et al.

    Data clustering using enhanced biogeography-based optimization

  • X. Wu

    Biogeography-based optimization for cluster analysis

  • S. Lawrence

    Face recognition: a convolutional neural-network approach

    IEEE Trans Neural Netw

    (1997)
  • S. Ren

    Faster R-CNN: towards real-time object detection with region proposal networks

  • K. Polat et al.

    Detection of skin diseases from dermoscopy image using the combination of convolutional neural network and one-versus-all

    J Artif Intell Syst

    (2020)
  • E. Aljalbout

    Clustering with deep learning: taxonomy and new methods

  • A. Ozdemir et al.

    Deep learning applications for hyperspectral imaging: a systematic review

    J Inst Electron Comput

    (2020)
  • I.H. Witten et al.

    Data mining: practical machine learning tools and techniques with Java implementations

    SIGMOD Rec

    (2002)
  • K. Lan

    A survey of data mining and deep learning in bioinformatics

    J Med Syst

    (2018)
  • N. Pawlowski

    DLTK: state of the art reference implementations for deep learning on medical images

  • M.F. Khan

    Moth flame clustering algorithm for internet of vehicle (MFCA-IoV)

    IEEE Access

    (2018)
  • Cited by (10)

    • Artificial Immune Systems-Based Classification Model for Code-Mixed Social Media Data

      2022, IRBM
      Citation Excerpt :

      However, it is found that the hyper-parameters tuning of the existing models is still a challenging issue. Therefore, many algorithms have been proposed in literature which can tune these parameters efficiently such as artificial immune systems [61], genetic algorithms [62], etc. Also, the deep transfer learning [63] models can be used to enhance the results further.

    • Data mining method of social media hot topics based on time series clustering

      2024, International Journal of Web Based Communities
    View all citing articles on Scopus
    View full text