Performance enhancement of a dynamic K-means algorithm through a parallel adaptive strategy on multicore CPUs

https://doi.org/10.1016/j.jpdc.2020.06.010Get rights and content

Highlights

  • For several problems the number of clusters in the K-means algorithm is unavailable.

  • Adaptive strategies enhance the performance of the K-means algorithm.

  • Parallel implementation is mandatory for large clustering problems.

Abstract

The K-means algorithm is one of the most popular algorithms in Data Science, and it is aimed to discover similarities among the elements belonging to large datasets, partitioning them in K distinct groups called clusters. The main weakness of this technique is that, in real problems, it is often impossible to define the value of K as input data. Furthermore, the large amount of data used for useful simulations makes impracticable the execution of the algorithm on traditional architectures. In this paper, we address the previous two issues. On the one hand, we propose a method to dynamically define the value of K by optimizing a suitable quality index with special care to the computational cost. On the other hand, to improve the performance and the effectiveness of the algorithm, we propose a strategy for parallel implementation on modern multicore CPUs.

Section snippets

Introduction and related works

In the last thirty years, several theories, methodologies, and tools have been introduced to learn from data, that is to understand, comprehensively, complex phenomena through the analysis of large structured or unstructured datasets representing real problems. This wealth of knowledge has often changed its name over the years (for example, data mining or big data), and today is commonly known as data science [10].

One of the most used tools in this field is a class of unsupervised learning

A new parallel adaptive K-means algorithm

This section has a double aim: from the one hand, we introduce a methodology aimed to dynamically define the number of clusters with a reduced computational cost of the algorithm, and, on the other hand, we propose a parallel implementation in multicore environments.

A widespread method to define the value of K without considering it as an input data is to execute the Basic K-means Algorithm several times, with an increasing value of K, until a given quality index, used as a measure for the

Implementation details

To better describe the new parallel Adaptive K-means Algorithm, in this section, we report some implementation details, with particular attention to the data structures that are used to describe the management of the clusters (see also Fig. 1).

To store the N elements xn, our implementation uses a 2-dimensional N×d static array S, where each row represents a d-dimensional element of the dataset. This choice is due to the greater efficiency in accessing the elements compared to other

Experimental results

We tested the accuracy and the efficiency of the proposed Adaptive K-means Algorithm running several experiments using the HPC cluster available at the Department of Science and Technologies of the University of Naples Parthenope. This facility is a computing environment where each node is equipped with two Intel Xeon 16-core 5218 CPUs running at 2.3 GHz for a total of 32 computing cores per node, and 192 Gbytes of main memory. In this system, we implemented the Algorithm 3 in C language using

Discussion

From Table 1, we observe better effectiveness of the Algorithm 3 with respect to Algorithm 2, measured in terms of the number of elements displaced among the clusters and the total execution time, mainly for large values of K (that is the datasets related to the Letters, Wines, and Cardio problems). More precisely, when the number of iterations is large, we can better appreciate the effects of the adaptive strategy aimed to reuse the partition already defined at the previous iterations, with a

Conclusions

This paper describes our studies aimed to improve the performance of the K-means algorithm in case the number of clusters K is not available as input data. This issue is a common situation in real applications so that traditional approaches are based on several runs of the algorithm with different values of K attempting to optimize some quality index, with a high risk to increase the computational cost. The method we introduced is based on an adaptive procedure aimed to minimize the number of

CRediT authorship contribution statement

Giuliano Laccetti: Conceptualization, Supervision, Resources. Marco Lapegna: Conceptualization, Methodology, Investigation, Writing - review & editing. Valeria Mele: Data curation, Software, Writing - original draft. Diego Romano: Software, Validation, Writing - original draft. Lukasz Szustak: Software, Validation, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported by institutional funding provided by the universities of the researchers .

Giuliano Laccetti is a full professor of computer science at the University of Naples Federico II, Italy. He received his Laurea degree (cum laude) in Physics from the University of Naples. His main research interests are Mathematical Software, High-Performance Architecture for Scientific Computing, Distributed Computing, Grid, and Cloud Computing, Algorithms on emerging hybrid architectures (CPU+GPU, …), Internet of Things. He has been organizer and chair of several Workshops joint to larger

References (40)

  • de CamposA.

    SisPorto 2.0 a program for automated analysis of cardiotocograms

    J. Matern. Fetal Neonatal Med.

    (2000)
  • DharV.

    Data science and prediction

    Commun. ACM

    (2013)
  • DhillonI.S. et al.

    A data-clustering algorithm on distributed memory multiprocessors

  • DuaD. et al.

    UCI Machine Learning Repository

    (2017)
  • DudaR. et al.

    Pattern Classification and Scene Analysis

    (1973)
  • FreyP.W. et al.

    Letter recognition using Holland-style adaptive classifiers

    Mach. Learn.

    (1991)
  • GanD.G. et al.
  • S. Glock, E. Gillich, J. Schaede, V. Lohweg, Feature extraction algorithm for banknote textures based on incomplete...
  • HaaseW. et al.

    Adaptive grids in numerical fluid dynamics

    Numer. Methods Fluids

    (1985)
  • HalkidiM. et al.

    On clustering validation techniques

    J. Intell. Inf. Syst.

    (2001)
  • Cited by (25)

    View all citing articles on Scopus

    Giuliano Laccetti is a full professor of computer science at the University of Naples Federico II, Italy. He received his Laurea degree (cum laude) in Physics from the University of Naples. His main research interests are Mathematical Software, High-Performance Architecture for Scientific Computing, Distributed Computing, Grid, and Cloud Computing, Algorithms on emerging hybrid architectures (CPU+GPU, …), Internet of Things. He has been organizer and chair of several Workshops joint to larger International Conferences. He is the author (or co-author) of about 100 papers published in refereed international Journals, international books, and International Conference Proceedings.

    Marco Lapegna received a Ph.D. in Applied Mathematics and Computer Science in 1991 from the University of Naples Federico II. He worked from 1991 until 2001 as Assistant Professor, and now he is an Associate Professor of Computer Science at the University of Naples Federico II. His research activity is aimed at the development of high performance distributed and parallel algorithms for computational mathematics for advanced architecture environments. He participated in projects funded by Italian and international institutions, and he is author of several scientific publications. His teaching activity concerns computer programming, operating systems, and distributed/parallel computing.

    Valeria Mele today is a Researcher at the University of Naples Federico II (Naples, Italy). Degree in Informatics and Ph.D. in Computational Science. Her research activity has been mainly focused on development and performance evaluation of parallel algorithms and software for heterogeneous, hybrid, and multilevel parallel architectures, from multicore to GPU-enhanced machines and modern clusters and supercomputers. After attending the Argonne Training Program on ExtremeScale Computing (ATPESC) and visiting the Argonne National Laboratory (ANL, Chicago, Illinois, USA) several times, she is now mainly working on the designing, implementation and performance prediction/evaluation of software with/for the PETSc library.

    Diego Romano was awarded a M.S. in Mathematics in 2000, and a Ph.D. degree in Computational and Computer Sciences from the University of Naples Federico II, Italy, in 2012. He obtained a permanent position as a researcher at the Italian National Research Council (CNR) in 2008, where he is currently employed at the Institute for High-Performance Computing and Networking (ICAR). His research interests include the performance and design of GPU Computing algorithms. Within this field, he works, for instance, on the Global Illumination problem in Computer Graphics, and mathematical models for performance analysis.

    Lukasz Szustak received a D.Sc. Degree in Computer Science in 2019 and a Ph.D. granted by the Czestochowa University of Technology in 2012. His main research interests include parallel computing and mapping algorithms onto parallel architectures. His current work is focused on the development of methods for performance portability, scheduling, and load balancing, including the adaptation of stencil-based computations to modern HPC architectures.

    View full text