Performance enhancement of a dynamic K-means algorithm through a parallel adaptive strategy on multicore CPUs

doi:10.1016/j.jpdc.2020.06.010

Journal of Parallel and Distributed Computing

Volume 145, November 2020, Pages 34-41

https://doi.org/10.1016/j.jpdc.2020.06.010 Get rights and content

Highlights

•
For several problems the number of clusters in the K-means algorithm is unavailable.
•
Adaptive strategies enhance the performance of the K-means algorithm.
•
Parallel implementation is mandatory for large clustering problems.

Abstract

The K-means algorithm is one of the most popular algorithms in Data Science, and it is aimed to discover similarities among the elements belonging to large datasets, partitioning them in $K$ distinct groups called clusters. The main weakness of this technique is that, in real problems, it is often impossible to define the value of $K$ as input data. Furthermore, the large amount of data used for useful simulations makes impracticable the execution of the algorithm on traditional architectures. In this paper, we address the previous two issues. On the one hand, we propose a method to dynamically define the value of $K$ by optimizing a suitable quality index with special care to the computational cost. On the other hand, to improve the performance and the effectiveness of the algorithm, we propose a strategy for parallel implementation on modern multicore CPUs.

Section snippets

Introduction and related works

In the last thirty years, several theories, methodologies, and tools have been introduced to learn from data, that is to understand, comprehensively, complex phenomena through the analysis of large structured or unstructured datasets representing real problems. This wealth of knowledge has often changed its name over the years (for example, data mining or big data), and today is commonly known as data science [10].

One of the most used tools in this field is a class of unsupervised learning

A new parallel adaptive $K$ -means algorithm

This section has a double aim: from the one hand, we introduce a methodology aimed to dynamically define the number of clusters with a reduced computational cost of the algorithm, and, on the other hand, we propose a parallel implementation in multicore environments.

A widespread method to define the value of $K$ without considering it as an input data is to execute the Basic $K$ -means Algorithm several times, with an increasing value of $K$ , until a given quality index, used as a measure for the

Implementation details

To better describe the new parallel Adaptive $K$ -means Algorithm, in this section, we report some implementation details, with particular attention to the data structures that are used to describe the management of the clusters (see also Fig. 1).

To store the $N$ elements $x_{n}$ , our implementation uses a 2-dimensional $N \times d$ static array $S$ , where each row represents a $d$ -dimensional element of the dataset. This choice is due to the greater efficiency in accessing the elements compared to other

Experimental results

We tested the accuracy and the efficiency of the proposed Adaptive $K$ -means Algorithm running several experiments using the HPC cluster available at the Department of Science and Technologies of the University of Naples Parthenope. This facility is a computing environment where each node is equipped with two Intel Xeon 16-core 5218 CPUs running at 2.3 GHz for a total of 32 computing cores per node, and 192 Gbytes of main memory. In this system, we implemented the Algorithm 3 in C language using

Discussion

From Table 1, we observe better effectiveness of the Algorithm 3 with respect to Algorithm 2, measured in terms of the number of elements displaced among the clusters and the total execution time, mainly for large values of $K$ (that is the datasets related to the Letters, Wines, and Cardio problems). More precisely, when the number of iterations is large, we can better appreciate the effects of the adaptive strategy aimed to reuse the partition already defined at the previous iterations, with a

Conclusions

This paper describes our studies aimed to improve the performance of the $K$ -means algorithm in case the number of clusters $K$ is not available as input data. This issue is a common situation in real applications so that traditional approaches are based on several runs of the algorithm with different values of $K$ attempting to optimize some quality index, with a high risk to increase the computational cost. The method we introduced is based on an adaptive procedure aimed to minimize the number of

CRediT authorship contribution statement

Giuliano Laccetti: Conceptualization, Supervision, Resources. Marco Lapegna: Conceptualization, Methodology, Investigation, Writing - review & editing. Valeria Mele: Data curation, Software, Writing - original draft. Diego Romano: Software, Validation, Writing - original draft. Lukasz Szustak: Software, Validation, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported by institutional funding provided by the universities of the researchers .

Giuliano Laccetti is a full professor of computer science at the University of Naples Federico II, Italy. He received his Laurea degree (cum laude) in Physics from the University of Naples. His main research interests are Mathematical Software, High-Performance Architecture for Scientific Computing, Distributed Computing, Grid, and Cloud Computing, Algorithms on emerging hybrid architectures (CPU+GPU, …), Internet of Things. He has been organizer and chair of several Workshops joint to larger

References (40)

CortezP. et al.
Modeling wine preferences by data mining from physicochemical properties
CuomoS. et al.
A GPU-accelerated parallel K-means algorithm
Comput. Electr. Eng.
(2019)
MoroS. et al.
A data-driven approach to predict the success of bank telemarketing
Decis. Support Syst.
(2014)
ThompsonJ.F.
A survey of dynamically-adaptive grids in the numerical solution of partial differential equations
Appl. Numer. Math.
(1985)
AhmadA. et al.
Survey of state-of-the-art mixed data clustering algorithms
IEEE Access
(2019)
ArthurD. et al.
K-means++: The advantages of careful seeding
BallG.H. et al.
ISODATA, A Novel Method of Data Analysis and Pattern ClassificationTechnical Report, DTIC Document
(1965)
CalinskiT. et al.
A dendrite method for cluster analysis
Comm. Statist. Theory Methods
(1974)
H. Chen, X. We, J. Hu, Proc. SPIE 6788, MIPPR 2007: Pattern Recognition and Computer Vision, 67882A,...
CuomoS. et al.
Parallel implementation of a machine learning algorithm on GPU
Int. J. Parallel Program.
(2018)

de CamposA.

SisPorto 2.0 a program for automated analysis of cardiotocograms

J. Matern. Fetal Neonatal Med.

(2000)

DharV.

Data science and prediction

Commun. ACM

(2013)

DhillonI.S. et al.

A data-clustering algorithm on distributed memory multiprocessors

DuaD. et al.

UCI Machine Learning Repository

(2017)

DudaR. et al.

Pattern Classification and Scene Analysis

(1973)

FreyP.W. et al.

Letter recognition using Holland-style adaptive classifiers

Mach. Learn.

(1991)

GanD.G. et al.

S. Glock, E. Gillich, J. Schaede, V. Lohweg, Feature extraction algorithm for banknote textures based on incomplete...

HaaseW. et al.

Adaptive grids in numerical fluid dynamics

Numer. Methods Fluids

(1985)

HalkidiM. et al.

On clustering validation techniques

J. Intell. Inf. Syst.

(2001)

Cited by (25)

An electric vehicle charging load prediction model for different functional areas based on multithreaded acceleration
2023, Journal of Energy Storage
This paper proposes an electric vehicle (EV) charging load prediction model for different functional areas based on multithreaded technology. This model comprehensively incorporates the preference characteristics of EV users in charging behavior and mode, as well as accounting for diverse travel purposes. It is worth emphasizing that a comprehensive analysis of the spatiotemporal travel characteristics exhibited by users during different seasons and on working days versus rest days is provided. In terms of temporal scale, refined probability models are employed to accurately fit the user's travel time variables. In terms of the spatial scale, the initial position and spatial transfer probability of users are analyzed. Subsequently, leveraging graph theory, a regional transportation network model is established. On this basis, a novel improved time optimal path decision-making algorithm is proposed by holistically considering factors such as road grade, saturation, traffic conditions, one-way/two-way streets, and delay at signalized intersections. Additionally, the impact of temperature on the upper limit capacity of batteries is analyzed, as well as driving/air conditioning energy consumption models are established. It is noteworthy that a two-stage charging power variation model is introduced, enhancing the precision of charging power calculation. Finally, the effectiveness of the method is validated through case analysis. The results demonstrate a significant disparity in load demand among various functional areas, and the spatiotemporal distribution of load demand is significantly influenced by seasons and working days versus rest days. Furthermore, when employing 4 threads, the proposed method achieves a speedup ratio exceeding 2.5 compared to conventional serial methods.
Novel similarity measure between hesitant fuzzy set and their applications in pattern recognition and clustering analysis
2024, Journal of Engineering and Applied Science
A MapReduce-based approach to social network big data mining
2023, Journal of Computational Methods in Sciences and Engineering
A GPU-based real-time processing system for frequency division multiple-input-multiple-output radar
2023, IET Radar, Sonar and Navigation
Clustering Algorithms for Enhanced Trustworthiness on High-Performance Edge-Computing Devices
2023, Electronics (Switzerland)
Citizen Science for the Sea with Information Technologies: An Open Platform for Gathering Marine Data and Marine Litter Detection from Leisure Boat Instruments
2023, Proceedings 2023 IEEE 19th International Conference on e-Science, e-Science 2023

View all citing articles on Scopus

Marco Lapegna received a Ph.D. in Applied Mathematics and Computer Science in 1991 from the University of Naples Federico II. He worked from 1991 until 2001 as Assistant Professor, and now he is an Associate Professor of Computer Science at the University of Naples Federico II. His research activity is aimed at the development of high performance distributed and parallel algorithms for computational mathematics for advanced architecture environments. He participated in projects funded by Italian and international institutions, and he is author of several scientific publications. His teaching activity concerns computer programming, operating systems, and distributed/parallel computing.

Valeria Mele today is a Researcher at the University of Naples Federico II (Naples, Italy). Degree in Informatics and Ph.D. in Computational Science. Her research activity has been mainly focused on development and performance evaluation of parallel algorithms and software for heterogeneous, hybrid, and multilevel parallel architectures, from multicore to GPU-enhanced machines and modern clusters and supercomputers. After attending the Argonne Training Program on ExtremeScale Computing (ATPESC) and visiting the Argonne National Laboratory (ANL, Chicago, Illinois, USA) several times, she is now mainly working on the designing, implementation and performance prediction/evaluation of software with/for the PETSc library.

Diego Romano was awarded a M.S. in Mathematics in 2000, and a Ph.D. degree in Computational and Computer Sciences from the University of Naples Federico II, Italy, in 2012. He obtained a permanent position as a researcher at the Italian National Research Council (CNR) in 2008, where he is currently employed at the Institute for High-Performance Computing and Networking (ICAR). His research interests include the performance and design of GPU Computing algorithms. Within this field, he works, for instance, on the Global Illumination problem in Computer Graphics, and mathematical models for performance analysis.

Lukasz Szustak received a D.Sc. Degree in Computer Science in 2019 and a Ph.D. granted by the Czestochowa University of Technology in 2012. His main research interests include parallel computing and mapping algorithms onto parallel architectures. His current work is focused on the development of methods for performance portability, scheduling, and load balancing, including the adaptation of stencil-based computations to modern HPC architectures.

View full text

Performance enhancement of a dynamic K-means algorithm through a parallel adaptive strategy on multicore CPUs

Highlights

Abstract

Section snippets

Introduction and related works

A new parallel adaptive K-means algorithm

Implementation details

Experimental results

Discussion

Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgment

Comput. Electr. Eng.

Decis. Support Syst.

Appl. Numer. Math.

Survey of state-of-the-art mixed data clustering algorithms

IEEE Access

K-means++: The advantages of careful seeding

ISODATA, A Novel Method of Data Analysis and Pattern ClassificationTechnical Report, DTIC Document

A dendrite method for cluster analysis

Comm. Statist. Theory Methods

Parallel implementation of a machine learning algorithm on GPU

Int. J. Parallel Program.