Journal of Parallel and Distributed Computing ( IF 3.8 ) Pub Date : 2020-06-25 , DOI: 10.1016/j.jpdc.2020.06.010 Giuliano Laccetti , Marco Lapegna , Valeria Mele , Diego Romano , Lukasz Szustak
The K-means algorithm is one of the most popular algorithms in Data Science, and it is aimed to discover similarities among the elements belonging to large datasets, partitioning them in distinct groups called clusters. The main weakness of this technique is that, in real problems, it is often impossible to define the value of as input data. Furthermore, the large amount of data used for useful simulations makes impracticable the execution of the algorithm on traditional architectures. In this paper, we address the previous two issues. On the one hand, we propose a method to dynamically define the value of by optimizing a suitable quality index with special care to the computational cost. On the other hand, to improve the performance and the effectiveness of the algorithm, we propose a strategy for parallel implementation on modern multicore CPUs.
中文翻译:
通过多核CPU上的并行自适应策略提高动态K均值算法的性能
K均值算法是数据科学中最流行的算法之一,其目的是发现属于大型数据集的元素之间的相似性,并将其划分为 不同的组称为群集。该技术的主要缺点是,在实际问题中,通常无法定义作为输入数据。此外,用于有用仿真的大量数据使算法无法在传统架构上执行。在本文中,我们解决了前两个问题。一方面,我们提出了一种动态定义通过优化计算质量来优化合适的质量指标。另一方面,为了提高算法的性能和有效性,我们提出了一种在现代多核CPU上并行实现的策略。