当前位置: X-MOL 学术Data Min. Knowl. Discov. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An efficient K -means clustering algorithm for tall data
Data Mining and Knowledge Discovery ( IF 2.8 ) Pub Date : 2020-07-15 , DOI: 10.1007/s10618-020-00678-9
Marco Capó , Aritz Pérez , Jose A. Lozano

The analysis of continously larger datasets is a task of major importance in a wide variety of scientific fields. Therefore, the development of efficient and parallel algorithms to perform such an analysis is a a crucial topic in unsupervised learning. Cluster analysis algorithms are a key element of exploratory data analysis and, among them, the K-means algorithm stands out as the most popular approach due to its easiness in the implementation, straightforward parallelizability and relatively low computational cost. Unfortunately, the K-means algorithm also has some drawbacks that have been extensively studied, such as its high dependency on the initial conditions, as well as to the fact that it might not scale well on massive datasets. In this article, we propose a recursive and parallel approximation to the K-means algorithm that scales well on the number of instances of the problem, without affecting the quality of the approximation. In order to achieve this, instead of analyzing the entire dataset, we work on small weighted sets of representative points that are distributed in such a way that more importance is given to those regions where it is harder to determine the correct cluster assignment of the original instances. In addition to different theoretical properties, which explain the reasoning behind the algorithm, experimental results indicate that our method outperforms the state-of-the-art in terms of the trade-off between number of distance computations and the quality of the solution obtained.

中文翻译:

高数据量的高效K均值聚类算法

不断扩大的数据集的分析是在广泛的科学领域中极为重要的任务。因此,开发有效且并行的算法以执行这种分析是无监督学习中的关键课题。聚类分析算法是探索性数据分析的关键要素,其中,K- means算法因其易于实施,直接的可并行性和较低的计算成本而成为最受欢迎的方法。不幸的是,K-means算法还具有一些已被广泛研究的缺点,例如它对初始条件的高度依赖以及它可能无法在大规模数据集上很好地扩展的事实。在本文中,我们提出了K的递归并行近似-means算法可在问题的实例数上很好地扩展,而不会影响近似的质量。为了实现这一目标,我们不分析整个数据集,而是处理代表点的小加权集,这些点的分布方式使得对那些难以确定原始图元的正确聚类分配的区域给予了更多的重视。实例。除了可以解释该算法背后原因的不同理论特性外,实验结果还表明,我们的方法在距离计算数量与所获得解决方案的质量之间的权衡方面优于最新技术。
更新日期:2020-07-15
down
wechat
bug