当前位置: X-MOL 学术Knowl. Inf. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improvements on approximation algorithms for clustering probabilistic data
Knowledge and Information Systems ( IF 2.7 ) Pub Date : 2021-08-30 , DOI: 10.1007/s10115-021-01601-4
Sharareh Alipour 1
Affiliation  

Uncertainty about data appears in many real-world applications and an important issue is how to manage, analyze and solve optimization problems over such data. An important tool for data analysis is clustering. When the data set is uncertain, we can model them as a set of probabilistic points each formalized as a probability distribution function which describes the possible locations of the points. In this paper, we study k-center problem for probabilistic points in a general metric space. First, we present a fast greedy approximation algorithm that builds k centers using a farthest-first traversal in k iterations. This algorithm improves the previous approximation factor of the unrestricted assigned k-center problem from 10 (see [1]) to 6. Next, we restrict the centers to be selected from all the probabilistic locations of the given points and we show that an optimal solution for this restricted setting is a 2-approximation factor solution for an optimal solution of the assigned k-center problem with expected distance assignment. Using this idea, we improve the approximation factor of the unrestricted assigned k-center problem to 4 by increasing the running time. The algorithm also runs in polynomial time when k is a constant. Additionally, we implement our algorithms on three real data sets. The experimental results show that in practice the approximation factors of our algorithms are better than in theory for these data sets. Also we compare the results of our algorithm with the previous works and discuss about the achieved results. At the end, we present our theoretical results for probabilistic k-median clustering.



中文翻译:

聚类概率数据的近似算法的改进

数据的不确定性出现在许多现实世界的应用中,一个重要的问题是如何管理、分析和解决这些数据的优化问题。数据分析的一个重要工具是聚类。当数据集不确定时,我们可以将它们建模为一组概率点,每个点都形式化为描述点可能位置的概率分布函数。在本文中,我们研究了一般度量空间中概率点的k 中心问题。首先,我们提出了一种快速贪婪逼近算法,该算法使用k次迭代中的最远优先遍历来构建k 个中心。该算法改进了无限制分配k的先前逼近因子-center 问题从 10(见 [1])到 6。接下来,我们限制要从给定点的所有概率位置中选择的中心,并且我们表明此受限设置的最佳解决方案是 2-近似因子解决方案对于具有预期距离分配的分配k 中心问题的最佳解决方案。使用这个想法,我们通过增加运行时间将无限制分配k 中心问题的近似因子提高到 4。该算法也在多项式时间内运行,当k是一个常数。此外,我们在三个真实数据集上实现了我们的算法。实验结果表明,对于这些数据集,我们算法的近似因子在实践中比理论上更好。我们还将我们算法的结果与以前的工作进行比较,并讨论取得的结果。最后,我们介绍了概率k 中值聚类的理论结果。

更新日期:2021-08-31
down
wechat
bug