Correlation Clustering in Data Streams,Algorithmica

当前位置： X-MOL 学术 › Algorithmica › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Correlation Clustering in Data Streams
Algorithmica ( IF 1.1 ) Pub Date : 2021-03-13 , DOI: 10.1007/s00453-021-00816-9
Kook Jin Ahn , Graham Cormode , Sudipto Guha , Andrew McGregor , Anthony Wirth

Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as k-center, k-median, and k-means. Such algorithms need to be both time and and space efficient. In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on n nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, \(O(n\cdot {{\,\mathrm{polylog}\,}}n)\)-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the “quality” of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. Unfortunately, the standard LP and SDP formulations are not obviously solvable in \(O(n\cdot {{\,\mathrm{polylog}\,}}n)\)-space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling.

中文翻译：

数据流中的相关性聚类

聚类是分析大型数据集的基本工具。针对相关优化问题（例如k中心，k中间值和k均值），已经致力于设计数据流算法。这样的算法需要在时间和空间上都是有效的。在本文中，我们解决了动态数据流模型中的相关性聚类问题。该流包含对n个节点上图的边缘权重的更新，目标是找到一个节点分区，以使负权重边缘的端点通常位于不同的群集中，而负权重边缘的端点通常位于不同的群集中边缘通常在同一群集中。我们提出多项式时间\（O（n \ cdot {{\，\ mathrm {polylog} \，}} n）\） -出现自然问题的空间近似算法。我们首先基于线性草图开发数据结构，该结构允许测量给定节点分区的“质量”。然后，我们将这些数据结构与凸规划和采样技术结合起来，以解决相关的逼近问题。不幸的是，标准LP和SDP公式在\（O（n \ cdot {{\，\ mathrm {polylog} \，}} n）\）空间中显然不可解。我们的工作提出了所需凸编程所需的空间高效算法，以及降低采样适应性的方法。

更新日期：2021-03-15

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>