当前位置: X-MOL 学术ACM Trans. Knowl. Discov. Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Attribute-Guided Network Sampling Mechanisms
ACM Transactions on Knowledge Discovery from Data ( IF 3.6 ) Pub Date : 2021-04-18 , DOI: 10.1145/3441445
Suhansanu Kumar 1 , Hari Sundaram 1
Affiliation  

This article introduces a novel task-independent sampler for attributed networks. The problem is important because while data mining tasks on network content are common, sampling on internet-scale networks is costly. Link-trace samplers such as Snowball sampling, Forest Fire, Random Walk, and Metropolis–Hastings Random Walk are widely used for sampling from networks. The design of these attribute-agnostic samplers focuses on preserving salient properties of network structure, and are not optimized for tasks on node content. This article has three contributions. First, we propose a task-independent, attribute aware link-trace sampler grounded in Information Theory. Our sampler greedily adds to the sample the node with the most informative (i.e., surprising) neighborhood. The sampler tends to rapidly explore the attribute space, maximally reducing the surprise of unseen nodes. Second, we prove that content sampling is an NP-hard problem. A well-known algorithm best approximates the optimization solution within 1 − 1/ e , but requires full access to the entire graph. Third, we show through empirical counterfactual analysis that in many real-world datasets, network structure does not hinder the performance of surprise based link-trace samplers. Experimental results over 18 real-world datasets reveal: surprise-based samplers are sample efficient and outperform the state-of-the-art attribute-agnostic samplers by a wide margin (e.g., 45% performance improvement in clustering tasks).

中文翻译:

属性引导的网络采样机制

本文介绍了一种用于属性网络的新型任务无关采样器。这个问题很重要,因为虽然网络内容的数据挖掘任务很常见,但互联网规模网络上的采样成本很高。诸如雪球采样、森林火灾、随机游走和 Metropolis-Hastings 随机游走等链路跟踪采样器广泛用于网络采样。这些与属性无关的采样器的设计侧重于保留网络结构的显着属性,并且没有针对节点内容的任务进行优化。这篇文章有三个贡献。首先,我们提出了一个任务无关的、属性感知的链接跟踪以信息论为基础的采样器。我们的采样器贪婪地将具有最多信息(即令人惊讶)邻域的节点添加到样本中。采样器倾向于快速探索属性空间,最大限度地减少看不见节点的惊喜。其次,我们证明内容采样是一个 NP-hard 问题。一个著名的算法在 1 − 1/ 内最接近优化解决方案e,但需要完全访问整个图形。第三,我们通过实证反事实分析表明,在许多现实世界的数据集中,网络结构不会阻碍基于惊喜的链接跟踪采样器的性能。18 个真实世界数据集的实验结果表明:基于惊喜的采样器具有采样效率,并且大大优于最先进的与属性无关的采样器(例如,在聚类任务中性能提高了 45%)。
更新日期:2021-04-18
down
wechat
bug