当前位置: X-MOL 学术ACM Trans. Knowl. Discov. Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Faster Motif Counting via Succinct Color Coding and Adaptive Sampling
ACM Transactions on Knowledge Discovery from Data ( IF 3.6 ) Pub Date : 2021-05-19 , DOI: 10.1145/3447397
Marco Bressan 1 , Stefano Leucci 2 , Alessandro Panconesi 3
Affiliation  

We address the problem of computing the distribution of induced connected subgraphs, aka graphlets or motifs , in large graphs. The current state-of-the-art algorithms estimate the motif counts via uniform sampling by leveraging the color coding technique by Alon, Yuster, and Zwick. In this work, we extend the applicability of this approach by introducing a set of algorithmic optimizations and techniques that reduce the running time and space usage of color coding and improve the accuracy of the counts. To this end, we first show how to optimize color coding to efficiently build a compact table of a representative subsample of all graphlets in the input graph. For 8-node motifs, we can build such a table in one hour for a graph with 65M nodes and 1.8B edges, which is times larger than the state of the art. We then introduce a novel adaptive sampling scheme that breaks the “additive error barrier” of uniform sampling, guaranteeing multiplicative approximations instead of just additive ones. This allows us to count not only the most frequent motifs, but also extremely rare ones. For instance, on one graph we accurately count nearly 10.000 distinct 8-node motifs whose relative frequency is so small that uniform sampling would literally take centuries to find them. Our results show that color coding is still the most promising approach to scalable motif counting.

中文翻译:

通过简洁的颜色编码和自适应采样加快母题计数

我们解决了计算诱导连通子图分布的问题,也就是小图要么主题, 在大图中。当前最先进的算法利用 Alon、Yuster 和 Zwick 的颜色编码技术,通过均匀采样来估计基序计数。在这项工作中,我们通过引入一组算法优化和技术来扩展这种方法的适用性,以减少颜色编码的运行时间和空间使用并提高计数的准确性。为此,我们首先展示了如何优化颜色编码以有效地构建输入图中所有 graphlet 的代表性子样本的紧凑表。对于 8 节点的主题,我们可以在一小时内为具有 65M 节点和 1.8B 边的图构建这样的表,即 比现有技术大几倍。然后,我们引入了一种新颖的自适应采样方案,它打破了均匀采样的“加性误差障碍”,保证了乘​​法近似,而不仅仅是加法近似。这使我们不仅可以计算最常见的主题,还可以计算极其罕见的主题。例如,在一张图表上,我们准确计算了近 10.000 个不同的 8 节点图案,它们的相对频率非常小,以至于均匀采样实际上需要几个世纪才能找到它们。我们的结果表明,颜色编码仍然是最有希望的可扩展基序计数方法。
更新日期:2021-05-19
down
wechat
bug