当前位置: X-MOL 学术VLDB J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs
The VLDB Journal ( IF 4.2 ) Pub Date : 2021-05-07 , DOI: 10.1007/s00778-021-00669-2
Yingxia Shao , Shiyue Huang , Yawen Li , Xupeng Miao , Bin Cui , Lei Chen

Second-order random walk is an important technique for graph analysis. Many applications including graph embedding, proximity measure and community detection use it to capture higher-order patterns in the graph, thus improving the model accuracy. However, the memory explosion problem of this technique hinders it from analyzing large graphs. When processing a billion-edge graph like Twitter, existing solutions (e.g., alias method) of the second-order random walk may take up 1796TB memory. Such high memory consumption comes from the memory-unaware strategies for the node sampling during the random walk. In this paper, to clearly compare the efficiency of various node sampling methods, we first design a cost model and propose two new node sampling methods: one follows the acceptance-rejection paradigm to achieve a better balance between memory and time cost, and the other is optimized for fast sampling the skewed probability distributions existed in natural graphs. Second, to achieve the high efficiency of the second-order random walk within arbitrary memory budgets, we propose a novel memory-aware framework on the basis of the cost model. The framework applies a cost-based optimizer to assign desirable node sampling method for each node or edge in the graph within a memory budget meanwhile minimizing the time cost of the random walk. Finally, the framework provides general programming interfaces for users to define new second-order random walk models easily. The empirical studies demonstrate that our memory-aware framework is robust with respect to memory and is able to achieve considerable efficiency by reducing 90% of the memory cost.



中文翻译:

内存感知框架,可在十亿边缘自然图上进行快速且可扩展的二阶随机游动

二阶随机游走是图形分析的重要技术。许多应用程序(包括图形嵌入,邻近度测量和社区检测)都使用它来捕获图形中的高阶模式,从而提高了模型的准确性。但是,此技术的内存爆炸问题使它无法分析大型图形。当处理像Twitter这样的十亿边缘图时,二阶随机游走的现有解决方案(例如,别名方法)可能会占用1796TB内存。如此高的内存消耗来自于随机游走期间用于节点采样的无内存策略。在本文中,为了清楚地比较各种节点采样方法的效率,我们首先设计了一个成本模型,并提出了两种新的节点采样方法:一种遵循接受-拒绝范式,以在内存和时间成本之间实现更好的平衡,另一种经过优化以快速采样自然图中存在的偏斜概率分布。其次,为了在任意内存预算内实现二阶随机游走的高效率,我们在成本模型的基础上提出了一种新颖的内存感知框架。该框架应用基于成本的优化器为内存预算内的图形中的每个节点或边分配所需的节点采样方法,同时最大程度地减少了随机游走的时间成本。最后,该框架为用户提供了通用的编程接口,以轻松定义新的二阶随机游动模型。

更新日期:2021-05-08
down
wechat
bug