SF-Sketch: A Two-Stage Sketch for Data Streams,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SF-Sketch: A Two-Stage Sketch for Data Streams
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2020-04-15 , DOI: 10.1109/tpds.2020.2987609
Lingtong Liu , Yulong Shen , Yibo Yan , Tong Yang , Muhammad Shahzad , Bin Cui , Gaogang Xie

Sketches are probabilistic data structures designed for recording frequencies of items in a multi-set. They are widely used in various fields, especially for gathering Internet statistics from distributed data streams in network measurements. In a distributed streaming application with high data rates, a sketch in each monitoring node “fills up” very quickly and then its content is transferred to a remote collector responsible for answering queries. Thus, the size of the contents transferred must be kept as small as possible while meeting the desired accuracy requirement. To obtain significantly higher accuracy while keeping the same update and query speed as the best prior sketches, in this article, we propose a new sketch - the Slim-Fat (SF) sketch. The key idea behind the SF-sketch is to maintain two separate sketches: a larger sketch, the Fat-subsketch, and a smaller sketch, the Slim-subsketch. The Fat-subsketch is used for updating and periodically producing the Slim-subsketch, which is then transferred to the remote collector for answering queries quickly and accurately. We also present the error bound as well as an accurate model of the correct rate of the SF-sketch, and verify their correctness through experiments. We implemented and extensively evaluated the SF-sketch along with several prior sketches. Our results show that when the size of our Slim-subsketch and of the widely used Count-Min (CM) sketch are kept the same, our SF-sketch outperforms the CM-sketch by up to 33.1 times in terms of accuracy (when the ratio of the sizes of the Fat-subsketch and the Slim-subsketch is 16:1). We have made all source codes publicly available at Github [“Source code of SF sketches,” [Online]. Available: https://github.com/paper2017/SF-sketch].

中文翻译：

SF-Sketch：数据流的两阶段草图

草图是概率数据结构，设计用于记录多集中项目的频率。它们广泛应用于各个领域，特别是用于从网络测量中的分布式数据流收集互联网统计数据。在具有高数据速率的分布式流应用程序中，每个监控节点中的草图很快“填满”，然后其内容被传输到负责回答查询的远程收集器。因此，传输内容的大小必须尽可能小，同时满足所需的精度要求。为了获得更高的准确性，同时保持与最佳先前草图相同的更新和查询速度，在本文中，我们提出了一种新草图 - Slim-Fat (SF) 草图。 SF 草图背后的关键思想是维护两个单独的草图：较大的草图（“胖子草图”）和较小的草图（“细长子草图”）。 Fat-subsketch 用于更新并定期生成 Slim-subsketch，然后将其传输到远程收集器以快速准确地回答查询。我们还提出了 SF 草图的误差界以及正确率的精确模型，并通过实验验证了其正确性。我们实施并广泛评估了 SF 草图以及之前的几个草图。我们的结果表明，当我们的 Slim 子草图和广泛使用的 Count-Min (CM) 草图的尺寸保持相同时，我们的 SF 草图在精度方面优于 CM 草图高达 33.1 倍（当Fat 子草图和 Slim 子草图的尺寸比为 16:1)。我们已在 Github 上公开提供所有源代码 [“SF sketches 的源代码”，[在线]。可用：https://github.com/paper2017/SF-sketch]。

更新日期：2020-04-15

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11