Estimating Cardinality for Arbitrarily Large Data Stream With Improved Memory Efficiency,IEEE/ACM Transactions on Networking

当前位置： X-MOL 学术 › IEEE ACM Trans. Netw. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Estimating Cardinality for Arbitrarily Large Data Stream With Improved Memory Efficiency
IEEE/ACM Transactions on Networking ( IF 3.0 ) Pub Date : 2020-03-19 , DOI: 10.1109/tnet.2020.2970860
Qingjun Xiao , Shigang Chen , You Zhou , Junzhou Luo

Cardinality estimation is the task of determining the number of distinct elements (or the cardinality) in a data stream, under a stringent constraint that the input data stream can be scanned by just one single pass. This is a fundamental problem with many practical applications, such as traffic monitoring of high-speed networks and query optimization of Internet-scale database. To solve the problem, we propose an algorithm named HLL-TailCut, which implements the estimation standard error

$1.0 / \sqrt {m}$

using the memory units of four or three bits each, whose cost is much smaller than the five-bit memory units used by HyperLogLog, the best previously known cardinality estimator. This makes it possible to reduce the memory cost of HyperLogLog by 20%~45%. For example, when the target estimation error is 1.1%, state-of-the-art HyperLogLog needs 5.6 kilobytes memory. By contrast, our new algorithm only needs 3 kilobytes memory consumption for attaining the same accuracy. Additionally, our algorithm is able to support the estimation of very large stream cardinalities, even on the Tera and Peta scale.

中文翻译：

估计具有改善的存储效率的任意大数据流的基数

基数估计是在严格的约束条件下确定数据流中不同元素的数量（或基数）的任务，即严格限制一次输入数据流即可扫描一次。这是许多实际应用中的一个基本问题，例如高速网络的流量监视和Internet规模数据库的查询优化。为了解决该问题，我们提出了一种名为HLL-TailCut的算法，该算法实现了估计标准误

$ 1.0 / \ sqrt {m} $

使用每个四或三位的存储单元，其成本比以前最好的基数估计器HyperLogLog使用的五位存储单元小得多。这样可以将HyperLogLog的内存成本降低20％〜45％。例如，当目标估计误差为1.1％时，最新的HyperLogLog需要5.6 KB的内存。相比之下，我们的新算法只需要3 KB的内存消耗就可以达到相同的精度。此外，即使在Tera和Peta规模上，我们的算法也能够支持非常大的流基数的估计。

更新日期：2020-04-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文