当前位置: X-MOL 学术IEEE ACM Trans. Netw. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Estimating Cardinality for Arbitrarily Large Data Stream With Improved Memory Efficiency
IEEE/ACM Transactions on Networking ( IF 3.0 ) Pub Date : 2020-03-19 , DOI: 10.1109/tnet.2020.2970860
Qingjun Xiao , Shigang Chen , You Zhou , Junzhou Luo

Cardinality estimation is the task of determining the number of distinct elements (or the cardinality) in a data stream, under a stringent constraint that the input data stream can be scanned by just one single pass. This is a fundamental problem with many practical applications, such as traffic monitoring of high-speed networks and query optimization of Internet-scale database. To solve the problem, we propose an algorithm named HLL-TailCut, which implements the estimation standard error $1.0 / \sqrt {m}$ using the memory units of four or three bits each, whose cost is much smaller than the five-bit memory units used by HyperLogLog, the best previously known cardinality estimator. This makes it possible to reduce the memory cost of HyperLogLog by 20%~45%. For example, when the target estimation error is 1.1%, state-of-the-art HyperLogLog needs 5.6 kilobytes memory. By contrast, our new algorithm only needs 3 kilobytes memory consumption for attaining the same accuracy. Additionally, our algorithm is able to support the estimation of very large stream cardinalities, even on the Tera and Peta scale.

中文翻译:

估计具有改善的存储效率的任意大数据流的基数

基数估计是在严格的约束条件下确定数据流中不同元素的数量(或基数)的任务,即严格限制一次输入数据流即可扫描一次。这是许多实际应用中的一个基本问题,例如高速网络的流量监视和Internet规模数据库的查询优化。为了解决该问题,我们提出了一种名为HLL-TailCut的算法,该算法实现了估计标准误 $ 1.0 / \ sqrt {m} $ 使用每个四或三位的存储单元,其成本比以前最好的基数估计器HyperLogLog使用的五位存储单元小得多。这样可以将HyperLogLog的内存成本降低20%〜45%。例如,当目标估计误差为1.1%时,最新的HyperLogLog需要5.6 KB的内存。相比之下,我们的新算法只需要3 KB的内存消耗就可以达到相同的精度。此外,即使在Tera和Peta规模上,我们的算法也能够支持非常大的流基数的估计。
更新日期:2020-04-22
down
wechat
bug