当前位置: X-MOL 学术arXiv.cs.DB › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Data stream fusion for accurate quantile tracking and analysis
arXiv - CS - Databases Pub Date : 2021-01-17 , DOI: arxiv-2101.06758
Massimo Cafaro, Catiuscia Melle, Italo Epicoco, Marco Pulimeno

UDDSKETCH is a recent algorithm for accurate tracking of quantiles in data streams, derived from the DDSKETCH algorithm. UDDSKETCH provides accuracy guarantees covering the full range of quantiles independently of the input distribution and greatly improves the accuracy with regard to DDSKETCH. In this paper we show how to compress and fuse data streams (or datasets) by using UDDSKETCH data summaries that are fused into a new summary related to the union of the streams (or datasets) processed by the input summaries whilst preserving both the error and size guarantees provided by UDDSKETCH. This property of sketches, known as mergeability, enables parallel and distributed processing. We prove that UDDSKETCH is fully mergeable and introduce a parallel version of UDDSKETCH suitable for message-passing based architectures. We formally prove its correctness and compare it to a parallel version of DDSKETCH, showing through extensive experimental results that our parallel algorithm almost always outperforms the parallel DDSKETCH algorithm with regard to the overall accuracy in determining the quantiles.

中文翻译:

数据流融合,用于精确的分位数跟踪和分析

UDDSKETCH是从DDSKETCH算法派生的一种用于精确跟踪数据流中分位数的最新算法。UDDSKETCH提供与输入分布无关的覆盖所有分位数的精度保证,并极大地提高了DDSKETCH的精度。在本文中,我们展示了如何使用UDDSKETCH数据摘要来压缩和融合数据流(或数据集),这些数据摘要融合到与输入摘要处理的流(或数据集)的并集相关的新摘要中,同时保留了错误和UDDSKETCH提供的尺寸保证。草图的此属性称为可合并性,可实现并行和分布式处理。我们证明UDDSKETCH是完全可合并的,并引入了适用于基于消息传递的体系结构的UDDSKETCH并行版本。
更新日期:2021-01-19
down
wechat
bug