当前位置: X-MOL 学术arXiv.cs.DB › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SetSketch: Filling the Gap between MinHash and HyperLogLog
arXiv - CS - Databases Pub Date : 2021-01-01 , DOI: arxiv-2101.00314
Otmar Ertl

MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to continuously fill the gap between both use cases. Its commutative and idempotent insert operation and its mergeable state make it suitable for distributed environments. Robust and easy-to-implement estimators for cardinality and joint quantities, as well as the ability to use SetSketch for similarity search, enable versatile applications. The developed methods can also be used for HyperLogLog sketches and allow estimation of joint quantities such as the intersection size with a smaller error compared to the common estimation approach based on the inclusion-exclusion principle.

中文翻译:

SetSketch:填补MinHash和HyperLogLog之间的空白

MinHash和HyperLogLog是草绘算法,对于大数据应用程序中的集合摘要而言已成为必不可少的算法。HyperLogLog允许以很小的空间对不同元素进行计数,而MinHash适用于集合的快速比较,因为它可以估计Jaccard相似度和其他联合数量。这项工作提出了一个称为SetSketch的新数据结构,该结构能够不断填补两个用例之间的空白。它的可交换和幂等的插入操作以及可合并的状态使其适合于分布式环境。用于基数和联合量的鲁棒且易于实现的估计器,以及使用SetSketch进行相似性搜索的功能,可以实现多种应用。
更新日期:2021-01-05
down
wechat
bug