当前位置:
X-MOL 学术
›
arXiv.cs.DB
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
SetSketch: Filling the Gap between MinHash and HyperLogLog
arXiv - CS - Databases Pub Date : 2021-01-01 , DOI: arxiv-2101.00314 Otmar Ertl
arXiv - CS - Databases Pub Date : 2021-01-01 , DOI: arxiv-2101.00314 Otmar Ertl
MinHash and HyperLogLog are sketching algorithms that have become
indispensable for set summaries in big data applications. While HyperLogLog
allows counting different elements with very little space, MinHash is suitable
for the fast comparison of sets as it allows estimating the Jaccard similarity
and other joint quantities. This work presents a new data structure called
SetSketch that is able to continuously fill the gap between both use cases. Its
commutative and idempotent insert operation and its mergeable state make it
suitable for distributed environments. Robust and easy-to-implement estimators
for cardinality and joint quantities, as well as the ability to use SetSketch
for similarity search, enable versatile applications. The developed methods can
also be used for HyperLogLog sketches and allow estimation of joint quantities
such as the intersection size with a smaller error compared to the common
estimation approach based on the inclusion-exclusion principle.
中文翻译:
SetSketch:填补MinHash和HyperLogLog之间的空白
MinHash和HyperLogLog是草绘算法,对于大数据应用程序中的集合摘要而言已成为必不可少的算法。HyperLogLog允许以很小的空间对不同元素进行计数,而MinHash适用于集合的快速比较,因为它可以估计Jaccard相似度和其他联合数量。这项工作提出了一个称为SetSketch的新数据结构,该结构能够不断填补两个用例之间的空白。它的可交换和幂等的插入操作以及可合并的状态使其适合于分布式环境。用于基数和联合量的鲁棒且易于实现的估计器,以及使用SetSketch进行相似性搜索的功能,可以实现多种应用。
更新日期:2021-01-05
中文翻译:
SetSketch:填补MinHash和HyperLogLog之间的空白
MinHash和HyperLogLog是草绘算法,对于大数据应用程序中的集合摘要而言已成为必不可少的算法。HyperLogLog允许以很小的空间对不同元素进行计数,而MinHash适用于集合的快速比较,因为它可以估计Jaccard相似度和其他联合数量。这项工作提出了一个称为SetSketch的新数据结构,该结构能够不断填补两个用例之间的空白。它的可交换和幂等的插入操作以及可合并的状态使其适合于分布式环境。用于基数和联合量的鲁棒且易于实现的估计器,以及使用SetSketch进行相似性搜索的功能,可以实现多种应用。