当前位置: X-MOL 学术Big Data Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Analytical Confidence Intervals for the Number of Different Objects in Data Streams
Big Data Research ( IF 3.3 ) Pub Date : 2021-08-08 , DOI: 10.1016/j.bdr.2021.100248
Giacomo Aletti 1
Affiliation  

This paper develops a new mathematical-statistical approach to analyze a class of Flajolet-Martin algorithms (FMa), and provides analytical confidence intervals for the number F0 of distinct elements in a stream, based on Chernoff bounds. The class of FMa has reached a significant popularity in bigdata stream learning, and the attention of the literature has mainly been based on algorithmic aspects, basically complexity optimality, while the statistical analysis of these class of algorithms has been often faced heuristically. The analysis provided here shows deep connections with mathematical special functions and with extreme value theory. The latter connection may help in explaining heuristic considerations, while the first opens many numerical issues, faced at the end of the present paper. Finally, the algorithms are tested on an anonymized real data stream and MonteCarlo simulations are provided to support our analytical choice in this context.



中文翻译:

数据流中不同对象数量的分析置信区间

本文开发了一种新的数理统计方法来分析一类 Flajolet-Martin 算法 (FMa),并提供了数字的分析置信区间 F0基于 Chernoff 边界的流中不同元素的数量。FMa 类在大数据流学习中已经非常流行,文献的关注点主要基于算法方面,基本上是复杂度优化,而这类算法的统计分析经常面临启发式的问题。此处提供的分析显示了与数学特殊函数和极值理论的深层联系。后一个连接可能有助于解释启发式考虑,而第一个连接打开了本文末尾面临的许多数值问题。最后,算法在匿名的真实数据流上进行了测试,并提供了蒙特卡罗模拟来支持我们在这种情况下的分析选择。

更新日期:2021-08-13
down
wechat
bug