当前位置: X-MOL 学术ACM Trans. Knowl. Discov. Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Synopsis Based Approach for Itemset Frequency Estimation over Massive Multi-Transaction Stream
ACM Transactions on Knowledge Discovery from Data ( IF 3.6 ) Pub Date : 2021-07-21 , DOI: 10.1145/3465238
Guangtao Wang 1 , Gao Cong 2 , Ying Zhang 3 , Zhen Hai 4 , Jieping Ye 5
Affiliation  

The streams where multiple transactions are associated with the same key are prevalent in practice, e.g., a customer has multiple shopping records arriving at different time. Itemset frequency estimation on such streams is very challenging since sampling based methods, such as the popularly used reservoir sampling, cannot be used. In this article, we propose a novel k -Minimum Value (KMV) synopsis based method to estimate the frequency of itemsets over multi-transaction streams. First, we extract the KMV synopses for each item from the stream. Then, we propose a novel estimator to estimate the frequency of an itemset over the KMV synopses. Comparing to the existing estimator, our method is not only more accurate and efficient to calculate but also follows the downward-closure property. These properties enable the incorporation of our new estimator with existing frequent itemset mining (FIM) algorithm (e.g., FP-Growth) to mine frequent itemsets over multi-transaction streams. To demonstrate this, we implement a KMV synopsis based FIM algorithm by integrating our estimator into existing FIM algorithms, and we prove it is capable of guaranteeing the accuracy of FIM with a bounded size of KMV synopsis. Experimental results on massive streams show our estimator can significantly improve on the accuracy for both estimating itemset frequency and FIM compared to the existing estimators.

中文翻译:

一种基于概要的海量多事务流项集频率估计方法

多个交易与同一个键相关联的流在实践中很普遍,例如,客户有多个购物记录在不同时间到达。对此类流的项集频率估计非常具有挑战性,因为不能使用基于采样的方法,例如常用的储层采样。在这篇文章中,我们提出了一部小说ķ-基于最小值(KMV)概要的方法来估计多事务流上项目集的频率。首先,我们从流中提取每个项目的 KMV 概要。然后,我们提出了一种新的估计器来估计项集在 KMV 概要上的频率。与现有的估计器相比,我们的方法不仅计算更准确、更高效,而且遵循向下封闭的特性。这些特性使我们的新估计器能够与现有的频繁项集挖掘 (FIM) 算法(例如,FP-Growth)相结合,以挖掘多事务流上的频繁项集。为了证明这一点,我们通过将我们的估计器集成到现有的 FIM 算法中来实现基于 KMV 概要的 FIM 算法,并且我们证明它能够保证具有有限大小的 KMV 概要的 FIM 的准确性。
更新日期:2021-07-21
down
wechat
bug