当前位置: X-MOL 学术IEEE Trans. Cybern. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fast Frequent Patterns Mining by Multiple Sampling With Tight Guarantee Under Bayesian Statistics
IEEE Transactions on Cybernetics ( IF 11.8 ) Pub Date : 2021-11-18 , DOI: 10.1109/tcyb.2021.3125196
Zhongjie Zhang 1 , Jian Huang 1
Affiliation  

Sampling from large dataset is commonly used in the frequent patterns (FPs) mining. To tightly and theoretically guarantee the quality of the FPs obtained from samples, current methods theoretically stabilize the supports of all the patterns in random samples, despite only FPs do matter, so they always overestimate the sample size. We propose an algorithm called multiple sampling-based FPs mining (MSFP). The MSFP first generates the set of approximate frequent items ( $AFI$ ), and uses the $AFI$ to form the set of approximate FPs without supports ( ${\mathrm{ AFP}}^{*}$ ), where it does not stabilize the value of any item’s or pattern’s support, but only stabilizes the relationship $\ge $ or < between the support and the minimum support, so the MSFP can use small samples to successively obtain the $AFI$ and ${\mathrm{ AFP}}^{*}$ , and can successively prune the patterns not contained by the $AFI$ and not in the ${\mathrm{ AFP}}^{*}$ . Then, the MSFP introduces the Bayesian statistics to only stabilize the values of supports of ${\mathrm{ AFP}}^{*}$ ’s patterns. If a pattern’s support in the original dataset is unknown, the MSFP regards it as random, and keeps updating its distribution by its approximations obtained from the samples taken in the progressive sampling, so the error probability can be bound better. Furthermore, to reduce the I/O processes in the progressive sampling, the MSFP stores a large enough random sample in memory in advance. The experiments show that the MSFP is reliable and efficient.

中文翻译:

贝叶斯统计下具有紧保证的多重抽样快速频繁模式挖掘

从大型数据集中抽样通常用于频繁模式 (FP) 挖掘。为了从理论上严格保证从样本中获得的 FP 的质量,目前的方法在理论上稳定了随机样本中所有模式的支持,尽管只有 FP 很重要,因此它们总是高估样本量。我们提出了一种称为基于多重采样的 FP 挖掘 (MSFP) 的算法。MSFP 首先生成一组近似频繁项 ( $AFI$ ), 并使用 $AFI$形成无支撑的近似 FP 集( ${\mathrm{ 法新社}}^{*}$ ), 它不会稳定任何项目或模式的支持值,而只会稳​​定关系 $\ge $or < 介于支持度和最小支持度之间,所以MSFP可以使用小样本依次获得 $AFI$ ${\mathrm{ 法新社}}^{*}$ ,并且可以连续修剪不包含的模式 $AFI$而不是在 ${\mathrm{ 法新社}}^{*}$ . 然后,MSFP 引入贝叶斯统计来稳定支持的值 ${\mathrm{ 法新社}}^{*}$ 的图案。如果一个模式在原始数据集中的支持是未知的,MSFP 将其视为随机的,并通过从渐进采样中获取的样本中获得的近似值不断更新其分布,因此可以更好地约束错误概率。此外,为了减少渐进采样中的I/O进程,MSFP提前在内存中存储了足够大的随机样本。实验表明,MSFP 是可靠和高效的。
更新日期:2021-11-18
down
wechat
bug