当前位置: X-MOL 学术Concurr. Comput. Pract. Exp. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
On a parallel spark workflow for frequent itemset mining based on array prefix-tree
Concurrency and Computation: Practice and Experience ( IF 1.5 ) Pub Date : 2021-04-25 , DOI: 10.1002/cpe.6313
Xinzheng Niu 1 , Peng Wu 1 , Chase Q. Wu 2 , Aiqin Hou 3 , Mideng Qian 1
Affiliation  

Extracting frequent itemsets from datasets is an important problem in data mining, for which several mining methods including FP-Growth have been proposed. FP-Growth is a classical frequent itemset mining method, which generates pattern databases without candidates. Many improvements have been made in the literature due to the high time complexity and memory usage of FP-Growth. However, most of them still suffer from performance issues on large datasets. In this paper, we design an auxiliary structure, Array Prefix-Tree (AP-Tree), and propose a new algorithm, Array Prefix-Tree Growth (APT-Growth), which is further parallelized as a Spark workflow, referred to as PAPT-Growth. Based on a density threshold, we incorporate an adaptive algorithm selection process into PAPT-Growth to ensure its running time performance. We conduct extensive experiments on different thresholds and multiple datasets, and experimental results show the performance superiority of PAPT-Growth in comparison with several state-of-the-art methods such as PFP, YAFIM, and DFPS. The analysis on density reveals a changing point, which justifies the necessity and validity of adaptive algorithm selection.

中文翻译:

基于数组前缀树的频繁项集挖掘并行Spark工作流

从数据集中提取频繁项集是数据挖掘中的一个重要问题,为此提出了包括FP-Growth在内的多种挖掘方法。FP-Growth 是一种经典的频繁项集挖掘方法,它生成没有候选的模式数据库。由于 FP-Growth 的高时间复杂度和内存使用率,文献中已经进行了许多改进。然而,他们中的大多数人在大型数据集上仍然存在性能问题。在本文中,我们设计了一个辅助结构,数组前缀树(AP-Tree),并提出了一种新的算法,数组前缀树增长(APT-Growth),它被进一步并行化为一个Spark工作流,简称PAPT -生长。基于密度阈值,我们将自适应算法选择过程纳入 PAPT-Growth 以确保其运行时间性能。我们对不同的阈值和多个数据集进行了广泛的实验,实验结果表明,与 PFP、YAFIM 和 DFPS 等几种最先进的方法相比,PAPT-Growth 的性能优势。对密度的分析揭示了一个变化点,这证明了自适应算法选择的必要性和有效性。
更新日期:2021-04-25
down
wechat
bug