HBPFP-DC: A parallel frequent itemset mining using Spark,Parallel Computing

当前位置： X-MOL 学术 › Parallel Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

HBPFP-DC: A parallel frequent itemset mining using Spark
Parallel Computing ( IF 2.0 ) Pub Date : 2020-11-30 , DOI: 10.1016/j.parco.2020.102738
Yaling Xun , Jifu Zhang , Haifeng Yang , Xiao Qin

The frequent itemset mining (FIM) is one of the most important techniques to extract knowledge from data in many real-world applications. Facing big data applications, parallel and distributed solutions are widely studied. However, the frequent itemset mining process is a continuous iteration process. As an in-memory parallel execution model in which all data will be loaded into memory, Spark is especially beneficial for iterative calculations. In the study, we propose a HBPFP-DC (High Balanced Parallel Fp-Growth Considering Data Correlation) algorithm on the Spark platform. HBPFP-DC uses a newly defined node computation workload estimation model to realize the balanced grouping of the calculation tasks among computing nodes, so that each computing node can achieve a completely asynchronous frequent itemset mining only relying on their respective local projection datasets. And, in order to improve the ‘compression factor’ of the tree structure to boost mining efficiency, we consider the correlation among items when performing the above grouping process. Thereby, network and computing consumption by dividing similar items in the same group are significantly decreased. Finally, extensive experiments demonstrate that our proposed solution is efficient and scalable.

中文翻译：

HBPFP-DC：使用Spark的并行频繁项集挖掘

频繁项集挖掘（FIM）是从许多实际应用程序中的数据中提取知识的最重要技术之一。面对大数据应用，并行和分布式解决方案得到了广泛的研究。但是，频繁项集挖掘过程是一个连续的迭代过程。作为将所有数据都加载到内存中的内存中并行执行模型，Spark对于迭代计算特别有用。在研究中，我们在Spark平台上提出了一种HBPFP-DC（考虑数据相关性的高平衡并行Fp-增长）算法。HBPFP-DC使用新定义的节点计算工作量估计模型来实现计算节点之间的计算任务的均衡分组，这样，每个计算节点仅依赖于它们各自的局部投影数据集就可以实现完全异步的频繁项集挖掘。并且，为了提高树形结构的“压缩因子”以提高挖掘效率，我们在执行上述分组过程时考虑了项目之间的相关性。从而，通过将相同项目划分为同一组来显着减少网络和计算消耗。最后，大量实验证明我们提出的解决方案是有效且可扩展的。通过将相同项目划分为相同的组，可以显着减少网络和计算消耗。最后，大量实验证明我们提出的解决方案是有效且可扩展的。通过将相同项目划分为相同的组，可以显着减少网络和计算消耗。最后，大量实验证明我们提出的解决方案是有效且可扩展的。

更新日期：2020-12-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11