当前位置: X-MOL 学术Big Data Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Frequent Itemsets Mining for Big Data: A Comparative Analysis
Big Data Research ( IF 3.3 ) Pub Date : 2017-08-24 , DOI: 10.1016/j.bdr.2017.06.006
Daniele Apiletti , Elena Baralis , Tania Cerquitelli , Paolo Garza , Fabio Pulvirenti , Luca Venturini

Itemset mining is a well-known exploratory data mining technique used to discover interesting correlations hidden in a data collection. Since it supports different targeted analyses, it is profitably exploited in a wide range of different domains, ranging from network traffic data to medical records. With the increasing amount of generated data, different scalable algorithms have been developed, exploiting the advantages of distributed computing frameworks, such as Apache Hadoop and Spark.

This paper reviews Hadoop- and Spark-based scalable algorithms addressing the frequent itemset mining problem in the Big Data domain through both theoretical and experimental comparative analyses. Since the itemset mining task is computationally expensive, its distribution and parallelization strategies heavily affect memory usage, load balancing, and communication costs. A detailed discussion of the algorithmic choices of the distributed methods for frequent itemset mining is followed by an experimental analysis comparing the performance of state-of-the-art distributed implementations on both synthetic and real datasets. The strengths and weaknesses of the algorithms are thoroughly discussed with respect to the dataset features (e.g., data distribution, average transaction length, number of records), and specific parameter settings. Finally, based on theoretical and experimental analyses, open research directions for the parallelization of the itemset mining problem are presented.



中文翻译:

大数据频繁项集挖掘:比较分析

项集挖掘是一种众所周知的探索性数据挖掘技术,用于发现隐藏在数据集合中的有趣关联。由于它支持不同的目标分析,因此可以在从网络流量数据到病历的众多不同领域中获利。随着生成数据量的增加,已经开发了各种可扩展算法,从而利用了诸如Apache Hadoop和Spark之类的分布式计算框架的优势。

本文通过理论和实验比较分析,回顾了基于Hadoop和Spark的可扩展算法,以解决大数据领域中频繁出现的项目集挖掘问题。由于项集挖掘任务的计算量很大,因此其分配和并行化策略会严重影响内存使用,负载平衡和通信成本。在频繁项目集挖掘的分布式方法的算法选择的详细讨论之后,进行了实验分析,比较了最新的分布式实现在综合和真实数据集上的性能。关于数据集功能(例如,数据分布,平均事务长度,记录数)和特定参数设置,对算法的优缺点进行了全面讨论。最后,

更新日期:2017-08-24
down
wechat
bug