当前位置: X-MOL 学术Big Data Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Parallel MapReduce Algorithm to Efficiently Support Itemset Mining on High Dimensional Data
Big Data Research ( IF 3.3 ) Pub Date : 2017-10-18 , DOI: 10.1016/j.bdr.2017.10.004
Daniele Apiletti , Elena Baralis , Tania Cerquitelli , Paolo Garza , Fabio Pulvirenti , Pietro Michiardi

In today's world, large volumes of data are being continuously generated by many scientific applications, such as bioinformatics or networking. Since each monitored event is usually characterized by a variety of features, high-dimensional datasets have been continuously generated. To extract value from these complex collections of data, different exploratory data mining algorithms can be used to discover hidden and non-trivial correlations among data. Frequent closed itemset mining is an effective but computational expensive technique that is usually used to support data exploration. Thanks to the spread of distributed and parallel frameworks, the development of scalable approaches able to deal with the so called Big Data has been extended to frequent itemset mining. Unfortunately, most of the current algorithms are designed to cope with low-dimensional datasets, delivering poor performances in those use cases characterized by high-dimensional data. This work introduces PaMPa-HD, a MapReduce-based frequent closed itemset mining algorithm for high dimensional datasets. An efficient solution has been proposed to parallelize and speed up the mining process. Furthermore, different strategies have been proposed to easily configure the algorithm parameter. The experimental results, performed on real-life high-dimensional use cases, show the efficiency of the proposed approach in terms of execution time, load balancing and robustness to memory issues.



中文翻译:

有效支持高维数据项集挖掘的并行MapReduce算法

在当今世界,许多科学应用(例如生物信息学或网络)正在不断生成大量数据。由于每个监视事件通常都具有多种特征,因此已连续生成高维数据集。为了从这些复杂的数据集中提取价值,可以使用不同的探索性数据挖掘算法来发现数据之间隐藏的和非平凡的相关性。频繁的封闭项集挖掘是一种有效的但计算量大的技术,通常用于支持数据探索。由于分布式和并行框架的普及,能够处理所谓的大数据的可伸缩方法的开发已扩展到频繁的项目集挖掘。不幸,当前大多数算法都旨在处理低维数据集,在以高维数据为特征的用例中,性能较差。这项工作介绍了PaMPa-​​HD,这是一种基于MapReduce的针对高维数据集的频繁封闭项目集挖掘算法。已经提出了一种有效的解决方案来并行化和加快采矿过程。此外,已经提出了不同的策略来容易地配置算法参数。在现实生活中的高维用例上执行的实验结果显示了该方法在执行时间,负载平衡和对内存问题的鲁棒性方面的效率。一种基于MapReduce的频繁闭合项集挖掘算法,用于高维数据集。已经提出了一种有效的解决方案来并行化和加快采矿过程。此外,已经提出了不同的策略来容易地配置算法参数。在现实生活中的高维用例上执行的实验结果显示了该方法在执行时间,负载平衡和对内存问题的鲁棒性方面的效率。一种基于MapReduce的频繁闭合项集挖掘算法,用于高维数据集。已经提出了一种有效的解决方案来并行化和加快采矿过程。此外,已经提出了不同的策略来容易地配置算法参数。在现实生活中的高维用例上执行的实验结果显示了该方法在执行时间,负载平衡和对内存问题的鲁棒性方面的效率。

更新日期:2017-10-18
down
wechat
bug