EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data,Knowledge and Information Systems

当前位置： X-MOL 学术 › Knowl. Inf. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data
Knowledge and Information Systems ( IF 2.5 ) Pub Date : 2020-04-07 , DOI: 10.1007/s10115-020-01464-1
Shashi Raj , Dharavath Ramesh , M. Sreenu , Krishan Kumar Sethi

Frequent itemset mining is considered a popular tool to discover knowledge from transactional datasets. It also serves as the basis for association rule mining. Several algorithms have been proposed to find frequent patterns in which the apriori algorithm is considered as the earliest proposed. Apriori has two significant bottlenecks associated with it: first, repeated scanning of input dataset and second, the requirement of generation of all the candidate itemsets before counting its support value. These bottlenecks reduce the effectiveness of apriori for large-scale datasets. Reasonable efforts have been made to diminish these bottlenecks so that efficiency can be improved. Especially, when the data size is larger, even distributed and parallel environments like MapReduce does not perform well due to the iterative nature of the algorithm that incurs high disk overhead. Apache Spark, on the other hand, is gaining significant attention in the field of big data processing because of its in-memory processing capabilities. Apart from utilizing the parallel and distributed computing environment of Spark, the proposed scheme named efficient apriori-based frequent itemset mining (EAFIM) presents two novel methods to improve the efficiency further. Unlike apriori, it generates the candidates ‘on-the-fly,’ i.e., candidate generation, and count of its support values go simultaneously when the input dataset is being scanned. Also, instead of using the original input dataset in each iteration, it calculates the updated input dataset by removing useless items and transactions. Reduction in size of the input dataset for higher iterations enables EAFIM to perform better. Extensive experiments were conducted to analyze the efficiency and scalability of EAFIM, which outperforms other existing methodologies.

中文翻译：

EAFIM：Spark上基于有效的基于先验的频繁项集挖掘算法，用于大事务数据

频繁项集挖掘被认为是从事务数据集中发现知识的流行工具。它还充当关联规则挖掘的基础。已经提出了几种算法来寻找频繁模式，其中先验算法被认为是最早提出的。Apriori有两个重要的瓶颈：第一，重复扫描输入数据集；第二，在计算支持值之前需要生成所有候选项目集。这些瓶颈降低了先验对大规模数据集的有效性。已经做出了合理的努力来减小这些瓶颈，从而可以提高效率。特别是当数据量较大时，甚至MapReduce之类的分布式并行环境也不能很好地执行操作，这是由于该算法的迭代性质会导致大量磁盘开销。另一方面，Apache Spark由于具有内存处理功能，因此在大数据处理领域引起了广泛关注。除了利用Spark的并行和分布式计算环境外，所提出的名为高效基于先验的频繁项集挖掘（EAFIM）的方案还提出了两种新颖的方法来进一步提高效率。与先验不同，它会“动态”生成候选对象，即候选对象生成，并且在扫描输入数据集时会同时支持其支持值。此外，与其在每次迭代中都不使用原始输入数据集，它通过删除无用的项目和交易来计算更新的输入数据集。减少输入数据集的大小以获得更高的迭代次数，可使EAFIM更好地执行。进行了广泛的实验以分析EAFIM的效率和可扩展性，其性能优于其他现有方法。

更新日期：2020-04-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11