当前位置: X-MOL 学术J. Supercomput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Spark-based Apriori algorithm with reduced shuffle overhead
The Journal of Supercomputing ( IF 3.3 ) Pub Date : 2020-03-27 , DOI: 10.1007/s11227-020-03253-7
Shashi Raj , Dharavath Ramesh , Krishan Kumar Sethi

Mining frequent itemset is considered as a core activity to find association rules from transactional datasets. Among the different well-known approaches to find frequent itemsets, the Apriori algorithm is the earliest proposed. Many attempts have been made to adopt the Apriori algorithm for large-scale datasets. But the bottlenecks associated with Apriori like/such as repeated scans of the input dataset, generation of all the candidate itemsets prior to counting their support value, etc., reduce the effectiveness of Apriori for large-size datasets. When the data size is large, even distributed and parallel implementations of Apriori using the MapReduce framework does not perform well. This is due to the iterative nature of the algorithm that incurs high disk overhead. In each iteration, the input dataset is scanned that resides on disk, causing the high disk I/O. Apache Spark implementations of Apriori show better performance due to in-memory processing capabilities. It makes iterative scanning of datasets faster by keeping it in a memory abstraction called resilient distributed dataset (RDD). An RDD keeps datasets in the form of key-value pairs spread across the cluster nodes. RDD operations require these key-value pairs to be redistributed among cluster nodes in the course of processing. This redistribution or shuffle operation incurs communication and synchronization overhead. In this manuscript, we propose a novel approach, namely the Spark-based Apriori algorithm with reduced shuffle overhead (SARSO). It utilizes the benefits of Spark’s parallel and distributed computing environment, and it is in-memory processing capabilities. It improves the efficiency further by reducing the shuffle overhead caused by RDD operations at each iteration. In other words, it restricts the movement of key-value pairs across the cluster nodes by using a partitioning method and hence reduces the necessary communication and synchronization overhead incurred by the Spark shuffle operation. Extensive experiments have been conducted to measure the performance of the SARSO on benchmark datasets and compared with an existing algorithm. Experimental results show that the SARSO has better performance in terms of running time and scalability.

中文翻译:

一种基于 Spark 的 Apriori 算法,减少了 shuffle 开销

挖掘频繁项集被认为是从事务数据集中找到关联规则的核心活动。在各种著名的寻找频繁项集的方法中,Apriori 算法是最早提出的。许多尝试采用 Apriori 算法来处理大规模数据集。但是与 Apriori 相关的瓶颈,如输入数据集的重复扫描、在计算其支持值之前生成所有候选项集等,降低了 Apriori 对大型数据集的有效性。当数据量很大时,即使是使用 MapReduce 框架的 Apriori 分布式并行实现也表现不佳。这是由于算法的迭代性质会导致高磁盘开销。在每次迭代中,扫描驻留在磁盘上的输入数据集,导致高磁盘 I/O。由于内存处理能力,Apache Spark 的 Apriori 实现表现出更好的性能。通过将数据集保存在称为弹性分布式数据集 (RDD) 的内存抽象中,它可以更快地迭代扫描数据集。RDD 以分布在集群节点上的键值对的形式保存数据集。RDD 操作需要在处理过程中将这些键值对在集群节点之间重新分配。这种重新分配或洗牌操作会导致通信和同步开销。在这份手稿中,我们提出了一种新方法,即基于 Spark 的 Apriori 算法,具有减少的 shuffle 开销(SARSO)。它利用了 Spark 的并行和分布式计算环境的优势,并具有内存中处理能力。它通过减少每次迭代由 RDD 操作引起的 shuffle 开销来进一步提高效率。换句话说,它通过使用分区方法来限制键值对跨集群节点的移动,从而减少了 Spark shuffle 操作产生的必要通信和同步开销。已经进行了大量实验来测量 SARSO 在基准数据集上的性能,并与现有算法进行比较。实验结果表明,SARSO在运行时间和可扩展性方面具有更好的性能。它通过使用分区方法限制键值对跨集群节点的移动,从而减少了 Spark shuffle 操作所产生的必要通信和同步开销。已经进行了大量实验来测量 SARSO 在基准数据集上的性能,并与现有算法进行比较。实验结果表明,SARSO在运行时间和可扩展性方面具有更好的性能。它通过使用分区方法限制键值对跨集群节点的移动,从而减少了 Spark shuffle 操作所产生的必要通信和同步开销。已经进行了大量实验来测量 SARSO 在基准数据集上的性能,并与现有算法进行比较。实验结果表明,SARSO在运行时间和可扩展性方面具有更好的性能。
更新日期:2020-03-27
down
wechat
bug