A fast parallel attribute reduction algorithm using Apache Spark,Knowledge-Based Systems

当前位置： X-MOL 学术 › Knowl. Based Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A fast parallel attribute reduction algorithm using Apache Spark
Knowledge-Based Systems ( IF 8.8 ) Pub Date : 2020-11-11 , DOI: 10.1016/j.knosys.2020.106582
Linzi Yin , Liyang Qin , Zhaohui Jiang , Xuemei Xu

Effective and fast attribute reduction algorithm on high-dimensional dataset is one of the most important issues of big data, and several parallel attribute reduction algorithms were implemented by using MapReduce. However, MapReduce is not suitable for iterative computing, which causes low calculation efficiency in many cases. In this paper, we proposed a novel parallel attribute reduction algorithm by considering the new generation distributed computing framework Apache Spark. First, the core attribute decision strategy is proposed to replace the traditional attribute significance calculation, and the number of iterations is reduced from $|C| |R| - {|R|}^{2} ∕ 2 + | R | ∕ 2$ to $|C|$ ( $|C|$ represents the number of condition attributes and $|R|$ represents the number of attributes in the reduct result). Furthermore, for high-dimensional datasets, we designed a batch processing strategy to reduce the number of iterations exponentially. Second, the proposed algorithm was speeded up with three techniques, including: (1) the network data transmission is minimized based on the localized operation;(2) a single cache iteration method is suggested to reduce disk I/O cost; (3) some calculations are skipped by an interruption strategy. In the experimental analysis, we succeeded with various types of real big datasets and random datasets in a real distributed computing environment and compared with the classic MapReduce-based parallel attribute reduction algorithm PAAR_PR in various aspects. Experimental conclusions proved that the computing efficiency of our algorithm has been improved by more than 98% compared to the classic parallel attribute reduction algorithm PAAR_PR.

中文翻译：

使用Apache Spark的快速并行属性约简算法

高维数据集的有效快速属性约简算法是大数据最重要的问题之一，并通过MapReduce实现了几种并行的属性约简算法。但是，MapReduce不适合用于迭代计算，这在许多情况下会导致较低的计算效率。在本文中，我们考虑了新一代分布式计算框架Apache Spark，提出了一种新颖的并行属性约简算法。首先，提出了核心属性决策策略来代替传统的属性重要性计算，并且迭代次数从 $|C| |[R| - {|[R|}^{2} ∕ 2 + | [R | ∕ 2$ 至 $|C|$ （ $|C|$ 表示条件属性的数量，并且 $|[R|$ 表示归约结果中的属性数）。此外，对于高维数据集，我们设计了批处理策略以成倍减少迭代次数。其次，通过三种技术来加快算法的速度，包括：（1）基于本地化操作最小化网络数据传输；（2）建议采用单一缓存迭代方法来降低磁盘I / O成本；以及（3）一些计算被中断策略跳过。在实验分析中，我们在真实的分布式计算环境中成功处理了各种类型的真实大数据集和随机数据集，并在各个方面与经典的基于MapReduce的并行属性约简算法PAAR_PR进行了比较。

更新日期：2020-11-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>