RST-DE: Rough Sets-Based New Differential Evolution Algorithm for Scalable Big Data Feature Selection in Distributed Computing Platforms,Big Data

当前位置： X-MOL 学术 › Big Data › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

RST-DE: Rough Sets-Based New Differential Evolution Algorithm for Scalable Big Data Feature Selection in Distributed Computing Platforms
Big Data ( IF 2.6 ) Pub Date : 2022-08-12 , DOI: 10.1089/big.2021.0267
Santosh Thakur ₁ , Ramesh Dharavath ₂ , Achyut Shankar ₃ , Prabhishek Singh ₃ , Manoj Diwakar ₄ , Mohammad R Khosravi ₅

Affiliation

In data analysis, data scientists usually focus on the size of data instead of features selection. Owing to the extreme growth of internet resources data are growing exponentially with more features, which leads to big data dimensionality problems. The high volume of features contains much of redundant data, which may affect the feature classification in terms of accuracy. In the current scenario, feature selection attracts the research community to identify and to remove irrelevant features with more scalability and accuracy. To accommodate this, in this research study, we present a novel feature selection framework that is implemented on Hadoop and Apache Spark platform. In contrast, the proposed model also includes rough sets and differential evolution (DE) algorithm, where rough sets are used to find the minimum features, but rough sets do not consider the degree of overlying in the data. Therefore, DE algorithm is used to find the most optimal features. The proposed model is studied with Random Forest and Naive Bayes classifiers on five well-known data sets and compared with existing feature selection models presented in the literature. The results show that the proposed model performs well in terms of scalability and accuracy.

中文翻译：

RST-DE：分布式计算平台中可扩展大数据特征选择的基于粗糙集的新差分进化算法

在数据分析中，数据科学家通常关注数据的大小而不是特征选择。由于互联网资源的极端增长，数据呈指数级增长，特征越来越多，这导致了大数据的维数问题。大量的特征包含大量冗余数据，这可能会影响特征分类的准确性。在当前情况下，特征选择吸引了研究界以更高的可扩展性和准确性来识别和删除不相关的特征。为了适应这一点，在这项研究中，我们提出了一个在 Hadoop 和 Apache Spark 平台上实现的新颖的特征选择框架。相比之下，所提出的模型还包括粗糙集和差分进化（DE）算法，其中粗糙集用于寻找最小特征，但粗糙集没有考虑数据中的重叠程度。因此，DE算法用于寻找最优特征。所提出的模型在五个众所周知的数据集上使用随机森林和朴素贝叶斯分类器进行研究，并与文献中现有的特征选择模型进行比较。结果表明，所提出的模型在可扩展性和准确性方面表现良好。

更新日期：2022-08-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11