A scalable and effective rough set theory-based approach for big data pre-processing,Knowledge and Information Systems

当前位置： X-MOL 学术 › Knowl. Inf. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A scalable and effective rough set theory-based approach for big data pre-processing
Knowledge and Information Systems ( IF 2.5 ) Pub Date : 2020-05-02 , DOI: 10.1007/s10115-020-01467-y
Zaineb Chelly Dagdia , Christine Zarges , Gaël Beck , Mustapha Lebbah

A big challenge in the knowledge discovery process is to perform data pre-processing, specifically feature selection, on a large amount of data and high dimensional attribute set. A variety of techniques have been proposed in the literature to deal with this challenge with different degrees of success as most of these techniques need further information about the given input data for thresholding, need to specify noise levels or use some feature ranking procedures. To overcome these limitations, rough set theory (RST) can be used to discover the dependency within the data and reduce the number of attributes enclosed in an input data set while using the data alone and requiring no supplementary information. However, when it comes to massive data sets, RST reaches its limits as it is highly computationally expensive. In this paper, we propose a scalable and effective rough set theory-based approach for large-scale data pre-processing, specifically for feature selection, under the Spark framework. In our detailed experiments, data sets with up to 10,000 attributes have been considered, revealing that our proposed solution achieves a good speedup and performs its feature selection task well without sacrificing performance. Thus, making it relevant to big data.

中文翻译：

一种可扩展且有效的基于粗糙集理论的大数据预处理方法

知识发现过程中的一大挑战是对大量数据和高维属性集执行数据预处理，特别是特征选择。在文献中已经提出了多种技术来以不同的成功程度来应对这一挑战，因为这些技术中的大多数都需要有关给定输入数据的更多信息以进行阈值处理，需要指定噪声水平或使用某些特征分级程序。为了克服这些限制，可以使用粗糙集理论（RST）来发现数据内的依赖性，并减少输入数据集中包含的属性的数量，而仅使用数据就不需要补充信息。但是，对于海量数据集，RST达到了极限，因为它的计算量很大。在本文中，我们提出了一种可扩展且有效的基于粗糙集理论的方法，用于Spark框架下的大规模数据预处理，尤其是特征选择。在我们的详细实验中，已经考虑了具有10,000个属性的数据集，这表明我们提出的解决方案可以实现良好的加速效果，并且可以在不牺牲性能的情况下很好地执行其特征选择任务。因此，使其与大数据相关。

更新日期：2020-05-02

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11