Feature selection using cloud-based parallel genetic algorithm for intrusion detection data classification,Neural Computing and Applications

当前位置： X-MOL 学术 › Neural Comput. & Applic. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Feature selection using cloud-based parallel genetic algorithm for intrusion detection data classification
Neural Computing and Applications ( IF 4.5 ) Pub Date : 2021-03-17 , DOI: 10.1007/s00521-021-05871-5
Dželila Mehanović , Dino Kečo , Jasmin Kevrić , Samed Jukić , Adnan Miljković , Zerina Mašetić

With the exponential growth of the amount of data being generated, stored and processed on a daily basis in the machine learning, data analytics and decision-making systems, the data preprocessing established itself as the key factor for building reliable high-performance machine learning models. One of the roles in preprocessing is variable reduction using feature selection methods; however, the processing time needed for these methods is a major drawback. This study aims at mitigating this problem by migrating the algorithm to a MapReduce implementation suitable for parallelization on a high number of commodity hardware units. The genetic algorithm-based methods were put in the focus of this study. Hadoop, an open-source MapReduce library, was used as a framework for implementing parallel genetic algorithms within our research. The representative machine learning methods, SVM (support vector machine), ANN (artificial neural network), RT (random tree), logistic regression and Naive Bayes, were embedded into implementation for feature selection. The feature selection methods were applied to four NSL-KDD data sets, and the number of features is reduced from cca 40 to cca 10 data sets with the accuracy of 90.45%. These results have both significant practical and theoretical impact. On the one hand, the genetic algorithm has been parallelized in the MapReduce manner, which has been considered unachievable in a strict sense. Furthermore, the genetic algorithm allows randomness-enhanced feature selection and its parallelization reduces overall data preprocessing and allows larger population count which in turn leads to better feature selection. On the practical side, it has been shown that this implementation outperforms the existing feature selection methods.

中文翻译：

基于云的并行遗传算法进行入侵检测数据分类的特征选择

随着机器学习，数据分析和决策系统中每天生成，存储和处理的数据量呈指数级增长，数据预处理已成为构建可靠的高性能机器学习模型的关键因素。预处理的作用之一是使用特征选择方法进行变量约简。但是，这些方法所需的处理时间是主要缺点。本研究旨在通过将算法迁移到适用于大量商品硬件单元上并行化的MapReduce实现中来缓解此问题。基于遗传算法的方法成为本研究的重点。Hadoop是一个开放源代码的MapReduce库，被用作我们研究中实现并行遗传算法的框架。代表性的机器学习方法，SVM（支持向量机），ANN（人工神经网络），RT（随机树），逻辑回归和朴素贝叶斯被嵌入到实现中进行特征选择。将特征选择方法应用于四个NSL-KDD数据集，并将特征数从cca 40减少到cca 10数据集，准确性为90.45％。这些结果具有重大的实践和理论影响。一方面，遗传算法已经以MapReduce方式并行化，从严格意义上讲，这被认为是无法实现的。此外，遗传算法允许增强随机性的特征选择，并且其并行化减少了总体数据预处理，并允许更大的总体计数，进而导致更好的特征选择。在实际方面，

更新日期：2021-03-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文