A comparison of two hybrid ensemble techniques for network anomaly detection in spark distributed environment,Journal of Information Security and Applications

当前位置： X-MOL 学术 › J. Inf. Secur. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A comparison of two hybrid ensemble techniques for network anomaly detection in spark distributed environment
Journal of Information Security and Applications ( IF 3.8 ) Pub Date : 2020-09-02 , DOI: 10.1016/j.jisa.2020.102601
Gagandeep Kaur

In this paper, the authors have compared ensemble methods in Spark supported distributed environment. With ever changing attack trends traditional machine learning algorithms fail to detect new types of network based attacks. Machine learning techniques therefore need to be improved. Secondly, there is need for faster and accurate detection algorithms and study of distributed frameworks like Apache Spark is much needed. Thirdly, dataset size reduction plays major role in machine learning algorithms and therefore effort is required to reduce data sizes without affecting the performance metrics. In this work KMeans Clustering and GMM based Clustering have been used to reduce the dataset size while maintaining the diversity of the traffic. The clustered data acts as input to Random Forest Classifier. The RF classification has also been done for class-wise detection of attacks. The outputs from KMeans based RF classification, GMM based classification and class-wise RF classifications were taken as input for base learners of ensemble methods. Two ensemble methods, namely, Weighted Voting based AdaBoostensemble and Stacking based ensemble have been studied and compared. Two dataset, namely, NSL-KDD and UNSW-NB15 have been used to carry out the study. An accuracy of 78.9% and 58.54% for KDDTest+ and KDDTest-21 with KM+RF was achieved. An accuracy of 79.98% and 63.19% were achieved with GMM+RF. Furthermore, an accuracy of 82% was achieved for UNSW-NB15 with KM+RF whereas an accuracy of 84% was achieved for the same with GMM+RF.

With Weighted Voting based AdaBoost ensemble accuracies of 90.46% and 83.32% for KDDTest+ and KDDTest-21 were achieved respectively. Similarly an accuracy of 91.31% was achieved for UNSW-NB15 Test data with Weighted Voting based AdaBoost ensemble. With Stacking based ensemble accuracies of 85.24% and 78.20% were achieved for KDDTest+ and KDDTest-21 respectively. Lastly an accuracy of 89.57% was achieved with Stacking based ensemble for UNSW-NB15 Test dataset. Overall we were able to achieve better detection rates and accuracies with reduced false alarm rates by using ensemble methods. Tests were conducted on different machines by varying the number of executor cores to study time latency in distributed Spark environment.

中文翻译：

火花分布环境中两种混合集成网络异常检测技术的比较

在本文中，作者比较了Spark支持的分布式环境中的集成方法。随着攻击趋势的不断变化，传统的机器学习算法无法检测到新型的基于网络的攻击。因此，机器学习技术需要改进。其次，需要更快，更准确的检测算法，并且非常需要研究Apache Spark等分布式框架。第三，数据集大小的减少在机器学习算法中起主要作用，因此需要在不影响性能指标的情况下减少数据大小。在这项工作中，已经使用KMeans聚类和基于GMM的聚类来减少数据集的大小，同时保持流量的多样性。聚类数据充当“随机森林分类器”的输入。RF分类也已完成，可以对攻击进行逐级检测。来自基于KMeans的RF分类，基于GMM的分类以及基于类的RF分类的输出被用作集成方法基础学习者的输入。研究并比较了两种集成方法，即基于加权投票的AdaBoostensemble和基于堆叠的集成。研究使用了两个数据集，即NSL-KDD和UNSW-NB15。使用KM + RF的KDDTest +和KDDTest-21的准确度达到78.9％和58.54％。使用GMM + RF可以达到79.98％和63.19％的精度。此外，使用KM + RF的UNSW-NB15的精度达到82％，而使用GMM + RF的UNSW-NB15的精度达到84％。基于GMM的分类和按类别的RF分类被用作集成方法基础学习者的输入。研究并比较了两种集成方法，即基于加权投票的AdaBoostensemble和基于堆叠的集成。研究使用了两个数据集，即NSL-KDD和UNSW-NB15。使用KM + RF的KDDTest +和KDDTest-21的准确度达到78.9％和58.54％。使用GMM + RF可以达到79.98％和63.19％的精度。此外，使用KM + RF的UNSW-NB15的精度达到82％，而使用GMM + RF的UNSW-NB15的精度达到84％。基于GMM的分类和按类别的RF分类被用作集成方法基础学习者的输入。研究并比较了两种集成方法，即基于加权投票的AdaBoostensemble和基于堆叠的集成。研究使用了两个数据集，即NSL-KDD和UNSW-NB15。使用KM + RF的KDDTest +和KDDTest-21的准确度达到78.9％和58.54％。使用GMM + RF可以达到79.98％和63.19％的精度。此外，使用KM + RF的UNSW-NB15的精度达到82％，而使用GMM + RF的UNSW-NB15的精度达到84％。NSL-KDD和UNSW-NB15已用于进行研究。使用KM + RF的KDDTest +和KDDTest-21的准确度达到78.9％和58.54％。使用GMM + RF可以达到79.98％和63.19％的精度。此外，使用KM + RF的UNSW-NB15的精度达到82％，而使用GMM + RF的UNSW-NB15的精度达到84％。NSL-KDD和UNSW-NB15已用于进行研究。使用KM + RF的KDDTest +和KDDTest-21的准确度达到78.9％和58.54％。使用GMM + RF可以达到79.98％和63.19％的精度。此外，使用KM + RF的UNSW-NB15的精度达到82％，而使用GMM + RF的UNSW-NB15的精度达到84％。

使用基于加权投票的AdaBoost，KDDTest +和KDDTest-21的整体准确度分别达到90.46％和83.32％。同样，使用基于加权投票的AdaBoost集成的UNSW-NB15测试数据也达到91.31％的准确性。使用基于堆栈的集成，KDDTest +和KDDTest-21的准确度分别达到85.24％和78.20％。最后，使用基于堆叠的UNSW-NB15测试数据集的集成度达到了89.57％的精度。总体而言，通过使用集成方法，我们能够以更低的误报率实现更好的检测率和准确性。通过改变执行器内核的数量在不同的机器上进行测试，以研究分布式Spark环境中的时间延迟。

更新日期：2020-09-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文