当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Investigating rarity in web attacks with ensemble learners
Journal of Big Data ( IF 8.6 ) Pub Date : 2021-05-20 , DOI: 10.1186/s40537-021-00462-6
Richard Zuech , John Hancock , Taghi M. Khoshgoftaar

Class rarity is a frequent challenge in cybersecurity. Rarity occurs when the positive (attack) class only has a small number of instances for machine learning classifiers to train upon, thus making it difficult for the classifiers to discriminate and learn from the positive class. To investigate rarity, we examine three individual web attacks in big data from the CSE-CIC-IDS2018 dataset: “Brute Force-Web”, “Brute Force-XSS”, and “SQL Injection”. These three individual web attacks are also severely imbalanced, and so we evaluate whether random undersampling (RUS) treatments can improve the classification performance for these three individual web attacks. The following eight different levels of RUS ratios are evaluated: no sampling, 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. For measuring classification performance, Area Under the Receiver Operating Characteristic Curve (AUC) metrics are obtained for the following seven different classifiers: Random Forest (RF), CatBoost (CB), LightGBM (LGB), XGBoost (XGB), Decision Tree (DT), Naive Bayes (NB), and Logistic Regression (LR) (with the first four learners being ensemble learners and for comparison, the last three being single learners). We find that applying random undersampling does improve overall classification performance with the AUC metric in a statistically significant manner. Ensemble learners achieve the top AUC scores after massive undersampling is applied, but the ensemble learners break down and have poor performance (worse than NB and DT) when no sampling is applied to our unique and harsh experimental conditions of severe class imbalance and rarity.



中文翻译:

与整体学习者一起调查网络攻击中的稀有性

稀有性是网络安全中经常遇到的挑战。当肯定(攻击)类仅具有少量实例供机器学习分类器进行训练时,就会发生稀有度,从而使分类器难以区分和学习肯定类。为了调查稀有性,我们从CSE-CIC-IDS2018数据集中检查了大数据中的三种单独的Web攻击:“ Brute Force-Web”,“ Brute Force-XSS”和“ SQL注入”。这三种单独的Web攻击也严重失衡,因此我们评估了随机欠采样(RUS)处理是否可以提高这三种单独的Web攻击的分类性能。评估了以下八个不同的RUS比率级别:无采样,999:1、99:1、95:5、9:1、3:1、65:35和1:1。为了衡量分类效果,对于以下七个不同的分类器,获得了接收器工作特征曲线(AUC)下的面积度量:随机森林(RF),CatBoost(CB),LightGBM(LGB),XGBoost(XGB),决策树(DT),朴素贝叶斯(Naive Bayes( NB)和Logistic回归(LR)(前四个学习者是整体学习者,为了进行比较,后三个学习者是单个学习者)。我们发现,应用随机欠采样确实可以以统计上显着的方式提高AUC度量的总体分类性能。集体学习者在进行大规模欠采样后获得了最高的AUC分数,但是如果不对严重的班级失衡和稀有性进行独特而苛刻的实验,而没有进行采样,则整体学习者就会崩溃并且表现不佳(比NB和DT更差)。

更新日期:2021-05-22
down
wechat
bug