当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A survey and analysis of intrusion detection models based on CSE-CIC-IDS2018 Big Data
Journal of Big Data ( IF 8.6 ) Pub Date : 2020-11-23 , DOI: 10.1186/s40537-020-00382-x
Joffrey L. Leevy , Taghi M. Khoshgoftaar

The exponential growth in computer networks and network applications worldwide has been matched by a surge in cyberattacks. For this reason, datasets such as CSE-CIC-IDS2018 were created to train predictive models on network-based intrusion detection. These datasets are not meant to serve as repositories for signature-based detection systems, but rather to promote research on anomaly-based detection through various machine learning approaches. CSE-CIC-IDS2018 contains about 16,000,000 instances collected over the course of ten days. It is the most recent intrusion detection dataset that is big data, publicly available, and covers a wide range of attack types. This multi-class dataset has a class imbalance, with roughly 17% of the instances comprising attack (anomalous) traffic. Our survey work contributes several key findings. We determined that the best performance scores for each study, where available, were unexpectedly high overall, which may be due to overfitting. We also found that most of the works did not address class imbalance, the effects of which can bias results in a big data study. Lastly, we discovered that information on the data cleaning of CSE-CIC-IDS2018 was inadequate across the board, a finding that may indicate problems with reproducibility of experiments. In our survey, major research gaps have also been identified.



中文翻译:

基于CSE-CIC-IDS2018大数据的入侵检测模型调查与分析

全球计算机网络和网络应用的指数增长与网络攻击的激增相提并论。因此,创建了CSE-CIC-IDS2018之类的数据集来训练基于网络的入侵检测的预测模型。这些数据集并不是要用作基于签名的检测系统的存储库,而是要通过各种机器学习方法来促进基于异常检测的研究。CSE-CIC-IDS2018包含在十天内收集的约1600万个实例。它是最新的入侵检测数据集,是大数据,可公开获得,涵盖了广泛的攻击类型。该多类数据集具有类不平衡性,大约17%的实例包含攻击(异常)流量。我们的调查工作贡献了一些关键发现。我们确定,每项研究的最佳表现得分(如果有)总体上出乎意料地高,这可能是由于过拟合造成的。我们还发现,大多数工作并未解决班级失衡问题,而这种失衡会影响大数据研究的结果。最后,我们发现有关CSE-CIC-IDS2018数据清理的信息总体上不足,这一发现可能表明实验的可重复性存在问题。在我们的调查中,还发现了主要的研究空白。我们发现有关CSE-CIC-IDS2018数据清理的信息总体上不足,这一发现可能表明实验的可重复性存在问题。在我们的调查中,还发现了主要的研究空白。我们发现有关CSE-CIC-IDS2018数据清理的信息并不全面,这一发现可能表明实验的可重复性存在问题。在我们的调查中,还发现了主要的研究空白。

更新日期:2020-11-25
down
wechat
bug