Investigating rarity in web attacks with ensemble learners

Zuech, Richard; Hancock, John; Khoshgoftaar, Taghi M.

doi:10.1186/s40537-021-00462-6

Investigating rarity in web attacks with ensemble learners

Research
Open access
Published: 20 May 2021

Volume 8, article number 71, (2021)
Cite this article

Download PDF

You have full access to this open access article

Journal of Big Data Submit manuscript

Investigating rarity in web attacks with ensemble learners

Download PDF

1822 Accesses
8 Citations
Explore all metrics

Abstract

Class rarity is a frequent challenge in cybersecurity. Rarity occurs when the positive (attack) class only has a small number of instances for machine learning classifiers to train upon, thus making it difficult for the classifiers to discriminate and learn from the positive class. To investigate rarity, we examine three individual web attacks in big data from the CSE-CIC-IDS2018 dataset: “Brute Force-Web”, “Brute Force-XSS”, and “SQL Injection”. These three individual web attacks are also severely imbalanced, and so we evaluate whether random undersampling (RUS) treatments can improve the classification performance for these three individual web attacks. The following eight different levels of RUS ratios are evaluated: no sampling, 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. For measuring classification performance, Area Under the Receiver Operating Characteristic Curve (AUC) metrics are obtained for the following seven different classifiers: Random Forest (RF), CatBoost (CB), LightGBM (LGB), XGBoost (XGB), Decision Tree (DT), Naive Bayes (NB), and Logistic Regression (LR) (with the first four learners being ensemble learners and for comparison, the last three being single learners). We find that applying random undersampling does improve overall classification performance with the AUC metric in a statistically significant manner. Ensemble learners achieve the top AUC scores after massive undersampling is applied, but the ensemble learners break down and have poor performance (worse than NB and DT) when no sampling is applied to our unique and harsh experimental conditions of severe class imbalance and rarity.

Cybersecurity data science: an overview from machine learning perspective

Article Open access 01 July 2020

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

A survey on ensemble learning

Article 30 August 2019

Introduction

Cybersecurity is an important consideration for the modern Internet era, with consumers spending over $600 billion on e-commerce sales during 2019 in the United States [1]. Security practitioners struggle to properly defend this increasingly important cyberspace in a constant arms race against criminals and other adversaries. When employing security analytics [2,3,4], one important aspect that defenders confront is the issue of class imbalance.

Class imbalance occurs when one class label is disproportionately represented as compared to another class label. For example, in cybersecurity, it is not uncommon for a cyberattack to be lost in a sea of normal instances similar to the proverbial “needle in a haystack” analogy. Amit et al. [5] at Palo Alto Networks and Shodan, state that in cybersecurity “imbalance ratios of 1 to 10,000 are common.” We agree with their assessment that very high imbalance ratios are common in cybersecurity, which is a motivation for this study to explore sampling ratios in cybersecurity web attacks.

Class rarity is an extreme case of class imbalance, and rarity is not uncommon in cybersecurity especially among more stealthy or sophisticated attacks [6]. Throughout this document, the term rarity will always refer to class rarity. Rarity occurs in machine learning when the Positive Class Count (PCC) has less than a few hundred instances [7], as compared to many more negative instances. For example, 10,000,000 total instances with an imbalance level of 1% from the positive class would yield a PCC of 100,000 which is typically enough positive class instances for machine learning classifiers to discriminate class patterns (and this example would only be highly imbalanced). On the other hand, 1,000 total instances with that same imbalance level of 1% would only provide a PCC of 10, and this would constitute rarity as machine learning classifiers will generally struggle with such few instances from the positive class [8]. For the purposes of our experiment, we will consider a PCC of less than 300 instances to constitute rarity.

To evaluate web attacks, we utilize the CSE-CIC-IDS2018 dataset which was created by Sharafaldin et al. [9] at the Canadian Institute for Cybersecurity. CSE-CIC-IDS2018 is a more recent intrusion detection dataset than the popular CIC-IDS2017 dataset [10], which was also created by Sharafaldin et al. The CSE-CIC-IDS2018 dataset includes over 16 million instances which includes normal instances, as well as the following family of attacks: web attack, Denial of Service (DoS), Distributed Denial of Service (DDoS), brute force, infiltration, and botnet. For additional details on the CSE-CIC-IDS2018 dataset [11], please refer to [12].

The CSE-CIC-IDS2018 dataset is big data, as it contains over 16 million instances. While big data has not been formally defined in terms of the number of instances, one study [13] considers only 100,000 instances to be big data. Other studies [14, 15] have considered 1,000,000 instances to be big data. Since CSE-CIC-IDS2018 is more than 1,000,000 instances, we consider it to be big data as well.

For illustrative purposes, Table 1 contains the breakdown for the entire CSE-CIC-IDS2018 dataset (although the entire dataset is not used in these experiments, and this table should only be used for reference purposes). In this study, we only focus on web attacks with normal traffic and discard the other attack instances (further details of creating the datasets are provided in the “Data preparation” section below).

Table 1 Entire CSE-CIC-IDS2018 dataset by files/days (only web attacks and normal traffic are used in our experiments)

Investigating rarity in web attacks with ensemble learners

Abstract

Similar content being viewed by others

Cybersecurity data science: an overview from machine learning perspective

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A survey on ensemble learning

Introduction

Related work

Data preparation

Methodologies

Classifiers

Performance metrics

Sampling techniques

Results and discussion

Results for Brute Force web attacks

Results with no sampling—Brute Force web attacks

Results with sampling—Brute Force web attacks

Statistical analysis—Brute Force web attacks

Results for XSS web attacks

Results with no sampling—XSS web attacks

Results with sampling—XSS web attacks

Statistical analysis—XSS web attacks

Results for SQL injection web attacks

Results with no sampling—SQL injection web attacks

Results with sampling—SQL injection web attacks

Statistical analysis—SQL injection web attacks

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation