Introduction

Cybersecurity is an important consideration for the modern Internet era, with consumers spending over $600 billion on e-commerce sales during 2019 in the United States [1]. Security practitioners struggle to properly defend this increasingly important cyberspace in a constant arms race against criminals and other adversaries. When employing security analytics [2,3,4], one important aspect that defenders confront is the issue of class imbalance.

Class imbalance occurs when one class label is disproportionately represented as compared to another class label. For example, in cybersecurity, it is not uncommon for a cyberattack to be lost in a sea of normal instances similar to the proverbial “needle in a haystack” analogy. Amit et al. [5] at Palo Alto Networks and Shodan, state that in cybersecurity “imbalance ratios of 1 to 10,000 are common.” We agree with their assessment that very high imbalance ratios are common in cybersecurity, which is a motivation for this study to explore sampling ratios in cybersecurity web attacks.

Class rarity is an extreme case of class imbalance, and rarity is not uncommon in cybersecurity especially among more stealthy or sophisticated attacks [6]. Throughout this document, the term rarity will always refer to class rarity. Rarity occurs in machine learning when the Positive Class Count (PCC) has less than a few hundred instances [7], as compared to many more negative instances. For example, 10,000,000 total instances with an imbalance level of 1% from the positive class would yield a PCC of 100,000 which is typically enough positive class instances for machine learning classifiers to discriminate class patterns (and this example would only be highly imbalanced). On the other hand, 1,000 total instances with that same imbalance level of 1% would only provide a PCC of 10, and this would constitute rarity as machine learning classifiers will generally struggle with such few instances from the positive class [8]. For the purposes of our experiment, we will consider a PCC of less than 300 instances to constitute rarity.

To evaluate web attacks, we utilize the CSE-CIC-IDS2018 dataset which was created by Sharafaldin et al. [9] at the Canadian Institute for Cybersecurity. CSE-CIC-IDS2018 is a more recent intrusion detection dataset than the popular CIC-IDS2017 dataset [10], which was also created by Sharafaldin et al. The CSE-CIC-IDS2018 dataset includes over 16 million instances which includes normal instances, as well as the following family of attacks: web attack, Denial of Service (DoS), Distributed Denial of Service (DDoS), brute force, infiltration, and botnet. For additional details on the CSE-CIC-IDS2018 dataset [11], please refer to [12].

The CSE-CIC-IDS2018 dataset is big data, as it contains over 16 million instances. While big data has not been formally defined in terms of the number of instances, one study [13] considers only 100,000 instances to be big data. Other studies [14, 15] have considered 1,000,000 instances to be big data. Since CSE-CIC-IDS2018 is more than 1,000,000 instances, we consider it to be big data as well.

For illustrative purposes, Table 1 contains the breakdown for the entire CSE-CIC-IDS2018 dataset (although the entire dataset is not used in these experiments, and this table should only be used for reference purposes). In this study, we only focus on web attacks with normal traffic and discard the other attack instances (further details of creating the datasets are provided in the “Data preparation” section below).

Table 1 Entire CSE-CIC-IDS2018 dataset by files/days (only web attacks and normal traffic are used in our experiments)

Table 2 contains the three datasets we use for the experiments in this study, where each of these three datasets are comprised of web attacks from the following labels in CSE-CIC-IDS2018: “Brute Force-Web”, “Brute Force-XSS”, and “SQL Injection”. The “Imbalance Classification” column in Table 2 indicates the varying levels of class imbalance and rarity we can explore within our experimental frameworks.

Authors of the CSE-CIC-IDS2018 dataset utilized the Damn Vulnerable Web App (DVWA) [16] and Selenium framework [17] for implementing their three web attacks. The “Brute Force-Web” label corresponds to brute force login attacks targeting web pages. Next, the “Brute Force-XSS” label refers to a cross-site scripting (XSS) attack [18] where attackers inject malicious client-side scripts into susceptible web pages targeting web users which view those pages. Finally, the “SQL Injection” label represents a code injection technique [19] where attackers craft special sequences of characters and submit them to web page forms in an attempt to directly query the back-end database of that website.

Table 2 Individual attacks used in this experiment from CSE-CIC-IDS2018

Through our data preparation process, we are able to evaluate web attacks from CSE-CIC-IDS2018 at a class ratio for normal to attack of: 21,915:1 for Brute Force, 58,218:1 for XSS, and 153,911:1 for SQL Injection web attacks. Our work is unique, in that existing works only evaluate class ratios as high as 2896:1 for web attacks and none of the existing works evaluate the effects of applying sampling techniques. The CSE-CIC-IDS2018 dataset is comprised of ten different days of files, and we combine all 10 days of normal traffic with the web attack instances. Other works only evaluate web attacks with 1 or 2 days of normal traffic. By combining all 10 days of normal traffic, we can obtain a higher imbalance ratio as well as have a richer backdrop of normal data as compared to other studies. We provide further details for this in the “Related work” and “Data preparation” sections.

To evaluate the effects of class imbalance, we explore eight different levels of sampling ratios with random undersampling (RUS): no sampling, 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. We also compare the following seven different classifiers in our experiments with web attacks: Decision Tree, Random Forest, CatBoost, LightGBM, XGBoost, Naive Bayes, and Logistic Regression. To quantify classification performance, we utilize the Area Under the Receiver Operating Characteristic Curve (AUC) metric.

The uniqueness of our contribution is that no current works explore the effects of various sampling ratios with the CSE-CIC-IDS2018 dataset. None of the existing works combine all the days of normal traffic from CSE-CIC-IDS2018 to analyze individual web attacks, as we have uniquely done with our data preparation process to isolate these three individual web attacks with binary classification and imbalance ratios exceeding the highest 2,896:1 ratio from existing CSE-CIC-IDS2018 literature. Our work considers severe imbalance ratios as high as 153,911:1. Additionally, no works with CSE-CIC-IDS2018 explore the effects of class rarity as we present in this study with XSS and SQL Injection web attacks and their low Positive Class Count (PCC) as outlined in Table 2.

Our work focuses exclusively on web attacks to consider the above research issues, while other related works we surveyed with web attacks from CSE-CIC-IDS2018 were more generalized studies considering all attack types (as detailed in the “Related work” section below). For the few studies that did consider individual web attacks through multi-class classification, they had extremely poor classification results for those web attacks. Thus, we were surprised when our classification performance yielded such good results. We statistically validated classification performance improvements resulting from our sampling treatments, but our extensive data preparation process might have also helped as some other studies had data preparation mistakes and they were generally not specified very well.

The remaining sections of this paper are organized as follows. The “Related work” section studies existing literature for web attacks with CSE-CIC-IDS2018 data. In the “Data preparation” section, we describe how the datasets used in our experiments were cleaned and prepared. Then, the “Methodologies” section describes the classifiers, performance metrics, and sampling techniques applied in our experiments. The Results and Discussion provides our results and statistical analysis. Finally, the “Conclusion” section concludes the work presented in this paper.

Related work

None of the prior four studies [20,21,22,23] for web attacks with CSE-CIC-IDS2018 provided any results for class imbalance analysis. No sampling techniques are applied to explore class imbalance issues for web attacks in CSE-CIC-IDS2018. None of these four studies combine the full normal traffic (all days) from CSE-CIC-IDS2018 with the individual web attacks for analysis, and instead they only use a single day of normal traffic when considering web attacks.

By combining all the normal traffic with the three individual web attacks, we can experiment with big data challenges as well as more severe levels of class imbalance which has not previously been done. Additionally, our data preparation framework allows us to isolate the three individual web attacks from all other attack traffic to research class imbalance with binary classification. Plus, this allows us to explore class rarity which has not previously been done with CSE-CIC-IDS2018.

Three of these four studies [20,21,22] utilized multi-class classification for the “Web” attacks, resulting in extremely poor classification performance for each of the three individual web attack labels (“Brute Force-Web”, “Brute Force-XSS”, and “SQL Injection”). In many cases, not even one instance could be correctly classified for an individual web attack. However, classification results for the aggregated web attacks in [23] are extremely high.

This performance discrepancy in literature between the three individual web attacks and those same web attacks combined (aggregated), motivated us to conduct this study. We were surprised to find our results to be so much better than the three other studies [20,21,22] analyzing these same three individual web attacks through multi-class classification. Our random undersampling approach definitely helped, although some of our classifiers still fared much better even when no sampling was applied which was likely due to our rigorous data preparation approach.

With the CSE-CIC-IDS2018 dataset, Basnet et al. [20] benchmark different deep learning frameworks: Keras-Tensorflow, Keras-Theano, and fast.ai using 10-fold cross validation. However, full results are only produced for fast.ai which is likely due to the computational constraints they frequently mention (where in some cases it took weeks to produce results). They achieve 99.9% accuracy for the aggregated web attacks with binary classification. However, the multi-class classification for those same three individual web attacks tell a completely different story with: 53 of 121 “Brute Force-Web” classified correctly, 17 of 45 “Brute Force-XSS” classified correctly, and 0 of 16 “SQL Injection” classified correctly.

Basnet et al. only provide classification results in terms of the Accuracy metric and confusion matrices (where only accuracy is provided for the aggregated web attacks). Their 99.9% accuracy scores for the aggregated web attacks can be deceptive when dealing with such high levels of class imbalance, as such a high accuracy can still be attained even with zero instances from the positive class correctly classified. When dealing with high levels of class imbalance, performance metrics which are more sensitive to class imbalance should be utilized. For web attacks, only two separate days of traffic from CSE-CIC-IDS2018 are evaluated with imbalance levels of 2,880:1 (binary) and 30,665:7.32:2.32:1 (multi-class) for 1 day and 1,842:1 (binary) and 19,666:6.83:2.85:1 (multi-class) for the other day. Such high imbalance levels require metrics more sensitive to class imbalance. Also, perhaps better classification performance might have been achieved by properly treating the class imbalance problem.

Basnet et al. use seven of the 10 days from CSE-CIC-IDS2018, and drop approximately 20,000 samples that contained “Infinity”, “NaN”, or missing values. Destination_Port and Protocol fields are treated as categorical, and the rest of the features as numeric. They state their cleaned datasets contain 79 features, which would include 8 fields containing all zero values. Instead, they should have filtered out these fields containing all zero values. Similarly, none of the other studies cited here state whether those 8 fields were filtered out or not (although it appears for most cases that of them did not filter out these 8 fields containing all zero values were not filtered out).

Atefinia and Ahmadi [21] propose a new “modular deep neural network model” and test it with CSE-CIC-IDS2018 data. Web attacks perform very poorly in their model with multi-class classification results of: 56 of 122 “Brute Force-Web” classified correctly, 0 of 46 “Brute Force-XSS” classified correctly, and 0 of 18 “SQL Injection” classified correctly. For two of the three web attacks, their model does not correctly classify even one instance of the test data. They only produce results with their one custom learner, and so benchmarking their approach is not easy.

Experimental specifications from Atefinia and Ahmadi are not clear. They state they use 2 days of web attack data from CSE-CIC-IDS2018, and “the train and test dataset are generated using 20:80 Stratified sampling of each subset”. But even if we infer the test dataset to be 20% of the total, we still do not know how many instances they dropped during their preprocessing steps and for what reasons. Also, the class labels from the confusion matrix in their Fig. 10 do not match what they state for their legend: “for Web attacks, classes 1, 2, 3, and 4 represent Benign, Brute Force-Web, Brute Force-XSS and SQL Injection” (where “class 4” would result in the “SQL Injection” class to have 416,980 instances while the entire CSE-CIC-IDS2018 dataset only contains 87 instances with the “SQL Injection” label). Vague experimental specifications are a serious deficiency among the CSE-CIC-IDS2018 literature in general, and the ability to reproduce these experiments is a problem.

The work of Atefinia and Ahmadi is unique compared to the other three CSE-CIC-IDS2018 studies considering web attacks in that Atefinia and Ahmadi combine the two web attack days together with the attack and normal traffic for only those 2 days, whereas the other three studies consider each of these 2 days separately for the web attack data (days: Thursday 02/22/2018 and Friday 02/23/2018). The classification results with their new model are very poor for the web attacks, and they do not explore treating the class imbalance problem.

Unfortunately, Atefinia and Ahmadi do not provide any preprocessing details for how they cleaned and prepared the data other than stating they properly scaled the features and “the rows with missing values and the columns with too much missing values are also dropped”. This statement is very ambiguous, especially since they could have easily listed the dropped columns, which is an important omission. And they state they remove IP addresses, but CSE-CIC-IDS2018 does not contain IP addresses in 9 of the 10 downloaded .csv files. Plus, the entire CSE-CIC-IDS2018 dataset contained very few missing values (only a total of 59 rows have missing values which is mainly due to repeated header lines). They do not state how they handle “Infinity” and “NaN” values.

Li et al. [22] create an unsupervised Auto-Encoder Intrusion Detection System (AE-IDS), which is based on an anomaly detection approach utilizing 85% of the normal instances as the training dataset with the testing dataset consisting of the remaining 15% of the normal instances plus all the attack instances. They only analyze 1 day of the available 2 days of “Web” attack traffic from CSE-CIC-IDS2018, and they evaluate the three different web attacks separately (versus aggregating the “Web” category together). The three individual web attacks perform very poorly with AE-IDS and multi-class classification results of: 147 of 362 “Brute Force-Web” classified correctly, 26 of 151 “Brute Force-XSS” classified correctly, and 6 of 53 “SQL Injection” classified correctly. Overall, less than half of the web attacks are classified correctly for each of the three different web attacks.

The confusion matrices provided by Li et al. are not correct and have major errors. When inspecting the confusion matrix from their Table 5 for “SQL Injection” (the class with the least number of instances) for their AE-IDS, we can see 6 True Positive instances but an incorrect number of 1,689 False Negative instances for SQL Injection. The entire CSE-CIC-IDS2018 dataset only contains 87 instances for the SQL Injection class, which is much less than their results of 1,689 False Negative instances for SQL Injection. Instead, it seems their “Actual” and “Predicted” axes for their confusion matrices should be reversed which would instead yield a number of 47 False Negative instances for that SQL Injection example. All their confusion matrices have this problem where the “Actual” and “Predicted” axes seem incorrect, and should be the opposite versus what they reported in their results.

A major component of their experiment includes dividing the CSE-CIC-IDS2018 dataset into different sparse and dense matrices for separate evaluation. However, this sparse and dense matrix experimental factor introduces serious ambiguity in the results. First, their different results for each of these matrix approaches might actually be a result from purely partitioning the dataset into different datasets based upon different values of the data (they partition the dataset into a “sparse matrix dataset” when the “value of totlen FWD PKTS and totlen BWD PKTS is very small”. Instead, a better way may have been to randomly partition the dataset into sparse and dense matrices so that the underlying different data values themselves were not responsible for the different results from the two different sparse and dense matrix approaches.

The AE-IDS approach of Li et al. was only compared to one other learner called “KitNet” , where their AE-IDS results provided a better score for Recall. Recall is the metric they decided to use to compare all experiments. However, Precision should also be considered when comparing results with Recall. When dealing with such high levels of class imbalance such as with these web attacks, it is important to use metrics which are more sensitive to class imbalance.

Li et al. did provide AUC scores, but only for the more prominent portions of their experiments where the data was partitioned separately into sparse and dense matrices based upon certain field values. Unfortunately, as mentioned earlier, the different results for these different matrix approaches might be purely due to the fact that very different data values are being fed into these different matrix encoding approaches. Additionally, for their sparse matrix approaches, they never stated whether they were rounding down the “very small” values to zero which would be an additional concern to consider. They also assert their approach helps with class imbalance, but they do not provide any results or statistical validation to substantiate their brief commentary regarding class imbalance treatments.

Li et al. replace “Nan” and “Infinity” values with zero, but instead these imputed values should be very high, based upon manually inspecting the data. They mention no other data preparation steps other than normalizing the data, and further splitting the dataset into sparse matrices and dense matrices.

D’hooge et al. [23] evaluate each day of the CSE-CIC-IDS2018 dataset separately for binary classification with 12 different learners and stratified 5-fold cross validation. The F1 and AUC scores for the two different days with “Web” categories are generally very high, with some perfect F1 and AUC scores achieved with XGBoost. Other learners varied between 0.9 and 1.0 for both F1 and AUC scores, with the first day of “Web” usually having better performance than the second day of “Web”. The three other studies we evaluated all used multi-class classification for these same web attacks, but they all had extremely poor classification performance (many times with zero attack instances classified correctly).

D’hooge et al. state overfitting might have been a problem for CIC-IDS2017 in this same study, and “further analysis is required to be more conclusive about this finding”. Given such extremely high classification scores, overfitting may have been a problem in their CSE-CIC-IDS2018 results as well (for example in their source code, we noticed the max_depth hyperparameter set to a value of 35 for Decision Tree and Random Forest learners).

In addition, their model validation approach is not clear. They state they utilize two-thirds of each day’s data with stratified 5-fold cross validation for hyperparameter tuning. And then, they utilize “single execution testing”. However, it is not clear how this single execution testing was performed and whether there is indeed a “gold standard” holdout test set.

D’hooge et al. replace “Infinity” values with “NaN” values in CSE-CIC-IDS2018, but “NaN” should not be used to replace other values. In the case of these “Infinity” values for CSE-CIC-IDS2018, imputed values should be very high, based upon manual inspection of the “Flow Bytes/s” and “Flow Packets/s” features. An even better alternative is to simply filter out those instances containing the “Infinity” values, as they comprise less than 1% of the data and very little attack instances are lost. The authors made no other mention for any other data preparations with CSE-CIC-IDS2018.

In summary, these enormous discrepancies in classification performance between aggregated web attacks and the three individual web attacks from CSE-CIC-IDS2018 motivated us to further explore and explain these differences. Additionally, we investigate severe class imbalance and rarity for the three individual web attacks in CSE-CIC-IDS2018 which has not previously been done.

Data preparation

In this section, we describe how we prepared and cleaned the dataset files used in our experiments. Properly documenting these steps is important in being able to reproduce experiments.

We dropped the “Protocol” and “Timestamp” fields from CSE-CIC-IDS2018 during our preprocessing steps. The “Protocol” field is somewhat redundant as the “Dst Port” (Destination_Port) field mostly contains equivalent “Protocol” values for each Destination_Port value. Additionally, we dropped the “Timestamp” field as we wanted the learners not to discriminate attack predictions based on time especially with more stealthy attacks in mind. In other words, the learners should be able to discriminate attacks regardless of whether the attacks are high volume or slow and stealthy. Dropping the “Timestamp” field also allows us the convenience of combining or dividing the datasets into ways more compatible with our experimental frameworks. Additionally, a total of 59 records were dropped from CSE-CIC-IDS2018 due to header rows being repeated in certain days of the datasets. These duplicates were easily found and removed by filtering records based on a white list of valid label values.

The fourth downloaded file named “Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv” was different than the other nine files from CSE-CIC-IDS2018. This file contained four extra columns: “Flow ID”, “Src IP”, “Src Port”, and “Dst IP”. We dropped these four additional fields. Also of note is that this one particular file contained nearly half of all the records for CSE-CIC-IDS2018. This fourth file contained 7,948,748 records of the dataset’s total 16,232,943 records.

Certain fields contained negative values which did not make sense and so we dropped those instances with negative values for the “Fwd_Header_Length”, “Flow_Duration”, and “Flow_IAT_Min” fields (with a total of 15 records dropped from CSE-CIC-IDS2018 for these fields containing negative values). Negative values in these fields were causing extreme values that can skew classifiers which are sensitive to outliers.

Eight fields contained constant values of zero for every instance. In other words, these fields did not contain any value other than zero. Before running machine learning, we filtered out the following list of fields (which all had values of zero):

  1. 1.

    Bwd_PSH_Flags

  2. 2.

    Bwd_URG_Flags

  3. 3.

    Fwd_Avg_Bytes_Bulk

  4. 4.

    Fwd_Avg_Packets_Bulk

  5. 5.

    Fwd_Avg_Bulk_Rate

  6. 6.

    Bwd_Avg_Bytes_Bulk

  7. 7.

    Bwd_Avg_Packets_Bulk

  8. 8.

    Bwd_Avg_Bulk_Rate

We also excluded the “Init_Win_bytes_forward” and “Init_Win_bytes_backward” fields because they contained negative values. These fields were excluded since about half of the total instances contained negative values for these two fields (so we would have removed a very large portion of the dataset by filtering all these instances out). Similarly, we did not use the “Flow_Duration” field as some of those values were unreasonably low with zero values.

The “Flow Bytes/s” and “Flow Packets/s” fields contained some “Infinity” and “NaN” values (with less than 0.6% of the records containing these values). We dropped these instances where either “Flow Bytes/s” or “Flow Packets/s” contained “Infinity” or “NaN” values. Upon carefully and manually inspecting the entire CSE-CIC-IDS2018 dataset for such values, there was too much uncertainty as to whether they were valid records or not. As sorted from minimum to maximum on these fields, neighboring records were very different where “Infinity” was found. Similar to Zhang et al. [24], we did attempt to impute values for these columns by taking the maximum value of the column and adding one. In the end, we abandoned this imputation approach and dropped 95,760 records from CSE-CIC-IDS2018 for records containing any “Infinity” or “NaN” values.

We also excluded the Destination_Port categorical feature which contains more than 64,000 distinct categorical values. Since Destination_Port has so many values, we determined that finding an optimal encoding technique was out of scope for this study. For each of the three web attacks in Table 2, we dropped all the other attack instances and kept all the normal instances from all 10 days in Table 1 (except for those instances which we removed as indicated earlier in this section). Each of the three final datasets for our individual web attacks ended up having roughly 13 million instances as specified in Table 2.

Methodologies

Classifiers

For all experiments in this study, stratified 5-fold cross validation [25] is used. Stratified [26] refers to evenly splitting each training and test fold so that each class is proportionately weighted across all folds equally. Splitting in a stratified manner is especially important when dealing with high levels of class imbalance, as randomness can inadvertently skew the results between folds [27]. To account for randomness, each stratified 5-fold cross validation was repeated 10 times. Therefore, all of our AUC results are the mean values from 50 measurements (5 folds x 10 repeats). All classifiers from this experiment are implemented with Scikit-learn [28] and respective Python modules.

  • Decision Tree (DT) is a learner which builds branches of a tree by splitting on features based on a cost [29]. The algorithm will attempt to select the most important features to split branches upon, and iterate through the feature space by building leaf nodes as the tree is built. The cost function utilized to evaluate splits in the branches is called the Gini impurity [30].

  • Random Forest (RF) is an ensemble of independent decision trees. Each instance is initially classified by every individual decision tree, and the instance is then finally classified by consensus among the individual trees (e.g., majority voting) [31]. Diversity among the individual decision trees can improve overall classification performance, and so bagging is introduced to each of the individual decision trees to promote diversity. Bagging (bootstrap aggregation) [32] is a technique to sample the dataset with replacement to accommodate randomness for each of the decision trees.

  • CatBoost (CB) [33] is based on gradient boosting, and is essentially another ensemble of tree-based learners. It utilizes an ordered boosting algorithm [34] to overcome prediction shifting difficulties which are common in gradient boosting. CatBoost has native built-in support for categorical features.

  • LightGBM (LGB), or Light Gradient Boosted Machine [35], is another learner based on Gradient Boosted Tree (GBTs) [36]. To optimize and avoid the need to scan every instance of a dataset when considering split points, LightGBM implements Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) algorithms [37]. LightGBM also offers native built-in support for categorical features.

  • XGBoost (XGB) is another ensemble based on GBTs. To help determine splitting points, XGBoost utilizes a Weighted Quantile Sketch algorithm [38] to improve upon where split points should occur. Additionally, XGBoost employs a sparsity-aware algorithm to help with sparse data to determine default tree directions for missing values. Categorical features are not natively supported by XGBoost, and must be encoded outside of the learner with a technique such as One Hot Encoding (OHE) [39].

  • Naive Bayes (NB) [40] is a probabilistic classifier which uses the Bayes’ theorem [41] to calculate the posterior probability that an instance can be classified as belonging to a certain class. The posterior probability is calculated by multiplying the prior times the likelihood over the evidence. It uses a naive assumption that features are independent of each other.

  • Logistic Regression (LR) [42] is similar to linear regression [43], and converts the output of a linear regression to a classification (categorical) value. This binary classification value is determined by applying a logarithmic function to the output value of the linear regression value.

Four of these learners are ensemble learners: Random Forest, CatBoost, LightGBM, and XGBoost. These particular learners are built upon ensembles of independent Decision Tree classifiers. Ensembles have been shown to perform very well versus their non-ensemble counterparts [44], and have been popular in Kaggle competitions [45]. In this study, we will highlight any trends for the ensemble-based learners (as well as any for those which are not based on the ensembles).

The hyper-parameters used to initialize the classifiers are indicated in Tables 3, 4, 5, and 6. The settings of these parameters were selected based on preliminary experimentation. Only the default hyper-parameters were used for LightGBM, Naive Bayes, and Logistic Regression, and so tables are not provided for these three classifiers.

Table 3 XGBoost classifier hyper-parameters
Table 4 Random Forest classifier hyper-parameters
Table 5 CatBoost classifier hyper-parameters
Table 6 Decision Tree classifier hyper-parameters

Performance metrics

Area Under the Receiver Operating Characteristic Curve (AUC) is a metric which measures the area under the Receiver Operator Characteristic (ROC) curve. AUC [46] measures the aggregate performance across all classification thresholds. The ROC curve [47, 48] is a plot of the True Positive Rate (TPR) along the y-axis versus the False Positive Rate (FPR) along the x-axis. The area under this ROC curve corresponds to a numeric value ranging between 0.0 to 1.0, where an AUC value equal to 1.0 would correspond to a perfect classification system. An AUC value equal to 0.5 would represent a classifier system which performs as well as a random guess similar to flipping a coin. The AUC metric is used to score how effective a classification system is in terms of comparing TPR to FPR over the total range of learner threshold values.

Sampling techniques

Random undersampling (RUS) is a sampling technique to improve imbalance levels of the classes to the desired target by removing instances from the majority class(es). The removal of instances from the majority class is done without replacement, which means once an instance is removed from the majority class it is deleted and not replaced back into the majority class. RUS has been shown to be an effective sampling technique as compared to other techniques in [49]. Additional studies [50,51,52] have also employed RUS to deal with class imbalance.

Table 7 indicates the eight different sampling ratios applied in this study, where “None” represents no sampling applied. When no sampling is applied, the default class ratio between normal to web attacks is: 21,915:1 for Brute Force, 58,218:1 for XSS, and 153,911:1 for SQL Injection web attacks. In addition to these severe class imbalances, the XSS and SQL Injection web attacks exhibit rarity with a low Positive Class Count (PCC) as indicated in Table 2. These extreme imbalance and rarity conditions provide a problem statement as to whether RUS treatments can improve classification performance. In addition to these severe class imbalances, the XSS and SQL Injection web attacks exhibit rarity [53] with a low PCC as indicated in Table 2.

Table 7 Random undersampling (RUS) sampling ratio levels applied

Results and discussion

This section is divided into three subsections for each of the 3 datasets we evaluated for our three different individual web attacks from CSE-CIC-IDS2018: Brute Force, XSS, and SQL Injection from Table 2. These three sections are presented by increasing levels of class imbalance with Brute Force web attacks presented first being only severely imbalanced. Next, the XSS results are presented with more imbalance and a slight degree of rarity. Finally, the SQL Injection results are presented last with the most severe form of class rarity where the Positive Class Count (PCC) is very low along with extreme class imbalance.

Each of these three subsections are broken down further into three additional subsections. The first subsection for each web attack are the results before any sampling is applied, and this section identifies the problem showing poor classification performance without any RUS class imbalance treatments. The next subsection for each web attack are the results with RUS applied. Then, each web attack is concluded with a subsection which includes statistical analysis.

Results for Brute Force web attacks

Results with no sampling—Brute Force web attacks

In this section, we first present results obtained without the application of sampling techniques to Brute Force web attacks. No feature selection was applied for any of the results in this entire study, as we found all 66 features performed better with web attacks compared to our preliminary attempts with feature selection. Table 8 shows the results with no sampling applied. In this table, the AUC values are the mean across 50 values from each 5-fold cross validation being repeated 10 times. The “SD” prefix for AUC refers to the standard deviation across the 50 measurements previously described.

Table 8 Results for classification of Brute Force web attacks and no sampling applied; classifiers: RF is Random Forest, CB is CatBoost, NB is Naive Bayes, LR is Logistic Regression, DT is Decision Tree, XGB is XGBoost, LGB is LightGBM; AUC stands for Area Under the Receiver Operating Characteristic Curve; SD stands for standard deviation

One minor issue we encountered for Logistic Regression, was an AUC value equal to 0.5 with a standard deviation of 0.0. Upon close inspection of the results with no RUS applied, LR was not able to correctly classify any of the positive instances. None of the other classifiers exhibited this same problem for Brute Force web attacks.

Based on the results of Table 8 (with no sampling applied), Naive Bayes is the top-performing classifier in terms of AUC for Brute Force web attacks. Logistic Regression performs the worst in terms of AUC. Overall, these classification performance scores are not very good considering an AUC score of 0.5 is equivalent to a random guess. This establishes a baseline of poor classification performance with such high class imbalance for Brute Force web attacks, and the next section explores whether applying RUS can improve upon this problem.

Results with sampling—Brute Force web attacks

Table 9 in this section provides results for each classifier with various sampling ratios applied to Brute Force web attacks (and no feature selection is applied). The following seven sampling ratios are applied to each of the classifiers with random undersampling: 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. In addition, “no sampling” is also indicated in the tables with the value “None” from the results of the previous section. Therefore, a total of eight different sampling ratios are evaluated. These seven classifiers are evaluated in the following tables for our various RUS ratios: RF, LR, XGB, CB, NB, DT, and LGB. Similar to the previous section, AUC results are the mean across 50 different measurements (5-fold cross validation repeated 10 times). The “SD” prefix refers to the standard deviation across these 50 different measurements.

In general, when we visually inspect the results of the different classifiers from Table 9, the results suggest that applying RUS does indeed improve classification performance. In some cases, the improvements from applying sampling are very substantial. For example, LightGBM improves from an AUC score of 0.52347 with no sampling applied to an AUC score of 0.94182 with a 1:1 RUS ratio applied. Overall, LightGBM achieves the highest AUC score of the seven classifiers. Although, Random Forest is a close second place with an AUC score of 0.9416 and 1:1 RUS ratio applied. All four of the ensemble learners (LGB, RF, XGB, and CB) have dramatic improvements in AUC scores as increased levels of RUS ratios are applied.

While Naive Bayes performs the best among all the classifiers with no sampling applied (from Table 8), the classification performance of Naive Bayes does not substantially improve as more RUS is applied (unlike substantial improvements seen from applying RUS to the other learners). Future work can explore both of these phenomena with Naive Bayes. When considering high levels of sampling like the RUS 1:1 ratio, Naive Bayes performs much worse than all the ensemble classifiers and Decision Tree. Overall, Logistic Regression appears to perform the worst among all the classifiers. Upon visual inspection of all the results from Table 9, it does appear that applying RUS does substantially improve upon the problem of such severe class imbalance. In the next section, we will apply statistical analysis to validate our visual interpretations of this table.

Table 9 Results for all seven classifiers and Brute Force web attacks across eight sampling ratios; Sampling column reports negative to positive class ratio after applying random undersampling, or “None” for case when no undersampling is applied; abbreviations are the same as in Table 8

Statistical analysis—Brute Force web attacks

We conduct two two-factor ANalysis Of VAriance (ANOVA) [54] tests to test the impact of the combination of learner and sampling ratio on performance in terms of AUC for Brute Force web attacks. The results of the ANOVA tests are in Table 10. The data for these tests is the same data we summarize from the prior section with the results of seven different classifiers across eight different sampling ratios. A confidence level of 99% is used for all tests.

Table 10 ANOVA results for 2-factor test of classifier and sampling ratio with Brute Force web attacks, including their interaction; in terms of AUC

Since the p-values for the ANOVA tests are 0 in all cases, we conduct Tukey’s Honestly Significant Difference (HSD) [55] tests to find the optimal values for learner and sampling ratio. We conduct a total of two HSD tests: one test to determine groupings of classifiers by performance in terms of AUC, and one test to determine groupings of sampling ratios, also by performance in terms of AUC.

These ANOVA results are based on ten iterations of 5-fold cross validation for 8 sampling levels of 7 different classifiers, hence a total of \(10\times 5\times 8\times 7=2,800\) combinations to analyze.

For the choice of classifier, the p-value is equal to zero from Table 10 and this indicates the choice of classifier is statistically significant for classification performance of detecting Brute Force web attacks in this experiment. Table 11 provides Tukey’s HSD groupings of the seven different classifiers as ranked by AUC. However, we must emphasize caution about interpreting these rankings of the classifiers as they are ranked across all of the various sampling ratio levels in general. For example, Random Forest ranks best across all sampling ratio levels for Brute Force web attacks according to the HSD rankings. But, LightGBM actually achieved the top score at the specific RUS ratio of 1:1. Still, these rankings can be useful in order to gain a general sense of their robustness across a various spectrum of sampling ratios (especially when the top ranked classifier also happens to achieve the highest AUC score too).

Table 11 HSD groupings of classifiers after 2-factor ANOVA where classifier and sampling ratio are factors with Brute Force web attacks; groups are by performance in terms of AUC

The sampling ratio factor is also statistically significant based upon the p-value being equal to zero from Table 10 for Brute Force web attacks and AUC. Table 12 provides Tukey’s HSD rankings for RUS ratios across all seven of the classifiers for Brute Force web attacks and AUC, and indicates a clear trend that classification performance improves as more random undersampling (RUS) is applied. This is very important to our problem statement that statistically shows that applying sampling improves AUC scores in detecting Brute Force web attacks across seven different learners in this experiment.

Table 12 HSD groupings of sampling ratios after 2-factor ANOVA where classifier and sampling ratio are factors with Brute Force web attacks; groups are by performance in terms of AUC

Results for XSS web attacks

Results with no sampling—XSS web attacks

In this section, we first present results obtained without the application of sampling techniques to XSS web attacks. No feature selection was applied for any of the results in this study, as we found all 66 features performed better with web attacks compared to our preliminary attempts with feature selection. Table 13 shows the results with no sampling applied. In this table, the AUC values are the mean across 50 values from each 5-fold cross validation being repeated 10 times. The “SD” prefix for AUC refers to the standard deviation across the 50 measurements previously described.

Table 13 Results for classification of XSS web attacks and no sampling applied; abbreviations are the same as in Table 8

With no undersampling applied, Logistic Regression classification resulted with an AUC value equal to 0.5 with a standard deviation of 0.0. Essentially, with no RUS applied, LR was not able to correctly classify any of the positive instances. None of the other classifiers exhibited this same problem for XSS web attacks (and the Brute Force web attacks had this very same issue with LR and no sampling applied).

Based on the results of Table 13 (with no sampling applied), Naive Bayes is the top-performing classifier in terms of AUC for XSS web attacks. Logistic Regression performs the worst in terms of AUC. These AUC scores are not very good with such severe class imbalance and a small degree of rarity. This establishes our problem statement as to whether applying sampling can improve upon the poor classification performance. In the next section, we present results of eight different RUS ratios to see whether applying sampling can improve in detecting XSS web attacks with AUC.

Results with sampling—XSS web attacks

Table 14 in this section provides results for each classifier with various sampling ratios applied to XSS web attacks (and no feature selection is applied). The following seven sampling ratios are applied to each of the classifiers with random undersampling: 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. In addition, “no sampling” is also indicated in the tables with the value “None” from the results of the previous section. Therefore, a total of eight different sampling ratios are evaluated. These seven classifiers are evaluated in the following tables for our various RUS ratios: RF, LR, XGB, CB, NB, DT, and LGB. Similar to the previous section, AUC results are the mean across 50 different measurements (5-fold cross validation repeated 10 times). The “SD” prefix refers to the standard deviation across these 50 different measurements.

Table 14 Results for all seven classifiers and XSS web attacks across eight sampling ratios; Sampling column reports negative to positive class ratio after applying random undersampling, or “None” for case when no undersampling is applied; abbreviations are the same as in Table 8

In general, when we visually inspect the results of the different classifiers from Table 14, the results suggest that applying RUS does indeed improve classification performance. Random Forest achieves the top AUC score of 0.9524 at a RUS ratio of 65:35. All three of the other ensemble classifiers (LGB, XGB, and CB) and Decision Tree compete closely with the top score, and they all achieve their best AUC scores at a RUS ratio of 65:35 as well.

Logistic Regression appears to perform the worst overall across all the sampling ratios for detecting XSS web attacks. Again, Naive Bayes performs the best among all the classifiers with no sampling applied (from Table 13), but the classification performance of Naive Bayes does not improve as more RUS is applied (unlike improvements seen from applying RUS to the other learners). All the classifiers besides Native Bayes have a substantial improvement in AUC scores as more sampling is applied (based upon visual inspection of the table).

Overall, applying more RUS does substantially improve classification performance until the 65:35 RUS ratio is reached at which point the 1:1 RUS ratio seems to perform similar or a little worse than the 65:35 ratio. This is important as applying RUS does improve AUC scores for XSS web attacks with such severe class imbalance. In the next section, we employ statistical analysis to validate the visual interpretation of our results.

Statistical analysis—XSS web attacks

We conduct two two-factor ANOVA tests to test the impact of the combination of learner and sampling ratio on performance in terms of AUC for XSS web attacks. The results of the ANOVA tests are in Table 15. The data for these tests is the same data we summarize from the prior section with the results of seven different classifiers across eight different sampling ratios. A confidence level of 99% is used for all tests.

Table 15 ANOVA results for 2-factor test of classifier and sampling ratio with XSS web attacks, including their interaction; in terms of AUC

Since the p-values for the ANOVA tests are 0 in all cases, we conduct Tukey’s HSD tests to find the optimal values for learner and sampling ratio. We conduct a total of two HSD tests: one test to determine groupings of classifiers by performance in terms of AUC, and one test to determine groupings of sampling ratios, also by performance in terms of AUC.

These ANOVA results are based on ten iterations of 5-fold cross validation for 8 sampling levels of 7 different classifiers, hence a total of \(10\times 5\times 8\times 7=2,800\) combinations to analyze.

For the choice of classifier, the p-value is equal to zero from Table 15 and this indicates the choice of classifier is statistically significant for classification performance of detecting XSS web attacks in this experiment. Table 16 provides Tukey’s HSD groupings of the seven different classifiers as ranked by AUC. Again, we must emphasize caution about interpreting these rankings of the classifiers as they are ranked across all of the various sampling ratio levels in general. Both Random Forest and Decision Tree rank the best across all sampling ratio levels for XSS web attacks according to the HSD rankings. But, LGB and XGB actually have top scores a little higher than Decision Tree for the 65:35 RUS ratio. Nonetheless, these HSD rankings can still be useful when carefully employed. For example, one could select RF for detecting XSS web attacks as it had the top score across all sampling ratios and learners and was also in the top performing HSD group of classifiers across all sampling ratios (meaning its AUC performance could generalize relatively well across different RUS sampling ratios).

Table 16 HSD groupings of classifiers after 2-factor ANOVA where classifier and sampling ratio are factors with XSS web attacks; groups are by performance in terms of AUC

The sampling ratio factor is also statistically significant based upon the p-value being equal to zero from Table 15 for XSS web attacks and AUC. Table 17 provides Tukey’s HSD rankings for RUS ratios across all seven of the classifiers for XSS web attacks and AUC, and indicates a clear trend that classification performance improves as more RUS is applied until the 65:35 RUS ratio. Both the 65:35 and 1:1 RUS ratios are the top performing sampling ratios, and are not statistically different from each other in terms of AUC performance across all seven classifiers. These HSD rankings indicate statistically that applying sampling does improve AUC scores for detecting XSS web attacks across seven different learners in this experiment.

Table 17 HSD groupings of sampling ratios after 2-factor ANOVA where classifier and sampling ratio are factors with XSS web attacks; groups are by performance in terms of AUC

Results for SQL injection web attacks

Results with no sampling—SQL injection web attacks

In this section, we first present results obtained without the application of sampling techniques to SQL Injection web attacks. No feature selection was applied for any of the results in this study, as we found all 66 features performed better with web attacks compared to our preliminary attempts with feature selection. Table 18 shows the results with no sampling applied. In this table, the AUC values are the mean across 50 values from each 5-fold cross validation being repeated 10 times. The “SD” prefix for AUC refers to the standard deviation across the 50 measurements previously described.

Table 18 Results for classification of SQL Injection web attacks and no sampling applied; abbreviations are the same as in Table 8

With no undersampling applied, three different classifiers have AUC scores less than or equal to 0.5 showing their difficulty in dealing with class rarity. LightGBM, XGBoost, and Logistic Regression all performed roughly as well as randomly guessing (as an AUC score of 0.5 is comparable to random guesses). With the more pronounced class rarity with SQL Injection attacks, some of these learners highlight their difficulties in dealing with class rarity and start to break down under a PCC of only 85 instances.

Based on the results of Table 18 (with no sampling applied), by far Naive Bayes is the top-performing classifier with an AUC score of 0.889 for SQL Injection web attacks. Interestingly, all of the ensemble learners perform very poorly with no sampling applied in detecting SQL Injection web attacks. Random Forest achieves the highest score for all of the ensembles with a paltry AUC score of 0.65645 (which is outperformed by the simplistic Decision Tree).

Results with sampling—SQL injection web attacks

Table 19 in section provides results for each classifier with various sampling ratios applied to SQL Injection web attacks (and no feature selection is applied). The following seven sampling ratios are applied to each of the classifiers with random undersampling: 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. In addition, “no sampling” is also indicated in the tables with the value “None” from the results of the previous section. Therefore, a total of eight different sampling ratios are evaluated. These seven classifiers are evaluated in the following tables for our various RUS ratios: RF, LR, XGB, CB, NB, DT, and LGB. Similar to the previous section, AUC results are the mean across 50 different measurements (5-fold cross validation repeated 10 times). The “SD” prefix refers to the standard deviation across these 50 different measurements.

Table 19 Results for all seven classifiers and SQL Injection web attacks across eight sampling ratios; Sampling column reports negative to positive class ratio after applying random undersampling, or “None” for case when no undersampling is applied; abbreviations are the same as in Table 8

In general, when we visually inspect the results of the different classifiers from Table 19, the results suggest that applying RUS does indeed improve classification performance. LightGBM achieves the top AUC score of 0.946 at the RUS ratio of 65:35. All three of the other ensemble classifiers (LGB, XGB, and CB) and Decision Tree perform a little less than the LGB top score, and 65:35 is the RUS ratio for which they all achieve their best AUC scores as well.

Logistic Regression appears to perform the worst among all the classifiers for detecting SQL Injection web attacks. Again, Naive Bayes performs the best among all the classifiers with no sampling applied (from Table 18), but its classification performance still does not improve very much as more RUS is applied (unlike improvements seen from applying RUS to the other learners). Overall, all the classifiers except Naive Bayes have dramatic improvements in classification performance as more RUS is applied (up until the 3:1 RUS ratio). Based upon visual inspection of our results, it is clear that applying sampling does substantially improve performance for such extreme class imbalance and rarity. In the next section, we employ statistical analysis to further validate our observations.

Statistical analysis—SQL injection web attacks

We conduct two two-factor ANOVA tests to test the impact of the combination of learner and sampling ratio on performance in terms of AUC for SQL Injection web attacks. The results of the ANOVA tests are in Table 20. The data for these tests is the same data we summarize from the prior section with the results of seven different classifiers across eight different sampling ratios. A confidence level of 99% is used for all tests.

Since the p-values for the ANOVA tests are 0 in all cases, we conduct Tukey’s HSD tests to find the optimal values for learner and sampling ratio. We conduct a total of two HSD tests: one test to determine groupings of classifiers by performance in terms of AUC, and one test to determine groupings of sampling ratios, also by performance in terms of AUC.

Table 20 ANOVA results for 2-factor test of classifier and sampling ratio with SQL Injection web attacks, including their interaction; in terms of AUC

These ANOVA results are based on ten iterations of 5-fold cross validation for 8 sampling levels of 7 different classifiers, hence a total of \(10\times 5\times 8\times 7=2,800\) combinations to analyze.

For the choice of classifier, the p-value is equal to zero from Table 20 and this indicates the choice of classifier is statistically significant for classification performance of detecting SQL Injection web attacks in this experiment. Table 21 provides Tukey’s HSD groupings of the seven different classifiers as ranked by AUC. Again, we must emphasize caution about interpreting these rankings of the classifiers as they are ranked across all of the various sampling ratio levels in general. Naive Bayes ranks the best across all sampling ratio levels for SQL Injection web attacks according to the HSD rankings. All of the classifiers except LR actually have higher scores than NB at the 65:35 or 1:1 RUS ratios. However, Naive Bayes is still surprisingly competitive against all the classifiers even at the highest 65:35 and 1:1 RUS ratios. For example, at the 1:1 RUS ratio, NB’s AUC score of 0.90454 performs better than XGB’s score of 0.899 (even though XGB performs better than NB at the 65:35 RUS ratio). LightGBM achieves the top AUC score of 0.946 at a RUS ratio of 65:35.

Table 21 HSD groupings of classifiers after 2-factor ANOVA where classifier and sampling ratio are factors with SQL Injection web attacks; groups are by performance in terms of AUC

The sampling ratio factor is also statistically significant based upon the p-value being equal to zero from Table 20 for SQL Injection web attacks and AUC. Table 22 provides Tukey’s HSD rankings for RUS ratios across all seven of the classifiers for SQL Injection web attacks and AUC, and indicates a clear trend that classification performance improves as more RUS is applied until the 3:1 RUS ratio. The 1:1 RUS ratio is the top performing sampling ratio, followed by the 65:35 and then 3:1 RUS ratios in terms of AUC performance across all seven classifiers. These HSD rankings indicate statistically that applying sampling does improve AUC scores for detecting SQL Injection web attacks across seven different learners in this experiment.

Table 22 HSD groupings of sampling ratios after 2-factor ANOVA where classifier and sampling ratio are factors with SQL Injection web attacks; groups are by performance in terms of AUC

Conclusion

Applying random undersampling improves classification performance for detecting web attacks in big data from the CSE-CIC-IDS2018 dataset. Based on statistical analysis, the RUS ratio is a significant factor for the AUC metric in detecting all three individual web attacks in the CSE-CIC-IDS2018 dataset for: Brute Force, XSS, and SQL Injection web attacks. Either the 1:1, 65:35, or 3:1 RUS ratios achieved the top AUC scores for the seven different classifiers and three different web attacks.

Classification performance problems with such severe class imbalance and rarity for these three web attacks were all significantly improved with the application of RUS. In general, classification performance was mostly very poor for all three web attacks until massive levels of undersampling were applied. Classification performance improvements from applying sampling were easily observable and statistically validated across all seven classifiers and eight levels of RUS ratios. When a RUS ratio of 1:1 is applied to an imbalance ratio of 153,911:1 like we had for SQL Injection web attacks in our experiment, training machine learning models will be much more computationally efficient with such massive amounts of undersampling and can be helpful with big data challenges.

The choice of classifier is also a significant factor in detecting the three web attacks in terms of AUC with the CSE-CIC-IDS2018 dataset. All four of the ensemble learners (Random Forest, LightGBM, XGBoost, and CatBoost) performed well at the highest RUS ratios for all three web attacks, but these ensemble learners broke down with very poor performance when challenged by the class rarity of the SQL Injection web attacks (and ensembles generally did not perform well across all three web attacks without the help RUS). The ensemble learners obtained the top AUC score in detecting each of the three web attacks: LGB (0.94182) for Brute Force, RF (0.9524) for XSS, and LGB (0.946) for SQL Injection web attacks.

The simplistic Decision Tree classifier was very competitive with the ensemble learners in most cases, and beat the ensemble learners when no sampling was applied (except for one case with Brute Force web attacks where RF did slightly better when no sampling was applied). Logistic Regression did not perform well overall. Our unique data preparation framework was rigorous as compared to other CSE-CIC-IDS2018 related works and was likely helpful towards classification performance, as well as providing a harsh experimental test-bed for severe class imbalance and rarity (to model cybersecurity conditions confronted in the real world).

Future work can explore Naive Bayes and its noteworthy classification performance when no sampling is applied under conditions of severe class imbalance and rarity (as well as its insensitivity to improvements when applying RUS). Other datasets could also be included for future work, as well as additional performance metrics, families of attacks, classifiers, sampling techniques, and rarity levels [56].