Introduction

Intrusion detection is the accurate identification of various attacks capable of damaging or compromising an information system. An (IDS) can be host-based, network-based, or a combination of both. A host-based IDS is primarily concerned with the internal monitoring of a computer. Windows registry monitoring, log analysis, and file integrity checking are some of the tasks performed by a host-based IDS [1]. A network-based IDS monitors and analyzes network traffic to detect threats that include Denial-of-Service (DoS) attacks, SQL injection attacks, and password attacks [2]. The rapid growth of computer networks and network applications worldwide has encouraged an increase in cyberattacks [3]. In 2019, business news channel CNBC reported that the average cost of a cyberattack was $200,000 [4].

An IDS can also be categorized as signature-based or anomaly-based. A signature-based IDS contains patterns for known attacks and is unable to detect unknown attacks. This means that the database of a signature-based IDS must be updated ad nauseam to keep up with all known attack signatures. By contrast, an anomaly-based IDS identifies deviations from normal traffic behavior. Since various machine learning approaches can generally be successfully applied to anomaly detection, it makes intuitive sense that anomaly-based intrusion detection is a productive research area.

Datasets such as CSE-CIC-IDS2018 [5] were created to train predictive models on anomaly-based intrusion detection for network traffic. CSE-CIC-IDS2018 is not an entirely new project, but part of an existing project that produces modern, realistic datasets in a scalable manner [6]. In the next three paragraphs we trace the development of this project, from the foundational dataset (ISCXIDS2012 [7]) to CSE-CIC-IDS2018.

Created in 2012 by the Information Security Centre of Excellence (ISCX) at the University of New Brunswick (UNB) over a seven-day period, ISCXIDS2012 contains both normal and anomalous network traffic. The dataset contains several attack types (e.g. DoS, Distributed Denial-of-Service (DDoS), and brute force), but these have all been labeled as “attack” [8]. ISCXIDS2012 is big data, with 20 independent features and 2,450,324 instances, of which roughly 2.8% typifies attack traffic. Big data is associated with specific properties, such as volume, variety, velocity, variability, value, and complexity [9]. These properties may make classification more challenging for learners trained on big data. Hereafter, “ISCXIDS2012” will be referred to as “ISCX2012” throughout the text.

In 2017, the creators of ISCX2012 and the Canadian Institute of Cybersecurity (CIC) acted on the fact that the dataset was limited to only six traffic protocols (HTTP, SMTP, SSH, IMAP, POP3, FTP). A case in point was the lack of representation of HTTPS, an important protocol accounting for about 70% of current network traffic in the real world [5]. Also, the distribution of simulated attacks did not conform to reality. CICIDS2017, which contains five days of network traffic, was released to remedy the deficiencies of its predecessor. Among the many benefits of this new dataset, the high number of features (80) facilitates machine learning. CICIDS2017 contains 2,830,743 instances, with attack traffic amounting to about 19.7 % of this total number. The dataset has a class imbalance and a wider range of attack types than ISCX2012. Class imbalance, which is a phenomenon caused by unequal distribution between majority and minority classes, can skew results in a big data study. At a granular level, CICIDS2017 has a high class imbalance with respect to some of the individual attack types. High class imbalance is defined by a majority-to-minority ratio between 100:1 and 10,000:1 [10].

The Communications Security Establishment (CSE) joined the project, and in 2018, the latest iteration of the intrusion detection dataset was released, CSE-CIC-IDS2018. The updated version also has a class imbalance and is structurally similar to CICIDS2017. However, CSE-CIC-IDS2018 was prepared from a much larger network of simulated client-targets and attack machines [11], resulting in a dataset that contains 16,233,002 instances gathered from 10 days of network traffic. About 17% of the instances is attack traffic. Table 1 shows the percentage distribution for the seven types of network traffic represented by CSE-CIC-IDS2018. Hereafter, “CSE-CIC-IDS2018” and “CICIDS2018” will be used interchangeably throughout the text. The dataset is distributed over ten CSV files that are downloadable from the cloud.Footnote 1 Nine files consist of 79 independent features, and the remaining file consists of 83 independent features.

Table 1 CICIDS2018: Network traffic distribution

Our exhaustive search for relevant, peer-reviewed papers ended on September 22, 2020. To the best of our knowledge, this is the first survey to exclusively present and analyze intrusion detection research on CICIDS2018 in such detail. CICIDS2018 is the most recent intrusion detection dataset that is big data, publicly available, and covers a wide range of attack types. The contribution of our survey centers around three important findings. In general, we observed that the best performance scores for each study, where provided, were unusually high. This may be a consequence of overfitting. The second finding deals with the apparent lack of concern in most studies for the class imbalance of CICIDS2018. Finally, we note that for all works, the data cleaning of CICIDS2018 has been given little attention, a shortcoming that could hinder reproducibility of experiments. Data cleaning involves the modification, formatting, and removal of data to enhance dataset usability [12].

The remainder of this paper is organized as follows: "Research papers using CICIDS2018" section describes and analyzes the compiled works; "Discussion of surveyed works" section discusses survey findings, identifies gaps in the current research, and explains the performance metrics used in the curated works; and "Conclusion" section concludes with the main points of the paper and offers suggestions for future work.

Research papers using CICIDS2018

In this section, we examine research papers that use CICIDS2018. Works of research are presented in alphabetical order by author. All scores obtained from metrics (accuracy, recall, etc.) are the best scores in each study for binary classification [13].

Table 2 provides an alphabetical listing by author of the papers discussed in this section, along with the best respective performance score(s) for CICIDS2018. Comparisons between scores for separate works of research or separate experiments in the same paper may not be valid. This is because datasets may differ in the number of instances and features, and possibly the choice of computational framework. Furthermore, variations of an original experiment may be performed on the same dataset. However, providing these scores may be valuable for future comparative research. Table 3 provides an alphabetical listing by author of the papers discussed in this section, along with the proposed respective model(s) for CICIDS2018, and Table 4 shows the same ordered listing by author coupled with the associated computing environment(s) for CICIDS2018.

Table 2 CICIDS2018: Performance scores
Table 3 CICIDS2018: Proposed models
Table 4 CICIDS2018: Computing environment

Atefinia and Ahmadi [14] (Network intrusion detection using multi-architectural modular deep neural network)

Using an aggregator module [33] to integrate four network architectures, the authors aimed to obtain a higher precision rate than any of those produced in the related works described in their paper. The four network components included a restricted Boltzmann machine [34], a deep feed-forward neural network [35], and two Recurrent Neural Networks (RNNs) [36], specifically a Long Short-term Memory (LSTM) [37] and a Gated Recurrent Unit (GRU) [38]. The models were implemented with Python and the Scikit-learnFootnote 2 library. Data preprocessing involved the removal of source and destination IP addresses and also source port numbers. Labels with string values were one-hot encoded, and feature scaling was used to normalize the feature space of all the attributes between a range of 0 and 1. Rows with missing values and columns with too many missing values were dropped from CICIDS2018. However, no information is provided on how many rows and columns were removed. Stratified sampling with a train to test ratio of 80-20 was performed on each of the four modules. Information was then fed to the aggregator module, which used a weighted averaging technique to produce the output for the modular network. The highest accuracies (100%) were obtained for the DoS, DDoS, and brute force attack types. These accuracies were associated with precision and recall scores of 100%. One drawback is the authors’ comparison of results from their study with results from the related works. A better approach is to empirically evaluate at least two models, one of which would be the proposed modular network. Another shortcoming relates to the non-availability of performance scores that cover the collective attack types. In other words, the scores of precision, recall, etc. for the combination of attacks could provide additional insight. This does not detract from the usefulness of reporting precision, recall, etc. for each attack type.

Basnet et al. [15] (Towards detecting and classifying network intrusion traffic using deep learning frameworks)

The authors experimented with various deep learning frameworks (fast.ai,Footnote 3 Keras,Footnote 4 PyTorch,Footnote 5 TensorFlow,Footnote 6 TheanoFootnote 7) to detect network intrusion traffic and classify attack types. For preprocessing, samples with “Infinity”, “NaN”, or missing values were dropped and timestamps converted to Unix epoch numeric values (number of seconds since January 1, 1970). About 20,000 samples were dropped after the data cleanup process. The destination port and protocol features were treated as categorical data, and the remainder were treated as numerical data. Ten-fold cross validation with either an 80–20 or 70–30 split was used for training and testing. Both binary class and multi-class classification [39] were considered. A Multilayer Perceptron (MLP) [40] served as the only classifier. With the aid of GPU acceleration, the authors observed that fast.ai outperformed the other frameworks consistently among all the experiments, yielding an optimum accuracy of 99% for binary classification. The main limitation of this study is the use of only one classifier.

Catillo et al. [16] (2L-ZED-IDS: a two-level anomaly detector for multiple attack classes)

Based on an extension of previous research with CICIDS2017, this study trained a deep autoencoder [41] on CICIDS2017 and CICIDS2018. In the preprocessing stage, the Flow_ID and Timestamp features of the datasets were not selected because they were deemed not relevant to the study. The autoencoder was implemented with Python, Keras, and TensorFlow and trained on normal and DoS attack traffic. The train to test ratio was 80–20 for both datasets. The highest accuracy for CICIDS2018 (99.2%) was obtained for the botnet attack type, corresponding to a precision of 95.0% and a recall of 98.9%. The highest accuracy (99.3%) of the entire study was obtained for CICIDS2017 (botnet attack type), coupled with a precision of 94.8% and a recall of 98.6%. One drawback of the study is the non-availability of an accuracy score for the collective attack types. Another disadvantage is the use of only one classifier.

Chadza et al. [17] (Contemporary sequential network attacks prediction using hidden Markov Model)

By way of MATLAB software, two conventional Hidden Markov Model (HMM) training algorithms, namely Baum Welch [42] and Viterbi [43], were applied to CICIDS2018. HMM is a probabilistic machine learning framework that generates states and observations. For this study, information is clearly lacking on data preprocessing. About 457,550 records (selection criteria set in Snort Intrusion Detection System [44]) were selected from CICIDS2018. From that sample of records, 70% were allocated to training and the remainder to testing. The authors found that the highest accuracy of about 97% was achieved by both the Baum Welch and Viterbi algorithms. This paper is only three pages in length. The main shortcoming of this work is the lack of detail on the experiments performed.

Chastikova and Sotnikov [18] (Method of analyzing computer traffic based on recurrent neural networks)

This highly theoretical study, which was submitted to the Journal of Physics Conference series, does not give any empirical results and is extremely short on details. It merely proposes a LSTM model to analyze computer network traffic using CICIDS2018. The authors note that their use of the Focal Loss function [45] (initially developed by Facebook AI research) addresses class imbalance. The fact that no metrics have been used and no computing environment was provided is a major drawback of this six-page paper.

D’hooge et al. [19] (Inter-dataset generalization strength of supervised machine learning methods for intrusion detection)

By including both CICIDS2017 and CICIDS2018, this study investigates how efficiently the results of an intrusion detection dataset can be generalized. For performance evaluation, the authors used 12 supervised learning algorithms from different families: Decision Tree (DT) [46], Random Forest (RF) [47], DT-based Bagging [48], Gradient Boosting [49], Extratree [50], Adaboost [51], XGBoost [52], k-Nearest Neighbor (k-NN) [53], Ncentroid [54], linearSVC [55], RBFSVC [56], and Logistic Regression [57]. The models were built within a Python framework with Scikit-learn and XGBoost modules. Feature scaling was used to normalize the feature space of all the attributes. The tree-based classifiers performed best, and among them, XGBoost ranked first. The top accuracy, precision, and recall scores for CICIDS2018 were 96%, 99%, and 79%, respectively. For intrusion detection, the authors concluded that a model trained on one dataset (CICIDS2017) cannot generalize to another dataset (CICIDS2018). One shortcoming of this work is the assumption that certain categorical features, such as destination port, have the same number of unique values for both datasets. Drawing upon an example, we expound on this limitation: a classifier trained on a dataset with feature ’X’ of {car, boat} is not expected to generalize well to a related dataset with feature ’X’ of {car, boat, train, plane}.

Ferrag et al. [20] (Deep learning for cyber security intrusion detection: approaches, datasets, and comparative study)

Seven deep learning models were evaluated on CICIDS2018 and the Bot-IoT dataset [58]. The models included RNNs, deep neural networks [59], restricted Boltzmann machines, deep belief networks [60], Convolutional Neural Networks (CNNs) [61], deep Boltzmann machines [62], and deep autoencoders. The Bot-IoT dataset is a 2018 creation from the University of New South Wales (UNSW) that incorporates about 72,000,000 normal and botnetInternet of Things (IoT) instances with 46 features. The experiment was performed on Google ColaboratoryFootnote 8 using Python and TensorFlow with GPU acceleration. Only 5% of the instances were used in this study, as determined by [58]. The highest accuracy for the Bot-IoT dataset (98.39%) was obtained with a deep autoencoder, while the highest accuracy for CICIDS2018 (97.38%) was obtained with an RNN. The highest recall for the Bot-IoT dataset (97.01%) came from a CNN, whereas the highest recall for CICIDS2018 (98.18%) came from a deep autoencoder. The bulk of this paper deals with classifying 35 cyber datasets and describing the seven deep learning models. Only the last three pages discuss the actual experimentation, which is lacking in detail. This is a major shortcoming of the study.

Filho et al. [21] (Smart detection: an online approach for DoS/DDoS attack detection using machine learning)

The authors used a customized dataset and four well-known ones (CIC-DoS [63], ISCX2012, CICIDS2017, and CICIDS2018) to obtain online random samples of network traffic and classify them as DoS attacks or normal. There were 33 features obtained for each dataset. These features were derived from source and destination ports, transport layer protocol, IP packet size, and TCP flags. The individual datasets were combined into one unit containing normal traffic (23,088 instances), TCP flood attacks (14,988 instances), UDP flood (6,894 instances), HTTP flood (347 instances), and HTTP slow (183 instances). For the combined dataset, the authors noted that ISCX2012 was only used to provide data traffic with normal activity behavior. Recursive Feature Elimination with Cross Validation [64] was performed on six learners (RF, DT, Logistic Regression, SGDClassifier [65], Adaboost, MLP). The learners were built with Scikit-learn. With regard to the combined dataset, Random Forest (20 features selected) had the highest accuracy among the learners. For CICIDS2018, the precision and recall for RF were both 100%. One shortcoming of this study is the use of ISCX2012 to provide normal traffic for the combined dataset. ISCX2012 is outdated, and as we previously pointed out, it is limited to only six traffic protocols.

Fitni and Ramli [22] (Implementation of ensemble learning and feature selection for performance improvements in anomaly-based intrusion detection systems)

Adopting an ensemble model approach, this study compared seven single learners to evaluate the top performers for integration into a classifier unit. The seven learners are as follows: RF, Gaussian Naive Bayes [66], DT, Quadratic Discriminant Analysis [67], Gradient Boosting, and Logistic Regression. The models were built with Python and Scikit-learn. During preprocessing, samples with missing values and infinity were removed. Records that were actually a repetition of the header rows were also removed. The dataset was then divided into training and testing validation sets in an 80-20 ratio. Feature selection [68], a technique for selecting the most important features of a predictive model, was performed using the Spearman’s rank correlation coefficient [69] and Chi-squared test [70], resulting in the selection of 23 features. After the evaluation of the seven learners with these features, Gradient Boosting, Logistic Regression, and DT emerged as the top performers for use in the ensemble model. Accuracy, precision, and recall scores for this model were 98.80%, 98.80%,and 97.10%, respectively, along with an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.94. We believe that the performance of the ensemble model could be improved by substituting a Catboost [71], LightGBM [72], or XGBoost classifier for the Gradient Boosting classifier. The three possible substitutions are enhanced gradient boosting variants [73].

Gamage and Samarabandu [23] (Deep learning methods in network intrusion detection: a survey and an objective comparison)

This work introduces a taxonomy of deep learning models for intrusion detection and summarizes relevant research papers. Four deep learning models (feed-forward neural network, autoencoder, deep belief network, LSTM) were then evaluated on the KDD Cup 1999 [74], NSL-KDD [75], CICIDS2017, and CICIDS2018 datasets. The KDD Cup 1999 dataset, which was developed by Defense Advanced Research Project Agency (DARPA), contains four categories of attacks, including the DoS category. Preprocessing of the datasets consisted of removing invalid flow records (missing values, strings, etc.) and feature scaling. One-hot encoding was done for the protocol type, service, and flag features, three attributes that are categorical and non-numerical. The full datasets were split into train and test sets, with hyperparameter tuning applied. Results show that the feed-forward neural networks performed well on all four datasets. For this classifier, the highest accuracy (99.58%) was obtained on the CICIDS2017 dataset. This score is associated with a precision of 99.43% and a recall of 99.45%. With respect to CICIDS2018, the highest accuracy for the feed-forward network was 98.4%, corresponding to a precision of 97.79% and a recall of 98.27%. GPU acceleration was used on the some of the PCs involved in the experiments. One shortcoming of this study stems from the use of KDD Cup 1999 and NSL-KDD, both of which are outdated and have known issues. The main problem with KDD Cup 1999 is its significant number of duplicate records [75]. NSL-KDD is an improved version that does not have the issue of redundant instances, but it is far from ideal. For example, there are some attack classes without records in the test dataset of NSL-KDD [12].

Hua [24] (An efficient traffic classification scheme using embedded feature selection and LightGBM)

To tackle the class imbalance of CICIDS2018, the author incorporates an undersampling and embedded feature selection approach with a LightGBM classifier. LightGBM contains algorithms that address high numbers of instances and features in datasets. Undersampling [76] randomly removes majority class instances to affect distribution. During the data cleaning stage of this study, missing values and useless features were removed from the full dataset, resulting in a modified set of 77 features. String labels were consequently converted to integer labels, which were then one-hot encoded. Six other learners were evaluated in this research work: Support Vector Machine (SVM) [47], RF, Adaboost, MLP, CNN, and Naive Bayes [77]. Learners were implemented with Scikit-learn and TensorFlow. The train to test ratio was 70–30, and the XGBoost algorithm was used to perform feature selection. The LightGBM classifier had the best performance of the group, with an optimum accuracy of 98.37% when the sample size was three million and the top ten features were selected. For this accuracy, the precision and recall were 98.14% and 98.37%, respectively. LightGBM also had the second fastest training time among the classifiers. Although this study provides more information about data preprocessing than other works in our survey, it deals with data cleaning in a superficial matter.

Huancayo Ramos et al. [25] (Benchmark-based reference model for evaluating Botnet detection tools driven by traffic-flow analytics)

Five learners were evaluated on two datasets (CICIDS2018 and ISOT HTTP Botnet [78]) to determine the best botnet classifier. The ISOT HTTP Botnet dataset contains malicious and benign instances of Domain Name System (DNS) traffic. The learners in the study include RF, DT, k-NN, Naive Bayes, and SVM. Feature selection was performed using various techniques, including the feature importance method [79] of RF. After feature selection, CICIDS2018 had 19 independent attributes while ISOT HTTP had 20, with destination port number, source port number, and transport protocol among the selected features. The models were implemented with Python and Scikit-learn. Five-fold cross-validation was applied to a training set comprising 80% of the botnet instances. The remainder of the botnet instances served as the testing set. For optimization, the Grid Search algorithm [80] was used. With regard to CICIDS2018, the RF and DT learners scored an accuracy of 99.99%. Tied to this accuracy, the precision was 100% and the recall was 99.99% for both learners. The RF and DT learners also had the highest accuracy for ISOT HTTP (99.94% for RF and 99.90% for DT). One limitation of this paper is the inadequate information provided on data preprocessing.

Kanimozhi and Jacob [26] (Artificial Iitelligence based network intrusion detection with hyper-parameter optimization tuning on the realistic cyber dataset CSE-CIC-IDS2018 using cloud computing)

The authors trained a two-layer MLP, implemented with Python and Scikit-learn, to detect botnet attacks. GridSearchCV [81] performed hyper-parameter optimization, and L2 regularization [82] was used to prevent overfitting. Overfitting refers to a model that has memorized training data instead of learning to generalize it [83]. The MLP classifier was trained only on the botnet instances of CICIDS2018, with ten-fold cross validation [84] implemented. For this study the AUC was 1, which is a perfect score. Related accuracy, precision, and recall scores were all 100%. The paper is four pages long (with two references), and one major shortcoming is an obvious lack of detail. Another drawback is the use of only one classifier to evaluate performance.

Kanimozhi and Jacob [27] (Calibration of various optimized machine learning classifiers in network intrusion detection system on the realistic cyber dataset CSE-CIC-IDS2018 using cloud computing)

The purpose of this study was to determine the best classifier out of six candidates (MLP, RF, k-NN, SVM, Adaboost, Naive Bayes). The models were developed with Python and Scikit-learn. A calibration curve was used, which is a graph showing the deviation of classifiers from a perfectly calibrated plot. Botnet instances of CICIDS2018 were split into train and test instances, with no information provided on the ratio of train to test instances. The MLP model emerged as the top choice with an AUC of 1. Accuracy, precision, and recall scores associated with this perfect AUC score were 99.97%, 99.96%, and 100%, respectively. No information was provided on the MLP classifier, but it is most likely the same two-layer network as in [26]. The main shortcoming of this paper is the lack of detail.

Karatas et al. [28] (Increasing the performance of machine learning-based IDSs on an imbalanced and up-to-date dataset)

Using the Synthetic Minority Oversampling Technique (SMOTE) [85] algorithm to address class imbalance, the authors evaluated the performance of six learners on CICIDS2018. The classifiers involved were k-NN, RF, Gradient Boosting, Adaboost, DT, and Linear Discriminant Analysis [86]. The learners were developed in a Python environment using Keras, TensorFlow, and Scikit-learn. According to the authors, CICIDS2018 contains about 5,000,000 samples. However, the full dataset inarguably contains about 16,000,0000 instances, so the authors should clearly indicate that a subset was used. The dataset was preprocessed to address issues such as missing values and “Infinity.” In addition, one-hot encoding was used, and rows were shuffled for randomness. Five-fold cross-validation was applied to a training set comprising 80% of the instances. The remaining instances served as the test set. After SMOTE was applied, the total dataset size increased by 17%. The Adaboost learner was shown to be the best performer, with an accuracy of 99.69%, along with precision and recall scores of 99.70% and 99.69%, respectively. In our opinion, this study should have gone into a little more detail on data cleaning. Nevertheless, among the surveyed works, this paper has done the best job at covering data cleaning.

Kim et al. [29] (CNN-based network intrusion detection against denial-of-service attacks)

In this study, the authors trained a CNN on DoS datasets from KDD Cup 1999 and CICIDS2018. The model was implemented with Python and TensorFlow. For both datasets, the train to test ratio was 70–30. In the case of KDD, the authors used about 283,000 samples, and for CICIDS2018, about 11,000,000. Image datasets were subsequently generated, and binary and multi-class classification was performed. The authors established that for the two datasets, the accuracy was about 99% for binary classification, which corresponded to precision and recall scores of 81.75% and 82.25%, respectively. An RNN model was subsequently introduced into the study for comparative purposes. The main drawback of this work arises from the use of the KDD Cup 1999 dataset, which, as previously discussed, is an outdated dataset with a high number of redundant instances.

Li et al. [30] (Building auto-encoder intrusion detection system based on random forest feature selection)

In this online real-time detection study, unsupervised clustering and feature selection play a major role. For preprocessing, “Infinity” and “NaN” values were replaced by 0, and the data was then divided into sparse and dense matrices, normalized by L2 regularization. A sparse matrix has a majority of elements with value 0, while a dense matrix has a majority of elements with non-zero values. The model was built within a Python environment. The best features were selected by RF, and the train to test ratio was set as 85–15. The Affinity Propagation (AP) clustering [87] algorithm was subsequently used on 25% of the training dataset to group features into subsets, which were relayed to the autoencoder. Recall rates for all attack types for the proposed model were compared with those of another autoencoder model called Kitnet [88]. Several attack types for both models had a recall of 100%. Only the proposed model was evaluated with the AUC metric, with several attack types yielding a score of 1. Based on detection time results, the authors showed that their model has a faster detection time than KitNet. The authors provided performance scores for AUC and recall for each attack type of CICIDS2018. This is a deficiency of the study as scores covering the collective attack types could provide additional insight. The absence of AUC values for Kitnet is another shortcoming.

Lin et al. [31] (Dynamic network anomaly detection system by using deep learning techniques)

The authors investigated the use of Attention Mechanism (AM) [89] with LSTM to improve performance. Attention mechanism imitates the focus mechanism of the human brain, extracting and representing information most relevant to the target through an automatic weighing scheme. The model was built with TensorFlow and further optimized with Adam Gradient Descent [90], a replacement algorithm for Stochastic Gradient Descent [91]. Seven other learners (DT, Gaussian Naive Bayes, RF, k-NN, SVM, MLP, LSTM without AM) were also evaluated. Preprocessing of a CICIDS2018 subset (about 50% of the original size) involved removing the timestamp feature and IP address feature. The dataset was then divided into training, test, and validation sets in the ratios of 90%, 9%, and 1%. Normal dataset traffic was randomly undersampled to obtain 2,000,000 records, while Web and infiltration attacks were oversampled with SMOTE to address class imbalance. The LSTM model with AM outperformed the other learners with an accuracy of 96.2% and a precision and recall of 96%. The contribution of this useful study is limited by the inadequate information provided on data cleaning. Another shortcoming is the omission of the oversampling rate for SMOTE.

Zhao et al. [32] (A semi-self-taught network intrusion detection system)

The authors used a denoising autoencoder [92] with a heuristic method of class separation based on the fuzzy c-means algorithm [93]. This approach was adopted to get rid of samples with problems such as missing values and redundant data. However, it is ineffective against class noise. Class noise is caused either by different class labels for duplicate instances or by misclassified instances [94]. The autoencoder was developed using Python and TensorFlow. Training, validation, and test sets comprised 70%, 15%, and 15% of the data, respectively. The highest accuracy obtained was 97.9%, accompanied by a score of 98.0% for both precision and recall. One limitation of this study is a lack of details about the experiments. Another limitation is the use of only one learner.

Discussion of surveyed works

In general, the best performance scores are unusually high for studies where scores are provided. This finding is notable. Accuracy scores are between 96 (D’hooge et al., 2020) and 100 (Atefinia & Ahmadi, 2020; Kanimozhi & Jacob, 2019a). Several papers show recall scores of 100 (Atefinia & Ahmadi, 2020; Kanimozhi & Jacob, 2019a; Kanimozhi & Jacob, 2019b; Li et al., 2020; Filho et al., 2019) and also precision scores of 100 (Atefinia & Ahmadi, 2020; Kanimozhi & Jacob, 2019a; Huancayo Ramos et al., 2020; Filho et al., 2019). In addition, three studies show a perfect AUC score (Kanimozhi & Jacob, 2019a; Kanimozhi & Jacob, 2019b ;Li et al., 2020). These noticeably high scores for the various metrics may be due to overfitting.

Surprisingly, use of the accuracy metric is prevalent throughout the surveyed works, while use of the AUC metric has only been used in four studies (Fitni & Ramli, 2020; Kanimozhi & Jacob, 2019a; Kanimozhi & Jacob, 2019b; Li et al., 2020). This observation relates to the class imbalance of CICIDS2018. The high imbalance makes identification of the minority class more burdensome for learners, especially in the case of big data, and tends to introduce a bias in favor of the majority class. Hence, the use of accuracy alone may not be beneficial since a deceptively high score could be obtained when the influence of the minority class is greatly reduced. It is always better to provide accuracy along with other metrics, such as precision and recall, and in all fairness, most of the works have shown this. We point out that the use of AUC as a robust, standalone metric for class imbalance has been demonstrated in several studies [95,96,97]. Please see "Performance metrics" for an explanation of the various metrics provided.

As mentioned previously, the CICIDS2018 dataset has a class imbalance. The effects of this imbalance can be mitigated by techniques at the data level (e.g. random undersampling, feature selection) and algorithm level (e.g. cost-sensitive classification, ensemble techniques) [9]. We make the important observation that less than half of the curated papers discuss techniques for addressing the high imbalance of CICIDS2018. Hua, 2020, for example, has highlighted the use of embedded feature selection and undersampling with a LightGBM classifier.

None of the papers satisfactorily discuss the data cleaning of CICIDS2018. This is a significant revelation. About 60% of data scientists believe that no task is more time-consuming than data cleaning [12]. A discussion of data cleaning in a research paper should provide detailed information on all rows and columns of a dataset that have been dropped or modified, along with a rationale for these actions. Insufficient information on data cleaning in a study can make duplication of an experiment problematic for outside researchers. Data cleaning is a subset of data preprocessing, a task that makes a dataset more usable. It is important to note that data preprocessing should be performed on a dataset such as CICIDS2018 before learners are trained, as failure to do so could lead to inaccurate analytics.

Another important consideration pertains to the use of outdated datasets, such as KDD 1999, NSL-KDD, and ISCX2012, alongside CICIDS2018 in a study. For some or all attack traffic embodied within these older datasets, patches have long been issued and updated software versions hardened. A much greater concern, however, are the issues (discussed in Section 2) associated with these datasets. Researchers using intrusion detection datasets that are outdated should thoroughly understand how these known issues could affect the outcome of experiments.

Finally, our survey shows that statistical analysis of performance scores appears to have been overlooked. Determining the statistical significance of these scores provides clarity, and there are some established techniques for doing this, such as ANalysis Of VAriance (ANOVA) [98] and Tukey’s Honestly Significant Difference (HSD) [99]. ANOVA reveals whether the means of one or more independent factors are significant. Tukey’s HSD ascribes group letters to means that are significantly different from each other.

Gaps in current research

Significant gaps exist in intrusion detection research with CICIDS2018. Topics such as big data processing frameworks, concept drift, and transfer learning are missing from the literature. We explain further in the following paragraphs.

There are specialized frameworks for handling the processing and analysis of big data, where computations are enhanced by the utilization of computing clusters and parallel algorithms. One example is Apache Hadoop, an open source variant of the MapReduce framework, which divides a dataset into subsets for easier processing and then recombines the partial solutions [100]. The Apache Spark framework, another example, enables faster distributed computing by using in-memory operations [101]. Apache Spark is currently one of the most popular engines for big data processing, and we encourage researchers to evaluate learner performance on CICIDS2018 with this framework.

Concept drift is the variation of data distributions over time [102]. For example, a model trained today on CICIDS2018 may have a lower optimum recall score in five or ten years when tested against an up-to-date intrusion detection dataset. As discussed previously, some of the attack instances in a modern dataset would be rendered ineffective in the future (patches, updated software, etc.) and not reflect current reality. Research examining the effect of time on intrusion detection models is a promising area.

Transfer learning attempts to boost the performance of target learners on target domains by transferring knowledge from related but different source domains [103]. The aim is to construct models with a reduced number of target data instances. Within the context of intrusion detection, Singla et al. [104] note that models are better able to identify new attacks, through transfer learning, when the training data is limited. We surmise that CICIDS2018, with its ample supply of instances, could serve as an ideal source domain.

Performance metrics

In order to explain the metrics provided in this survey, it is necessary to start with the fundamental metrics and then build on the basics. Our list of applicable performance metrics is explained as follows:

  • True Positive (TP) is the number of positive instances correctly identified as positive.

  • True Negative (TN) is the number of negative instances correctly identified as negative.

  • False Positive (FP), also known as Type I error, is the number of negative instances incorrectly identified as positive.

  • False Negative (FN), also known as Type II error, is the number of positive instances incorrectly identified as negative.

Based on these fundamental metrics, the other performance metrics are derived as follows:

  • Recall, also known as sensitivity or True Positive Rate (TPR), is equal to \(\hbox {TP} / (\hbox {TP} + \hbox {FN})\).

  • Precision, also known as positive predictive value, is equal to \(\hbox {TP} / (\hbox {TP} + \hbox {FP})\).

  • Fall-Out, also known as False Positive Rate (FPR), is equal to \(\hbox {FP} / (\hbox {TP} + \hbox {FN})\).

  • Accuracy is equal to \((\hbox {TP} + \hbox {TN}) / (\hbox {TP} + \hbox {TN} + \hbox {FP} + \hbox {FN})\).

  • AUC provides the area under the Receiver Operating Characteristic (ROC) curve, which plots TPR against FPR for various classification cut-offs. The behavior of a classifier is shown across all thresholds of the ROC curve. AUC is a popular metric that counters the adverse effects of class imbalance. A model with 100% correct predictions has an AUC of 1, while a model with 100% incorrect predictions has an AUC of 0.

Conclusion

A marked increase in cyberattacks has shadowed the rapid growth of computer networks and network applications. In light of this, several intrusion detection datasets, including CICIDS2018, have been created to train predictive models. CICIDS2018 is multi-class, contains about 16,000,000 instances, and is class-imbalanced. As late as September 22, 2020, we aggressively searched for relevant studies based on this dataset.

For the most part, we observed that the best performance scores for each study, where provided, were unusually high. This may be attributable to overfitting. Furthermore, we note that only a few of the surveyed works explored treatment for the class imbalance of CICIDS2018. Class imbalance, particularly for big data, can skew the results of an experiment. As a final point, we emphasize that the level of detail paid to the data cleaning of CICIDS2018 failed to meet our expectations. This concern has a bearing on the reproducibility of experiments.

Several gaps have been identified in the current research. Topics such as big data processing frameworks, concept drift, and transfer learning are missing from the literature. Future work should address these gaps.