Introduction

Cybersecurity is increasingly becoming a major concern due to the increased reliance on computers and the Internet. In order to detect Cyber-attacks, it is prudent that we build efficient Network Intrusion Detection Systems, and the basis for doing this is to be able to analyze network traffic flow data, termed here as Cybersecurity data, efficiently and quickly. There is an inherent problem with most network traffic flow data or Cybersecurity data—the data is highly imbalanced, that is, there is a disproportionately large amount of good or normal traffic data and, in a most cases, very few attack instances. Even existing benchmark datasets suffer from this problem. Using imbalanced data for machine learning or deep learning algorithms like Artificial Neural Networks (ANN) is a major challenge. Moreover, many of these datasets require multi-class classification.

ANN needs to be trained on historical data, and can be seriously affected by imbalanced proportions in the data. When training data is extremely imbalanced, that is, when one class (or classes) outnumbers the other class(es) by a large proportion, majority data (the class or classes with larger proportions) will have a stronger influence on the ANN model than minority data (the class or classes in lesser proportions). Under these circumstances, the ANN model will recognize majority data well but have poor performance on recognizing minority data.

In most network traffic flow data or Cybersecurity data, benign or normal data makes up a large proportion of the dataset, and attack data makes up only a small proportion of the dataset. If this imbalanced data is used to train an ANN model, the ANN model will have good performance on recognizing the benign data and bad performance on recognizing the attack data. This means that the model will recognize benign data as benign data and might also recognize attack data as benign. Especially in multiclassification, if there are small numbers of certain attack types, the minority attack data may be recognized as benign data or majority attack data. When network traffic cybersecurity data is being used for attack detection, recognizing minority attack data correctly is more important than recognizing majority benign data correctly.

In order to improve performance on classifying imbalanced data, researchers have suggested a number of approaches including resampling, cost sensitive kernel modification methods, and active learning methods [15]. This paper focuses on resampling strategies. The resampling techniques, random undersampling (RU), random oversampling (RO), random undersampling and random oversampling (RURO), random undersampling with Synthetic Minority Oversampling Technique (RU-SMOTE), and random undersampling with Adaptive Synthetic Sampling Method (RU-ADASYN) were applied to six benchmark cybersecurity datasets, KDD99,Footnote 1 UNSW-NB15,Footnote 2 UNSW-NB17-Ecobee_Thermostat,Footnote 3 UNSW-NB17-Danmini_Doorbell (see Footnote 3), UNSW-NB17-Philips_B120N10_Baby_Monitor (see Footnote 3), and UNSW-NB18 [18], before performing classification using ANN. The classification results are evaluated using macro metrics including macro precision, macro recall and macro F-1 score. The training time, which usually forms the major part of the total running time of the algorithm, was also considered. Results of regular ANN using scikit-learn were compared to ANN in the Big Data framework using an EC2 instance of the Spark Machine Learning Library (MLlib) on an EMR Cluster.

The uniqueness of this work can be stated as:

  • Applying new resampling technique combinations of random undersampling and random oversampling on imbalanced data. The following unique resampling combinations were used:

    • Random undersampling and random oversampling taken together (RURO).

    • Random undersampling with the random oversampling technique, SMOTE (RU-SMOTE).

    • Random undersampling with the random oversampling technique, ADASYN (RU-ADASYN).

  • Studying the behavior of the above and other resampling techniques, random undersampling and random oversampling, in the domain of Cybersecurity data, a crucial emerging domain with respect to imbalanced data.

  • Applying resampling to classification using ANN.

  • The application of all of the above on the Big Data Framework using Spark.

The rest of this paper is organized as follows. A background of the different resampling techniques is presented in the "Resampling techniques implemented” section; this is followed by a section on "Related works"; the following section provides a brief "Description of the datasets" used in this study; the “Experimental design” section presents the study design, followed by the "Evaluation metrics", "Results and discussion"; and finally, the "Conclusion" is presented.

Resampling techniques implemented

To address the problem of imbalanced learning, many resampling techniques have been created. Resampling techniques include: oversampling, undersampling, combining oversampling and undersampling techniques, and ensembling sampling. Both oversampling and undersampling are aimed at changing the ratios between the majority classes and minority classes. Combining oversampling and undersampling techniques use both oversampling and undersampling techniques to create a more balanced new dataset. By making the training data more balanced, resampling enables different classes to have relatively the same influence on the outcomes of the classification model. The resampling techniques used in this paper, random undersampling, random oversampling, random undersampling and random oversampling, random undersampling with SMOTE, and random undersampling with ADASYN, are presented next.

Random undersampling refers to the process of reducing the number of samples. Samples from the majority class(es) are randomly picked with or without replacement.Footnote 4 After random undersampling, the number of cases (of the majority class) in the dataset decrease, which significantly reduces the training time in a model. However, data points removed by random undersampling may include important information, which may lead to a decrease in classification results. Lemaître et al. [20] presents a scikit learn toolbox to resample training data. In this paper, this toolbox was used to resample training data and Listing 1 presents example scikit learn code for random undersampling.

figure a

Random oversampling over-samples the minority class(es) by picking samples at random with replacement from the minority class(es) (see Footnote 1). Since oversampling increases the number of cases in the training dataset, random oversampling increases the training time of a model. Random oversampling may also lead to overfitting because it adds replicated data to the dataset. Listing 2 presents example scikit learn code for random oversampling.

figure b

Random undersampling and random oversampling uses the two methods together.

Synthetic Minority Oversampling Technique (SMOTE), commonly used as a benchmark for oversampling [9, 34], improves on simple random oversampling by creating synthetic minority class samples [4] and addresses the problem of overfitting [5] that can happen with simple random oversampling. This is because the new data points generated by SMOTE are synthetic data points instead of mere duplications. To generate new minority data points, a linear combination of two similar samples from the minority class are used [4]. New feature values are uniformly interpolated between the minority instance and its respective nearest neighbors. SMOTE only considers within class neighbors. Listing 3 presents example scikit learn code for SMOTE oversampling, including the sampling strategy used. In this work, random undersampling is applied in combination with SMOTE, hence this is referred to as RU-SMOTE.

figure c

ADASYN [14], a pseudo-probabilistic oversampling technique, uses a weighted distribution for different minority data points according to their level of difficulty in learning. With ADASYN, more synthetic data is generated for minority class examples that are harder to learn as compared to those minority examples that are easier to learn. A fixed number of instances is generated for each minority instance, based on a weighted distribution of its neighbors [1]. Listing 4 presents example scikit learn code for ADASYN oversampling. In this work, random undersampling has been applied in combination with ADASYN, hence it is referred to as RU-ADASYN.

figure d

Table 1 presents a brief comparison of Random oversampling, SMOTE and ADASYN.

Table 1 Brief comparison of Random Oversampling, SMOTE and ADASYN

Related works

Resampling stems from the class imbalance problem. Leevy et al. [19] stressed on the importance of the class imbalance problem and presented a survey of works on the class imbalance problem. The works were mainly divided into data-level methods and algorithm-level methods.

Some of the recent works on algorithm-level methods are: Johnson and Khoshgoftaar [17] examined existing deep learning techniques for addressing class imbalance; Raghuwanshi and Shukla [28] designed a novel BalanceCascade-based kernelized extreme learning machine to handle the problem of class imbalance; Luque et al. [21] presented a new way of measuring imbalance. A set of null-biased multi-perspective Class Balance Metrics were proposed which extended the concept of Class Balance Accuracy to other performance metrics.

There are also several studies on the data-level methods. Several studies have been carried out on comparison of oversampling and undersampling methods for handling the class imbalance problem. Douzas and Bacao [7] developed a conditional version of Generative Adversarial Networks to approximate the true data distribution and generate data for minority classes of various imbalanced datasets. Douzas et al. [8] presented an oversampling method based on k-means clustering and SMOTE which avoids the generation of noise and overcomes imbalances between and within classes.

More [25] reviewed a number of resampling techniques, including random undersampling of the majority class, random oversampling of the minority class, SMOTE, and many other techniques, to handle unbalanced datasets and study their effect on classification.

Amin et al. [2] surveyed six well-known sampling techniques: mega-trend diffusion function (MTDF), SMOTE, ADASYN, couples top-N reverse k-nearest neighbor, majority weighted minority oversampling technique, and immune centroids oversampling technique. Their work showed that the overall predictive performance of MTDF and rules-generation based on genetic algorithms performed the best as compared with the rest of the evaluated oversampling methods and rule-generation algorithms.

Abdi and Sattar [1] looked at different synthetic oversampling techniques and proposed a new oversampling algorithm based on Mahalanobis distance. They showed that their proposed method generates less duplicate and overlapping data points as opposed to other oversampling techniques.

Cieslak et al. [6] used SMOTE to detect network traffic intrusions. Blagus and Lusa [4] investigated the theoretical properties of SMOTE and its performance on high-dimensional data. They considered a two-class classification using Classification and Regression Trees, k-NN, linear discriminant analysis, random forests and support vector machines (SVM).Wallace et al. [33] also used SMOTE with SVM as the base classifier. Past works have also looked at the effects of dimensionality on SMOTE [16]. Hulse et al. [16] showed that in low-dimensional data, simple undersampling tends to outperform SMOTE. Ertekin et al. [10] and Radivojac et al. [27] evaluated the performance of SMOTE based on the number of samples. Song et al. [29] looked at the class imbalance problem in software detection prediction.

Many works also looked at resampling in the context of Big Data. Fernandez et al. [11] looked at the imbalance problem in the Big Data framework. Basgall et al. [3] developed SMOTE-BD, a fully scalable oversampling technique for imbalanced classification in Big Data Analytics. Terzi and Sagiroglu [30] developed a distributed cluster based resampling for imbalanced Big Data, which was designed to overcome both between-class and within-class imbalance problems in big data. Gutiérrez et al. [13] proposed SMOTE-GPU to efficiently handle large datasets (several millions of instances) on a wide variety of commodity hardware, including a laptop computer. Triguero et al. [32] independently managed the majority as well as minority classes. They undersampled the majority class and took advantage of Apache Spark’s in-memory operations to diminish the effects of the small sample size of the minority class.

In summary, several studies have looked at the class imbalance problem, both in traditional data as well as big data, using various resampling oversampling and undersampling techniques. However, none of the studies have analyzed the application of the resampling techniques, random undersampling and random oversampling (RURO) used together, random undersampling with SMOTE (RU-SMOTE), and random undersampling with ADASYN (RU-ADASYN), using Spark’s ANN multi-class classifier, on imbalanced network traffic cybersecurity data, the work performed in this study. In this study, basically a data-level approach, resampling of the majority and minority classes are handled independently.

Description of the datasets

For experimentation, six popular datasets were used: KDD99 (see Footnote 1), UNSW-NB15 (see Footnote 2), UNSW-NB17 (Ecobee_Thermostat, Danmini_Doorbell, and Philips_B120N10_Baby_Monitor) (see Footnote 3), and UNSW-NB18 [18]. Next is a description of the datasets.

KDD99

The KDD99 dataset, considered a benchmark cybersecurity dataset for a long time, is a 41 feature dataset. The attack records of this dataset can be classified into four broad categories and 22 subcategories. Table 2 presents the distribution of benign and attack data (in the four broad categories). The data is extremely imbalanced. Benign data makes up almost 20% of the data and the DoS attacks make up almost the other 80% of the data, hence the other attack categories have extremely few case instances.

Table 2 % of benign and attack traffic in KDD99

UNSW-NB15

The UNSW-NB15 dataset, created by the Cyber Range Lab of the Australian Centre for Cyber Security has 49 features [26]. There are 10 categories (9 attack categories plus 1 benign category). Table 3 presents the distribution of benign and attack data in UNSW-NB15. Here too, the data is highly imbalanced. Benign traffic makes up 88.5% of the traffic, while the nine attack categories combined make up the other 11.5%. It can be noted that worms make up only 0.0069% of the data, hence there are extremely few cases.

Table 3 % of benign and attack traffic in UNSW-NB15

UNSW-NB17

The UNSW-NB17 dataset was generated by 9 IoT devices. There are 9 sub datasets in UNSW-NB17, of which three were arbitrarily selected for this study: Ecobee_Thermostat, Danmini_Doorbell, and Philips_B120N10_Baby_Monitor. Each sub dataset includes two of the most common IoT botnets, Gafgyt and Mirai [22]. Each of the botnets has 5 attack subcategories, hence there are 10 categories of attack traffic and 1 benign category. There are 115 independent features in this data set. The csv files were used, which were extracted from pcap files by Kitsune [23]. Tables 4, 5, and 6 present the distribution of the benign and attack data in these datasets respectively. These datasets are imbalanced, but not as imbalanced as the KDD99 or UNSW-NB15. In these datasets, the Gafgyt_junk and Gafgyt_scan have close of 3% of the data each, but the other attack categories are a little more balanced. And, in these datasets, the benign traffic is not disproportionately high, as is the case in UNSW-NB15.

Table 4 % of benign and attack traffic in UNSW-NB17_Ecobee
Table 5 % of benign and attack traffic in UNSW-NB17_Doorbell
Table 6 % of Benign and Attack Traffic in UNSW-NB17_Philips

UNSW-NB18

The UNSW-NB18 BoT-IoT dataset was created by designing a realistic network environment in the Cyber Range Lab of The center of UNSW Canberra Cyber [18]. Table 7 presents the distribution of benign and attack data in this dataset. Here again, the data is highly imbalanced. TCP attacks make up approximately 43% of the cases and UDP attacks make up approximately 54% of the cases. In this dataset too, normal traffic makes up only 0.031% of the dataset, hence is very low. This is almost the opposite of the pattern in UNSW-NB15.

Table 7 % of Normal and Attack Traffic in UNSW-NB18

Experimental design

Figure 1 shows the flow chart of the experimental design. For each dataset, the dataset was split into a training set (70%) and a testing set (30%). Both training as well as testing datasets were pre-processed and standardized. The training dataset was then resampled and the ANN model trained. The test dataset was tested on the ANN model.

Fig. 1
figure 1

Flow chart of the experiments

For each dataset, six sets of classifications were performed with the following combinations of resampling techniques.

  • No resampling (NR).

  • Random undersampling (RU).

  • Random oversampling (RO).

  • Random undersampling and random oversampling (RURO).

  • Random undersampling and SMOTE (RU-SMOTE).

  • Random undersampling and ADASYN (RU-ADASYN).

Resampling of the majority and minority classes was performed independently, meaning that each category in each dataset was considered individually, rather than taking a fixed  % for under or over sampling. Classification was performed using Artificial Neural Networks (ANN) available in Apache Spark. All experiments were run in two modes: (i) on a local machine using Scikit Learn, and (ii) for the Big Data framework, Apache Spark, on Amazon’s Web Service (AWS) EMR cluster. The AWS EMR cluster was setup with 3 nodes (one master nodes and two slave nodes). Each node was an m5.xlarge EC2.

Apache Spark

Apache Spark, an open source distributed cluster computing framework, is part of the Hadoop Ecosystem, but has an edge over Hadoop in terms of speed due to it’s in-memory processing architecture. Spark can run up to 100 times faster than Hadoop for data and processes completely residing in-memory [12]. The Spark framework also provides benefits such as scalability and fault tolerance [12], as well as providing a rich set of APIs that allow developers to perform many complex analytics operations out-of-the-box. This work took advantage of the Spark Core and Spark MLlib APIs.

Spark Core allows for basic operations on data including mapping, reducing, and filtering. These operations are available in Spark’s primary data structure, Resilient Distributed Datasets (RDDs) [12], which parallelizes computations in a transparent way. Apache Spark’s Machine Learning Library, MLlib, makes machine learning scalable and easy. MLlib provides tools including:

  1. 1.

    ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering.

  2. 2.

    Featurization: feature extraction, transformation, dimensionality reduction, and selection.

  3. 3.

    Pipelines: tools for constructing, evaluating, and tuning ML Pipelines.

  4. 4.

    Persistence: saving and load algorithms, models, and Pipelines.

  5. 5.

    Utilities: linear algebra, statistics, data handling, etc.

The ANN model used in this paper is multilayer perceptron classifier of Spark MLlib.

Artificial Neural Networks

As shown in Fig. 2, ANN is a feed forward neural network in which the information moves from the input layer to hidden layers then to the output layer. A fully connected ANN model was used with the number of neurons in the input layer set to the number of features in the data and the number of neurons in the output layer set to the number of the classes. The intermediary layer used a sigmoid function, where i is the input [31]:

Fig. 2
figure 2

ANN model used

$$f\left( {z_{i} } \right) = \frac{1}{{1 + e^{{ - z_{i} }} }}$$
(1)

The sigmoid function smoothly puts the input to an output between zero and one. This allows for the interpretation or output of any individual layer to be taken as a probability.

The output layer used the softmax function [31]:

$$f\left( {z_{i} } \right) = \frac{{e^{{z_{i} }} }}{{\mathop \sum \nolimits_{k = 1}^{N} e^{{z_{k} }} }}$$
(2)

The softmax function is often used as the activation function for the last layer of a neural network. This activation function turns numbers into probabilities that sum to one. The softmax function outputs a vector that represents the probability distributions to a list of potential outcomes.

Evaluation metrics

In this section, first a discussion of why the macro metrics was used is presented, and then the metrics are presented.

Using macro metrics

For this work, macro precision, macro recall, and macro F1-score were used instead of the micro or weighted metrics to evaluate the results. The macro metrics compute the metrics independently for each class and then take the average, hence all classes, majority as well as minority, are weighted equally.

The micro metrics aggregate the contributions of all classes to compute the average metric, hence results get skewed towards classes with larger case numbers. Micro metrics, in a multi-class setting, with highly imbalanced data, will often produce equal precision, recall and F1-score that is artificially high. The good performance of the majority data overly influences the micro metrics, which is the case for highly imbalanced data.

The weighted metrics compute the averages by taking the class size into account, that is, the number of cases for each class, hence it is the “weighted” average. If a model recognizes majority data correctly but does not recognize minority data correctly, the weighted metrics will be high. Hence, in this case, the weighted metrics does reflect the bad performance of the classifying minority data. Also, the weighted metric may produce an F1-score that is not between precision and recall. Hence, even if the weighted metrics may be good, it was not used for this work.

Since three of the cybersecurity datasets used in this study are highly imbalanced, after resampling, the macro metrics were used as the evaluation metrics in this study. The macro metrics produce relatively lower results than the micro metrics. This is because the macro metrics treat all classes equally, hence the poor performance of the minority classes will lower the macro metrics. But, though the macro metrics reflect the poor performance of classifying minority data, it was deemed that, for these datasets, the macro metrics would better reflect the overall performance of classifying the data.

Metrics formulas

Below are the respective formulas for accuracy, precision, recall and the F1-score. Although the micro, macro and weighted metrics are all computed slightly differently (as discussed in the previous section), all three metrics use the same formulas for calculating precision, recall and the F1-score.

Precision is the positive predictive value, or the percentage of classified attack instances that are truly classified as attack, calculated by [24]:

$$Precision = \frac{TP}{TP + FP}$$
(3)

Recall or attack detection rate (ADR) is the effectiveness of a model in identifying an attack. The objective is to target a higher ADR. The ADR is calculated by [24]:

$$Recall = \frac{TP}{TP + FN}$$
(4)

F-measure is the harmonic mean of precision and recall. The higher the F-measure, the more robust the classification model. The F-measure is calculated by [24]:

$$F1 - score = F = \frac{2 \times precision \times recall}{precision + recall}$$
(5)

True Positive (TP) is the number of positive records that were correctly labeled as positive. True Negative (TN) is the number of negative records that were correctly labeled as negative. False Positive (FP) is the number of negative records that were incorrectly labeled as positive. False Negative (FN) is the number of positive records that were correctly labeled as negative.

Results and discussion

In this section, first, the classification results for all six datasets, with no resampling, is presented. This will be used as a benchmark for analyzing the results. Then, for each dataset, resampling results and the classification results using the different resampling techniques, are presented. The ANN classification was done in two modes: (i) on the Big Data framework using Spark’s Machine Learning Library; and (ii) using Scikit Learn on a local machine. Observations and discussions follow each set of results.

Classification with original datasets (no resampling)

The first set of classifications were done with the original six datasets, that is, with no resampling. These results form the benchmark for the ANN classification results.

Table 8 present the macro precision, macro recall, macro F1 score and training time taken for ANN classification with no resampling on AWS with Spark for all the six datasets. Similarly, Table 9 presents the macro precision, macro recall, macro F1 score and training time taken for ANN classification with no resampling on the local machine for all the six datasets. The testing time was not recorded since the training time is the more significant of the two. Figure 3 graphically presents the macro precision, macro recall, macro F1 score for all six datasets run on Spark with no resampling. The results on the local machine show a similar trend, hence were not presented.

Table 8 ANN Classification results on AWS with Spark (no resampling)
Table 9 ANN classification results on Scikit-Learn on local machine (no resampling)
Fig. 3
figure 3

Classification results with no resampling on AWS

Observations and discussion

  • ANN classification on Scikit-Learn has better performance than ANN classification on Spark. The macro precision, macro recall, and macro F1-score are higher on the ANN classification on Scikit-Learn.

  • The ANN classification model on Spark trains faster than the ANN classification model of the local machine. This is expected since Spark is the Big Data framework, hence parallel processing is performed.

  • UNSW-NB15 has one category that has the most cases, that is, the benign category comprises almost 88% of the cases, hence this imbalance is causing the low results for this non-resampled dataset. BoT-IoT has two categories that have a large combined total number of cases, TCP (43%) and UDP (54%), hence this imbalance is also causing low results. The results of UNSW-NB17 are pretty high even without resampling, mainly because the three UNSW-NB17 datasets are relatively balanced compared to the other three datasets.

Classification with the Resampled datasets

This section presents the results of the resampling and classification on the six different datasets, KDD99 (see Footnote 1), UNSW-NB15 (see Footnote 2), UNSW-NB17 (Ecobee_Thermostat, Danmini_Doorbell, and Philips_B120N10_Baby_Monitor) (see Footnote 3), and UNSW-NB18 [18]. For all datasets, macro results are presented. For two of the datasets, KDD99 and UNSW-NB15, however, the micro metrics were also presented (for AWS runs), but these metrics were not presented for the rest of the datasets because of the artificially high micro results as well as almost equal micro recall, micro precision and micro F1 score. Also, the confusion matrices were presented for the highly imbalanced datasets, since there was little influence on the not highly imbalanced datasets. Also, in the respective resampling sections, for brevity’s sake, only the RU, RO, and RURO are presented, though RU-SMOTE and RU-ADASYN resampling was also done for the classifications.

Experimentation on KDD99

The first section presents the resampling of KDD99 and then the classification results are presented. An analysis of the KDD99 results are presented in the observations and discussions section.

Resampling KDD99

Table 10 presents the number of samples after before resampling, after RU, after RO, and after RURO and Fig. 4 presents the number of samples before resampling, after RU, and after RO. Before Resampling represents 70% of the original KDD99 dataset, which was used for training the model. From Table 10, it can be noted that u2r had only 40 instances and r2l had only 794 instances before resampling, so oversampling makes a big difference for these two attacks. With RU, the number of instances of benign and DoS were reduced to the number of Probe instances, making all three categories equal, while there was still a low number of instances for u2r and r2l. Hence with RU, the data still appears to be imbalanced overall. With RO, the number of Probe, DoS and r2l instances were made the same, although the number of benign and DoS instances were still high. With RURO, the number of instances for each attack were made equal, hence the results were not shown in Fig. 4.

Table 10 Resampling of KDD99
Fig. 4
figure 4

Resampling of KDD99

Classification results for KDD99

Table 11 presents the ANN classification results for KDD99 on AWS using Spark and Table 12 presents the ANN classification results for KDD99 run on the local machine with Scikit-Learn. The results of macro precision, macro recall and macro F1 score are presented for NR, RU, RO, RURO, RU-SMOTE, RU-ADAYSN for KDD99. Table 11 also presents the results of the micro precision, micro recall and micro F1 score. The training time of the models was also recorded in Tables 11 and 12. Figure 5 presents a comparison of the micro and macro metrices run on AWS for no resampling and random undersampling. It can be noted that the micro precision, micro recall and micro F1 score were almost equal as well as artificially high, hence the evaluations were based on the macro metrics.

Table 11 ANN Classification results for KDD99 on AWS with Spark for various resampling methods
Table 12 ANN classification results for KDD99 on local machine for various resampling methods
Fig. 5
figure 5

Comparison for micro and macro metrics on AWS

Figure 6 presents the graphical results of the different resampling methods using Spark. The results of the different resampling methods on the local machine show a similar trend, hence are not presented.

Fig. 6
figure 6

ANN classification results for KDD99 on AWS with Spark for various resampling methods

Confusion matrices for KDD99

Tables 13, 14, 15, 16, 17 and 18 show the confusion matrices using the various resampling methods for the AWS runs on Spark. The predicted label vs the true labels are shown, that is, if it was predicted as attack type 1, was it really attack type 1. The categories that had a really low number of instances are marked with an asterisk and the increases in the minority data identification are in italics.

Table 13 Confusion matrix: KDD99 NR
Table 14 Confusion Matrix: KDD99 RU
Table 15 Confusion Matrix: KDD99 RO
Table 16 Confusion Matrix: KDD99 RURO
Table 17 Confusion Matrix: KDD99 RU-SMOTE
Table 18 Confusion Matrix: KDD99 RU-ADASYN
Observations and discussion

Few conclusions that can be drawn from these above sets of results:

  • The mirco precision, micro recall and micro F1 score were showing very artificially high numbers as well as almost the same results for NR as well as RU (Fig. 5), hence were not considered useful for any further analysis.

  • There is almost no overall significant difference between the ANN classification results on AWS and ANN classification results on the local machine in terms of the macro precision and macro recall, and macro F1 score. After oversampling though, it took longer to run on the local machine than on AWS.

  • On both AWS and the local machine, when the minority data is increased by oversampling or majority data is decreased by undersampling, the macro precision decreases, and the macro recall increases. Oversampling improves the macro recall significantly. Macro precision decreasing implies that the ratio of the false positive to true positive is going up, and the macro recall increasing implies that the ratio of the false negative to true positive is going down. This means that, for this set of experiments, the false positives are going up and the false negatives are going down.

  • The confusion matrices also show an increase in the number of correctly classified cases for the very low minority classes (shown with asterisk) with resampling (results are in italics in Tables 13, 14, 15, 16, 17 and 18), with the best results for RURO and RU-SMOTE. From Table 10 it can be noted that RURO had an equal number of for all the attack types. And, even though the RU still had an imbalanced distribution, it was better than no resampling, and also performed better than no resampling.

  • Generally, the F1 score went down for both undersampling and oversampling. It went slightly up only for RU on AWS, but not significantly.

  • Except for RO, the training time decreased in all resampling scenarios, for both the local machine as well as AWS, and of course, the training time on AWS was a lot shorter than on the local machine (though it was higher on AWS when no resampling was done).

  • From Table 11 (AWS), it can be observed that RURO’s macro recall was the highest, at 96%, while RU-SMOTE and RU-ADAYSN’s macro recall were very close, at 95.59%. RU’s macro recall (90.5%) was lower than the recall of the other resampling methods, but a lot better than NR (73%).

  • From Table 12 (local machine), it can be observed that, RU-SMOTE and RU-ADASYN performed the best in terms of macro recall, at 96%. RU again had the lowest macro recall of the all the resampling methods (88%), but performed better than NR (83%).

Experimentation on UNSW-NB15

The first section presents the resampling of UNSW-NB15 and then the classification results are presented. An analysis of the UNSW-NB15 results are presented in the observations and discussions section.

Resampling UNSW-NB15

Table 19 presents the number of samples before resampling, after RU, after RO, and after RURO and Fig. 7 graphically presents the data before resampling, after RU, and after RO. Before Resampling represents 70% of the original UNSW-NB15 dataset, which was used for training the model. From Table 19 it can be noted that, with RU, though the number of benign and generic instances were reduced, some of the other attacks like Shellcode, Backdoors and Worms still had a lower number of instances. And overall, with RU, the data was still imbalanced. RO makes the attack instances equal for the rest of the attacks except the benign and generic traffic. The number of benign traffic instances was still very high compared to the rest of the attacks, as shown in Fig. 7. By RURO all the attack instances are made equal.

Table 19 Resampling of UNSW-NB15
Fig. 7
figure 7

Resampling of UNSW-NB15

Classification results for UNSW-NB15

Table 20 presents the ANN classification results for UNSW-NB15 on AWS using Spark and Table 21 presents the ANN classification results for UNSW-NB15 run on the local machine with Scikit-Learn. The results of macro precision, macro recall and macro F1 score are presented for NR, RU, RO, RURO, RU-SMOTE and RU-ADASYN for UNSW-NB15. Table 20 also presents the results of the micro precision, micro recall and micro F1 score. The training time was also recorded in Tables 19 and 20 respectively. Figure 8 presents a comparison of the micro and macro metrices on AWS for no resampling and random oversampling. It can be noted that the micro precision, micro recall and micro F1 score were almost equal as well as artificially high, hence the evaluations were done based on the macro metrics.

Table 20 ANN Classification results for UNSW-NB15 on AWS with Spark for various resampling methods
Table 21 ANN classification results for UNSW-NB15 on local machine for various resampling methods
Fig. 8
figure 8

Comparison of the micro and macro metrics for UNSW-NB15

Figure 9 presents the graphical results of the different resampling methods using Spark. The results of the different resampling methods on the local machine show a similar trend, hence are not presented.

Fig. 9
figure 9

ANN classification results for UNSW-NB15 on AWS with Spark for various resampling methods

Confusion matrices for UNSW-NB15

Tables 22, 23, 24, 25, 26 and 27 show the confusion matrices using the various resampling methods for the AWS runs on Spark. The predicted label vs the true labels are shown. The categories that had a really low number of instances are marked with an asterisk and the increases in the minority data identification are in italics.

Table 22 Confusion Matrix: NB15 NR
Table 23 Confusion Matrix: NB15 RU
Table 24 Confusion Matrix: NB15 RO
Table 25 Confusion matrix: NB15 RURO
Table 26 Confusion Matrix: NB15 RU-SMOTE
Table 27 Confusion Matrix: NB15 RU-ADASYN
Observations and discussion
  • The mirco precision, micro recall and micro F1 score were showing very artificially high numbers as well as almost the same results for NR as well as RO (Fig. 8), hence were not considered useful for any further analysis.

  • There is almost no overall significant difference between the ANN classification results on AWS and ANN classification results on the local machine in terms of the macro precision, macro recall, and macro F1 score. After oversampling though, it took longer to run on the local machine than on AWS.

  • When the minority data is increased by oversampling or majority data is decreased by undersampling, both the macro precision and macro recall increase, though resampling improves the macro recall significantly. Macro precision increasing implies that the ratio of the false positive to true positive is going down, and the macro recall increasing implies that the ratio of the false negative to true positive is going down. So, for this set of experiments, the true positives went up.

  • The confusion matrices also show an increase in the number of correctly classified cases for the very low minority classes (shown with asterisk) with resampling (results are in italics in Tables 22, 23, 24, 25, 26 and 27). Though RU did not perform as well as the other resampling measures, it was at least better than NR (though very slightly). RURO performed the best, though the other resampling options like RO, RU-SMOTE and RU-ADAYSN performed almost as well.

  • With respect to training time, on the local machine, except for undersampling, the training time went up in all scenarios of oversampling. But, on AWS, except for RO, the training time went significantly down. And of course, comparing the local machine to AWS, AWS had a lot lower training time in all cases.

  • From Table 20 (AWS), it can be observed that RURO and RU-SMOTE’s macro recall were the highest, and very close, at 74.4% and 74.6% respectively. RU’s macro recall (42%) was lower than the recall of the other resampling methods, but a lot better than NR (32%).

  • From Table 12 (local machine), it can be observed that, RURO’s macro recall was the highest at 77.5% and RU-SMOTE and RU-ADASYN’s macro recall were pretty close, at 76.4%. Again, RU had the lowest macro recall of the all the resampling methods (50%), but performed better than NR (41%).

Experimentation on UNSW-NB18

The first section presents the re-sampling of UNSW-NB18 and then the classification results are presented. An analysis of the UNSW-NB18 results are presented in the observations and discussions section.

Resampling UNSW-NB18

Table 28 presents the number of samples before resampling, after RU, after RO, and after RURO and Fig. 10 graphically presents the data before resampling, after RU, and after RO. Before Resampling represents 70% of the original UNSW-NB18 dataset, which was used for training the model. From Table 28 it can be noted that Data Exfiltration and Keylogging had only 4 and 50 instances respectively before resampling, so oversampling makes a big difference for these two attacks. With RU, mainly the number of TCP and UDP attacks, which had the most instances, was reduced. But overall, with RU as well as with RO, the data was still imbalanced. TCP and UDP still have a lot more instances.

Table 28 Resampling of UNSW-NB18 (BoT-IoT)
Fig. 10
figure 10

Resampling of UNSW-NB18 (BoT-IoT)

Classification results for UNSW-NB18 (BoT-IoT)

Table 29 presents the ANN classification results for UNSW-NB18 (BoT-IoT) on AWS using Spark and Table 30 presents the ANN classification results for UNSW-NB18 (BoT-IoT) on the local machine with Scikit-Learn. The results of macro precision, macro recall and macro F1 score are presented for NR, RU, RO, RURO, RU-SMOTE RU-ADASYN for UNSW-NB18. The training time for the model was also recorded in Tables 29 and 30 respectively. Figure 11 presents the graphical results of the different resampling methods using Spark. The results of the different resampling methods on the local machine show a similar trend, hence are not presented.

Table 29 ANN classification results for UNSW-NB18 (BoT-IoT) on AWS with Spark for various resampling methods
Table 30 ANN classification results for UNSW-NB18 (BoT-IoT) on local machine for various resampling methods
Fig. 11
figure 11

ANN classification results for UNSW-NB18 (BoT-IoT) on AWS with Spark for various resampling methods

Confusion matrices for UNSW-NB18

Tables 31, 32, 33, 34, 35 and 36 show the confusion matrices using the various resampling methods for the AWS runs on Spark. The predicted label vs the true labels are shown. The categories that had a really low number of instances are marked with an asterisk and the increases in the minority data identification are in italics.

Table 31 Confusion matrix: NB18 NR
Table 32 Confusion matrix: NB18 RU
Table 33 Confusion matrix: NB18 RO
Table 34 Confusion matrix: NB18 RURO
Table 35 Confusion matrix: NB18 RU-SMOTE
Table 36 Confusion Matrix: NB18 RU-ADAYSN
Observations and discussion
  • There is almost no overall significant difference between the ANN classification results on AWS and ANN classification results on the local machine in terms of the macro precision, macro recall, and macro F1 score. After oversampling though, it took longer to run on the local machine than on AWS.

  • When the minority data is increased by oversampling or the majority data is decreased by undersampling, the macro recall or ADT increases. Oversampling improves the macro recall significantly. The macro precision went up in only one case, in the case of RO. In all other cases, the macro precision decreased. Macro precision decreasing implies that the ratio of the false positive to true positive is going up. Since the macro recall increased, this implies that the ratio of the false negative to true positive is going down. So, in these set of experiments, it can be concluded that the false positives went up and false negatives went down.

  • The confusion matrices also show an increase in the number of correctly classified cases for the very low minority classes (shown with asterisk) with RO, RURO, RU-SMOTE, and RU-ADASYN (results are in italics in Tables 31, 32, 33, 34, 35 and 36), though the latter did not do as well as the earlier three resampling methods. It can be noted from Table 32 that RU did not have any effect on these results. From Table 28 it can be noted that Data_Exfiltration and Keylogging still have very small number of attacks for RU, which is why the ANN classifier could not train effectively for RU.

  • Using Spark, only with RO, the training time was the same as the benchmark (which had no resampling), but in all other resampling cases the training time decreased. On the local machine, all cases of resampling had lower times. But again, Spark took a lot less time for training the model than the local machine.

  • From Table 29 (AWS), it can be observed that RURO and RU-SMOTE’s macro recall were the highest, and very close, at 85.87% and 85.84% respectively. In this case RU-ADAYSN did not perform as well as RURO or RU-SMOTE. RU’s macro recall (54.46%) was lower than the recall of the other resampling methods, but a lot better than NR (45.14%).

  • From Table 30 (local machine), it can be observed that, RURO’s macro recall was the highest at 88.78% and RU-SMOTE’s macro recall was pretty close, at 88.1%. In this case, too, RU-ADAYSN did not perform as well as RURO or RU-SMOTE. RU, again, had the lowest macro recall of the all the resampling methods (63.6%), but performed better than NR (57.6%).

Experimentation on NB17-Ecobee

The first section presents the re-sampling of NB17-Ecobee and then the classification results are presented. An analysis of the NB17-Ecobee results are presented in the observations and discussions section.

Resampling NB17-Ecobee

Table 37 presents the number of samples before resampling, after RU, after RO, and after RURO and Fig. 12 graphically presents the data before resampling, after RU, and after RO. The Before Resampling column represents 70% of the original NB17-Ecobee dataset, which was used for training the model. Figure 12 shows the imbalance in the data before resampling. In this dataset there were a lower number of benign cases (lower than any of the attacks), and there were no attacks with extremely low number of cases. After RU, the data were more balanced than before resampling, but RO seemed to give the same pattern as before resampling, and the data was balanced for each category with RURO, hence this category was not shown in Fig. 12.

Table 37 Resampling of NB17-Ecobee
Fig. 12
figure 12

Resampling of NB17-Ecobee

Classification results for NB17-Ecobee

Table 38 presents the ANN classification results for NB17-Ecobee on AWS using Spark and Table 39 presents the ANN classification results for NB17-Ecobee on the local machine with Scikit-Learn. The results of macro precision, macro recall and macro F1 score are presented for NR, RU, RO, RURO, RU-SMOTE, and RU-ADASYN for NB17-Ecobee. The training time (in seconds) for the model was also recorded in Tables 38 and 39 respectively. Figure 13 presents the graphical results of the different resampling methods using Spark. The results of the different resampling methods on the local machine show a similar trend, so they are not presented graphically.

Table 38 ANN classification results for NB17-Ecobee on AWS with Spark for various resampling methods
Table 39 ANN classification results for NB17-Ecobee on local machine for various resampling methods
Fig. 13
figure 13

ANN classification results for NB17-Ecobee on AWS with Spark for various resampling methods

Observations and discussion
  • Resampling does not seem to have any effect on macro precision, macro recall or macro F1 score in this dataset. In fact, on AWS, Table 38, it can be observed that NR and RU performed better than the other resampling methods. On the local machine however, Table 39, NR and the other resampling measures seemed to give almost the save percentage for macro recall.

  • Except for RO, the training time is lower than the benchmark in all other cases. And of course, the training time on the local machine is higher than AWS.

Experimentation on NB17-Danmini

The first section presents the resampling of NB17-Danmini and then the classification results are presented. An analysis of the NB17-Danmini results are presented in the observations and discussions section.

Resampling NB17-Danmini

Table 40 presents the number of samples before resampling, after RU, after RO, and after RURO and Fig. 14 graphically presents the data before resampling, after RU, and after RO. Before Resampling represents 70% of the original NB17-Danmini dataset, which was used for training the model. Figure 14 shows the imbalance in the data before resampling. In this dataset, Gafgyt_junk and Gafgyt_scan had a lower number of cases, but the number of cases were not as low as some of the extremely low number of attacks in KDD99, UNSW-NB15 or UNSW-NB18. After RU the data was more balanced than before resampling, but RO seemed to give the same pattern as before resampling, and the data was balanced for each category with RURO, hence this latter category was not shown in Fig. 14.

Table 40 Resampling of NB17-Danmini
Fig. 14
figure 14

Resampling of NB17-Danmini

Classification results for NB17-Danmini

Table 41 presents the ANN classification results for NB17-Danmini on AWS using Spark and Table 42 presents the ANN classification results for NB17-Danmini on the local machine with Scikit-Learn. The results of macro precision, macro recall and macro F1 score are presented for NR, RU, RO, RURO, RU-SMOTE and RU-ADASYN for NB17-Danmini. The training time (in seconds) for the model was also recorded in Tables 41 and 42 respectively. Figure 15 presents the graphical results of the different resampling methods using Spark. The results of the different resampling methods on the local machine show a similar trend, hence were not presented.

Table 41 ANN Classification results for NB17-Danmini on AWS with Spark for various resampling methods
Table 42 ANN classification results for NB17-Danmini on local machine for various resampling methods
Fig. 15
figure 15

ANN classification results for NB17-Danmini on AWS with Spark for various resampling methods

Observations and discussion
  • Resampling does not seem to have any effect on macro precision, macro recall or macro F1 score in this dataset. On AWS, Table 41, NR had a macro recall of 84% while resampling measures had a macro recall of 85–87%. And, on the local machine however, Table 42, NR and the other resampling measures seemed to give almost the save percentage for macro recall, a little above 90%.

  • On AWS, except for RO, the training time went down in all cases. On the local machine, however, the time went up for RU, RO, and RU-ADASYN. And again, overall, it took much longer to run on the local machine.

Experimentation on NB17-Philips

The first section presents the resampling of NB17-Philips and then the classification results are presented. An analysis of the NB17-Philips results are presented in the observations and discussions section.

Resampling NB17-Philips

Table 43 presents the number of samples before resampling, after RU, after RO, and after RURO and Fig. 16 graphically presents the data before resampling, after RU, and after RO. Before Resampling represents 70% of the original NB17-Philips dataset, which was used for training the model. Figure 16 shows the imbalance in the data before resampling. In this dataset, Gafgyt_junk and Gafgyt_scan had a lower number of cases, but again, the cases were not as low as some of the attacks in KDD99, UNSW-NB15 or UNSW-NB18. After RU, the data was more balanced (as shown in Fig. 16), but after RO the pattern of distribution of data closely followed the before resampling. After RURO, the number of cases were balanced for each category, hence this was not included in Fig. 16.

Table 43 Resampling of NB17-Philips
Fig. 16
figure 16

Resampling of NB17-Philips

Classification results for NB17-Philips

Table 44 presents the ANN classification results for NB17-Philips on AWS using Spark and Table 45 presents the ANN classification results for NB17-Philips on the local machine with Scikit-Learn. The results of macro precision, macro recall and macro F1 score are presented for NR, RU, RO, RURO, RU- SMOTE and RU-ADASYN for NB17-Philips. The training time (in seconds) for the model was also recorded in Tables 44 and 45 respectively. Figure 17 presents the graphical results of the different resampling methods using Spark. The results of the different resampling methods on the local machine show a similar trend, so it was not presented.

Table 44 ANN classification results for NB17-Philips on AWS with Spark for various resampling methods
Table 45 ANN classification results for NB17-Philips on local machine for various resampling methods
Fig. 17
figure 17

ANN classification results for NB17-Philips on AWS with Spark for various resampling methods

Observations and discussion
  • Resampling does not seem to have any effect on macro precision, macro recall or macro F1 score in this dataset. From both Tables 44 and 45, it can be observed that almost all the resampling methods performed close to NR, on both AWS and the local machine.

  • On AWS, except for RO, the training time went down in all cases. On the local machine, however, the time went up for RO, RU-SMOTE and RU-ADASYN. And again, overall, it took much longer to run on the local machine.

Conclusion

Five different forms of resampling were applied to six different datasets. Three of these datasets can be considered highly imbalanced, and the other three datasets can be considered less imbalanced. The high imbalanced datasets were, KDD99, UNSW-NB15, and UNSW-NB18(BoT-IoT). And, the three UNSW-NB17 datasets can be considered less imbalanced. The following conclusions can be drawn from the resampling:

  1. 1.

    Oversampling increases the training time taken while undersampling decreases the training time taken. This is natural because oversampling increases the number of cases in training data, while undersampling decreases the number of cases in training data.

  2. 2.

    In the highly imbalanced datasets, both oversampling and undersampling increase recall significantly. This means that the ratio of the false negatives to the true positives decreases. So, the ANN model recognized more minority data correctly. And this was also shown by the confusion matrices. In some cases, the macro precision decreases, which means that the ANN model incorrectly recognized more majority data as minority data.

    In some cases, the macro precision decreased, meaning that the ANN model incorrectly recognized some majority data as minority data. A summary of the behavior of oversampling and undersampling the highly imbalanced datasets is presented in Table 46.

    Table 46 Summary for oversampling and undersampling highly imbalanced datasets

    With no resampling, micro precision and micro recall were high, but the macro precision and macro recall were relatively lower. This is because although the model recognized almost all majority instances correctly, it recognized minority instances incorrectly, which means that the model recognized most minority instances as belonging to the majority class. This made the macro precision and macro recall relatively lower.

    With resampling, however, micro precision and micro recall were still high. The macro recall increases after resampling because the model recognizes more minority instances as the minority class, and this was also reflected in the confusion matrices. However, macro precision decreases after resampling because the model also recognizes some majority instances as minority instances. The number of misrecognitions of majority instances is not relatively large in comparison with the number of majority instances. But the number of misrecognitions of majority instances is relatively large in comparison with the number of minority instances, which decreases the precision of minority classes. So, with resampling, generally, it can be stated that more minority instances were recognized correctly. Table 47 presents a summary of the behavior of the recognizing the minority and majority instances in highly imbalanced datasets.

    Table 47 Summary for recognizing minority and majority instances on highly imbalanced datasets
  3. 3.

    Also, for highly imbalanced datasets, NB15 and NB18, from the confusion matrices it appears that RURO performed the best in terms of identifying minority cases, though in some cases this was only a small improvement above RU-SMOTE and RU-ADASYN. For KDD99, RURO and RU-SMOTE can be considered to have performed equally well in identifying minority cases.

  4. 4.

    For highly imbalanced datasets, KDD99, NB15 and NB18, in most cases, the RURO and RU-SMOTE performed the best, in terms of macro recall. RU usually did not perform as well as the other resampling measures in terms of macro recall, but performed better than NR. And RO always performed better than RU in terms of macro recall, and sometimes it was comparable to RURO, RU-SMOTE, and RU-ADASYN.

  5. 5.

    If the data is not extremely imbalanced, for example, NB17, resampling makes no difference, as shown in Table 48.

    Table 48 Oversampling and Undersampling in not extremely Imbalanced Datasets

This could be because:

  1. i.

    Since the data set is not extremely imbalanced, majority data does not have a very strong influence on the model. Minority data has enough influence on the model, hence the model can classify minority data well.

  2. ii.

    Imbalance may not be the reason for the inaccuracy. Resampling improves the accuracy by reducing the extent of imbalance. If the inaccuracy is not caused by the imbalance, resampling will not be able to improve the accuracy.

Table 49 presents a summary of behavior of recognizing the minority and majority instances in not highly imbalanced datasets.

Table 49 Summary for recognizing minority and majority instances in not highly imbalanced datasets