Two different experimental set-ups are proposed. The first one is performed in a no-overfitting regime and the second one in an overfitting regime. For each experiment, a cross-validation procedure is set up. The training dataset is distributed into 5 training/validation folds. The models are trained using 4/5th parts of the training set and validated with the remaining fifth. This procedure is repeated 5 times, rotating the part that is left out for validation. For each run, an optimal model is selected according to the accuracy on the validation dataset. Then, adversarial examples are crafted on the test set for each method against the selected snapshot. To evaluate the results, the Adversarial Successful Rate (ASR) is computed, which is defined as “1—accuracy” obtained by the model for each adversarial set. The accuracy for the original test samples is also provided. Also, the distance metrics and adversarial robustness metrics described above are computed to show the difference in the behaviour between both regimes. This process is performed for each fold of the cross-validation scenario.
In order to show the results for the different runs, each line in the plots represents the mean value obtained at each epoch. Additionally, vertical lines are included to represent one standard deviation of the means. Analogously, when results are provided numerically in tables, they also contain the mean and standard deviation.
3.1. No Overfitting
The first set of experiments is performed using appropriate parameters to reduce the validation error. This is usually the main goal when training a model for a specific task. In consequence, the generated model is expected to generalize, being able to perform accurately for samples that are not present in the training dataset, while they keep on the same distribution, known as the test set.
In the case of the CIFAR-10 dataset, training has been performed using a LeNet architecture as described in [
25], implemented in a Keras-Tensorflow backend. Training was performed for 35 epochs and with a learning rate of
. The performance is shown in
Figure 3. The important key is to check that both training and validation accuracies grow with the same positive progression.
Considering the results for the previous dataset, in the case of the STL-10 dataset, the same set-up and parameters are employed. Performance is shown in
Figure 3. As in the previous case, it is observed that the achieved models are trained with no overfitting. So, in both cases, they are suitable to continue with the experimentation.
Regarding the performance on the training process, it should be noted that the objective is to clearly differentiate between normal and overfitting regimes, independently of the absolute values in validation accuracy, for example. As suggested in the experimental proposal, when the model is trained on a normal regime, the validation accuracy increases progressively, which is achieved here. However, in the overfitting regime, this accuracy should reach a peak and then start dropping, as observed in further experimentation. In both cases, the images used for adversarial generation are correct predictions for the given model, so there is no drawback in discarding some of the images due to low relative nominal test performance. Specifically, the STL-10 dataset has a wider difference between training and validation accuracy, but the trends persist.
In this case, the models trained in this non-overfitting regime have been tested against the CW, FGSM and PGD attack methods for both datasets (CIFAR-10 and STL). In all cases, they have been set up to perform untargeted attacks (with no predefined destination class) using distance metric to generate the adversarial examples. Moreover, the minimum confidence value parameter is given to CW, and standard maximum perturbation of 0.3 are set-up for PGD and FGSM.
The epoch with the best validation accuracy is selected to craft the adversarial examples. Then the rest of models are applied over the same adversarials. Here we are leveraging a concept known as adversarial transferability [
26]. The architecture remains the same but with increasing/decreasing levels of generalization through the training process. For this reason, the adversarials are considered to be highly transferable and, therefore, suitable to study their effects on different stages of the training and with different parameter conditions (overfitted vs regularized).
As the training is performed in a normal regime, the 35th and last epoch is usually the one with the best validation accuracy, see
Figure 3. The adversarial examples generated with each method are classified with the snapshot models trained with all the epochs. Then, the accuracy in each AE (Adversarial Example) set is obtained.
Figure 4 shows the test accuracy curve and the Attack Success Rate (ASR), for the attacks that have been performed. This metric represents the percentage of adversarials that are misclassified by the model.
As shown in the performance of the CW attack for these models, the adversarial success rate is lower in the first epochs (from the 5th epoch onwards, as the previous ones should not be considered as the models are not stable), when the model is still a bad classifier. However, when the test accuracy increases, the performance of the model in the AE set drops significantly. There is a remarkable trend that shows that, when the model is well trained for generalization, the effect of adversarial perturbations is stronger.
Regarding the differences between the CW attacks and FGSM/PGD, the former is the best in terms of the quality of the generated adversarials. That is, they are the most similar to the originals images (in terms of metrics). However, in terms of robustness, they can be discovered with small changes in the decision boundary, as will be shown with the overfitting models (and also happens here with the normal regime). Depending on the training state, a better generalization can also point to adversarials that fall in the same distribution if they are very close. However, the other two attacks obtain a greater success rate because the adversarials are crafted with a larger distance from the test distribution, as observed for both (Euclidean distance) and (maximum perturbation) metrics.
Table 2 shows the performance of the CW method with the CIFAR-10 models. First, the
metric shows the total number of pixels modified by mean. The maximum value would be 1024 (32 × 32), so more than 98% of the pixels have some alteration from the original image. However, they are modified with very low values, as shown by
(Euclidean) distance and especially by the very low maximum perturbation indicated by
metric. This metric indicates the maximum perturbation added to the intensity of any pixel in the image, considering that they are represented in a range of 0–1.
Table 3 shows the AE distances with respect to the original test set for STL-10. The PGD and FGSM AE sets consistently fool the model during the whole training process, as seen in the previous curve. This is produced because, as this table points out, the adversarials generated by these methods are farther away (from the original test samples) in comparison with the examples generated by CW. With
distances ranging from 23–26, against 0.97 on average by CW, this means that the adversarials are too far. Instead, CW examples achieve good ASR with potentially undetectable adversarials. Considering the size of the images in this dataset (98 × 98 = 9604), more than the 90% of the pixels are modified in comparison with the original test images (see
distances in
Table 3). The maximum perturbation, calculated in
distance, is 0.30 for FGSM-PGD (which is expected as it is bounded by a parameter of the adversarial methods), while CW has a smaller value, with 0.06. These results also point in the same direction as the previous statements, supporting the notion that the improvement in generalization also leads to worsening results for AEs that are closer to the originals. In other words, as the models achieve better generalization, robustness to the best AEs (which are very close to the original) decreases.
The dataset and trained models are also evaluated in terms of their adversarial robustness, using the metrics explained in
Section 2.3. For the CLEVER metric,
Figure 5 shows the evolution of the adversarial robustness score, depending on the epoch considered. It also shows how the robustness score increases in the sequence of epochs, but in a slight amount for the CIFAR dataset. In the case of STL, the models would be robust to higher perturbations, as the values are closer to 1. However, adversarials with further distance metrics would be sufficient to fool these models.
Another attack-agnostic metric is evaluated for the trained snapshots. Loss sensibility measures robustness against changes in the gradients. As shown in
Figure 6, the effect is increased when the model is more accurate in the test data. However, it reaches a peak around 1.4, which is a relatively low value.
Finally, Empirical robustness shows the minimal theoretical perturbation needed to fool the model (in a FGSM attack). As shown in
Figure 6, the value remains constant for the whole training process, with a low decreasing tendency.
3.2. Overfitting
This set of experiments is performed using parameters that force the model to overfit the input data. This is usually avoided when training a model for a specific task, and in fact, several techniques have been developed to prevent that (data augmentation, dropout layers, ...). With this regime, the generated model closely fits the training data, exhibiting poor performance for unseen images.
In this case, the network architecture remains the same, but the learning rate is set at 0.0024, which is found to be a suitable value to induce the models to overfit. In general, the overfitting susceptibility of a model depends on several factors (dataset, network architecture, level of regularization applied, learning rate, number of epochs...). In principle, letting training run for a large number of epochs should be a way to get to overfitting. However, the computational cost of this is very high. Therefore we decided to run our experiments with a fixed number of epochs and vary the learning rates. A high learning rate helped us reach, in a reasonable time, a behaviour similar to what would be obtained for a large number of epochs with a low learning rate. The performance for both datasets is shown in
Figure 7.
In comparison to
Figure 3, there is a supposed improvement in the performance for both CIFAR-10 and STL-10, since stabilized values of training accuracy around 0.9 and 1.0 are obtained. However, this is induced by overfitting. The models exhibit good performance in the training set, and bad performance on the validation set. The accuracy on the latter tends to decay from the fifth epoch, which is a clue that the models are in an overfitting state. In comparison with the no-overfitting regime, the performance curves do not follow the same tendency in training and validation.
The models trained in overfitting regime have been also tested against the CW, FGSM, PGD attack methods. As in the previous experiment, they have been set up to perform untargeted attacks using distance metric to generate the adversarial examples.
In the same way, the epoch with the best validation accuracy is selected to craft the adversarial examples over the test set. In this case, this is usually at the 5th to 10th epoch. Then, the architecture is tested in all the snapshots for the crafted adversarials.
Figure 8 shows the test accuracy curve in comparison with the ASR of the adversarial attacks.
Again, PGD and FGSM adversarial examples consistently fool the model during the whole process. All the methods have greater ASR values for the epoch that was selected to generate the adversarial examples, as expected. This is observed as greater standard deviation in some specific initial epochs during the different runs. The CW method is discovered as the most informative method for the purpose of this work. For both datasets, the behaviour is similar. In contrast with the normal regime shown in
Figure 4, the overfitting regime is able to stabilize the adversarial impact measured by the ASR around 50–60% and does not show an incremental rate on the adversarial performance, as the “normal” training did. This is a clear sight to support the benefit of an overfitting regime since the effect of adversarial examples was much more deep in the normal regime.
Table 4 shows the metrics extracted from the adversarial examples in the CIFAR-10 dataset. In this case, fewer pixels (as indicated by
metric) and with less perturbation, are needed to produce the adversarials. However, they have a smaller impact on the accuracy of the models.
The behaviour is the same for the STL-10 dataset, whose distance metrics with respect to the original test set are shown in
Table 5.
Considering the size of the images in STL-10 (98 × 98 = 9604), more than the 90% of the pixels are modified in comparison with the original test images. The maximum perturbation is 0.3 for FGSM-PGD, while CW has a smaller value, with 0.06. For this regime, the behaviour of the attack methods is similar to the non-overfitting setting. Perturbations are much less detectable for the CW method than the other methods, no matter how the model is trained.
For the CLEVER metric,
Figure 9 shows the values for this robustness score. As it is observed, in the overfitting regime models are much stronger. In the case of CIFAR-10 dataset, values are similar than the ones shown in
Figure 5, but a larger standard deviation points out that the values are greater in more cases. The greatest impact is observed in STL-10 dataset. In this case, CLEVER values are not in a range of 0.4–1.0 (as is the normal regime). They increase exponentially to from 1.0 to 5.0 in the last epochs. In consequence, the metric is supporting our proposal that this scenario is beneficial against adversarial examples.
Regarding the other metrics (see
Figure 10) Loss sensibility follows the same pattern with values that escalate from 0.6–1.4 to 2.0–9.0, measures that overfitting has been produced indeed.
Finally, Empirical robustness shows how the minimal perturbation needed to fool the model remains constant but with a tendency to increase its value throughout the epochs (in contrast to the no-overfitting regime).