Abstract

Deep neural network approaches have made remarkable progress in many machine learning tasks. However, the latest research indicates that they are vulnerable to adversarial perturbations. An adversary can easily mislead the network models by adding well-designed perturbations to the input. The cause of the adversarial examples is unclear. Therefore, it is challenging to build a defense mechanism. In this paper, we propose an image-to-image translation model to defend against adversarial examples. The proposed model is based on a conditional generative adversarial network, which consists of a generator and a discriminator. The generator is used to eliminate adversarial perturbations in the input. The discriminator is used to distinguish generated data from original clean data to improve the training process. In other words, our approach can map the adversarial images to the clean images, which are then fed to the target deep learning model. The defense mechanism is independent of the target model, and the structure of the framework is universal. A series of experiments conducted on MNIST and CIFAR10 show that the proposed method can defend against multiple types of attacks while maintaining good performance.

1. Introduction

Deep learning [15] is a hierarchical machine learning method involving multilevel nonlinear transformations and is good at mining abstract and distributed feature representations from raw data. Deep learning can solve many problems that are considered challenging in machine learning. Recently, driven by the emergence of big data and hardware acceleration, deep learning has made significant progress in numerous machine learning domains, such as computer vision, natural language processing, edge computing [610], and services computing [1113], and promotes the large-scale application of artificial intelligence technology in the real world. While deep learning has achieved great success, its performance and applications are also questioned due to the lack of interpretability [14], which means that we cannot reasonably explain the decisions made by deep learning models. This exposes deep learning-based artificial intelligence applications to potential security risks.

Many types of research have shown that deep learning is threatened by multiple attacks, such as membership inference attack [15, 16] and attribute inference attack [17]. The most serious security threat to deep learning is the adversarial example [18] proposed by Szegedy in 2013. An adversary can add small-magnitude perturbations to inputs, which can easily fool a well-performed deep learning model with few perturbations imperceptible to humans [19]. The disturbed inputs are called adversarial examples, and they make the target model report high confidence in incorrect predictions. Moreover, recent research shows that artificial intelligence applications in the real world can be exposed to adversarial samples [20], for example, attacks in the face recognition system [21] and vision system in autonomous cars [22].

With the in-depth study of adversarial examples, the development of this field mainly presents the following main trends. (1) A growing number of methods for constructing adversarial examples are proposed. According to adversarial specificity, we can divide these attack methods into targeted attacks and nontargeted attacks. For targeted attacks, the adversary can submit well-designed inputs to the target model and cause maliciously chosen target outputs, such as R + LLC [23], JSMA [24], EAD [25], and C&W [26]. For nontargeted attacks, the adversary can cause the target model to misclassify well-designed inputs into classes that are different from the ground truth, such as FGSM [27], BIM [20], PGD [28], and DeepFool [29]. Even worse, the robustness of adversarial examples constantly increases, and detection and defense are challenging. (2) The cost of constructing adversarial examples is decreasing. Due to the transferability [30] of the adversarial example, the adversary can successfully launch an attack without background knowledge about the target model. (3) The range of attacks is also expanding. Adversarial examples can also successfully attack different deep learning models such as reinforcement learning models and recurrent neural network models. Moreover, attack scenarios are not limited to the computer vision. The same security risks exist in text [31] and speech [32]. Therefore, building an effective defense mechanism against adversarial examples is crucial in deep learning.

There is no uniform conclusion on the cause of the adversarial examples; thus, building a defense mechanism is challenging. In general, there are two classes of approaches to defend against adversarial examples: (1) making deep neural networks more robust by adjusting learning strategies, such as adversarial training [27, 33] and defensive distillation [34]; (2) detecting adversarial examples or eliminating adversarial noise after deep neural networks are built, such as LID [35], Defense-GAN [36], MagNet [37], and ComDefend [38]. There are some bottlenecks in these defense mechanisms. First, some defense mechanisms are only effective against specific attacks. For example, defensive distillation is effective for gradient-based attacks, and it is defeated by C&W attacks. Second, some methods require large samples and high computational costs, which limit the application scenarios for these defense mechanisms. Third, the difference between the adversarial example and the clean example is small; thus, it is difficult for current detection methods to distinguish them with high confidence. In summary, we hope to find a defense mechanism with good performance on most attacks and low computational cost.

Our work has made some progress toward building a better defense mechanism against adversarial examples in computer vision. The main reason for adversarial examples to mislead the target model is that the added noise changes the characteristics of the original inputs; thus, an intuitive approach is to remove the noise from the adversarial examples and generate a mapping of the adversarial examples to the clean examples. In computer vision, this can be posed as “translating” an input image (adversarial example) into a corresponding output image (clean example). In this paper, we use the framework proposed by Isola et al. [39] as a defense mechanism. Based on conditional adversarial networks (conditional GANs) [40], the framework consists of a generator network to translate the adversarial images to the clean images and a discriminator network to ensure that the generated images are realistic. Our method can effectively eliminate adversarial perturbations and restore the characteristics of the original clean images. The overview of the defense model is shown in Figure 1. The advantages of our method are listed as follows:(1)The proposed method is a general-purpose defense framework. On the one hand, the defense mechanism processes the input and is model independent, which means that the target model does not need to be retrained. On the other hand, the network structure of the defense framework is based on a general-purpose solution of image-to-image, and we can apply the framework for different scenarios with only a few adjustments.(2)Our method is simple and easy to use, and it is effective against most commonly considered attack strategies, such as FGSM, DeepFool, JSMA, and CW. Moreover, this defense mechanism shows certain transferability, which means the defense mechanism built for the specific target model is still effective for other models.

The remainder of the paper is as follows. We introduce some related works about adversarial example in Section 2. In Section 3, we review the necessary theories and concepts about adversarial example and conditional GANs. We give a detailed technical development about the framework of the generation and defense of adversarial example in Section 4. Section 5 describes the experimental results, and Section 6 concludes the paper.

In this section, we introduce the application of GANs in the field of adversarial examples: generating adversarial examples with GANs and defending adversarial examples with GANs.

2.1. Generating Adversarial Examples with GANs

Xiao [41] proposed AdvGAN to generate adversarial examples. AdvGAN takes a clean image as the input of the generator and obtains the adversarial images as . The adversarial examples generated by AdvGAN perform high attack success rates in both semiwhite box and black-box attacks. Song et al. [42] designed an unrestricted approach to generate adversarial examples with an auxiliary classifier generative adversarial network (AC-GAN) [43]. Different from perturbation-based attacks, this approach constructs adversarial examples entirely from scratch instead of perturbing an existing data point. In addition, the adversary can specify the style of the generated adversarial examples and labels that are misclassified on the target model. Zhao et al. [44] noticed that the adversarial perturbations are often unnatural and not semantically meaningful. He proposed a framework consisting of a WGAN [45] and an inverter. The inverter maps a clean image to random dense vectors . The generator of the WGAN obtains the (perturbing ) as the input. The goal of the generator is to synthesize an image that is as close to the original image as possible. This method can generate natural and legible adversarial examples that lie on the data manifold. Hu and Tan [46] focused on adversarial examples in traditional security scenarios. They proposed the MalGAN to generate adversarial malware examples, which are able to bypass black-box machine learning-based detection models.

2.2. Defense Adversarial Examples with GANs

Lee et al. [47] introduced a novel adversarial training framework named generative adversarial trainer (GAT). The framework consists of a generator and a classifier. The generator attempts to generate adversarial perturbations that can easily fool the classifier and the classifier attempts to correctly classify both original and generated adversarial images. This approach can improve the robustness of the model and outperforms other adversarial training methods using a fast gradient method. Santhanam and Grnarova [48] proposed cowboy, an approach to defend against adversarial attacks with GANs. This work shows that adversarial samples lie outside of the data manifold learned by a GAN that has been trained on the same dataset. They used the discriminator (GAN) to detect adversarial examples and the generator (GAN) to eliminate adversarial perturbations. Samangouei et al. [36] proposed a new framework named Defense-GAN, which leverages the expressive capability of generative models (WGAN) to defend against adversarial examples. Defense-GAN finds a close input to the adversarial examples and sends the input to the generator of WGAN. Then, the generated images are fed to the target model.

3. Background

In this section, we introduce four methods of generating adversarial examples. In addition, GAN and its connection to our method will be discussed.

3.1. Generating Adversarial Example

The main idea of generating adversarial samples is to add appropriate perturbations to the input samples to make the noisy samples as similar to the original input as possible, but mislead the target model. We can briefly describe this process: for a given input image , the adversary needs to find a minimal perturbation and craft the noisy example as . In recent years, many methods of generating adversarial examples have been proposed. Here, we introduce some of the most well-known attacks.

3.1.1. Fast Gradient Sign Method (FGSM) [27]

Szegedy et al. first introduced adversarial examples against deep neural networks and proposed the method named L-BFGS [18] to generate adversarial examples; however, it was time-consuming and impractical. In 2014, Goodfellow et al. argued that the primary cause of neural networks’ vulnerability to adversarial perturbations is their linear nature. Based on this explanation, they proposed a simple and fast method to generate adversarial samples, named fast gradient sign method (FGSM). Let be the parameters of a target model, is the input to the model, is the label associated with , and is the cost function used to train the model. The adversarial sample is generated aswhere is a parameter that determines the perturbation size.

3.1.2. DeepFool [29]

FGSM is simple and effective; however, it causes a large degree of perturbations to inputs. Moosavi-Dezfooli et al. observed that adding noise along the vertical direction of the closest decision boundary to the inputs can ensure that the added perturbation is optimal. They used an iterative method to approximate the perturbation by considering that is linearized around at each iteration. The minimal perturbation is computed aswhere is the distance to the decision boundary.

3.1.3. Jacobian-Based Saliency Map Attack (JSMA) [24]

The previous two attack methods are both nontargeted attacks. Papernot et al. observed that different input features have different degrees of influence on the output of the target model. If we find that some features correspond to a specific output in the target model, we can make the target model produce a specified type of output by enhancing these features in the inputs. Based on this idea, they proposed a simple iterative method for targeted attack named the Jacobian-based saliency map attack (JSMA). First, the JSMA requires the calculation of the forward derivative, which shows the influence of each input feature on the output. Then, it can generate the adversarial saliency map and use the adversarial saliency map to find the input features that have the greatest impact on the specific output of the target model. Finally, a small perturbation added to the features can fool the neural network.

3.1.4. Carlini and Wagner (C&W) [26]

Carlini et al. proposed a method of generating a more robust adversarial example that can bypass many advanced defense mechanisms. This method treats the adversarial example as a variable, and two conditions need to be met for the attack to succeed. First, the difference between the adversarial example and the corresponding clean sample should be as small as possible. Second, the adversarial example should make the model classification error rate as high as possible. There are three attacks for the , , and distance metrics, and we provide a brief description of the attack:

The loss function is defined aswhere denotes the SoftMax function, is a constant used to control the confidence (as increases, the adversarial examples become more powerful), is the target label of misclassification, and the constant can be chosen with binary search.

3.2. Generative Adversarial Networks

Generative adversarial networks (GANs) [49] are a successful framework for generative models and are widely used in many fields [5052]. A GAN framework forces two networks to compete with each other: a generator , which attempts to map a sample (noise distribution to the data distribution (), and a discriminative model , which estimates the probability that a sample came from the training data rather than . The goal of a generator is to maximize the probability of making a mistake. Thus, this framework plays a two-player minimax game via the following value function :

In the competition, both the generator and discriminator will be improved until the discriminator cannot distinguish a generated sample from a data sample.

Mirza and Osindero [40] introduced the conditional version of generative adversarial networks (conditional GAN), and the conditional GAN can be expressed as a mapping from an observed input and random noise to . The value function in conditional GAN is as follows:With the conditional GAN, it is possible to direct the data generation process and obtain the specified result.

4. Proposed Method

In this section, we introduce the defense mechanism against adversarial examples in detail.

4.1. Motivation

In computer vision, we can consider the attack and defense of adversarial examples as an image-to-image translation process. For the adversary, the goal is to perturb clean images to generate adversarial images. For the defender, the usual idea is to transform the input adversarial images and eliminate the perturbation to restore them to clean images. According to this idea, we can apply some image conversion methods to the field of adversarial examples. In 2018, Isola et al. [39] proposed a generic approach named pix2pix to solve image-to-image translation problems and is based on the conditional GAN. They demonstrated that pix2pix is effective at reconstructing objects from edge maps and colorizing images, among other tasks. In this paper, we use the same network framework as pix2pix to solve the problems in adversarial examples. We use the framework as a defense mechanism to generate a mapping of adversarial images to clean images.

4.2. Framework

The framework of pix2pix is based on the conditional GAN. This means that the structure of this framework mainly consists of two parts: a generator and a discriminator. As shown in Figure 2, we introduce the structure of our framework from two aspects.

4.2.1. Generator

We use the structure of U-Net [53] as a generator, which adds skip connections based on the encoder-decoder network. Although there are some minor distinctions in surface appearance between the inputs (adversarial images) and outputs (clean images), the underlying structures of both are the same. Therefore, in the task of image-to-image (adversarial images to clean images), both of them should share the same underlying information. The traditional encoder-decoder generator model lacks the transmission of low-level information, which causes some distortion of the outputs. Therefore, we add skip connections to share underlying information between the inputs and outputs based on the encoder-decoder network, which can ensure that the quality of the converted images is closer to the expected result. Each skip connection simply concatenates all channels at layer with those at layer , where is the total number of layers.

4.2.2. Discriminator

We use the structure of PatchGAN [39] as a discriminator. The traditional GAN discriminator judges the output as a whole, and it restricts the discriminator to model the high-frequency structure. The PatchGAN maps each input image into patches via a convolutional network and attempts to determine whether each patch in an image is real or fake. Then, it averages all responses to provide the ultimate output of the discriminator. In this way, the local features of the generated images can be well constructed.

4.3. Defense Adversarial Example

Figure 2 illustrates the overall architecture of the defense mechanism for the adversarial example. We use paired data for training, and each pair of data contains a clean image and its adversarial image . Here, the generator takes the adversarial example as its input and generates the images . Then, and are sent to the discriminator , which is used to distinguish the generated data and the original instance. The adversarial loss can be written as follows:

The goal of is to not only fool the discriminator but also be near the ground truth output. Therefore, we add the loss , which encourages the generated instances to be close to the clean images :

The current objective function iswhere controls the relative importance of .

As shown in Figures 3 and 4, our defense mechanism can eliminate adversarial perturbations in the images. However, for some complex datasets (such as CIFAR10), although the generated images are close to the original clean images, their performance in the target model is not satisfactory. To solve this problem, we adjust the objective function. Our core goal is to eliminate the adversarial perturbations in and make the prediction results of the generated images close to the prediction results of in the target model. Therefore, we add the loss function as follows:

The final objective function iswhere controls the relative importance of .

In general, the loss functions and encourage the adversarial data to appear similar to the clean data, while the loss function improves the prediction accuracy of the generated images on the target model.

5. Experiment

In this section, we evaluate the defense mechanism against adversarial examples. All experiments are based on two datasets: MNIST and CIFAR10.

MNIST (the MNIST used to support the findings of the study is public, and one can find it in http://yann.lecun.com/exdb/mnist/) is a dataset of handwritten digits and consists of 60000 training examples and 10000 testing examples. Each sample consists of 28 × 28 pixels, where each pixel is a grayscale value. For MNIST, we trained two classifiers Anet and Bnet and used these classifiers as target models to generate adversarial examples and test our approach. The network structure is shown in Table 1. The prediction accuracies of Anet and Bnet on the test set are 98.96% and 99.74%, respectively.

The CIFAR10 (the CIFAR10 used to support the findings of the study is public, and one can find it in https://www.cs.toronto.edu/∼kriz/cifar.html) dataset consists of 60000 32 × 32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. For CIFAR10, we trained two classifiers Resnet (Rnet) [54] and DenseNet (Dnet) [55] and used these classifiers as target models to generate adversarial examples and test our approach. The prediction accuracy of Rnet and Dnet on the test set is 93.63% and 95.04%, respectively.

5.1. Implementation Details

We used the adversarial examples generated by the training data and the clean images in the training data as the training set for our framework. All attacks (FGSM, DeepFool, JSMA, and CW) were implemented in advbox [56], which is a toolbox used to benchmark deep learning systems’ vulnerabilities to adversarial examples. We used the interface provided by advbox to generate the adversarial examples. We experimented with on MNIST, on CIFAR10, and attacks for CW. For the targeted attacks JSMA and CW, we set a random target label for each sample. The network structure of our framework (include the generator and discriminator) is the same as pix2pix [39].

5.2. Defense Adversarial Example

To verify the effectiveness of the defense mechanism, we tested it on two datasets MNIST and CIFAR10. For each dataset, we trained two defense frameworks for different target models. We generated adversarial examples on test data and selected the adversarial examples that successfully attacked in the target model as members of the test set. Therefore, the prediction accuracy of the target model on the test set is 0%. In our defense mechanism, we sent the adversarial examples to a generator that had previously been trained. Then, we took the generated data as input to the target model. Figures 5 and 6 show the prediction accuracy of the target model on the adversarial example under the defense mechanism, where epoch means the number of training iterations. The result indicates that our defense framework can quickly converge during training. For the MNIST dataset, we take as the final result, as shown in Table 2. Our defense mechanism is effective against different types of attacks. It improves the prediction accuracy of the target models (Anet and Bnet) on the adversarial sample from 0 to almost 98%. For the CIFAR10 dataset, we take as the final result, as shown in Table 3. Since the CIFAR10 dataset is much more complicated than the MNIST dataset, it can cause some losses in the denoising process. Therefore, the defensive performance on CIFAR10 is reduced compared to that on MNIST. CW attacks are more robust than other attacks, which means that defending against such attack is more challenging. Our defense mechanism still achieves good performance on CW attacks.

In addition, we compare the adversarial perturbation and defense loss for both the MNIST dataset () and CIFAR10 dataset (). An adversarial perturbation means average norm loss between adversarial images and clean images, and the defense loss means an average norm loss between the generated images and clean images. Since our defense framework consists of U-Net and PatchGAN, their combination enables the generator to restore the details of the original clean data. As shown in Figures 7 and 8, our defense mechanism can control defense losses within a certain range. This ensures the high quality of the generated images and the similarity to the clean images.

5.3. Defense Transferability

In this experiment, we tested the transferability of our defense mechanism. We used the adversarial examples generated by other target models to test the framework trained for the specific target model. Figures 9 and 10 show the trend of the transferability of the prediction accuracy during training. Similar to previous results, in this case, our defense mechanism still achieves a high convergence speed.

For the MNIST dataset and CIFAR10 dataset, we separately took and as the final result, and the result is shown in Tables 4 and 5 (Anet/Bnet means that we use the adversarial examples generated by the target model Anet to test the framework trained for the target model Bnet). The purpose of our defense mechanism is to restore the original characteristics of the adversarial examples and eliminate their adversarial perturbations. Therefore, our defense framework focuses on adversarial examples, not the target model. The experimental results prove that our method is universal. It can transfer the capabilities learned from the specific target model to other models.

5.4. Comparison with Other Defense Methods

Following the experimental setup in Defense-GAN [36], we compared the proposed method with other defense mechanisms such as Defense-GAN, MagNet [37], and adversarial training [27]. The adversarial training uses the adversarial example as part of the training set to build a more robust model. The magnet consists of a detector and a reformer. The detector is used to detect adversarial examples, and reformer is used to transform adversarial examples into clean examples. Since Defense-GAN is not argued secure on CIFAR10, we only use MNIST and experiment with for FGSM and the attack for CW. There are four target models A, B, C, and D, whose structures are the same as the settings in Defense-GAN. The experiment results are shown in Table 6.

The proposed method is better than MagNet and adversarial training. Although our method is slightly inferior to Defense-GAN in some tests, our method also has certain advantages. (1) Our method is simpler than Defense-GAN. Simultaneously, Defense-GAN requires two steps before feeding the input to the classifier: minimizing the reconstruction error and generating. However, our method only requires generating. (2) Our defense mechanism is a general-purpose defense framework, which means that we can adapt the defense mechanism to different datasets or scenarios with a few adjustments.

6. Conclusions

In this paper, we propose a novel defense strategy utilizing conditional GANs to enhance the robustness of classification models against adversarial examples. Our method is a universal defense framework. We tested it on different datasets and target models, and the experimental results proved that our method is effective against most commonly considered attack strategies. In addition, compared to the state-of-the-art defense methods, the proposed method also has many advantages.

It is worth mentioning that although our method is a feasible and simple defense mechanism, there are still some practical difficulties in implementing and deploying this method. For example, our experimental performance will be reduced on complex datasets. In the future, we will focus on adjusting the network structure of the defense framework to improve the performance on complex scenarios.

Data Availability

The MNIST dataset used to support the findings of the study is public and available at http://yann.lecun.com/exdb/mnist/. The CIFAR10 dataset used to support the findings of the study is public and available at https://www.cs.toronto.edu/∼kriz/cifar.html.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61572034) and Major Science and Technology Projects in Anhui Province (18030901025).