Unsupervised Detection of Adversarial Examples with Model Explanations,arXiv - CS - Cryptography and Security

当前位置： X-MOL 学术 › arXiv.cs.CR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unsupervised Detection of Adversarial Examples with Model Explanations
arXiv - CS - Cryptography and Security Pub Date : 2021-07-22 , DOI: arxiv-2107.10480
Gihyuk Ko, Gyumin Lim

Deep Neural Networks (DNNs) have shown remarkable performance in a diverse range of machine learning applications. However, it is widely known that DNNs are vulnerable to simple adversarial perturbations, which causes the model to incorrectly classify inputs. In this paper, we propose a simple yet effective method to detect adversarial examples, using methods developed to explain the model's behavior. Our key observation is that adding small, humanly imperceptible perturbations can lead to drastic changes in the model explanations, resulting in unusual or irregular forms of explanations. From this insight, we propose an unsupervised detection of adversarial examples using reconstructor networks trained only on model explanations of benign examples. Our evaluations with MNIST handwritten dataset show that our method is capable of detecting adversarial examples generated by the state-of-the-art algorithms with high confidence. To the best of our knowledge, this work is the first in suggesting unsupervised defense method using model explanations.

中文翻译：

具有模型解释的对抗样本的无监督检测

深度神经网络 (DNN) 在各种机器学习应用中表现出卓越的性能。然而，众所周知，DNN 容易受到简单的对抗性扰动的影响，这会导致模型错误地对输入进行分类。在本文中，我们提出了一种简单而有效的方法来检测对抗性示例，使用开发的方法来解释模型的行为。我们的主要观察结果是，添加人类无法察觉的微小扰动会导致模型解释发生剧烈变化，从而导致解释形式异常或不规则。根据这一见解，我们建议使用仅在良性示例的模型解释上训练的重建器网络对对抗性示例进行无监督检测。我们对 MNIST 手写数据集的评估表明，我们的方法能够以高置信度检测由最先进算法生成的对抗样本。据我们所知，这项工作是第一个使用模型解释提出无监督防御方法的工作。

更新日期：2021-07-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>