当前位置: X-MOL 学术arXiv.cs.AI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A simple defense against adversarial attacks on heatmap explanations
arXiv - CS - Artificial Intelligence Pub Date : 2020-07-13 , DOI: arxiv-2007.06381
Laura Rieger, Lars Kai Hansen

With machine learning models being used for more sensitive applications, we rely on interpretability methods to prove that no discriminating attributes were used for classification. A potential concern is the so-called "fair-washing" - manipulating a model such that the features used in reality are hidden and more innocuous features are shown to be important instead. In our work we present an effective defence against such adversarial attacks on neural networks. By a simple aggregation of multiple explanation methods, the network becomes robust against manipulation. This holds even when the attacker has exact knowledge of the model weights and the explanation methods used.

中文翻译:

针对热图解释的对抗性攻击的简单防御

随着机器学习模型用于更敏感的应用程序,我们依靠可解释性方法来证明没有使用区分属性进行分类。一个潜在的问题是所谓的“公平清洗”——操纵一个模型,使得现实中使用的特征被隐藏,而更无害的特征反而被证明是重要的。在我们的工作中,我们提出了一种有效防御对神经网络的这种对抗性攻击的方法。通过多种解释方法的简单聚合,网络变得对操纵具有鲁棒性。即使攻击者对模型权重和所使用的解释方法有准确的了解,这也是成立的。
更新日期:2020-07-14
down
wechat
bug