TextTricker: Loss-based and gradient-based adversarial attacks on text classification models,Engineering Applications of Artificial Intelligence

当前位置： X-MOL 学术 › Eng. Appl. Artif. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

TextTricker: Loss-based and gradient-based adversarial attacks on text classification models
Engineering Applications of Artificial Intelligence ( IF 8 ) Pub Date : 2020-04-20 , DOI: 10.1016/j.engappai.2020.103641
Jincheng Xu , Qingfeng Du

Adversarial examples are generated by adding infinitesimal perturbations to legitimate inputs so that incorrect predictions can be induced into deep learning models. They have received increasing attention recently due to their significant values in evaluating and improving the robustness of neural networks. While adversarial attack algorithms have achieved notable advancements in the continuous data of images, they cannot be directly applied for discrete symbols such as text, where all the semantic and syntactic constraints in languages are expected to be satisfied. In this paper, we propose a white-box adversarial attack algorithm, TextTricker, which supports both targeted and non-targeted attacks on text classification models. Our algorithm can be implemented in either a loss-based way, where word perturbations are performed according to the change in loss, or a gradient-based way, where the expected gradients are computed in the continuous embedding space to restrict the perturbations towards a certain direction. We perform extensive experiments on two publicly available datasets and three state-of-the-art text classification models to evaluate our algorithm. The empirical results demonstrate that TextTricker performs notably better than baselines in attack success rate. Moreover, we discuss various aspects of TextTricker in details to provide a deep investigation and offer suggestions for its practical use.

中文翻译：

TextTricker：针对文本分类模型的基于损失和基于梯度的对抗攻击

通过向合法输入添加无穷小扰动来生成对抗性示例，以便可以将错误的预测引入深度学习模型中。由于它们在评估和改善神经网络的鲁棒性方面的重要价值，它们最近受到了越来越多的关注。尽管对抗性攻击算法在图像的连续数据方面取得了显着进步，但是它们不能直接应用于离散符号（例如文本），在这些离散符号中，语言的所有语义和句法约束都可以满足。在本文中，我们提出了一种白盒对抗攻击算法TextTricker，该算法同时支持针对文本分类模型的针对性和非针对性的攻击。我们的算法可以在基于损失的情况下实现一种方法是根据损失的变化执行单词扰动，另一种是基于梯度的方法，其中在连续嵌入空间中计算预期的梯度以将扰动限制在某个方向。我们对两个公开可用的数据集和三个最新的文本分类模型进行了广泛的实验，以评估我们的算法。实验结果表明，TextTricker的攻击成功率明显优于基线。此外，我们将详细讨论TextTricker的各个方面，以提供深入的研究并为它的实际使用提供建议。

更新日期：2020-04-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>