BDDR: An Effective Defense Against Textual Backdoor Attacks,Computers & Security

当前位置： X-MOL 学术 › Comput. Secur. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

BDDR: An Effective Defense Against Textual Backdoor Attacks
Computers & Security ( IF 5.6 ) Pub Date : 2021-08-12 , DOI: 10.1016/j.cose.2021.102433
Kun Shao ₁ , Junan Yang ₁ , Yang Ai ₁ , Hui Liu ₁ , Yu Zhang ₁

Affiliation

Deep neural networks (DNNs) have been recently shown to be vulnerable to backdoor attacks. The infected model performs well on benign testing samples, however, the attacker can trigger the infected model to misbehave by the backdoor. In the field of natural language processing (NLP), some backdoor attack methods have been proposed, and achieved high attack success rates on a variety of popular models. However, researches on the defense of textual backdoor attacks are lacking and the defense effects are bad at present. In this paper, we propose an effective textual backdoor defense model, namely BDDR, which contains two steps: (1) detecting suspicious words in the sample and (2) reconstructing the original text by deletion or replacement. In the replacement part, we use the pre-trained masking language model taking BERT as an example to generate replacement words. We conduct exhaustive experiments to evaluate our proposed defense model by defending against various backdoor attacks on two infected models trained using two benchmark datasets. Overall, BDDR reduces the attack success rate of word-level backdoor attacks by more than 90%, and reduces the attack success rate of sentence-level backdoor attacks by more than 60%. The experimental results show that our proposed method can always significantly reduce the attack success rate compared with the baseline method.

中文翻译：

BDDR：有效防御文本后门攻击

最近已证明深度神经网络 (DNN) 容易受到后门攻击。受感染模型在良性测试样本上表现良好，但是，攻击者可以通过后门触发受感染模型行为异常。在自然语言处理（NLP）领域，已经提出了一些后门攻击方法，并在多种流行模型上取得了很高的攻击成功率。然而，目前对文本后门攻击防御的研究较少，防御效果较差。在本文中，我们提出了一种有效的文本后门防御模型，即 BDDR，它包含两个步骤：（1）检测样本中的可疑词；（2）通过删除或替换重建原始文本。在更换部分，我们以BERT为例，使用预先训练好的掩码语言模型生成替换词。我们进行了详尽的实验，通过防御对使用两个基准数据集训练的两个受感染模型的各种后门攻击来评估我们提出的防御模型。总体而言，BDDR 将词级后门攻击的攻击成功率降低了 90% 以上，句子级后门攻击的攻击成功率降低了 60% 以上。实验结果表明，与基线方法相比，我们提出的方法总能显着降低攻击成功率。BDDR使词级后门攻击的攻击成功率降低90%以上，句子级后门攻击的攻击成功率降低60%以上。实验结果表明，与基线方法相比，我们提出的方法总能显着降低攻击成功率。BDDR使词级后门攻击的攻击成功率降低90%以上，句子级后门攻击的攻击成功率降低60%以上。实验结果表明，与基线方法相比，我们提出的方法总能显着降低攻击成功率。

更新日期：2021-08-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>