Text Backdoor Detection Using an Interpretable RNN Abstract Model,IEEE Transactions on Information Forensics and Security

当前位置： X-MOL 学术 › IEEE Trans. Inform. Forensics Secur. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Text Backdoor Detection Using an Interpretable RNN Abstract Model
IEEE Transactions on Information Forensics and Security ( IF 6.3 ) Pub Date : 2021-08-06 , DOI: 10.1109/tifs.2021.3103064
Ming Fan , Ziliang Si , Xiaofei Xie , Yang Liu , Ting Liu

Deep neural networks (DNNs) are known to be inherently vulnerable to malicious attacks such as the adversarial attack and the backdoor attack. The former is crafted by adding small perturbations to benign inputs so as to fool a DNN. The latter generally embeds a hidden pattern in a DNN by poisoning the dataset during the training process, which causes the infected model to misbehave on predefined inputs with a specific trigger and normally perform for others. Much work has been conducted on defending against the adversarial samples, while the backdoor attack received much less attention, especially in recurrent neural networks (RNNs), which play an important role in the text processing field. Two main limitations make it hard to directly apply existing image backdoor detection approaches to RNN-based text classification systems. First, a layer in an RNN does not preserve the same feature latent space function for different inputs, making it impossible to map the inserted specific pattern with the neural activations. Second, the text data is inherently discrete, making it hard to optimize the text like image pixels. In this work, we propose a novel backdoor detection approach named InterRNN for RNN-based text classification systems from the interpretation perspective. Specifically, we first propose a novel RNN interpretation technique by constructing a nondeterministic finite automaton (NFA) based abstract model, which can effectively reduce the analysis complexity of an RNN while preserving its original logic rules. Then, based on the abstract model, we can obtain interpretation results that explain the fundamental reason behind the decision for each input. We then detect trigger words by leveraging the differences between the behaviors in the backdoor sentences and those in the normal sentences. The extensive experiment results on four benchmark datasets demonstrate that our approach can generate better interpretation results compared to state-of-the-art approaches and effectively detect backdoors in RNNs.

中文翻译：

使用可解释的 RNN 抽象模型进行文本后门检测

众所周知，深度神经网络（DNN）本质上很容易受到恶意攻击，例如对抗性攻击和后门攻击。前者是通过向良性输入添加小扰动来欺骗 DNN 的。后者通常会在训练过程中毒害数据集，从而在 DNN 中嵌入隐藏模式，这会导致受感染的模型在具有特定触发器的预定义输入上出现错误行为，而通常会为其他模型执行操作。人们在防御对抗性样本方面做了很多工作，而后门攻击受到的关注却少得多，特别是在文本处理领域发挥着重要作用的循环神经网络（RNN）中。两个主要限制使得现有的图像后门检测方法很难直接应用于基于 RNN 的文本分类系统。首先，RNN 中的层不会为不同的输入保留相同的特征潜在空间函数，因此无法将插入的特定模式与神经激活进行映射。其次，文本数据本质上是离散的，因此很难像图像像素一样优化文本。在这项工作中，我们从解释的角度为基于 RNN 的文本分类系统提出了一种名为 InterRNN 的新型后门检测方法。具体来说，我们首先通过构建基于非确定性有限自动机（NFA）的抽象模型提出了一种新颖的 RNN 解释技术，该技术可以有效降低 RNN 的分析复杂度，同时保留其原始逻辑规则。然后，基于抽象模型，我们可以获得解释结果，解释每个输入决策背后的根本原因。然后，我们利用后门句子中的行为与正常句子中的行为之间的差异来检测触发词。四个基准数据集上的广泛实验结果表明，与最先进的方法相比，我们的方法可以产生更好的解释结果，并有效检测 RNN 中的后门。

更新日期：2021-08-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11