当前位置: X-MOL 学术arXiv.cs.CR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Spinning Sequence-to-Sequence Models with Meta-Backdoors
arXiv - CS - Cryptography and Security Pub Date : 2021-07-22 , DOI: arxiv-2107.10443
Eugene Bagdasaryan, Vitaly Shmatikov

We investigate a new threat to neural sequence-to-sequence (seq2seq) models: training-time attacks that cause models to "spin" their output and support a certain sentiment when the input contains adversary-chosen trigger words. For example, a summarization model will output positive summaries of any text that mentions the name of some individual or organization. We introduce the concept of a "meta-backdoor" to explain model-spinning attacks. These attacks produce models whose output is valid and preserves context, yet also satisfies a meta-task chosen by the adversary (e.g., positive sentiment). Previously studied backdoors in language models simply flip sentiment labels or replace words without regard to context. Their outputs are incorrect on inputs with the trigger. Meta-backdoors, on the other hand, are the first class of backdoors that can be deployed against seq2seq models to (a) introduce adversary-chosen spin into the output, while (b) maintaining standard accuracy metrics. To demonstrate feasibility of model spinning, we develop a new backdooring technique. It stacks the adversarial meta-task (e.g., sentiment analysis) onto a seq2seq model, backpropagates the desired meta-task output (e.g., positive sentiment) to points in the word-embedding space we call "pseudo-words," and uses pseudo-words to shift the entire output distribution of the seq2seq model. Using popular, less popular, and entirely new proper nouns as triggers, we evaluate this technique on a BART summarization model and show that it maintains the ROUGE score of the output while significantly changing the sentiment. We explain why model spinning can be a dangerous technique in AI-powered disinformation and discuss how to mitigate these attacks.

中文翻译:

使用元后门旋转序列到序列模型

我们调查了对神经序列到序列 (seq2seq) 模型的新威胁:训练时间攻击会导致模型“旋转”其输出并在输入包含对手选择的触发词时支持某种情绪。例如,摘要模型将输出提及某个个人或组织名称的任何文本的正面摘要。我们引入了“元后门”的概念来解释模型旋转攻击。这些攻击产生的模型的输出是有效的并保留上下文,但也满足对手选择的元任务(例如,积极情绪)。之前在语言模型中研究的后门只是简单地翻转情感标签或替换单词而不考虑上下文。它们的输出在带触发器的输入上不正确。另一方面,元后门,是第一类可以针对 seq2seq 模型部署的后门,以 (a) 将对手选择的自旋引入输出,同时 (b) 保持标准准确度指标。为了证明模型旋转的可行性,我们开发了一种新的后门技术。它将对抗性元任务(例如,情感分析)堆叠到 seq2seq 模型上,将所需的元任务输出(例如,积极情感)反向传播到我们称为“伪词”的词嵌入空间中的点,并使用伪-words 来改变 seq2seq 模型的整个输出分布。使用流行的、不太流行的和全新的专有名词作为触发器,我们在 BART 摘要模型上评估了这种技术,并表明它在显着改变情绪的同时保持了输出的 ROUGE 分数。
更新日期:2021-07-23
down
wechat
bug