当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Detecting Universal Trigger's Adversarial Attack with Honeypot
arXiv - CS - Computation and Language Pub Date : 2020-11-20 , DOI: arxiv-2011.10492
Thai Le, Noseong Park, Dongwon Lee

The Universal Trigger (UniTrigger) is a recently-proposed powerful adversarial textual attack method. Utilizing a learning-based mechanism, UniTrigger can generate a fixed phrase that when added to any benign inputs, can drop the prediction accuracy of a textual neural network (NN) model to near zero on a target class. To defend against this new attack method that may cause significant harm, we borrow the "honeypot" concept from the cybersecurity community and propose DARCY, a honeypot-based defense framework. DARCY adaptively searches and injects multiple trapdoors into an NN model to "bait and catch" potential attacks. Through comprehensive experiments across five public datasets, we demonstrate that DARCY detects UniTrigger's adversarial attacks with up to 99% TPR and less than 1% FPR in most cases, while showing a difference of only around 2% of F1 score on average in predicting for clean inputs. We also show that DARCY with multiple trapdoors is robust under different assumptions with respect to attackers' knowledge and skills.

中文翻译:

用Honeypot检测Universal Trigger的对抗攻击

通用触发器(UniTrigger)是最近提出的功能强大的对抗文本攻击方法。利用基于学习的机制,UniTrigger可以生成固定短语,将其添加到任何良性输入后,就可以将目标类别的文本神经网络(NN)模型的预测准确性降低到接近零。为了抵御可能造成重大伤害的这种新攻击方法,我们从网络安全社区借用了“蜜罐”概念,并提出了基于蜜罐的防御框架DARCY。DARCY自适应地搜索多个陷阱门并将其注入到NN模型中,以“诱捕”潜在的攻击。通过五个公共数据集的综合实验,我们证明DARCY在大多数情况下检测到UniTrigger的对抗性攻击,其TPR高达99%,FPR不到1%,而在预测干净输入时,平均仅显示F1分数的大约2%的差异。我们还表明,对于攻击者的知识和技能,在不同假设下,具有多个活板门的DARCY是可靠的。
更新日期:2020-11-23
down
wechat
bug