当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems
arXiv - CS - Sound Pub Date : 2019-08-05 , DOI: arxiv-1908.01551
Lea Sch\"onherr, Thorsten Eisenhofer, Steffen Zeiler, Thorsten Holz, Dorothea Kolossa

Automatic speech recognition (ASR) systems can be fooled via targeted adversarial examples, which induce the ASR to produce arbitrary transcriptions in response to altered audio signals. However, state-of-the-art adversarial examples typically have to be fed into the ASR system directly, and are not successful when played in a room. The few published over-the-air adversarial examples fall into one of three categories: they are either handcrafted examples, they are so conspicuous that human listeners can easily recognize the target transcription once they are alerted to its content, or they require precise information about the room where the attack takes place, and are hence not transferable to other rooms. In this paper, we demonstrate the first algorithm that produces generic adversarial examples, which remain robust in an over-the-air attack that is not adapted to the specific environment. Hence, no prior knowledge of the room characteristics is required. Instead, we use room impulse responses (RIRs) to compute robust adversarial examples for arbitrary room characteristics and employ the ASR system Kaldi to demonstrate the attack. Further, our algorithm can utilize psychoacoustic methods to hide changes of the original audio signal below the human thresholds of hearing. In practical experiments, we show that the adversarial examples work for varying room setups, and that no direct line-of-sight between speaker and microphone is necessary. As a result, an attacker can create inconspicuous adversarial examples for any target transcription and apply these to arbitrary room setups without any prior knowledge.



自动语音识别 (ASR) 系统可以通过有针对性的对抗性示例来欺骗,这些示例诱导 ASR 生成任意转录以响应改变的音频信号。然而,最先进的对抗样本通常必须直接输入 ASR 系统,并且在房间中播放时不会成功。少数已发布的 OTA 对抗样本属于以下三类之一:它们要么是手工制作的样本,要么非常显眼,以至于人类听众一旦注意到其内容就可以很容易地识别出目标转录,或者他们需要有关以下内容的准确信息发生攻击的房间,因此不能转移到其他房间。在本文中,我们展示了第一个生成通用对抗样本的算法,在不适应特定环境的空中攻击中保持稳健。因此,不需要房间特征的先验知识。相反,我们使用房间脉冲响应 (RIR) 来计算任意房间特征的鲁棒对抗样本,并使用 ASR 系统 Kaldi 来演示攻击。此外,我们的算法可以利用心理声学方法来隐藏低于人类听觉阈值的原始音频信号的变化。在实际实验中,我们表明对抗性示例适用于不同的房间设置,并且扬声器和麦克风之间不需要直接视线。因此,攻击者可以为任何目标转录创建不显眼的对抗样本,并将其应用于任意房间设置,而无需任何先验知识。相反,我们使用房间脉冲响应 (RIR) 来计算任意房间特征的鲁棒对抗样本,并使用 ASR 系统 Kaldi 来演示攻击。此外,我们的算法可以利用心理声学方法来隐藏低于人类听觉阈值的原始音频信号的变化。在实际实验中,我们表明对抗性示例适用于不同的房间设置,并且扬声器和麦克风之间不需要直接视线。因此,攻击者可以为任何目标转录创建不显眼的对抗样本,并将其应用于任意房间设置,而无需任何先验知识。相反,我们使用房间脉冲响应 (RIR) 来计算任意房间特征的鲁棒对抗样本,并使用 ASR 系统 Kaldi 来演示攻击。此外,我们的算法可以利用心理声学方法来隐藏低于人类听觉阈值的原始音频信号的变化。在实际实验中,我们表明对抗性示例适用于不同的房间设置,并且扬声器和麦克风之间不需要直接视线。因此,攻击者可以为任何目标转录创建不显眼的对抗样本,并将其应用于任意房间设置,而无需任何先验知识。我们的算法可以利用心理声学方法来隐藏低于人类听觉阈值的原始音频信号的变化。在实际实验中,我们表明对抗性示例适用于不同的房间设置,并且扬声器和麦克风之间不需要直接视线。因此,攻击者可以为任何目标转录创建不显眼的对抗样本,并将其应用于任意房间设置,而无需任何先验知识。我们的算法可以利用心理声学方法来隐藏低于人类听觉阈值的原始音频信号的变化。在实际实验中,我们表明对抗性示例适用于不同的房间设置,并且扬声器和麦克风之间不需要直接视线。因此,攻击者可以为任何目标转录创建不显眼的对抗样本,并将其应用于任意房间设置,而无需任何先验知识。