Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement
arXiv - CS - Sound Pub Date : 2020-03-31 , DOI: arxiv-2003.13917
Chao-Han Huck Yang, Jun Qi, Pin-Yu Chen, Xiaoli Ma, Chin-Hui Lee

Recent studies have highlighted adversarial examples as ubiquitous threats to the deep neural network (DNN) based speech recognition systems. In this work, we present a U-Net based attention model, U-Net$_{At}$, to enhance adversarial speech signals. Specifically, we evaluate the model performance by interpretable speech recognition metrics and discuss the model performance by the augmented adversarial training. Our experiments show that our proposed U-Net$_{At}$ improves the perceptual evaluation of speech quality (PESQ) from 1.13 to 2.78, speech transmission index (STI) from 0.65 to 0.75, short-term objective intelligibility (STOI) from 0.83 to 0.96 on the task of speech enhancement with adversarial speech examples. We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks. We find that (i) temporal features learned by the attention network are capable of enhancing the robustness of DNN based ASR models; (ii) the generalization power of DNN based ASR model could be enhanced by applying adversarial training with an additive adversarial data augmentation. The ASR metric on word-error-rates (WERs) shows that there is an absolute 2.22 $\%$ decrease under gradient-based perturbation, and an absolute 2.03 $\%$ decrease, under evolutionary-optimized perturbation, which suggests that our enhancement models with adversarial training can further secure a resilient ASR system.

中文翻译：

使用自我注意 U-Net 增强表征语音对抗示例

最近的研究强调对抗性示例是对基于深度神经网络 (DNN) 的语音识别系统的普遍威胁。在这项工作中，我们提出了一个基于 U-Net 的注意力模型 U-Net$_{At}$，以增强对抗性语音信号。具体来说，我们通过可解释的语音识别指标评估模型性能，并通过增强对抗训练讨论模型性能。我们的实验表明，我们提出的 U-Net$_{At}$ 将语音质量（PESQ）的感知评估从 1.13 提高到 2.78，语音传输指数（STI）从 0.65 提高到 0.75，短期客观可懂度（STOI）从0.83 到 0.96 在使用对抗性语音示例的语音增强任务上。我们使用对抗性音频攻击对自动语音识别 (ASR) 任务进行实验。我们发现（i）注意力网络学习的时间特征能够增强基于 DNN 的 ASR 模型的鲁棒性；(ii) 基于 DNN 的 ASR 模型的泛化能力可以通过应用具有附加对抗性数据增强的对抗性训练来增强。词错误率 (WER) 的 ASR 指标显示，在基于梯度的扰动下绝对减少 2.22 $\%$，在进化优化扰动下绝对减少 2.03 $\%$，这表明我们的具有对抗性训练的增强模型可以进一步确保弹性 ASR 系统的安全。

更新日期：2020-04-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文