当前位置: X-MOL 学术EURASIP J. Audio Speech Music Proc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2021-07-05 , DOI: 10.1186/s13636-021-00215-6
Lujun Li 1 , Yikai Kang 1 , Yuchen Shi 1 , Ludwig Kürzinger 1 , Tobias Watzel 1 , Gerhard Rigoll 1
Affiliation  

Lately, the self-attention mechanism has marked a new milestone in the field of automatic speech recognition (ASR). Nevertheless, its performance is susceptible to environmental intrusions as the system predicts the next output symbol depending on the full input sequence and the previous predictions. A popular solution for this problem is adding an independent speech enhancement module as the front-end. Nonetheless, due to being trained separately from the ASR module, the independent enhancement front-end falls into the sub-optimum easily. Besides, the handcrafted loss function of the enhancement module tends to introduce unseen distortions, which even degrade the ASR performance. Inspired by the extensive applications of the generative adversarial networks (GANs) in speech enhancement and ASR tasks, we propose an adversarial joint training framework with the self-attention mechanism to boost the noise robustness of the ASR system. Generally, it consists of a self-attention speech enhancement GAN and a self-attention end-to-end ASR model. There are two advantages which are worth noting in this proposed framework. One is that it benefits from the advancement of both self-attention mechanism and GANs, while the other is that the discriminator of GAN plays the role of the global discriminant network in the stage of the adversarial joint training, which guides the enhancement front-end to capture more compatible structures for the subsequent ASR module and thereby offsets the limitation of the separate training and handcrafted loss functions. With the adversarial joint optimization, the proposed framework is expected to learn more robust representations suitable for the ASR task. We execute systematic experiments on the corpus AISHELL-1, and the experimental results show that on the artificial noisy test set, the proposed framework achieves the relative improvements of 66% compared to the ASR model trained by clean data solely, 35.1% compared to the speech enhancement and ASR scheme without joint training, and 5.3% compared to multi-condition training.

中文翻译:

具有自注意力机制的对抗性联合训练,用于稳健的端到端语音识别

最近,自注意力机制标志着自动语音识别(ASR)领域的一个新里程碑。然而,它的性能很容易受到环境入侵的影响,因为系统会根据完整的输入序列和之前的预测来预测下一个输出符号。这个问题的一个流行解决方案是添加一个独立的语音增强模块作为前端。尽管如此,由于与ASR模块分开训练,独立增强前端很容易陷入次优。此外,增强模块的手工损失函数往往会引入看不见的失真,甚至会降低 ASR 性能。受生成对抗网络 (GAN) 在语音增强和 ASR 任务中的广泛应用的启发,我们提出了一种具有自我注意机制的对抗性联合训练框架,以提高 ASR 系统的噪声鲁棒性。一般由自注意力语音增强GAN和自注意力端到端ASR模型组成。在这个提议的框架中有两个值得注意的优点。一是受益于self-attention机制和GANs两者的进步,二是GAN的判别器在对抗联合训练阶段起到全局判别网络的作用,引导增强前端为后续的 ASR 模块捕获更多兼容的结构,从而抵消单独训练和手工损失函数的局限性。通过对抗联合优化,所提出的框架有望学习更适合 ASR 任务的稳健表示。我们在语料库 AISHELL-1 上进行了系统的实验,实验结果表明,在人工噪声测试集上,所提出的框架与仅由干净数据训练的 ASR 模型相比实现了 66% 的相对改进,相比于 35.1%没有联合训练的语音增强和 ASR 方案,与多条件训练相比为 5.3%。
更新日期:2021-07-05
down
wechat
bug