Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition,IEEE Signal Processing Letters

当前位置： X-MOL 学术 › IEEE Signal Process. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition
IEEE Signal Processing Letters ( IF 3.2 ) Pub Date : 2021-04-07 , DOI: 10.1109/lsp.2021.3071668
Cheng Yi , Shiyu Zhou , Bo Xu

End-to-end models have achieved impressive results on the task of automatic speech recognition (ASR). For low-resource ASR tasks, however, labeled data can hardly satisfy the demand of end-to-end models. Self-supervised acoustic pre-training has already shown its impressive ASR performance, while the transcription is still inadequate for language modeling in end-to-end models. In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. The fused model only needs to learn the transfer from speech to language during fine-tuning on limited labeled data. The length of the two modalities is matched by a monotonic attention mechanism without additional parameters. Besides, a fully connected layer is introduced for the hidden mapping between modalities. We further propose a scheduled fine-tuning strategy to preserve and utilize the text context modeling ability of the pre-trained linguistic encoder. Experiments show our effective utilizing of pre-trained modules. Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.

中文翻译：

有效融合预训练的声学和语言编码器以实现低资源语音识别

端到端模型在自动语音识别（ASR）任务上取得了令人印象深刻的结果。然而，对于资源匮乏的 ASR 任务，标记数据很难满足端到端模型的需求。自监督声学预训练已经显示出其令人印象深刻的 ASR 性能，但转录仍然不足以用于端到端模型中的语言建模。在这项工作中，我们将预训练的声学编码器 (wav2vec2.0) 和预训练的语言编码器 (BERT) 融合到端到端 ASR 模型中。融合模型只需要在有限标记数据的微调过程中学习从语音到语言的迁移。两种模式的长度由单调注意机制匹配，无需额外参数。此外，还引入了全连接层来实现模态之间的隐藏映射。我们进一步提出了一种预定的微调策略，以保留和利用预训练语言编码器的文本上下文建模能力。实验表明我们有效利用了预训练模块。我们的模型在 CALLHOME 语料库（15 小时）上比其他端到端模型取得了更好的识别性能。

更新日期：2021-04-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11