Self-Supervised Pre-Training for Attention-Based Encoder-Decoder ASR Model,IEEE/ACM Transactions on Audio, Speech, and Language Processing

当前位置： X-MOL 学术 › IEEE ACM Trans. Audio Speech Lang. Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Self-Supervised Pre-Training for Attention-Based Encoder-Decoder ASR Model
IEEE/ACM Transactions on Audio, Speech, and Language Processing ( IF 4.1 ) Pub Date : 5-4-2022 , DOI: 10.1109/taslp.2022.3171967
Changfeng Gao ₁ , Gaofeng Cheng ₂ , Ta Li ₃ , Pengyuan Zhang ₄ , Yonghong Yan ₅

Affiliation

End-to-end (E2E) models, including the attention-based encoder-decoder (AED) models, have achieved promising performance on the automatic speech recognition (ASR) task. However, the supervised training process of the E2E model needs a large amount of speech-text paired data. In contrast, self-supervised pre-training can pre-train the model on the unlabeled data and then fine-tune it on the limited labeled data to realize better performance. Most of the previous self-supervised pre-training methods focus on learning hidden representations from speech but ignore how to utilize the unpaired text. As a result, previous works often pre-train an acoustic encoder and then fine-tune it as a classification based ASR model, such as Connectionist Temporal Classification (CTC) based model, rather than an AED model. In this paper, we propose a self-supervised pre-training method for the AED model (SP-AED). The SP-AED method contains acoustic pre-training for the encoder, linguistic pre-training for the decoder, and an adaptive combination fine-tuning for the whole system. We first design a linguistic pre-training method for decoder by utilizing the text-only data. The decoder will be pre-trained as a noise-condition language model to learn the prior distribution of the text. Then, we pre-train the AED encoder with the wav2vec2.0 method with some modifications. Finally, we combine the pre-trained encoder and decoder and fine-tune them on the limited labeled data. We design an adaptive combination method during fine-tuning by modifying the decoder’s input and output to prevent catastrophic forgetting. Experiments prove that compared with the random initialized models, the SP-AED pre-trained models can realize up to 17% relative improvement. And with similar model size or computational cost, we can get comparable results to other classification-based models on both English and Chinese corpus.

中文翻译：

基于注意力的编码器-解码器 ASR 模型的自监督预训练

端到端（E2E）模型，包括基于注意力的编码器-解码器（AED）模型，在自动语音识别（ASR）任务上取得了可喜的性能。然而，端到端模型的监督训练过程需要大量的语音文本配对数据。相比之下，自监督预训练可以在未标记数据上对模型进行预训练，然后在有限的标记数据上进行微调，以实现更好的性能。以前的大多数自监督预训练方法都专注于学习语音中的隐藏表示，但忽略了如何利用不配对的文本。因此，以前的工作经常预训练声学编码器，然后将其微调为基于分类的 ASR 模型，例如基于连接主义时间分类 (CTC) 的模型，而不是 AED 模型。在本文中，我们提出了一种 AED 模型（SP-AED）的自监督预训练方法。 SP-AED方法包含编码器的声学预训练、解码器的语言预训练以及整个系统的自适应组合微调。我们首先利用纯文本数据设计一种解码器的语言预训练方法。解码器将被预训练为噪声条件语言模型，以学习文本的先验分布。然后，我们使用 wav2vec2.0 方法并进行一些修改来预训练 AED 编码器。最后，我们结合预训练的编码器和解码器，并在有限的标记数据上对它们进行微调。我们在微调期间设计了一种自适应组合方法，通过修改解码器的输入和输出来防止灾难性遗忘。实验证明，与随机初始化模型相比，SP-AED预训练模型可以实现高达17%的相对改进。并且在模型大小或计算成本相似的情况下，我们可以在英语和中文语料库上获得与其他基于分类的模型相当的结果。

更新日期：2024-08-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文