Optimizing Data Usage for Low-Resource Speech Recognition,IEEE/ACM Transactions on Audio, Speech, and Language Processing

当前位置： X-MOL 学术 › IEEE ACM Trans. Audio Speech Lang. Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Optimizing Data Usage for Low-Resource Speech Recognition
IEEE/ACM Transactions on Audio, Speech, and Language Processing ( IF 4.1 ) Pub Date : 2022-01-05 , DOI: 10.1109/taslp.2022.3140552
Yanmin Qian ₁ , Zhikai Zhou ₁

Affiliation

Automatic speech recognition has made huge progress recently. However, the current modeling strategy still suffers a large performance degradation when facing the low-resource languages with limited training data. In this paper, we propose a series of methods to optimize the data usage for low-resource speech recognition. Multilingual speech recognition helps a lot in low-resource scenarios. The correlation and similarity between languages are further exploited for multilingual pretraining in our work. We utilize the posterior of the target language extracted from a language classifier to perform data weighing on training samples, which assists the model in being more biased towards the target language during pretraining. Furthermore, dynamic curriculum learning for data allocation and length perturbation for data augmentation are also designed. All these three methods form the new strategy on optimized data usage for low-resource languages. We evaluate the proposed method using rich resource languages for pretraining (PT) and finetuning (FT) the model on the target language with limited data. Experimental results show that the proposed data usage method obtains a 15 to 25% relative word error rate reduction for different target languages compared with the commonly adopted multilingual PT+FT method on CommonVoice dataset. The same improvement and conclusion are also observed on Babel dataset with conversational telephone speech, and ∼\sim40% relative character error rate reduction can be obtained for the target low-resource language.

中文翻译：

优化低资源语音识别的数据使用

自动语音识别最近取得了巨大进展。然而，当前的建模策略在面对训练数据有限的低资源语言时仍然会遭受很大的性能下降。在本文中，我们提出了一系列方法来优化低资源语音识别的数据使用。多语言语音识别在资源匮乏的情况下有很大帮助。在我们的工作中，语言之间的相关性和相似性被进一步用于多语言预训练。我们利用从语言分类器中提取的目标语言的后验对训练样本进行数据权重，这有助于模型在预训练期间更加偏向目标语言。此外，还设计了用于数据分配的动态课程学习和用于数据增强的长度扰动。所有这三种方法形成了针对低资源语言优化数据使用的新策略。我们使用丰富的资源语言对所提出的方法进行评估，以有限的数据在目标语言上对模型进行预训练（PT）和微调（FT）。实验结果表明，与 CommonVoice 数据集上常用的多语言 PT+FT 方法相比，所提出的数据使用方法在不同目标语言下相对词错误率降低了 15% 至 25%。在具有会话电话语音的 Babel 数据集上也观察到相同的改进和结论，并且对于目标低资源语言可以获得 ∼\sim40% 的相对字符错误率降低。

更新日期：2022-01-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文