CLSRIL-23: Cross Lingual Speech Representations for Indic Languages,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

CLSRIL-23: Cross Lingual Speech Representations for Indic Languages
arXiv - CS - Sound Pub Date : 2021-07-15 , DOI: arxiv-2107.07402
Anirudh Gupta, Harveen Singh Chadha, Priyanshi Shah, Neeraj Chimmwal, Ankur Dhuriya, Rishabh Gaur, Vivek Raghavan

We present a CLSRIL-23, a self supervised learning based audio pre-trained model which learns cross lingual speech representations from raw audio across 23 Indic languages. It is built on top of wav2vec 2.0 which is solved by training a contrastive task over masked latent speech representations and jointly learns the quantization of latents shared across all languages. We compare the language wise loss during pretraining to compare effects of monolingual and multilingual pretraining. Performance on some downstream fine-tuning tasks for speech recognition is also compared and our experiments show that multilingual pretraining outperforms monolingual training, in terms of learning speech representations which encodes phonetic similarity of languages and also in terms of performance on down stream tasks. A decrease of 5% is observed in WER and 9.5% in CER when a multilingual pretrained model is used for finetuning in Hindi. All the code models are also open sourced. CLSRIL-23 is a model trained on $23$ languages and almost 10,000 hours of audio data to facilitate research in speech recognition for Indic languages. We hope that new state of the art systems will be created using the self supervised approach, especially for low resources Indic languages.

中文翻译：

CLSRIL-23：印度语言的跨语言语音表示

我们提出了一个 CLSRIL-23，这是一种基于自我监督学习的音频预训练模型，它从 23 种印度语言的原始音频中学习跨语言语音表示。它建立在 wav2vec 2.0 之上，通过在掩蔽的潜在语音表示上训练对比任务来解决，并共同学习所有语言共享的潜在量化。我们比较了预训练期间的语言明智损失，以比较单语和多语预训练的效果。还比较了一些下游语音识别微调任务的性能，我们的实验表明，在学习对语言语音相似性进行编码的语音表示以及下游任务的性能方面，多语言预训练优于单语训练。在 WER 和 9 中观察到下降了 5%。当使用多语言预训练模型在印地语中进行微调时，CER 为 5%。所有的代码模型也是开源的。CLSRIL-23 是一个模型，训练了 23 美元的语言和近 10,000 小时的音频数据，以促进印度语语音识别的研究。我们希望使用自监督方法创建最先进的新系统，特别是对于低资源印度语。

更新日期：2021-07-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>