Scaling End-to-End Models for Large-Scale Multilingual ASR,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Scaling End-to-End Models for Large-Scale Multilingual ASR
arXiv - CS - Sound Pub Date : 2021-04-30 , DOI: arxiv-2104.14830
Bo Li, Ruoming Pang, Tara N. Sainath, Anmol Gulati, Yu Zhang, James Qin, Parisa Haghani, W. Ronny Huang, Min Ma

Building ASR models across many language families is a challenging multi-task learning problem due to large language variations and heavily unbalanced data. Existing work has shown positive transfer from high resource to low resource languages. However, degradations on high resource languages are commonly observed due to interference from the heterogeneous multilingual data and reduction in per-language capacity. We conduct a capacity study on a 15-language task, with the amount of data per language varying from 7.7K to 54.7K hours. We adopt GShard [1] to efficiently scale up to 10B parameters. Empirically, we find that (1) scaling the number of model parameters is an effective way to solve the capacity bottleneck - our 500M-param model is already better than monolingual baselines and scaling it to 1B and 10B brought further quality gains; (2) larger models are not only more data efficient, but also more efficient in terms of training cost as measured in TPU days - the 1B-param model reaches the same accuracy at 34% of training time as the 500M-param model; (3) given a fixed capacity budget, adding depth usually works better than width and large encoders tend to do better than large decoders.

中文翻译：

大规模多语言ASR的端到端模型扩展

由于语言种类繁多且数据严重失衡，因此跨许多语言系列建立ASR模型是一个具有挑战性的多任务学习问题。现有工作已显示出从高资源语言向低资源语言的积极转变。但是，由于异构多语言数据的干扰和每语言容量的减少，通常会观察到高资源语言的性能下降。我们对15种语言的任务进行了容量研究，每种语言的数据量从7.7K到54.7K小时不等。我们采用GShard [1]有效地扩展到10B参数。根据经验，我们发现（1）缩放模型参数的数量是解决容量瓶颈的有效方法-我们的500M参数模型已经比单语基线更好，将其缩放到1B和10B带来了进一步的质量提升；（2）较大的模型不仅在数据上更有效，而且在以TPU天数衡量的培训成本方面也更加有效-1B参数模型在训练时间的34％时达到了与500M参数模型相同的准确性；（3）在给定固定容量预算的情况下，添加深度通常比宽度更好，而大型编码器的效果往往优于大型解码器。

更新日期：2021-05-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文