当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts
arXiv - CS - Sound Pub Date : 2021-05-07 , DOI: arxiv-2105.03036
Zhao You, Shulin Feng, Dan Su, Dong Yu

Recently, Mixture of Experts (MoE) based Transformer has shown promising results in many domains. This is largely due to the following advantages of this architecture: firstly, MoE based Transformer can increase model capacity without computational cost increasing both at training and inference time. Besides, MoE based Transformer is a dynamic network which can adapt to the varying complexity of input instances in realworld applications. In this work, we explore the MoE based model for speech recognition, named SpeechMoE. To further control the sparsity of router activation and improve the diversity of gate values, we propose a sparsity L1 loss and a mean importance loss respectively. In addition, a new router architecture is used in SpeechMoE which can simultaneously utilize the information from a shared embedding network and the hierarchical representation of different MoE layers. Experimental results show that SpeechMoE can achieve lower character error rate (CER) with comparable computation cost than traditional static networks, providing 7.0%-23.0% relative CER improvements on four evaluation datasets.

中文翻译:

SpeechMoE:借助专家的动态路由混合扩展到大型声学模型

最近,基于专家混合(MoE)的变压器已在许多领域显示出令人鼓舞的结果。这主要是由于该体系结构的以下优点:首先,基于MoE的Transformer可以增加模型容量,而在训练和推理时都不会增加计算成本。此外,基于MoE的Transformer是一个动态网络,可以适应现实应用中输入实例变化的复杂性。在这项工作中,我们探索了基于MoE的语音识别模型,即SpeechMoE。为了进一步控制路由器激活的稀疏性并提高门控值的多样性,我们分别提出了稀疏性L1损失和平均重要性损失。此外,SpeechMoE中使用了一种新的路由器体系结构,该体系结构可以同时利用来自共享嵌入网络的信息以及不同MoE层的分层表示。实验结果表明,与传统的静态网络相比,SpeechMoE可以以较低的计算成本实现较低的字符错误率(CER),在四个评估数据集上可提供7.0%-23.0%的相对CER改进。
更新日期:2021-05-10
down
wechat
bug