当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition
arXiv - CS - Computation and Language Pub Date : 2021-06-09 , DOI: arxiv-2106.05111
Shigeki Karita, Yotaro Kubo, Michiel Adriaan Unico Bacchiani, Llion Jones

End-to-end (E2E) modeling is advantageous for automatic speech recognition (ASR) especially for Japanese since word-based tokenization of Japanese is not trivial, and E2E modeling is able to model character sequences directly. This paper focuses on the latest E2E modeling techniques, and investigates their performances on character-based Japanese ASR by conducting comparative experiments. The results are analyzed and discussed in order to understand the relative advantages of long short-term memory (LSTM), and Conformer models in combination with connectionist temporal classification, transducer, and attention-based loss functions. Furthermore, the paper investigates on effectivity of the recent training techniques such as data augmentation (SpecAugment), variational noise injection, and exponential moving average. The best configuration found in the paper achieved the state-of-the-art character error rates of 4.1%, 3.2%, and 3.5% for Corpus of Spontaneous Japanese (CSJ) eval1, eval2, and eval3 tasks, respectively. The system is also shown to be computationally efficient thanks to the efficiency of Conformer transducers.

中文翻译:

日语语音识别的神经结构和训练方法的比较研究

端到端 (E2E) 建模有利于自动语音识别 (ASR),尤其是日语,因为日语的基于单词的标记化并非微不足道,并且 E2E 建模能够直接对字符序列进行建模。本文重点介绍最新的 E2E 建模技术,并通过对比实验研究它们在基于字符的日语 ASR 上的性能。对结果进行分析和讨论,以了解长短期记忆 (LSTM) 和 Conformer 模型与连接时间分类、转换器和基于注意力的损失函数相结合的相对优势。此外,本文还研究了最近的训练技术的有效性,例如数据增强 (SpecAugment)、变分噪声注入和指数移动平均。论文中发现的最佳配置分别实现了自发日语语料库 (CSJ) eval1、eval2 和 eval3 任务的最先进字符错误率 4.1%、3.2% 和 3.5%。由于 Conformer 传感器的效率,该系统还显示出计算效率。
更新日期:2021-06-10
down
wechat
bug