Advances in subword-based HMM-DNN speech recognition across languages,Computer Speech & Language

当前位置： X-MOL 学术 › Comput. Speech Lang › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Advances in subword-based HMM-DNN speech recognition across languages
Computer Speech & Language ( IF 4.3 ) Pub Date : 2020-09-28 , DOI: 10.1016/j.csl.2020.101158
Peter Smit , Sami Virpioja , Mikko Kurimo

We describe a novel way to implement subword language models in speech recognition systems based on weighted finite state transducers, hidden Markov models, and deep neural networks. The acoustic models are built on graphemes in a way that no pronunciation dictionaries are needed, and they can be used together with any type of subword language model, including character models. The advantages of short subword units are good lexical coverage, reduced data sparsity, and avoiding vocabulary mismatches in adaptation. Moreover, constructing neural network language models (NNLMs) is more practical, because the input and output layers are small. We also propose methods for combining the benefits of different types of language model units by reconstructing and combining the recognition lattices. We present an extensive evaluation of various subword units on speech datasets of four languages: Finnish, Swedish, Arabic, and English. The results show that the benefits of short subwords are even more consistent with NNLMs than with traditional n-gram language models. Combination across different acoustic models and language models with various units improve the results further. For all the four datasets we obtain the best results published so far. Our approach performs well even for English, where the phoneme-based acoustic models and word-based language models typically dominate: The phoneme-based baseline performance can be reached and improved by 4% using graphemes only when several grapheme-based models are combined. Furthermore, combining both grapheme and phoneme models yields the state-of-the-art error rate of 15.9% for the MGB 2018 dev17b test. For all four languages we also show that the language models perform reasonably well when only limited training data is available.

中文翻译：

跨语言的基于子词的HMM-DNN语音识别的进展

我们描述了一种基于加权有限状态换能器，隐马尔可夫模型和深度神经网络的语音识别系统中实现子词语言模型的新颖方法。声学模型建立在字素上，不需要发音词典，并且可以与任何类型的子词语言模型（包括字符模型）一起使用。短子词单元的优点是良好的词汇覆盖率，减少的数据稀疏性以及避免自适应中的词汇不匹配。此外，由于输入和输出层很小，因此构造神经网络语言模型（NNLM）更实用。我们还提出了通过重建和组合识别格来组合不同类型语言模型单元的优点的方法。我们对四种语言的语音数据集（芬兰语，瑞典语，阿拉伯语和英语）中的各个子词单元进行了广泛的评估。结果表明，与传统的n-gram语言模型相比，使用NNLM的短子词的好处更加一致。跨不同声学模型和语言模型与各种单位的组合，可以进一步改善结果。对于所有四个数据集，我们都获得了迄今为止发布的最佳结果。即使在英语中，我们的方法也表现良好，而基于音素的声学模型和基于单词的语言模型通常占主导地位：仅当将多个基于音素的模型组合在一起时，使用音素才能达到和改善基于音素的基线性能4％。此外，将字素模型和音素模型结合在一起，可以得出最新的错误率15。MGB 2018 dev17b测试的9％。对于所有四种语言，我们还表明，只有有限的训练数据可用时，语言模型才能表现良好。

更新日期：2020-10-11

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>