Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian,Computer Speech & Language

当前位置： X-MOL 学术 › Comput. Speech Lang › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian
Computer Speech & Language ( IF 4.3 ) Pub Date : 2020-09-01 , DOI: 10.1016/j.csl.2020.101141
Matti Varjokallio , Sami Virpioja , Mikko Kurimo

We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish and Estonian. Due to morphological processes such as derivation, inflection and compounding, the models need to be trained with vocabulary sizes of several millions of word types. Class-based language modelling is in this case a powerful approach to alleviate the data sparsity and reduce the computational load. For a very large vocabulary, bigram statistics may not be an optimal way to derive the classes. We thus study utilizing the output of a morphological analyzer to achieve efficient word classes. We show that efficient classes can be learned by refining the morphological classes to smaller equivalence classes using merging, splitting and exchange procedures with suitable constraints. This type of classification can improve the results, particularly when language model training data is not very large. We also extend the previous analyses by rescoring the hypotheses obtained from a very large vocabulary recognizer using class-based neural network language models. We show that despite the fixed vocabulary, carefully constructed classes for word-based language models can in some cases result in lower error rates than subword-based unlimited vocabulary language models.

中文翻译：

基于形态学的单词类，可用于芬兰语和爱沙尼亚语的超大词汇语音识别

我们研究基于类别的n-gram和神经网络语言模型，用于两种形态丰富的语言（芬兰语和爱沙尼亚语）的非常大的词汇语音识别。由于诸如衍生，变形和复合等形态学过程，需要使用数百万个单词类型的词汇量来训练模型。在这种情况下，基于类的语言建模是减轻数据稀疏性并减少计算负荷的有效方法。对于非常大的词汇表，二元统计可能不是派生类的最佳方式。因此，我们研究了利用形态分析仪的输出来实现有效的单词分类。我们表明，可以通过使用具有适当约束的合并，拆分和交换过程将形态学类细化为较小的等价类来学习有效的类。这种类型的分类可以改善结果，尤其是在语言模型训练数据不是很大的情况下。我们还通过使用基于类的神经网络语言模型对从非常大的词汇识别器中获得的假设进行记录，来扩展先前的分析。我们显示，尽管词汇表固定，但精心设计的基于单词的语言模型类在某些情况下会比基于子单词的无限词汇表语言产生更低的错误率。

更新日期：2020-09-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>