Neural Models of Text Normalization for Speech Applications,Computational Linguistics

当前位置： X-MOL 学术 › Comput. Linguist. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Neural Models of Text Normalization for Speech Applications
Computational Linguistics ( IF 3.7 ) Pub Date : 2019-06-01 , DOI: 10.1162/coli_a_00349
Hao Zhang ₁ , Richard Sproat ₁ , Axel H. Ng ₁ , Felix Stahlberg ₂ , Xiaochang Peng ₃ , Kyle Gorman ₁ , Brian Roark ₁

Affiliation

Machine learning, including neural network techniques, have been applied to virtually every domain in natural language processing. One problem that has been somewhat resistant to effective machine learning solutions is text normalization for speech applications such as text-to-speech synthesis (TTS). In this application, one must decide, for example, that 123 is verbalized as one hundred twenty three in 123 pages but as one twenty three in 123 King Ave. For this task, state-of-the-art industrial systems depend heavily on hand-written language-specific grammars.We propose neural network models that treat text normalization for TTS as a sequence-to-sequence problem, in which the input is a text token in context, and the output is the verbalization of that token. We find that the most effective model, in accuracy and efficiency, is one where the sentential context is computed once and the results of that computation are combined with the computation of each token in sequence to compute the verbalization. This model allows for a great deal of flexibility in terms of representing the context, and also allows us to integrate tagging and segmentation into the process.These models perform very well overall, but occasionally they will predict wildly inappropriate verbalizations, such as reading 3 cm as three kilometers. Although rare, such verbalizations are a major issue for TTS applications. We thus use finite-state covering grammars to guide the neural models, either during training and decoding, or just during decoding, away from such “unrecoverable” errors. Such grammars can largely be learned from data.

中文翻译：

语音应用中文本规范化的神经模型

机器学习，包括神经网络技术，已经应用于自然语言处理的几乎每个领域。对有效的机器学习解决方案有些抵触的一个问题是语音应用程序的文本规范化，例如文本到语音合成 (TTS)。例如，在此应用程序中，必须决定 123 在 123 页中表述为 123，而在 King Ave 123 中表述为 123。对于这项任务，最先进的工业系统在很大程度上取决于手- 书面语言特定语法。我们提出了神经网络模型，将 TTS 的文本规范化视为序列到序列问题，其中输入是上下文中的文本标记，输出是该标记的语言表达。我们发现最有效的模型，在准确性和效率方面，是一种句子上下文被计算一次并且该计算的结果与每个标记的计算顺序相结合以计算语言表达的方法。该模型在表示上下文方面具有很大的灵活性，并且还允许我们将标记和分割集成到过程中。这些模型总体上表现非常好，但有时它们会预测非常不恰当的语言表达，例如阅读 3 cm作为三公里。虽然很少见，但这种语言表达是 TTS 应用程序的一个主要问题。因此，我们使用有限状态覆盖语法来指导神经模型，无论是在训练和解码期间，还是在解码期间，远离这种“不可恢复”的错误。这种语法很大程度上可以从数据中学习。

更新日期：2019-06-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11