Toward enriched decoding of mandarin spontaneous speech,Speech Communication

当前位置： X-MOL 学术 › Speech Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Toward enriched decoding of mandarin spontaneous speech
Speech Communication ( IF 3.2 ) Pub Date : 2023-09-14 , DOI: 10.1016/j.specom.2023.102983
Yu-Chih Deng , Yuan-Fu Liao , Yih-Ru Wang , Sin-Horng Chen

A deep neural network (DNN)-based automatic speech recognition (ASR) method for enriched decoding of Mandarin spontaneous speech is proposed. It adopts an enhanced approach over the baseline model built with factored time delay neural networks (TDNN-f) and rescored with RNNLM to first building a baseline system composed of a TDNN-f acoustic model (AM), a trigram language model (LM), and a recurrent neural network language model (RNNLM) to generate a word lattice. It then sequentially incorporates a multi-task Part-of-Speech-RNNLM (POS-RNNLM), a hierarchical prosodic model (HPM), and a reduplication-word LM (RLM) into the decoding process by expanding the word lattice and performing rescoring to improve recognition performance and enrich the decoding output with syntactic parameters of POS and punctuation (PM), prosodic tags of word-juncture break types and syllable prosodic states, and an edited recognition text with reduplication words being eliminated. Experimental results on the Mandarin conversational dialogue corpus (MCDC) showed that SER, CER, and WER of 13.2 %, 13.9 %, and 19.1 % were achieved when incorporating the POS-RNNLM and HPM into the baseline system. They represented relative SER, CER, and WER reductions of 7.7 %, 7.9 % and 5.0 % as comparing with those of the baseline system. Futhermore, the use of the RLM resulted in additional 3 %, 4.6 %, and 4.5 % relative SER, CER, and WER reductions through eliminating reduplication words.

中文翻译：

实现普通话自发语音的丰富解码

提出了一种基于深度神经网络（DNN）的自动语音识别（ASR）方法，用于普通话自发语音的丰富解码。它采用了对使用分解时间延迟神经网络（TDNN-f）构建的基线模型进行增强的方法，并使用 RNNLM 进行重新评分，首先构建由 TDNN-f 声学模型（AM）、三元语言模型（LM）组成的基线系统，以及递归神经网络语言模型（RNNLM）来生成词格。然后，它通过扩展词格并执行重新评分，依次将多任务词性 RNNLM (POS-RNNLM)、分层韵律模型 (HPM) 和重复词 LM (RLM) 合并到解码过程中使用 POS 和标点符号 (PM) 的句法参数提高识别性能并丰富解码输出，单词连接符类型和音节韵律状态的韵律标签，以及删除重复单词的编辑后的识别文本。在普通话会话对话语料库（MCDC）上的实验结果表明，将 POS-RNNLM 和 HPM 纳入基线系统时，SER、CER 和 WER 分别达到了 13.2%、13.9% 和 19.1%。与基线系统相比，它们的相对 SER、CER 和 WER 降低了 7.7%、7.9% 和 5.0%。此外，使用 RLM 通过消除重复词，相对 SER、CER 和 WER 分别减少了 3%、4.6% 和 4.5%。将 POS-RNNLM 和 HPM 纳入基线系统时，WER 分别为 13.2%、13.9% 和 19.1%。与基线系统相比，它们的相对 SER、CER 和 WER 降低了 7.7%、7.9% 和 5.0%。此外，使用 RLM 通过消除重复词，相对 SER、CER 和 WER 分别减少了 3%、4.6% 和 4.5%。将 POS-RNNLM 和 HPM 纳入基线系统时，WER 分别为 13.2%、13.9% 和 19.1%。与基线系统相比，它们的相对 SER、CER 和 WER 降低了 7.7%、7.9% 和 5.0%。此外，使用 RLM 通过消除重复词，相对 SER、CER 和 WER 分别减少了 3%、4.6% 和 4.5%。

更新日期：2023-09-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>