当前位置: X-MOL 学术Speech Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A unified system for multilingual speech recognition and language identification
Speech Communication ( IF 2.4 ) Pub Date : 2020-12-26 , DOI: 10.1016/j.specom.2020.12.008
Danyang Liu , Ji Xu , Pengyuan Zhang , Yonghong Yan

In this paper, a multilingual automatic speech recognition (ASR) and language identification (LID) system is designed. In contrast to conventional multilingual ASR systems, this paper takes advantage of the complementarity of the ASR and LID modules. First, the LID module contributes to the language adaptive training of the multilingual acoustic model. Then, the ASR decoding information acts as the confidence metric to balance the LID results. To simulate complex multilingual speech recognition situations, two types of LID strategies are conducted. For a multilingual speech recognition task in which only one language is contained in the speech stream, the language information can be directly determined based on utterance-level judgment. Under this condition, a segment-level statistical component and a two-stage update strategy are designed to assist in the utterance-level language classification. In another multilingual speech recognition task, where the speech stream contains multiple languages simultaneously, the Viterbi language state retrieval method based on neural network (NN) classification is used to perform dynamic detection of the language state. In both cases, the ASR decoding information is used to adjust the language classification results. Without prior knowledge of language identity information, the enhanced LID module achieves an accuracy of 99.3% for utterance-level language judgment and 92.4% for dynamic language detection, and the multilingual ASR system also provides performance comparable to that of monolingual ASR systems.



中文翻译:

统一的多语言语音识别和语言识别系统

本文设计了一种多语言自动语音识别(ASR)和语言识别(LID)系统。与传统的多语言ASR系统相比,本文利用了ASR和LID模块的互补性。首先,LID模块有助于多语言声学模型的语言自适应训练。然后,ASR解码信息充当置信度,以平衡LID结果。为了模拟复杂的多语言语音识别情况,进行了两种类型的LID策略。对于其中仅一种语言包含在语音流中的多语言语音识别任务,可以基于话语级判断直接确定语言信息。在这种情况下 细分级别的统计组件和两阶段更新策略旨在帮助进行话语级别的语言分类。在另一种多语言语音识别任务中,语音流同时包含多种语言,使用基于神经网络(NN)分类的维特比语言状态检索方法对语言状态进行动态检测。在两种情况下,都使用ASR解码信息来调整语言分类结果。在没有语言身份信息的先验知识的情况下,增强的LID模块的话语级别语言判断的准确性达到99.3%,动态语言检测的准确性达到92.4%,并且多语言ASR系统还提供了与单语言ASR系统相当的性能。在另一种多语言语音识别任务中,语音流同时包含多种语言,使用基于神经网络(NN)分类的维特比语言状态检索方法对语言状态进行动态检测。在两种情况下,都使用ASR解码信息来调整语言分类结果。在没有语言身份信息的先验知识的情况下,增强的LID模块的话语级别语言判断的准确性达到99.3%,动态语言检测的准确性达到92.4%,并且多语言ASR系统还提供了与单语言ASR系统相当的性能。在另一种多语言语音识别任务中,语音流同时包含多种语言,使用基于神经网络(NN)分类的维特比语言状态检索方法对语言状态进行动态检测。在两种情况下,都使用ASR解码信息来调整语言分类结果。在没有语言身份信息的先验知识的情况下,增强的LID模块的话语级别语言判断的准确性达到99.3%,动态语言检测的准确性达到92.4%,并且多语言ASR系统还提供了与单语言ASR系统相当的性能。采用基于神经网络分类的维特比语言状态检索方法对语言状态进行动态检测。在两种情况下,都使用ASR解码信息来调整语言分类结果。在没有语言身份信息的先验知识的情况下,增强的LID模块的话语级别语言判断的准确性达到99.3%,动态语言检测的准确性达到92.4%,并且多语言ASR系统还提供了与单语言ASR系统相当的性能。采用基于神经网络分类的维特比语言状态检索方法对语言状态进行动态检测。在两种情况下,都使用ASR解码信息来调整语言分类结果。在没有语言身份信息的先验知识的情况下,增强的LID模块的话语级别语言判断的准确性达到99.3%,动态语言检测的准确性达到92.4%,并且多语言ASR系统还提供了与单语言ASR系统相当的性能。

更新日期:2021-01-04
down
wechat
bug