Streaming End-to-End Bilingual ASR Systems with Joint Language Identification,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Streaming End-to-End Bilingual ASR Systems with Joint Language Identification
arXiv - CS - Sound Pub Date : 2020-07-08 , DOI: arxiv-2007.03900
Surabhi Punjabi, Harish Arsikere, Zeynab Raeesy, Chander Chandak, Nikhil Bhave, Ankish Bansal, Markus M\"uller, Sergio Murillo, Ariya Rastrow, Sri Garimella, Roland Maas, Mat Hans, Athanasios Mouchtaris, Siegfried Kunzmann

Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in voice-activated smart assistant systems, language identity is also required for downstream processing of ASR output. In this paper, we introduce streaming, end-to-end, bilingual systems that perform both ASR and language identification (LID) using the recurrent neural network transducer (RNN-T) architecture. On the input side, embeddings from pretrained acoustic-only LID classifiers are used to guide RNN-T training and inference, while on the output side, language targets are jointly modeled with ASR targets. The proposed method is applied to two language pairs: English-Spanish as spoken in the United States, and English-Hindi as spoken in India. Experiments show that for English-Spanish, the bilingual joint ASR-LID architecture matches monolingual ASR and acoustic-only LID accuracies. For the more challenging (owing to within-utterance code switching) case of English-Hindi, English ASR and LID metrics show degradation. Overall, in scenarios where users switch dynamically between languages, the proposed architecture offers a promising simplification over running multiple monolingual ASR models and an LID classifier in parallel.

中文翻译：

具有联合语言识别功能的流式端到端双语 ASR 系统

多语言 ASR 技术简化了模型训练和部署，但众所周知，其准确性取决于运行时语言信息的可用性。由于在现实世界场景中很少事先知道语言身份，因此必须以最小的延迟即时推断。此外，在声控智能助理系统中，ASR输出的下游处理也需要语言身份。在本文中，我们介绍了使用循环神经网络传感器 (RNN-T) 架构执行 ASR 和语言识别 (LID) 的流式、端到端、双语系统。在输入端，来自预训练的纯声学 LID 分类器的嵌入用于指导 RNN-T 训练和推理，而在输出端，语言目标与 ASR 目标联合建模。所提出的方法适用于两种语言对：美国使用的英语-西班牙语和印度使用的英语-印地语。实验表明，对于英语-西班牙语，双语联合 ASR-LID 架构与单语 ASR 和纯声学 LID 精度相匹配。对于更具挑战性的（由于话语内代码切换）英语-印地语的情况，英语 ASR 和 LID 指标显示降级。总体而言，在用户在语言之间动态切换的场景中，所提出的架构提供了一个有希望的简化，而不是并行运行多个单语 ASR 模型和 LID 分类器。双语联合 ASR-LID 架构匹配单语 ASR 和纯声学 LID 精度。对于更具挑战性的（由于话语内代码切换）英语-印地语的情况，英语 ASR 和 LID 指标显示降级。总体而言，在用户在语言之间动态切换的场景中，所提出的架构提供了一个有希望的简化，而不是并行运行多个单语 ASR 模型和 LID 分类器。双语联合 ASR-LID 架构匹配单语 ASR 和纯声学 LID 精度。对于更具挑战性的（由于话语内代码切换）英语-印地语的情况，英语 ASR 和 LID 指标显示降级。总体而言，在用户在语言之间动态切换的场景中，所提出的架构提供了一个有希望的简化，而不是并行运行多个单语 ASR 模型和 LID 分类器。

更新日期：2020-07-09

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>