当前位置: X-MOL 学术Appl. Soft Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems
Applied Soft Computing ( IF 7.2 ) Pub Date : 2021-08-26 , DOI: 10.1016/j.asoc.2021.107847
Yi Lin 1 , Bo Yang 1 , Linchao Li 2 , Dongyue Guo 1 , Jianwei Zhang 1 , Hu Chen 1 , Yi Zhang 1
Affiliation  

In this paper, a multilingual end-to-end framework, called ATCSpeechNet, is proposed to tackle the issue of translating communication speech into human-readable text in air traffic control (ATC) systems. In the proposed framework, we focus on integrating multilingual automatic speech recognition (ASR) into one model, in which an end-to-end paradigm is developed to convert speech waveforms into text directly, without any feature engineering or lexicon. To compensate the deficiency of handcrafted feature engineering caused by ATC challenges, including multilingual, multispeaker dialog and unstable speech rates, a speech representation learning (SRL) network is proposed to capture robust and discriminative speech representations from raw waves. The self-supervised training strategy is adopted to optimize the SRL network from unlabeled data, and to further predict the speech features, i.e., wave-to-feature. An end-to-end architecture is improved to complete the ASR task, in which a grapheme-based modeling unit is applied to address the multilingual ASR issue. Facing the problem of small transcribed samples in the ATC domain, an unsupervised approach with mask prediction is applied to pretrain the backbone network of the ASR model on unlabeled data by a feature-to-feature process. Finally, by integrating the SRL with ASR, an end-to-end multilingual ASR framework is formulated in a supervised manner, which is able to translate the raw wave into text in one model, i.e., wave-to-text. Experimental results on the ATCSpeech corpus demonstrate that the proposed approach achieves high performance with a very small labeled corpus and less resource consumption, only a 4.20% label error rate on the 58-hour transcribed corpus. Compared to the baseline model, the proposed approach obtains over 100% relative performance improvement which can be further enhanced with increasing size of the transcribed samples. It is also confirmed that the proposed SRL and training strategies make significant contributions to improving the final performance. In addition, the effectiveness of the proposed framework is also validated on common corpora (AISHELL, LibriSpeech, and cv-fr). More importantly, the proposed multilingual framework not only reduces the system complexity but also obtains higher accuracy compared to that of the independent monolingual ASR models. The proposed approach can also greatly reduce the cost of annotating samples, which benefits to advance the ASR technique to industrial applications.



中文翻译:

ATCSpeechNet:用于空中交通管制系统的多语言端到端语音识别框架

在本文中,提出了一种称为 ATCSpeechNet 的多语言端到端框架,以解决在空中交通管制 (ATC) 系统中将通信语音转换为人类可读文本的问题。在所提出的框架中,我们专注于将多语言自动语音识别 (ASR) 集成到一个模型中,其中开发了端到端范式以将语音波形直接转换为文本,无需任何特征工程或词典。为了弥补由 ATC 挑战引起的手工特征工程的不足,包括多语言、多说话者对话和不稳定的语速,提出了一种语音表示学习(SRL)网络来从原始波中捕获鲁棒且有区别的语音表示。采用自监督训练策略从未标记的数据中优化 SRL 网络,并进一步预测语音特征,即wave-to-feature。改进了端到端架构以完成 ASR 任务,其中应用基于字素的建模单元来解决多语言 ASR 问题。面对 ATC 域中转录样本少的问题,采用具有掩码预测的无监督方法,通过特征到特征的过程对未标记数据的 ASR 模型的骨干网络进行预训练。最后,通过将 SRL 与 ASR 集成,以有监督的方式制定了端到端的多语言 ASR 框架,该框架能够在一个模型中将原始波形转换为文本,即波形到文本。在 ATCSpeech 语料库上的实验结果表明,所提出的方法以非常小的标记语料库和更少的资源消耗实现了高性能,仅为 4。58 小时转录语料库的标签错误率为 20%。与基线模型相比,所提出的方法获得了超过 100% 的相对性能改进,这可以随着转录样本大小的增加而进一步增强。还证实,所提出的 SRL 和训练策略对提高最终性能做出了重大贡献。此外,所提出框架的有效性也在通用语料库(AISHELL、LibriSpeech 和 cv-fr)上得到验证。更重要的是,与独立的单语 ASR 模型相比,所提出的多语言框架不仅降低了系统复杂度,而且获得了更高的准确性。所提出的方法还可以大大降低注释样本的成本,这有利于将 ASR 技术推向工业应用。

更新日期:2021-09-06
down
wechat
bug