当前位置: X-MOL 学术EURASIP J. Audio Speech Music Proc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improving low-resource Tibetan end-to-end ASR by multilingual and multilevel unit modeling
EURASIP Journal on Audio, Speech, and Music Processing ( IF 2.4 ) Pub Date : 2022-01-12 , DOI: 10.1186/s13636-021-00233-4
Siqing Qin 1 , Longbiao Wang 1 , Jianwu Dang 1, 2 , Sheng Li 3 , Lixin Pan 4
Affiliation  

Conventional automatic speech recognition (ASR) and emerging end-to-end (E2E) speech recognition have achieved promising results after being provided with sufficient resources. However, for low-resource language, the current ASR is still challenging. The Lhasa dialect is the most widespread Tibetan dialect and has a wealth of speakers and transcriptions. Hence, it is meaningful to apply the ASR technique to the Lhasa dialect for historical heritage protection and cultural exchange. Previous work on Tibetan speech recognition focused on selecting phone-level acoustic modeling units and incorporating tonal information but underestimated the influence of limited data. The purpose of this paper is to improve the speech recognition performance of the low-resource Lhasa dialect by adopting multilingual speech recognition technology on the E2E structure based on the transfer learning framework. Using transfer learning, we first establish a monolingual E2E ASR system for the Lhasa dialect with different source languages to initialize the ASR model to compare the positive effects of source languages on the Tibetan ASR model. We further propose a multilingual E2E ASR system by utilizing initialization strategies with different source languages and multilevel units, which is proposed for the first time. Our experiments show that the performance of the proposed method-based ASR system exceeds that of the E2E baseline ASR system. Our proposed method effectively models the low-resource Lhasa dialect and achieves a relative 14.2% performance improvement in character error rate (CER) compared to DNN-HMM systems. Moreover, from the best monolingual E2E model to the best multilingual E2E model of the Lhasa dialect, the system’s performance increased by 8.4% in CER.

中文翻译:

通过多语言和多级单元建模改进低资源藏文端到端ASR

传统的自动语音识别(ASR)和新兴的端到端(E2E)语音识别在提供足够的资源后取得了可喜的成果。然而,对于低资源语言,当前的 ASR 仍然具有挑战性。拉萨方言是分布最广的藏语方言,拥有丰富的讲者和转写方式。因此,将ASR技术应用到拉萨方言中,对历史遗产保护和文化交流具有重要意义。以前关于藏语语音识别的工作主要集中在选择电话级声学建模单元和整合音调信息,但低估了有限数据的影响。本文的目的是在基于迁移学习框架的E2E结构上采用多语言语音识别技术,提高低资源拉萨方言的语音识别性能。利用迁移学习,我们首先针对不同源语言的拉萨方言建立单语E2E ASR系统来初始化ASR模型,以比较源语言对藏语ASR模型的积极影响。我们进一步提出了一种多语言E2E ASR系统,利用不同源语言和多级单元的初始化策略,这是首次提出的。我们的实验表明,所提出的基于方法的 ASR 系统的性能超过了 E2E 基线 ASR 系统的性能。与 DNN-HMM 系统相比,我们提出的方法有效地对低资源拉萨方言进行建模,并在字符错误率 (CER) 方面实现了 14.2% 的相对性能提升。此外,从最好的单语E2E模型到最好的拉萨方言多语E2E模型,系统在CER上的性能提升了8.4%。
更新日期:2022-01-13
down
wechat
bug