当前位置: X-MOL 学术Comput. Speech Lang › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Code-switched automatic speech recognition in five South African languages
Computer Speech & Language ( IF 4.3 ) Pub Date : 2021-07-01 , DOI: 10.1016/j.csl.2021.101262
Astik Biswas , Emre Yılmaz , Ewald van der Westhuizen , Febe de Wet , Thomas Niesler

Most automatic speech recognition (ASR) systems are optimised for one specific language and their performance consequently deteriorates drastically when confronted with multilingual or code-switched speech. We describe our efforts to improve an ASR system that can process code-switched South African speech that contains English and four indigenous languages: isiZulu, isiXhosa, Sesotho and Setswana. We begin using a newly developed language-balanced corpus of code-switched speech compiled from South African soap operas, which are rich in spontaneous code-switching. The small size of the corpus makes this scenario under-resourced, and hence we explore several ways of addressing this sparsity of data. We consider augmenting the acoustic training sets with in-domain data at the expense of making it unbalanced and dominated by English. We further explore the inclusion of monolingual out-of-domain data in the constituent languages. For language modelling, we investigate the inclusion of out-of-domain text data sources and also the inclusion of synthetically-generated code-switch bigrams. In our experiments, we consider two system architectures. The first considers four bilingual speech recognisers, each allowing code-switching between English and one of the indigenous languages. The second considers a single pentalingual speech recogniser able to process switching between all five languages. We find that the additional inclusion of each acoustic and text data source leads to some improvements. While in-domain data is substantially more effective, performance gains were also achieved using out-of-domain data, which is often much easier to obtain. We also find that improvements are achieved in all five languages, even when the training set becomes unbalanced and heavily skewed in favour of English. Finally, we find the use of TDNN-F architectures for the acoustic model to consistently outperform TDNN–BLSTM models in our data-sparse scenario.



中文翻译:

五种南非语言的代码切换自动语音识别

大多数自动语音识别 (ASR) 系统针对一种特定语言进行了优化,因此在遇到多语言或代码切换语音时,它们的性能会急剧下降。我们描述了我们为改进 ASR 系统所做的努力,该系统可以处理包含英语和四种土著语言的代码转换南非语音:isiZulu、isiXhosa、Sesotho 和 Setswana。我们开始使用新开发的语言平衡语料库,该语料库是从南非肥皂剧中编纂的,其中包含丰富的自发语码转换。语料库的小规模使得这种情况资源不足,因此我们探索了几种解决这种数据稀疏性的方法。我们考虑用域内数据增加声学训练集,但代价是使其不平衡并以英语为主。我们进一步探索在构成语言中包含单语域外数据。对于语言建模,我们研究了域外文本数据源的包含以及合成生成的代码切换二元组的包含。在我们的实验中,我们考虑了两种系统架构。第一个考虑了四个双语语音识别器,每个都允许在英语和一种土著语言之间进行代码切换。第二个考虑能够处理所有五种语言之间切换的单个五语语音识别器。我们发现每个声学和文本数据源的额外包含导致了一些改进。虽然域内数据明显更有效,但使用域外数据也实现了性能提升,这通常更容易获得。我们还发现,所有五种语言都实现了改进,即使训练集变得不平衡并且严重偏向于英语。最后,我们发现使用 TDNN-F 架构的声学模型在我们的数据稀疏场景中始终优于 TDNN-BLSTM 模型。

更新日期:2021-07-09
down
wechat
bug