当前位置: X-MOL 学术Comput. Speech Lang › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
MuST-C: A multilingual corpus for end-to-end speech translation
Computer Speech & Language ( IF 4.3 ) Pub Date : 2020-10-07 , DOI: 10.1016/j.csl.2020.101155
Roldano Cattoni , Mattia Antonino Di Gangi , Luisa Bentivogli , Matteo Negri , Marco Turchi

End-to-end spoken language translation (SLT) has recently gained popularity thanks to the advancement of sequence to sequence learning in its two parent tasks: automatic speech recognition (ASR) and machine translation (MT). However, research in the field has to confront with the scarcity of publicly available corpora to train data-hungry neural networks. Indeed, while traditional cascade solutions can build on sizable ASR and MT training data for a variety of languages, the available SLT corpora suitable for end-to-end training are few, typically small and of limited language coverage. We contribute to fill this gap by presenting MuST-C, a large and freely available Multilingual Speech Translation Corpus built from English TED Talks. Its unique features include: i) language coverage and diversity (from English into 14 languages from different families), ii) size (at least 237 hours of transcribed recordings per language, 430 on average), iii) variety of topics and speakers, and iv) data quality. Besides describing the corpus creation methodology and discussing the outcomes of empirical and manual quality evaluations, we present baseline results computed with strong systems on each language direction covered by MuST-C.



中文翻译:

MuST-C:用于端到端语音翻译的多语言语料库

端到端口语翻译(SLT)最近在其两个父任务:自动语音识别(ASR)和机器翻译(MT)中得到了发展,这得益于序列学习的发展。但是,该领域的研究必须面对缺乏公开可用语料库来训练需要大量数据的神经网络的问题。的确,尽管传统的级联解决方案可以基于各种语言的大量ASR和MT培训数据,但是适用于端到端培训的可用SLT语料库很少,通常很小,语言覆盖范围有限。我们有助于填补呈现一定-C,一个庞大而免费提供的这一差距ltilingual小号peech牛逼ranslationC orpus是根据英语TED对话构建的。它的独特功能包括:i)语言覆盖范围和多样性(从英语到不同家族的14种语言),ii)大小(每种语言至少237个小时的转录录音,平均430个小时),iii)各种主题和演讲者,以及iv)数据质量。除了描述语料库创建方法并讨论经验和手动质量评估的结果外,我们还将介绍在MuST-C涵盖的每种语言方向上使用强大系统计算得出的基准结果。

更新日期:2020-10-30
down
wechat
bug