当前位置: X-MOL 学术Lang. Resour. Eval. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A large English–Thai parallel corpus from the web and machine-generated text
Language Resources and Evaluation ( IF 2.7 ) Pub Date : 2021-03-30 , DOI: 10.1007/s10579-021-09536-6
Lalita Lowphansirikul , Charin Polpanumas , Attapol T. Rutherford , Sarana Nutanong

The primary objective of our work is to build a large-scale English–Thai dataset for training neural machine translation models. We construct scb-mt-en-th-2020, an English–Thai machine translation dataset with over 1 million segment pairs, curated from various sources: news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data, government documents, and text artificially generated by a pretrained language model. We present the methods for gathering data, aligning texts, and removing preprocessing noise and translation errors automatically. We also train machine translation models based on this dataset to assess the quality of the corpus. Our models perform comparably to Google Translation API (as of May 2020) for Thai–English and outperform Google when the Open Parallel Corpus (OPUS) is included in the training data for both Thai–English and English–Thai translation. The dataset is available for public use under CC-BY-SA 4.0 License. The pre-trained models and source code to reproduce our work are available under Apache-2.0 License.



中文翻译:

来自网络和机器生成的文本的大型英语-泰语平行语料库

我们工作的主要目标是建立一个大规模的英语-泰国数据集来训练神经机器翻译模型。我们构建了scb-mt-en-th-2020,这是一个英语-泰语机器翻译数据集,具有超过100万个片段对,选自各种来源:新闻,维基百科文章,SMS消息,基于任务的对话框,网络抓取的数据,政府文档和由预先训练的语言模型人工生成的文本。我们介绍了自动收集数据,对齐文本以及消除预处理噪声和翻译错误的方法。我们还基于此数据集训练机器翻译模型,以评估语料库的质量。我们的模型与泰语-英语的Google Translation API(截至2020年5月)具有可比性,当泰语和英语-泰语翻译的训练数据中包含开放平行语料库(OPUS)时,我们的模型表现优于Google。根据CC-BY-SA 4.0许可,该数据集可供公众使用。可通过Apache-2.0许可获得用于重现我们工作的预训练模型和源代码。

更新日期:2021-03-30
down
wechat
bug