Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages
arXiv - CS - Computation and Language Pub Date : 2020-09-23 , DOI: arxiv-2009.11201
Xavier Garcia, Aditya Siddhant, Orhan Firat, Ankur P. Parikh

Unsupervised translation has reached impressive performance on resource-rich language pairs such as English-French and English-German. However, early studies have shown that in more realistic settings involving low-resource, rare languages, unsupervised translation performs poorly, achieving less than 3.0 BLEU. In this work, we show that multilinguality is critical to making unsupervised systems practical for low-resource settings. In particular, we present a single model for 5 low-resource languages (Gujarati, Kazakh, Nepali, Sinhala, and Turkish) to and from English directions, which leverages monolingual and auxiliary parallel data from other high-resource language pairs via a three-stage training scheme. We outperform all current state-of-the-art unsupervised baselines for these languages, achieving gains of up to 14.4 BLEU. Additionally, we outperform a large collection of supervised WMT submissions for various language pairs as well as match the performance of the current state-of-the-art supervised model for Nepali-English. We conduct a series of ablation studies to establish the robustness of our model under different degrees of data quality, as well as to analyze the factors which led to the superior performance of the proposed approach over traditional unsupervised models.

中文翻译：

在稀有语言的无监督机器翻译中利用多语言

无监督翻译在英语-法语和英语-德语等资源丰富的语言对上取得了令人瞩目的表现。然而，早期研究表明，在涉及资源匮乏、稀有语言的更现实环境中，无监督翻译表现不佳，达到 3.0 BLEU 以下。在这项工作中，我们表明多语言对于使无监督系统适用于低资源环境至关重要。特别是，我们为 5 种低资源语言（古吉拉特语、哈萨克语、尼泊尔语、僧伽罗语和土耳其语）和英语方向提供了一个单一模型，该模型利用来自其他高资源语言对的单语和辅助并行数据，通过三个-阶段训练计划。我们优于这些语言的所有当前最先进的无监督基线，实现了高达 14.4 BLEU 的增益。此外，我们在各种语言对的大量监督 WMT 提交中表现出色，并且与当前最先进的尼泊尔语-英语监督模型的性能相匹配。我们进行了一系列消融研究，以建立我们模型在不同数据质量程度下的稳健性，并分析导致所提出方法优于传统无监督模型的因素。

更新日期：2020-09-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文