AUGVIC: Exploiting BiText Vicinity for Low-Resource NMT,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

AUGVIC: Exploiting BiText Vicinity for Low-Resource NMT
arXiv - CS - Computation and Language Pub Date : 2021-06-09 , DOI: arxiv-2106.05141
Tasnim Mohiuddin, M Saiful Bari, Shafiq Joty

The success of Neural Machine Translation (NMT) largely depends on the availability of large bitext training corpora. Due to the lack of such large corpora in low-resource language pairs, NMT systems often exhibit poor performance. Extra relevant monolingual data often helps, but acquiring it could be quite expensive, especially for low-resource languages. Moreover, domain mismatch between bitext (train/test) and monolingual data might degrade the performance. To alleviate such issues, we propose AUGVIC, a novel data augmentation framework for low-resource NMT which exploits the vicinal samples of the given bitext without using any extra monolingual data explicitly. It can diversify the in-domain bitext data with finer level control. Through extensive experiments on four low-resource language pairs comprising data from different domains, we have shown that our method is comparable to the traditional back-translation that uses extra in-domain monolingual data. When we combine the synthetic parallel data generated from AUGVIC with the ones from the extra monolingual data, we achieve further improvements. We show that AUGVIC helps to attenuate the discrepancies between relevant and distant-domain monolingual data in traditional back-translation. To understand the contributions of different components of AUGVIC, we perform an in-depth framework analysis.

中文翻译：

AUGVIC：利用 BiText Vicinity 进行低资源 NMT

神经机器翻译 (NMT) 的成功在很大程度上取决于大型双文本训练语料库的可用性。由于在低资源语言对中缺乏如此大的语料库，NMT 系统通常表现出较差的性能。额外相关的单语数据通常会有所帮助，但获取它可能会非常昂贵，尤其是对于资源匮乏的语言。此外，双文本（训练/测试）和单语数据之间的域不匹配可能会降低性能。为了缓解这些问题，我们提出了 AUGVIC，这是一种用于低资源 NMT 的新型数据增强框架，它在不明确使用任何额外单语数据的情况下利用给定双文本的邻近样本。它可以通过更精细的级别控制使域内双文本数据多样化。通过对包含来自不同领域的数据的四个低资源语言对的广泛实验，我们已经证明我们的方法与使用额外域内单语数据的传统反向翻译相当。当我们将 AUGVIC 生成的合成并行数据与额外的单语数据中的数据结合起来时，我们实现了进一步的改进。我们表明 AUGVIC 有助于减少传统反向翻译中相关和远程域单语数据之间的差异。为了了解 AUGVIC 不同组件的贡献，我们进行了深入的框架分析。我们表明 AUGVIC 有助于减少传统反向翻译中相关和远程域单语数据之间的差异。为了了解 AUGVIC 不同组件的贡献，我们进行了深入的框架分析。我们表明 AUGVIC 有助于减少传统反向翻译中相关和远程域单语数据之间的差异。为了了解 AUGVIC 不同组件的贡献，我们进行了深入的框架分析。

更新日期：2021-06-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>