Neural machine translation of low-resource languages using SMT phrase pair injection,Natural Language Engineering

当前位置： X-MOL 学术 › Nat. Lang. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Neural machine translation of low-resource languages using SMT phrase pair injection
Natural Language Engineering ( IF 2.3 ) Pub Date : 2020-06-17 , DOI: 10.1017/s1351324920000303
Sukanta Sen , Mohammed Hasanuzzaman , Asif Ekbal , Pushpak Bhattacharyya , Andy Way

Neural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited to the specific languages and domains. In this paper, we propose an effective approach to improve an NMT system in low-resource scenario without using any additional data. Our approach aims at augmenting the original training data by means of parallel phrases extracted from the original training data itself using a statistical machine translation (SMT) system. Our proposed approach is based on the gated recurrent unit (GRU) and transformer networks. We choose the Hindi–English, Hindi–Bengali datasets for Health, Tourism, and Judicial (only for Hindi–English) domains. We train our NMT models for 10 translation directions, each using only 5–23k parallel sentences. Experiments show the improvements in the range of 1.38–15.36 BiLingual Evaluation Understudy points over the baseline systems. Experiments show that transformer models perform better than GRU models in low-resource scenarios. In addition to that, we also find that our proposed method outperforms SMT—which is known to work better than the neural models in low-resource scenarios—for some translation directions. In order to further show the effectiveness of our proposed model, we also employ our approach to another interesting NMT task, for example, old-to-modern English translation, using a tiny parallel corpus of only 2.7K sentences. For this task, we use publicly available old-modern English text which is approximately 1000 years old. Evaluation for this task shows significant improvement over the baseline NMT.

中文翻译：

使用 SMT 短语对注入的低资源语言的神经机器翻译

神经机器翻译 (NMT) 最近在公开可用的基准数据集上显示出可喜的结果，并且正在迅速被各种生产系统采用。但是，它需要高质量的大规模并行语料库，并且由于需要时间、金钱和专业人员，因此并不总是能够拥有足够大的语料库。因此，许多现有的大规模并行语料库仅限于特定的语言和领域。在本文中，我们提出了一种有效的方法来改进低资源场景下的 NMT 系统，而无需使用任何额外的数据。我们的方法旨在通过使用统计机器翻译 (SMT) 系统从原始训练数据本身中提取的并行短语来扩充原始训练数据。我们提出的方法基于门控循环单元（GRU）和变压器网络。我们为健康、旅游和司法（仅适用于印地语-英语）领域选择印地语-英语、印地语-孟加拉语数据集。我们针对 10 个翻译方向训练我们的 NMT 模型，每个翻译方向仅使用 5-23k 平行句子。实验表明，在 1.38–15.36 双语评估研究点的范围内比基线系统有所改进。实验表明，Transformer 模型在低资源场景下的表现优于 GRU 模型。除此之外，我们还发现我们提出的方法在某些翻译方向上优于 SMT——众所周知，SMT 在低资源场景中比神经模型工作得更好。为了进一步展示我们提出的模型的有效性，我们还将我们的方法用于另一个有趣的 NMT 任务，例如，从古到现代的英语翻译，使用只有 2.7K 句子的微小平行语料库。对于这项任务，我们使用了大约 1000 年前可公开获得的旧现代英文文本。对该任务的评估表明，与基线 NMT 相比有显着改进。

更新日期：2020-06-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11