Crosslingual Embeddings are Essential in UNMT for Distant Languages: An English to IndoAryan Case Study,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Crosslingual Embeddings are Essential in UNMT for Distant Languages: An English to IndoAryan Case Study
arXiv - CS - Computation and Language Pub Date : 2021-06-09 , DOI: arxiv-2106.04995
Tamali Banerjee, Rudra Murthy V, Pushpak Bhattacharyya

Recent advances in Unsupervised Neural Machine Translation (UNMT) have minimized the gap between supervised and unsupervised machine translation performance for closely related language pairs. However, the situation is very different for distant language pairs. Lack of lexical overlap and low syntactic similarities such as between English and Indo-Aryan languages leads to poor translation quality in existing UNMT systems. In this paper, we show that initializing the embedding layer of UNMT models with cross-lingual embeddings shows significant improvements in BLEU score over existing approaches with embeddings randomly initialized. Further, static embeddings (freezing the embedding layer weights) lead to better gains compared to updating the embedding layer weights during training (non-static). We experimented using Masked Sequence to Sequence (MASS) and Denoising Autoencoder (DAE) UNMT approaches for three distant language pairs. The proposed cross-lingual embedding initialization yields BLEU score improvement of as much as ten times over the baseline for English-Hindi, English-Bengali, and English-Gujarati. Our analysis shows the importance of cross-lingual embedding, comparisons between approaches, and the scope of improvements in these systems.

中文翻译：

跨语言嵌入对于远程语言的 UNMT 至关重要：英语到 IndoAryan 的案例研究

无监督神经机器翻译 (UNMT) 的最新进展已将密切相关的语言对的有监督和无监督机器翻译性能之间的差距降至最低。然而，远距离语言对的情况则大不相同。缺乏词汇重叠和低句法相似性（例如英语和印度-雅利安语言之间）导致现有 UNMT 系统的翻译质量不佳。在本文中，我们表明，与随机初始化嵌入的现有方法相比，使用跨语言嵌入初始化 UNMT 模型的嵌入层在 BLEU 得分方面有显着提高。此外，与在训练期间更新嵌入层权重（非静态）相比，静态嵌入（冻结嵌入层权重）会带来更好的收益。我们对三个远程语言对使用掩码序列到序列 (MASS) 和去噪自动编码器 (DAE) UNMT 方法进行了实验。所提出的跨语言嵌入初始化使英语-印地语、英语-孟加拉语和英语-古吉拉特语的 BLEU 分数比基线提高了十倍之多。我们的分析显示了跨语言嵌入的重要性、方法之间的比较以及这些系统的改进范围。

更新日期：2021-06-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文