Softmax Tempering for Training Neural Machine Translation Models,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Softmax Tempering for Training Neural Machine Translation Models
arXiv - CS - Computation and Language Pub Date : 2020-09-20 , DOI: arxiv-2009.09372
Raj Dabre and Atsushi Fujita

Neural machine translation (NMT) models are typically trained using a softmax cross-entropy loss where the softmax distribution is compared against smoothed gold labels. In low-resource scenarios, NMT models tend to over-fit because the softmax distribution quickly approaches the gold label distribution. To address this issue, we propose to divide the logits by a temperature coefficient, prior to applying softmax, during training. In our experiments on 11 language pairs in the Asian Language Treebank dataset and the WMT 2019 English-to-German translation task, we observed significant improvements in translation quality by up to 3.9 BLEU points. Furthermore, softmax tempering makes the greedy search to be as good as beam search decoding in terms of translation quality, enabling 1.5 to 3.5 times speed-up. We also study the impact of softmax tempering on multilingual NMT and recurrently stacked NMT, both of which aim to reduce the NMT model size by parameter sharing thereby verifying the utility of temperature in developing compact NMT models. Finally, an analysis of softmax entropies and gradients reveal the impact of our method on the internal behavior of NMT models.

中文翻译：

用于训练神经机器翻译模型的 Softmax Tempering

神经机器翻译 (NMT) 模型通常使用 softmax 交叉熵损失进行训练，其中将 softmax 分布与平滑的黄金标签进行比较。在低资源场景下，NMT 模型往往会过度拟合，因为 softmax 分布很快接近黄金标签分布。为了解决这个问题，我们建议在训练期间在应用 softmax 之前将 logits 除以温度系数。在我们对亚洲语言树库数据集和 WMT 2019 英德翻译任务中的 11 个语言对进行的实验中，我们观察到翻译质量显着提高了 3.9 个 BLEU 点。此外，softmax 缓和使贪婪搜索在翻译质量方面与束搜索解码一样好，可实现 1.5 到 3.5 倍的加速。我们还研究了 softmax 回火对多语言 NMT 和循环堆叠 NMT 的影响，两者都旨在通过参数共享来减小 NMT 模型的大小，从而验证温度在开发紧凑 NMT 模型中的实用性。最后，对 softmax 熵和梯度的分析揭示了我们的方法对 NMT 模型内部行为的影响。

更新日期：2020-09-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>