Exploring Supervised and Unsupervised Rewards in Machine Translation,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Exploring Supervised and Unsupervised Rewards in Machine Translation
arXiv - CS - Computation and Language Pub Date : 2021-02-22 , DOI: arxiv-2102.11403
Julia Ive, Zixu Wang, Marina Fomicheva, Lucia Specia

Reinforcement Learning (RL) is a powerful framework to address the discrepancy between loss functions used during training and the final evaluation metrics to be used at test time. When applied to neural Machine Translation (MT), it minimises the mismatch between the cross-entropy loss and non-differentiable evaluation metrics like BLEU. However, the suitability of these metrics as reward function at training time is questionable: they tend to be sparse and biased towards the specific words used in the reference texts. We propose to address this problem by making models less reliant on such metrics in two ways: (a) with an entropy-regularised RL method that does not only maximise a reward function but also explore the action space to avoid peaky distributions; (b) with a novel RL method that explores a dynamic unsupervised reward function to balance between exploration and exploitation. We base our proposals on the Soft Actor-Critic (SAC) framework, adapting the off-policy maximum entropy model for language generation applications such as MT. We demonstrate that SAC with BLEU reward tends to overfit less to the training data and performs better on out-of-domain data. We also show that our dynamic unsupervised reward can lead to better translation of ambiguous words.

中文翻译：

探索机器翻译中的有监督和无监督奖励

强化学习（RL）是一个强大的框架，可以解决训练过程中使用的损失函数与测试时要使用的最终评估指标之间的差异。当将其应用于神经机器翻译（MT）时，它将交叉熵损失与BLEU等不可微分评估指标之间的失配降至最低。但是，这些指标在培训时是否适合作为奖励功能值得怀疑：它们往往比较稀疏，而且偏向于参考文本中使用的特定单词。我们建议通过以下两种方式使模型减少对此类指标的依赖来解决此问题：（a）采用熵调整后的RL方法，该方法不仅最大化奖励函数，而且探索行动空间以避免峰值分布；（b）采用一种新颖的RL方法，该方法探索动态的无监督奖励函数，以在探索与开发之间取得平衡。我们的建议基于软参与者关键（SAC）框架，针对语言生成应用（例如MT）调整了政策外的最大熵模型。我们证明，具有BLEU奖励的SAC倾向于不太适合训练数据，并且在域外数据上表现更好。我们还表明，动态的无监督报酬可以导致歧义词的更好翻译。

更新日期：2021-02-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>