Taking MT Evaluation Metrics to Extremes: Beyond Correlation with Human Judgments,Computational Linguistics

当前位置： X-MOL 学术 › Comput. Linguist. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Taking MT Evaluation Metrics to Extremes: Beyond Correlation with Human Judgments
Computational Linguistics ( IF 9.3 ) Pub Date : 2019-09-01 , DOI: 10.1162/coli_a_00356
Marina Fomicheva ₁ , Lucia Specia ₂

Affiliation

Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new metrics devised every year. Evaluation metrics are generally benchmarked against manual assessment of translation quality, with performance measured in terms of overall correlation with human scores. Much work has been dedicated to the improvement of evaluation metrics to achieve a higher correlation with human judgments. However, little insight has been provided regarding the weaknesses and strengths of existing approaches and their behavior in different settings. In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects. First, we analyze the performance of the metrics when faced with different levels of translation quality, proposing a local dependency measure as an alternative to the standard, global correlation coefficient. We show that metrics’ performance varies significantly across different levels of MT quality: Metrics perform poorly when faced with low-quality translations and are not able to capture nuanced quality distinctions. Interestingly, we show that evaluating low-quality translations is also more challenging for humans. Second, we show that metrics are more reliable when evaluating neural MT than the traditional statistical MT systems. Finally, we show that the difference in the evaluation accuracy for different metrics is maintained even if the gold standard scores are based on different criteria.

中文翻译：

将 MT 评估指标发挥到极致：超越与人类判断的相关性

自动机器翻译 (MT) 评估是一个活跃的研究领域，每年都会设计一些新指标。评估指标通常以人工评估翻译质量为基准，并根据与人工评分的整体相关性来衡量绩效。许多工作致力于改进评估指标，以实现与人类判断的更高相关性。然而，关于现有方法的弱点和优势以及它们在不同环境中的行为，几乎没有提供任何见解。在这项工作中，我们对广泛的评估指标的表现进行了广泛的元评估研究，重点是三个主要方面。首先，我们分析了面对不同水平的翻译质量时指标的表现，提出局部依赖性测量作为标准的全局相关系数的替代。我们表明，指标的性能在 MT 质量的不同水平上存在显着差异：指标在面对低质量翻译时表现不佳，并且无法捕捉细微的质量差异。有趣的是，我们表明评估低质量的翻译对人类来说也更具挑战性。其次，我们表明在评估神经 MT 时，指标比传统的统计 MT 系统更可靠。最后，我们表明即使黄金标准分数基于不同的标准，不同指标的评估准确度的差异仍然存在。我们表明，指标的性能在 MT 质量的不同水平上存在显着差异：指标在面对低质量翻译时表现不佳，并且无法捕捉细微的质量差异。有趣的是，我们表明评估低质量的翻译对人类来说也更具挑战性。其次，我们表明在评估神经 MT 时，指标比传统的统计 MT 系统更可靠。最后，我们表明即使黄金标准分数基于不同的标准，不同指标的评估准确度的差异仍然存在。我们表明，指标的性能在 MT 质量的不同水平上存在显着差异：指标在面对低质量翻译时表现不佳，并且无法捕捉细微的质量差异。有趣的是，我们表明评估低质量的翻译对人类来说也更具挑战性。其次，我们表明在评估神经 MT 时，指标比传统的统计 MT 系统更可靠。最后，我们表明即使黄金标准分数基于不同的标准，不同指标的评估准确度的差异仍然存在。我们表明，在评估神经 MT 时，指标比传统的统计 MT 系统更可靠。最后，我们表明即使黄金标准分数基于不同的标准，不同指标的评估准确度的差异仍然存在。我们表明，在评估神经 MT 时，指标比传统的统计 MT 系统更可靠。最后，我们表明即使黄金标准分数基于不同的标准，不同指标的评估准确度的差异仍然存在。

更新日期：2019-09-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>