当前位置: X-MOL 学术Interpreting › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Automatic assessment of spoken-language interpreting based on machine-translation evaluation metrics
Interpreting ( IF 1.8 ) Pub Date : 2023-02-09 , DOI: 10.1075/intp.00076.lu
Xiaolei Lu 1 , Chao Han 1
Affiliation  

Automated metrics for machine translation (MT) such as BLEU are customarily used because they are quick to compute and sufficiently valid to be useful in MT assessment. Whereas the instantaneity and reliability of such metrics are made possible by automatic computation based on predetermined algorithms, their validity is primarily dependent on a strong correlation with human assessments. Despite the popularity of such metrics in MT, little research has been conducted to explore their usefulness in the automatic assessment of human translation or interpreting. In the present study, we therefore seek to provide an initial insight into the way MT metrics would function in assessing spoken-language interpreting by human interpreters. Specifically, we selected five representative metrics – BLEU, NIST, METEOR, TER and BERT – to evaluate 56 bidirectional consecutive English–Chinese interpretations produced by 28 student interpreters of varying abilities. We correlated the automated metric scores with the scores assigned by different types of raters using different scoring methods (i.e., multiple assessment scenarios). The major finding is that BLEU, NIST, and METEOR had moderate-to-strong correlations with the human-assigned scores across the assessment scenarios, especially for the English-to-Chinese direction. Finally, we discuss the possibility and caveats of using MT metrics in assessing human interpreting.

中文翻译:

基于机器翻译评价指标的口语口译自动评价

通常使用 BLEU 等机器翻译 (MT) 的自动化指标,因为它们可以快速计算并且足够有效以用于 MT 评估。尽管这些指标的即时性和可靠性是通过基于预定算法的自动计算实现的,但它们的有效性主要取决于与人类评估的强相关性。尽管此类指标在 MT 中很受欢迎,但很少有研究探讨它们在人工翻译或口译自动评估中的用处。因此,在本研究中,我们试图初步了解 MT 指标在评估人类口译员的口语翻译中的作用方式。具体来说,我们选择了五个具有代表性的指标——BLEU、NIST、METEOR、TER 和 BERT——评估由 28 名不同能力的学生口译员产生的 56 次双向连续英汉口译。我们将自动度量分数与不同类型的评分者使用不同评分方法(即多种评估方案)分配的分数相关联。主要发现是,BLEU、NIST 和 METEOR 与评估场景中的人工评分具有中度至强相关性,尤其是在英汉方向上。最后,我们讨论了使用 MT 指标评估人类口译的可能性和注意事项。多个评估场景)。主要发现是,BLEU、NIST 和 METEOR 与评估场景中的人工评分具有中度至强相关性,尤其是在英汉方向上。最后,我们讨论了使用 MT 指标评估人类口译的可能性和注意事项。多个评估场景)。主要发现是,BLEU、NIST 和 METEOR 与评估场景中的人工评分具有中度至强相关性,尤其是在英汉方向上。最后,我们讨论了使用 MT 指标评估人类口译的可能性和注意事项。
更新日期:2023-02-09
down
wechat
bug