当前位置: X-MOL 学术Assess. Educ. Princ. Policy Pract. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Complementary strengths? Evaluation of a hybrid human-machine scoring approach for a test of oral academic English
Assessment in Education: Principles, Policy & Practice ( IF 2.516 ) Pub Date : 2021-09-21 , DOI: 10.1080/0969594x.2021.1979466
Larry Davis 1 , Spiros Papageorgiou 1
Affiliation  

ABSTRACT

Human raters and machine scoring systems potentially have complementary strengths in evaluating language ability; specifically, it has been suggested that automated systems might be used to make consistent measurements of specific linguistic phenomena, whilst humans evaluate more global aspects of performance. We report on an empirical study that explored the possibility of combining human and machine scores using responses from the speaking section of the TOEFL iBT® test. Human raters awarded scores for three sub-constructs: delivery, language use and topic development. The SpeechRaterSM automated scoring system produced scores for delivery and language use. Composite scores computed from three different combinations of human and automated analytic scores were equally or more reliable than human holistic scores, probably due to the inclusion of multiple observations in composite scores. However, composite scores calculated solely from human analytic scores showed the highest reliability and reliability steadily decreased as more machine scores replaced human scores.



中文翻译:

互补优势?用于学术英语口语测试的混合人机评分方法的评估

摘要

人工评分者和机器评分系统在评估语言能力方面可能具有互补优势;具体来说,有人建议可以使用自动化系统对特定语言现象进行一致的测量,而人类则可以评估更全面的表现方面。我们报告了一项实证研究,该研究探索了使用 TOEFL iBT ®考试口语部分的回答将人类和机器分数相结合的可能性。人工评分员为三个子结构评分:交付、语言使用和主题发展。SpeechRater SM自动评分系统为交付和语言使用生成分数。从人类和自动分析分数的三种不同组合计算得出的综合分数与人类整体分数相同或更可靠,这可能是由于综合分数中包含了多项观察结果。然而,仅根据人类分析分数计算的综合分数显示出最高的可靠性,并且随着更多机器分数取代人工分数,可靠性稳步下降。

更新日期:2021-09-21
down
wechat
bug