当前位置: X-MOL 学术J. Sci. Educ. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Meta-Analysis of Machine Learning-Based Science Assessments: Factors Impacting Machine-Human Score Agreements
Journal of Science Education and Technology ( IF 3.3 ) Pub Date : 2020-11-19 , DOI: 10.1007/s10956-020-09875-z
Xiaoming Zhai , Lehong Shi , Ross H. Nehm

Machine learning (ML) has been increasingly employed in science assessment to facilitate automatic scoring efforts, although with varying degrees of success (i.e., magnitudes of machine-human score agreements [MHAs]). Little work has empirically examined the factors that impact MHA disparities in this growing field, thus constraining the improvement of machine scoring capacity and its wide applications in science education. We performed a meta-analysis of 110 studies of MHAs in order to identify the factors most strongly contributing to scoring success (i.e., high Cohen's kappa [κ]). We empirically examined six factors proposed as contributors to MHA magnitudes: algorithm, subject domain, assessment format, construct, school level, and machine supervision type. Our analyses of 110 MHAs revealed substantial heterogeneity in \(\kappa{ (mean =} \, \text{.64; range = .09-.97}\), taking weights into consideration). Using three-level random-effects modeling, MHA score heterogeneity was explained by the variability both within publications (i.e., the assessment task level: 82.6%) and between publications (i.e., the individual study level: 16.7%). Our results also suggest that all six factors have significant moderator effects on scoring success magnitudes. Among these, algorithm and subject domain had significantly larger effects than the other factors, suggesting that technical features and assessment external features might be primary targets for improving MHAs and ML-based science assessments.



中文翻译:

基于机器学习的科学评估的元分析:影响机器-人类得分协议的因素

机器学习(ML)已越来越多地用于科学评估中,以促进自动评分工作,尽管取得了不同程度的成功(例如,机器与人类得分协议[MHA]的大小)。很少有工作凭经验检查影响这一增长领域中MHA差异的因素,从而限制了机器评分能力的提高及其在科学教育中的广泛应用。我们对110项MHA的研究进行了荟萃分析,以找出最有助于得分成功的因素(例如,高Cohenκ[ κ])。我们根据经验检查了提议的MHA量级的六个因素:算法,学科领域,评估格式,结构,学历和机器监督类型。我们对110个MHA的分析显示\(\ kapp {{mean =} \,\ text {.64; range = .09-.97} \)中存在很大的异质性,并考虑到权重)。使用三级随机效应模型,MHA评分异质性由出版物内部(即评估任务水平:82.6%)和出版物之间(即个体研究水平:16.7%)的变异性解释。我们的研究结果还表明,所有六个因素均对得分成功率有显着的调节作用。其中,算法和主题领域的影响明显大于其他因素,这表明技术特征和评估外部特征可能是改进MHA和基于ML的科学评估的主要目标。

更新日期:2020-12-23
down
wechat
bug