Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy,Ophthalmology

当前位置： X-MOL 学术 › Ophthalmology › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy
Ophthalmology ( IF 13.1 ) Pub Date : 2018-03-13 , DOI: 10.1016/j.ophtha.2018.01.034
Jonathan Krause , Varun Gulshan , Ehsan Rahimy , Peter Karth , Kasumi Widner , Greg S. Corrado , Lily Peng , Dale R. Webster

Purpose

Use adjudication to quantify errors in diabetic retinopathy (DR) grading based on individual graders and majority decision, and to train an improved automated algorithm for DR grading.

Design

Retrospective analysis.

Participants

Retinal fundus images from DR screening programs.

Methods

Images were each graded by the algorithm, U.S. board-certified ophthalmologists, and retinal specialists. The adjudicated consensus of the retinal specialists served as the reference standard.

Main Outcome Measures

For agreement between different graders as well as between the graders and the algorithm, we measured the (quadratic-weighted) kappa score. To compare the performance of different forms of manual grading and the algorithm for various DR severity cutoffs (e.g., mild or worse DR, moderate or worse DR), we measured area under the curve (AUC), sensitivity, and specificity.

Results

Of the 193 discrepancies between adjudication by retinal specialists and majority decision of ophthalmologists, the most common were missing microaneurysm (MAs) (36%), artifacts (20%), and misclassified hemorrhages (16%). Relative to the reference standard, the kappa for individual retinal specialists, ophthalmologists, and algorithm ranged from 0.82 to 0.91, 0.80 to 0.84, and 0.84, respectively. For moderate or worse DR, the majority decision of ophthalmologists had a sensitivity of 0.838 and specificity of 0.981. The algorithm had a sensitivity of 0.971, specificity of 0.923, and AUC of 0.986. For mild or worse DR, the algorithm had a sensitivity of 0.970, specificity of 0.917, and AUC of 0.986. By using a small number of adjudicated consensus grades as a tuning dataset and higher-resolution images as input, the algorithm improved in AUC from 0.934 to 0.986 for moderate or worse DR.

Conclusions

Adjudication reduces the errors in DR grading. A small set of adjudicated DR grades allows substantial improvements in algorithm performance. The resulting algorithm's performance was on par with that of individual U.S. Board-Certified ophthalmologists and retinal specialists.

中文翻译：

平地机变异性和评估糖尿病性视网膜病机器学习模型的参考标准的重要性

目的

使用裁决基于个体评分者和多数决策来量化糖尿病性视网膜病变（DR）评分中的错误，并训练用于DR评分的改进的自动化算法。

设计

回顾性分析。

参加者

来自DR筛查程序的视网膜眼底图像。

方法

图像分别由算法，美国董事会认证的眼科医生和视网膜专家进行分级。视网膜专家的裁决共识作为参考标准。

主要观察指标

为了使不同的评分者之间以及评分者和算法之间达成一致，我们测量了（二次加权）kappa得分。为了比较不同形式的手动分级的性能以及各种DR严重程度临界值（例如，轻度或较差DR，中度或较差DR）的性能，我们测量了曲线下面积（AUC），敏感性和特异性。

结果

在视网膜专家的裁决与眼科医生的多数决定之间的193个差异中，最常见的是缺少微动脉瘤（MAs）（36％），伪影（20％）和错误分类的出血（16％）。相对于参考标准，单个视网膜专家，眼科医生和算法的kappa值分别为0.82至0.91、0.80至0.84和0.84。对于中度或更严重的DR，眼科医生的多数决定具有0.838的敏感性和0.981的特异性。该算法的灵敏度为0.971，特异性为0.923，AUC为0.986。对于轻度或重度DR，该算法的灵敏度为0.970，特异性为0.917，AUC为0.986。通过使用少量裁定的共识等级作为调整数据集并使用更高分辨率的图像作为输入，该算法在AUC中从0进行了改进。