当前位置: X-MOL 学术BMC Med. Inform. Decis. Mak. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms.
BMC Medical Informatics and Decision Making ( IF 3.5 ) Pub Date : 2020-01-06 , DOI: 10.1186/s12911-019-1014-6
André M Carrington 1 , Paul W Fieguth 2 , Hammad Qazi 3 , Andreas Holzinger 4, 5 , Helen H Chen 3 , Franz Mayr 6 , Douglas G Manuel 1, 7, 8, 9, 10, 11, 12
Affiliation  

BACKGROUND In classification and diagnostic testing, the receiver-operator characteristic (ROC) plot and the area under the ROC curve (AUC) describe how an adjustable threshold causes changes in two types of error: false positives and false negatives. Only part of the ROC curve and AUC are informative however when they are used with imbalanced data. Hence, alternatives to the AUC have been proposed, such as the partial AUC and the area under the precision-recall curve. However, these alternatives cannot be as fully interpreted as the AUC, in part because they ignore some information about actual negatives. METHODS We derive and propose a new concordant partial AUC and a new partial c statistic for ROC data-as foundational measures and methods to help understand and explain parts of the ROC plot and AUC. Our partial measures are continuous and discrete versions of the same measure, are derived from the AUC and c statistic respectively, are validated as equal to each other, and validated as equal in summation to whole measures where expected. Our partial measures are tested for validity on a classic ROC example from Fawcett, a variation thereof, and two real-life benchmark data sets in breast cancer: the Wisconsin and Ljubljana data sets. Interpretation of an example is then provided. RESULTS Results show the expected equalities between our new partial measures and the existing whole measures. The example interpretation illustrates the need for our newly derived partial measures. CONCLUSIONS The concordant partial area under the ROC curve was proposed and unlike previous partial measure alternatives, it maintains the characteristics of the AUC. The first partial c statistic for ROC plots was also proposed as an unbiased interpretation for part of an ROC curve. The expected equalities among and between our newly derived partial measures and their existing full measure counterparts are confirmed. These measures may be used with any data set but this paper focuses on imbalanced data with low prevalence. FUTURE WORK Future work with our proposed measures may: demonstrate their value for imbalanced data with high prevalence, compare them to other measures not based on areas; and combine them with other ROC measures and techniques.

中文翻译:

机器学习算法评估中不平衡数据的新一致局部AUC和局部c统计量。

背景技术在分类和诊断测试中,接收器-操作员特征(ROC)图和ROC曲线下的面积(AUC)描述了可调阈值如何导致两种错误类型的变化:错误肯定和错误否定。但是,当它们与不平衡数据一起使用时,ROC曲线和AUC的一部分仅能提供信息。因此,已经提出了AUC的替代方案,例如部分AUC和精确召回曲线下的面积。但是,这些替代方案不能像AUC那样被完全解释,部分原因是它们忽略了一些有关实际负片的信息。方法我们推导并提出了ROC数据的新的一致局部AUC和新的局部c统计量,作为有助于理解和解释ROC图和AUC的部分的基础措施和方法。我们的部分度量是同一度量的连续和离散版本,分别来自AUC和c统计量,经过验证彼此相等,并且在总和与期望的总度量相等的情况下进行验证。我们使用Fawcett的经典ROC示例(其变体)和两个真实的乳腺癌基准数据集(威斯康星州和卢布尔雅那数据集)对我们的部分度量进行了有效性测试。然后提供示例的解释。结果结果显示了我们的新局部度量与现有整体度量之间的预期均等性。示例解释说明了我们新派生的局部度量的必要性。结论提出了ROC曲线下的一致局部区域,与以前的局部测量方法不同,它保留了AUC的特征。还提出了ROC图的第一部分c统计量,作为ROC曲线一部分的无偏解释。确认了我们新导出的局部度量与现有的完全度量对应度量之间的预期均等性。这些措施可用于任何数据集,但本文着重于低流行率的不平衡数据。未来的工作我们提议的措施的未来工作可能是:•证明其对普遍存在的不平衡数据的价值,将其与不基于地区的其他措施进行比较;并将它们与其他中华民国措施和技术结合起来。这些措施可用于任何数据集,但本文着重于低流行率的不平衡数据。未来的工作我们提议的措施的未来工作可能是:•证明其对普遍存在的不平衡数据的价值,将其与不基于地区的其他措施进行比较;并将它们与其他中华民国措施和技术结合起来。这些措施可用于任何数据集,但本文着重于低流行率的不平衡数据。未来的工作我们提议的措施的未来工作可能是:•证明其对普遍存在的不平衡数据的价值,将其与不基于地区的其他措施进行比较;并将它们与其他中华民国措施和技术结合起来。
更新日期:2020-01-06
down
wechat
bug