当前位置: X-MOL 学术npj Digit. Med. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Detecting the impact of subject characteristics on machine learning-based diagnostic applications
npj Digital Medicine ( IF 15.2 ) Pub Date : 2019-10-11 , DOI: 10.1038/s41746-019-0178-x
Elias Chaibub Neto 1 , Abhishek Pratap 1, 2 , Thanneer M Perumal 1 , Meghasyam Tummalacherla 1 , Phil Snyder 1 , Brian M Bot 1 , Andrew D Trister 1 , Stephen H Friend 1, 3, 4 , Lara Mangravite 1 , Larsson Omberg 1
Affiliation  

Collection of high-dimensional, longitudinal digital health data has the potential to support a wide-variety of research and clinical applications including diagnostics and longitudinal health tracking. Algorithms that process these data and inform digital diagnostics are typically developed using training and test sets generated from multiple repeated measures collected across a set of individuals. However, the inclusion of repeated measurements is not always appropriately taken into account in the analytical evaluations of predictive performance. The assignment of repeated measurements from each individual to both the training and the test sets (“record-wise” data split) is a common practice and can lead to massive underestimation of the prediction error due to the presence of “identity confounding.” In essence, these models learn to identify subjects, in addition to diagnostic signal. Here, we present a method that can be used to effectively calculate the amount of identity confounding learned by classifiers developed using a record-wise data split. By applying this method to several real datasets, we demonstrate that identity confounding is a serious issue in digital health studies and that record-wise data splits for machine learning- based applications need to be avoided.



中文翻译:

检测主题特征对基于机器学习的诊断应用的影响

高维、纵向数字健康数据的收集有可能支持各种研究和临床应用,包括诊断和纵向健康跟踪。处理这些数据并为数字诊断提供信息的算法通常是使用从一组个体中收集的多个重复测量生成的训练和测试集来开发的。然而,在预测性能的分析评估中并不总是适当地考虑包括重复测量。将每个人的重复测量值分配给训练集和测试集(“记录方式”数据分割)是一种常见的做法,并且由于“身份混淆”的存在,可能会导致预测误差的大量低估。本质上,除了诊断信号之外,这些模型还学会识别受试者。在这里,我们提出了一种方法,可用于有效计算使用记录方式数据分割开发的分类器所学到的身份混淆量。通过将此方法应用于多个真实数据集,我们证明身份混淆是数字健康研究中的一个严重问题,并且需要避免基于机器学习的应用程序的记录式数据分割。

更新日期:2019-10-12
down
wechat
bug