当前位置: X-MOL 学术J. Am. Med. Inform. Assoc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Reporting of demographic data and representativeness in machine learning models using electronic health records.
Journal of the American Medical Informatics Association ( IF 6.4 ) Pub Date : 2020-09-16 , DOI: 10.1093/jamia/ocaa164
Selen Bozkurt 1 , Eli M Cahan 1, 2 , Martin G Seneviratne 1 , Ran Sun 1 , Juan A Lossio-Ventura 1 , John P A Ioannidis 1, 3, 4, 5, 6 , Tina Hernandez-Boussard 1, 4, 7
Affiliation  

Abstract
Objective
The development of machine learning (ML) algorithms to address a variety of issues faced in clinical practice has increased rapidly. However, questions have arisen regarding biases in their development that can affect their applicability in specific populations. We sought to evaluate whether studies developing ML models from electronic health record (EHR) data report sufficient demographic data on the study populations to demonstrate representativeness and reproducibility.
Materials and Methods
We searched PubMed for articles applying ML models to improve clinical decision-making using EHR data. We limited our search to papers published between 2015 and 2019.
Results
Across the 164 studies reviewed, demographic variables were inconsistently reported and/or included as model inputs. Race/ethnicity was not reported in 64%; gender and age were not reported in 24% and 21% of studies, respectively. Socioeconomic status of the population was not reported in 92% of studies. Studies that mentioned these variables often did not report if they were included as model inputs. Few models (12%) were validated using external populations. Few studies (17%) open-sourced their code. Populations in the ML studies include higher proportions of White and Black yet fewer Hispanic subjects compared to the general US population.
Discussion
The demographic characteristics of study populations are poorly reported in the ML literature based on EHR data. Demographic representativeness in training data and model transparency is necessary to ensure that ML models are deployed in an equitable and reproducible manner. Wider adoption of reporting guidelines is warranted to improve representativeness and reproducibility.


中文翻译:

使用电子健康记录报告机器学习模型中的人口统计数据和代表性。

摘要
客观的
用于解决临床实践中面临的各种问题的机器学习 (ML) 算法的发展迅速增加。然而,出现了有关其发展中的偏见的问题,这些偏见可能会影响它们在特定人群中的适用性。我们试图评估从电子健康记录 (EHR) 数据开发 ML 模型的研究是否报告了足够的研究人群人口统计数据以证明代表性和可重复性。
材料和方法
我们在 PubMed 中搜索了应用 ML 模型以使用 EHR 数据改进临床决策的文章。我们将搜索限制在 2015 年至 2019 年间发表的论文。
结果
在审查的 164 项研究中,人口统计变量的报告和/或作为模型输入包括在内。64% 的人没有报告种族/民族;分别有 24% 和 21% 的研究未报告性别和年龄。92% 的研究未报告人口的社会经济状况。提到这些变量的研究通常没有报告它们是否作为模型输入。很少有模型 (12%) 使用外部人群进行验证。很少有研究 (17%) 开源了他们的代码。与美国一般人群相比,ML 研究中的人群包括更高比例的白人和黑人,但西班牙裔受试者较少。
讨论
基于 EHR 数据的 ML 文献很少报道研究人群的人口统计学特征。训练数据中的人口统计代表性和模型透明度对于确保以公平和可重复的方式部署 ML 模型是必要的。有必要更广泛地采用报告指南,以提高代表性和可重复性。
更新日期:2020-12-10
down
wechat
bug