当前位置: X-MOL 学术J. Am. Med. Inform. Assoc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Representation of EHR data for predictive modeling: a comparison between UMLS and other terminologies.
Journal of the American Medical Informatics Association ( IF 4.7 ) Pub Date : 2020-09-15 , DOI: 10.1093/jamia/ocaa180
Laila Rasmy 1 , Firat Tiryaki 1 , Yujia Zhou 1 , Yang Xiang 1 , Cui Tao 1 , Hua Xu 1 , Degui Zhi 1
Affiliation  

Abstract
Objective
Predictive disease modeling using electronic health record data is a growing field. Although clinical data in their raw form can be used directly for predictive modeling, it is a common practice to map data to standard terminologies to facilitate data aggregation and reuse. There is, however, a lack of systematic investigation of how different representations could affect the performance of predictive models, especially in the context of machine learning and deep learning.
Materials and Methods
We projected the input diagnoses data in the Cerner HealthFacts database to Unified Medical Language System (UMLS) and 5 other terminologies, including CCS, CCSR, ICD-9, ICD-10, and PheWAS, and evaluated the prediction performances of these terminologies on 2 different tasks: the risk prediction of heart failure in diabetes patients and the risk prediction of pancreatic cancer. Two popular models were evaluated: logistic regression and a recurrent neural network.
Results
For logistic regression, using UMLS delivered the optimal area under the receiver operating characteristics (AUROC) results in both dengue hemorrhagic fever (81.15%) and pancreatic cancer (80.53%) tasks. For recurrent neural network, UMLS worked best for pancreatic cancer prediction (AUROC 82.24%), second only (AUROC 85.55%) to PheWAS (AUROC 85.87%) for dengue hemorrhagic fever prediction.
Discussion/Conclusion
In our experiments, terminologies with larger vocabularies and finer-grained representations were associated with better prediction performances. In particular, UMLS is consistently 1 of the best-performing ones. We believe that our work may help to inform better designs of predictive models, although further investigation is warranted.


中文翻译:


用于预测建模的 EHR 数据表示:UMLS 与其他术语之间的比较。


 抽象的
 客观的

使用电子健康记录数据进行预测疾病建模是一个不断发展的领域。尽管原始形式的临床数据可以直接用于预测建模,但通常的做法是将数据映射到标准术语以促进数据聚合和重用。然而,对于不同的表示如何影响预测模型的性能,特别是在机器学习和深度学习的背景下,缺乏系统的研究。
 材料和方法

我们将 Cerner HealthFacts 数据库中的输入诊断数据投影到统一医学语言系统 (UMLS) 和其他 5 个术语,包括 CCS、CCSR、ICD-9、ICD-10 和 PheWAS,并评估这些术语在 2 上的预测性能。不同的任务:糖尿病患者心力衰竭的风险预测和胰腺癌的风险预测。评估了两种流行的模型:逻辑回归和循环神经网络。
 结果

对于逻辑回归,使用 UMLS 在登革出血热 (81.15%) 和胰腺癌 (80.53%) 任务中提供了接受者操作特征 (AUROC) 下的最佳面积结果。对于循环神经网络,UMLS 在胰腺癌预测方面效果最好(AUROC 82.24%),在登革出血热预测方面仅次于 PheWAS(AUROC 85.87%)(AUROC 85.55%)。
 讨论/结论

在我们的实验中,具有更大词汇量和更细粒度表示的术语与更好的预测性能相关。特别是,UMLS 始终是表现最好的之一。我们相信,我们的工作可能有助于更好地设计预测模型,尽管还需要进一步的调查。
更新日期:2020-10-16
down
wechat
bug