当前位置: X-MOL 学术J. Biomed. Inform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Unsupervised machine learning for the discovery of latent disease clusters and patient subgroups using electronic health records.
Journal of Biomedical informatics ( IF 4.5 ) Pub Date : 2019-12-28 , DOI: 10.1016/j.jbi.2019.103364
Yanshan Wang 1 , Yiqing Zhao 1 , Terry M Therneau 2 , Elizabeth J Atkinson 2 , Ahmad P Tafti 1 , Nan Zhang 1 , Shreyasee Amin 3 , Andrew H Limper 4 , Sundeep Khosla 5 , Hongfang Liu 1
Affiliation  

Machine learning has become ubiquitous and a key technology on mining electronic health records (EHRs) for facilitating clinical research and practice. Unsupervised machine learning, as opposed to supervised learning, has shown promise in identifying novel patterns and relations from EHRs without using human created labels. In this paper, we investigate the application of unsupervised machine learning models in discovering latent disease clusters and patient subgroups based on EHRs. We utilized Latent Dirichlet Allocation (LDA), a generative probabilistic model, and proposed a novel model named Poisson Dirichlet Model (PDM), which extends the LDA approach using a Poisson distribution to model patients' disease diagnoses and to alleviate age and sex factors by considering both observed and expected observations. In the empirical experiments, we evaluated LDA and PDM on three patient cohorts, namely Osteoporosis, Delirium/Dementia, and Chronic Obstructive Pulmonary Disease (COPD)/Bronchiectasis Cohorts, with their EHR data retrieved from the Rochester Epidemiology Project (REP) medical records linkage system, for the discovery of latent disease clusters and patient subgroups. We compared the effectiveness of LDA and PDM in identifying disease clusters through the visualization of disease representations. We tested the performance of LDA and PDM in differentiating patient subgroups through survival analysis, as well as statistical analysis of demographics and Elixhauser Comorbidity Index (ECI) scores in those subgroups. The experimental results show that the proposed PDM could effectively identify distinguished disease clusters based on the latent patterns hidden in the EHR data by alleviating the impact of age and sex, and that LDA could stratify patients into differentiable subgroups with larger p-values than PDM. However, those subgroups identified by LDA are highly associated with patients' age and sex. The subgroups discovered by PDM might imply the underlying patterns of diseases of greater interest in epidemiology research due to the alleviation of age and sex. Both unsupervised machine learning approaches could be leveraged to discover patient subgroups using EHRs but with different foci.

中文翻译:

使用电子健康记录的无监督机器学习,用于发现潜在疾病群和患者亚组。

机器学习已经无处不在,并且是挖掘电子健康记录(EHR)的一项关键技术,用于促进临床研究和实践。与有监督的学习相反,无监督的机器学习在无需使用人工创建的标签的情况下从EHR识别新颖的模式和关系方面显示出了希望。在本文中,我们研究了基于EHR的无监督机器学习模型在发现潜在疾病群和患者亚组中的应用。我们利用了潜在的潜在狄利克雷分配(LDA)这一生成型概率模型,并提出了一种名为Poisson Dirichlet模型(PDM)的新型模型,该模型扩展了使用Poisson分布的LDA方法,以对患者的疾病诊断进行建模,并通过以下方法缓解了年龄和性别因素:同时考虑观察到的和预期的观察。在经验实验中,我们对三个患者队列(即骨质疏松症,Deli妄/痴呆症和慢性阻塞性肺疾病(COPD)/支气管扩张队列)的LDA和PDM进行了评估,并从罗彻斯特流行病学计划(REP)医疗记录链接系统中检索了其EHR数据,以进行发现潜在疾病群和患者亚组的分布。我们比较了LDA和PDM通过可视化疾病表征来识别疾病群的有效性。我们通过生存分析以及这些亚组的人口统计学和Elixhauser合并症指数(ECI)得分的统计分析,测试了LDA和PDM在区分患者亚组方面的表现。实验结果表明,提出的PDM可以通过减轻年龄和性别的影响,根据EHR数据中隐藏的潜在模式有效地识别出明显的疾病群,并且LDA可以将患者分为p值比PDM更大的可区分的亚组。但是,通过LDA识别的那些亚组与患者的年龄和性别高度相关。由于年龄和性别的减轻,PDM发现的亚组可能暗示了流行病学研究中更感兴趣的潜在疾病模式。两种无监督的机器学习方法都可以利用EHR但具有不同的焦点来发现患者亚组。LDA鉴定出的那些亚组与患者的年龄和性别高度相关。由于年龄和性别的减轻,PDM发现的亚组可能暗示了流行病学研究中更感兴趣的潜在疾病模式。两种无监督的机器学习方法都可以利用EHR但具有不同的焦点来发现患者亚组。LDA鉴定出的那些亚组与患者的年龄和性别高度相关。由于年龄和性别的减轻,PDM发现的亚组可能暗示了流行病学研究中更感兴趣的潜在疾病模式。两种无监督的机器学习方法都可以利用EHR但具有不同的焦点来发现患者亚组。
更新日期:2019-12-29
down
wechat
bug