当前位置: X-MOL 学术Arthritis Res. Ther. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Rule-based and machine learning algorithms identify patients with systemic sclerosis accurately in the electronic health record
Arthritis Research & Therapy ( IF 4.4 ) Pub Date : 2019-12-30 , DOI: 10.1186/s13075-019-2092-7
Lia Jamian , Lee Wheless , Leslie J. Crofford , April Barnado

Systemic sclerosis (SSc) is a rare disease with studies limited by small sample sizes. Electronic health records (EHRs) represent a powerful tool to study patients with rare diseases such as SSc, but validated methods are needed. We developed and validated EHR-based algorithms that incorporate billing codes and clinical data to identify SSc patients in the EHR. We used a de-identified EHR with over 3 million subjects and identified 1899 potential SSc subjects with at least 1 count of the SSc ICD-9 (710.1) or ICD-10-CM (M34*) codes. We randomly selected 200 as a training set for chart review. A subject was a case if diagnosed with SSc by a rheumatologist, dermatologist, or pulmonologist. We selected the following algorithm components based on clinical knowledge and available data: SSc ICD-9 and ICD-10-CM codes, positive antinuclear antibody (ANA) (titer ≥ 1:80), and a keyword of Raynaud’s phenomenon (RP). We performed both rule-based and machine learning techniques for algorithm development. Positive predictive values (PPVs), sensitivities, and F-scores (which account for PPVs and sensitivities) were calculated for the algorithms. PPVs were low for algorithms using only 1 count of the SSc ICD-9 code. As code counts increased, the PPVs increased. PPVs were higher for algorithms using ICD-10-CM codes versus the ICD-9 code. Adding a positive ANA and RP keyword increased the PPVs of algorithms only using ICD billing codes. Algorithms using ≥ 3 or ≥ 4 counts of the SSc ICD-9 or ICD-10-CM codes and ANA positivity had the highest PPV at 100% but a low sensitivity at 50%. The algorithm with the highest F-score of 91% was ≥ 4 counts of the ICD-9 or ICD-10-CM codes with an internally validated PPV of 90%. A machine learning method using random forests yielded an algorithm with a PPV of 84%, sensitivity of 92%, and F-score of 88%. The most important feature was RP keyword. Algorithms using only ICD-9 codes did not perform well to identify SSc patients. The highest performing algorithms incorporated clinical data with billing codes. EHR-based algorithms can identify SSc patients across a healthcare system, enabling researchers to examine important outcomes.

中文翻译:

基于规则和机器学习的算法可在电子健康记录中准确识别患有系统性硬化症的患者

系统性硬化症(SSc)是一种罕见的疾病,其研究受限于小样本量。电子健康记录(EHR)是研究SSc等罕见疾病患者的有力工具,但需要经过验证的方法。我们开发并验证了基于EHR的算法,该算法结合了计费代码和临床数据以识别EHR中的SSc患者。我们使用了超过300万名受试者的去识别EHR,并确定了1899名潜在SSc受试者,其中至少有1个SSc ICD-9(710.1)或ICD-10-CM(M34 *)代码。我们随机选择200个训练集进行图表审查。如果是由风湿病学家,皮肤病学家或肺病学家确诊为SSc,则为受试者。我们根据临床知识和可用数据选择了以下算法组件:SSc ICD-9和ICD-10-CM代码,抗核抗体(ANA)阳性(滴度≥1:80),以及雷诺现象(RP)的关键词。我们执行了基于规则的学习和机器学习技术来进行算法开发。为该算法计算了阳性预测值(PPV),敏感性和F分数(说明了PPV和敏感性)。对于仅使用1个SSc ICD-9代码的算法,PPV较低。随着代码数量的增加,PPV也随之增加。使用ICD-10-CM代码的算法的PPV高于ICD-9代码。仅使用ICD帐单代码添加正ANA和RP关键字可以增加算法的PPV。使用SSc ICD-9或ICD-10-CM代码的≥3或≥4计数和ANA阳性的算法在100%时具有最高的PPV,而在50%时则具有较低的灵敏度。F分数最高的算法为91%,是ICD-9或ICD-10-CM代码的≥4个计数,内部验证的PPV为90%。使用随机森林的机器学习方法得出的算法的PPV为84%,灵敏度为92%,F得分为88%。最重要的功能是RP关键字。仅使用ICD-9代码的算法不能很好地识别SSc患者。性能最高的算法将临床数据与计费代码结合在一起。基于EHR的算法可以识别整个医疗系统中的SSc患者,从而使研究人员能够检查重要的结局。仅使用ICD-9代码的算法不能很好地识别SSc患者。性能最高的算法将临床数据与计费代码结合在一起。基于EHR的算法可以识别整个医疗系统中的SSc患者,从而使研究人员能够检查重要的结局。仅使用ICD-9代码的算法不能很好地识别SSc患者。性能最高的算法将临床数据与计费代码结合在一起。基于EHR的算法可以识别整个医疗系统中的SSc患者,从而使研究人员能够检查重要的结局。
更新日期:2019-12-31
down
wechat
bug