当前位置: X-MOL 学术Ups. J. Med. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Natural language processing and machine learning to enable automatic extraction and classification of patients' smoking status from electronic medical records.
Upsala Journal of Medical Sciences ( IF 1.5 ) Pub Date : 2020-07-22 , DOI: 10.1080/03009734.2020.1792010
Andrea Caccamisi 1, 2 , Leif Jørgensen 3 , Hercules Dalianis 2 , Mats Rosenlund 1, 3
Affiliation  

Abstract

Background

The electronic medical record (EMR) offers unique possibilities for clinical research, but some important patient attributes are not readily available due to its unstructured properties. We applied text mining using machine learning to enable automatic classification of unstructured information on smoking status from Swedish EMR data.

Methods

Data on patients’ smoking status from EMRs were used to develop 32 different predictive models that were trained using Weka, changing sentence frequency, classifier type, tokenization, and attribute selection in a database of 85,000 classified sentences. The models were evaluated using F-score and accuracy based on out-of-sample test data including 8500 sentences. The error weight matrix was used to select the best model, assigning a weight to each type of misclassification and applying it to the model confusion matrices. The best performing model was then compared to a rule-based method.

Results

The best performing model was based on the Support Vector Machine (SVM) Sequential Minimal Optimization (SMO) classifier using a combination of unigrams and bigrams as tokens. Sentence frequency and attributes selection did not improve model performance. SMO achieved 98.14% accuracy and 0.981 F-score versus 79.32% and 0.756 for the rule-based model.

Conclusion

A model using machine-learning algorithms to automatically classify patients’ smoking status was successfully developed. Such algorithms may enable automatic assessment of smoking status and other unstructured data directly from EMRs without manual classification of complete case notes.



中文翻译:

自然语言处理和机器学习能够从电子病历中自动提取和分类患者的吸烟状况。

摘要

背景

电子病历 (EMR) 为临床研究提供了独特的可能性,但由于其非结构化特性,一些重要的患者属性并不容易获得。我们使用机器学习应用文本挖掘来自动分类瑞典 EMR 数据中有关吸烟状态的非结构化信息。

方法

来自 EMR 的患者吸烟状况数据被用于开发 32 种不同的预测模型,这些模型使用 Weka 进行训练,在 85,000 个分类句子的数据库中改变句子频率、分类器类型、标记化和属性选择。使用 F 分数和基于样本外测试数据(包括 8500 个句子)的准确性来评估模型。误差权重矩阵用于选择最佳模型,为每种类型的错误分类分配权重并将其应用于模型混淆矩阵。然后将性能最佳的模型与基于规则的方法进行比较。

结果

性能最佳的模型基于支持向量机 (SVM) 序列最小优化 (SMO) 分类器,使用 unigrams 和 bigrams 的组合作为标记。句子频率和属性选择并没有提高模型性能。SMO 实现了 98.14% 的准确率和 0.981 F 分数,而基于规则的模型则为 79.32% 和 0.756。

结论

成功开发了使用机器学习算法自动分类患者吸烟状态的模型。这样的算法可以直接从 EMR 自动评估吸烟状态和其他非结构化数据,而无需手动分类完整的案例笔记。

更新日期:2020-07-22
down
wechat
bug