Med7: A transferable clinical natural language processing model for electronic health records,Artificial Intelligence in Medicine

当前位置： X-MOL 学术 › Artif. Intell. Med. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Med7: A transferable clinical natural language processing model for electronic health records
Artificial Intelligence in Medicine ( IF 6.1 ) Pub Date : 2021-05-18 , DOI: 10.1016/j.artmed.2021.102086
Andrey Kormilitzin ₁ , Nemanja Vaci ₂ , Qiang Liu ₁ , Alejo Nevado-Holgado ₁

Affiliation

Electronic health record systems are ubiquitous and the majority of patients’ data are now being collected electronically in the form of free text. Deep learning has significantly advanced the field of natural language processing and the self-supervised representation learning and the transfer learning have become the methods of choice in particular when the high quality annotated data are limited. Identification of medical concepts and information extraction is a challenging task, yet important ingredient for parsing unstructured data into structured and tabulated format for downstream analytical tasks. In this work we introduced a named-entity recognition (NER) model for clinical natural language processing. The model is trained to recognise seven categories: drug names, route of administration, frequency, dosage, strength, form, duration. The model was first pre-trained on the task of predicting the next word, using a collection of 2 million free-text patients’ records from MIMIC-III corpora followed by fine-tuning on the named-entity recognition task. The model achieved a micro-averaged F1 score of 0.957 across all seven categories. Additionally, we evaluated the transferability of the developed model using the data from the Intensive Care Unit in the US to secondary care mental health records (CRIS) in the UK. A direct application of the trained NER model to CRIS data resulted in reduced performance of F1 = 0.762, however after fine-tuning on a small sample from CRIS, the model achieved a reasonable performance of F1 = 0.944. This demonstrated that despite a close similarity between the data sets and the NER tasks, it is essential to fine-tune the target domain data in order to achieve more accurate results. The resulting model and the pre-trained embeddings are available at https://github.com/kormilitzin/med7.

中文翻译：

Med7：用于电子健康记录的可转移临床自然语言处理模型

电子健康记录系统无处不在，大多数患者的数据现在都以自由文本的形式以电子方式收集。深度学习显着推进了自然语言处理领域，自监督表示学习和迁移学习已成为首选方法，尤其是在高质量标注数据有限的情况下。医学概念的识别和信息提取是一项具有挑战性的任务，但对于将非结构化数据解析为结构化和表格格式以供下游分析任务使用的重要组成部分。在这项工作中，我们引入了用于临床自然语言处理的命名实体识别 (NER) 模型。该模型经过训练以识别七类：药物名称、给药途径、频率、剂量、强度、形式、持续时间。该模型首先在预测下一个单词的任务上进行了预训练，使用了来自 MIMIC-III 语料库的 200 万个自由文本患者记录的集合，然后对命名实体识别任务进行了微调。该模型实现了微平均所有七个类别的F 1 得分为 0.957。此外，我们使用美国重症监护室的数据评估了开发模型到英国二级护理心理健康记录 (CRIS) 的可转移性。将经过训练的 NER 模型直接应用于 CRIS 数据导致F 1 = 0.762 的性能降低，但是在对来自 CRIS 的小样本进行微调后，该模型实现了F 1 = 0.944的合理性能。这表明尽管数据集和 NER 任务之间非常相似，但微调目标域数据以获得更准确的结果是必不可少的。生成的模型和预训练的嵌入可在 https://github.com/kormilitzin/med7 上获得。

更新日期：2021-06-02

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11