De-identifying Hospital Discharge Summaries: An End-to-End Framework using Ensemble of De-Identifiers,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

De-identifying Hospital Discharge Summaries: An End-to-End Framework using Ensemble of De-Identifiers
arXiv - CS - Information Retrieval Pub Date : 2021-01-01 , DOI: arxiv-2101.00146
Leibo Liu, Oscar Perez-Concha, Anthony Nguyen, Vicki Bennett, Louisa Jorm

Objective:Electronic Medical Records (EMRs) contain clinical narrative text that is of great potential value to medical researchers. However, this information is mixed with Protected Health Information (PHI) that presents risks to patient and clinician confidentiality. This paper presents an end-to-end de-identification framework to automatically remove PHI from hospital discharge summaries. Materials and Methods:Our corpus included 600 hospital discharge summaries which were extracted from the EMRs of two principal referral hospitals in Sydney, Australia. Our end-to-end de-identification framework consists of three components: 1) Annotation: labelling of PHI in the 600 hospital discharge summaries using five pre-defined categories: person, address, date of birth, individual identification number, phone/fax number; 2) Modelling: training and evaluating ensembles of named entity recognition (NER) models through the use of three natural language processing (NLP) toolkits (Stanza, FLAIR and spaCy) and both balanced and imbalanced datasets; and 3) De-identification: removing PHI from the hospital discharge summaries. Results:The final model in our framework was an ensemble which combined six single models using both balanced and imbalanced datasets for training majority voting. It achieved 0.9866 precision, 0.9862 recall and 0.9864 F1 scores. The majority of false positives and false negatives were related to the person category. Discussion:Our study showed that the ensemble of different models which were trained using three different NLP toolkits upon balanced and imbalanced datasets can achieve good results even with a relatively small corpus. Conclusion:Our end-to-end framework provides a robust solution to de-identifying clinical narrative corpuses safely. It can be easily applied to any kind of clinical narrative documents.

中文翻译：

取消识别出院摘要：使用识别符集合的端到端框架

目的：电子病历（EMR）包含临床叙事文本，对医学研究人员具有巨大的潜在价值。但是，此信息与受保护的健康信息（PHI）混合在一起，会对患者和临床医生的机密性带来风险。本文提出了一种端到端去识别框架，该框架可自动从医院出院摘要中删除PHI。资料和方法：我们的语料库包括600份出院摘要，摘录自澳大利亚悉尼的两家主要转诊医院的EMR。我们的端到端去识别框架包括三个部分：1）注释：使用五个预定义类别在600种出院摘要中标记PHI：人，地址，出生日期，个人识别号，电话/传真数; 2）建模：通过使用三个自然语言处理（NLP）工具包（Stanza，FLAIR和spaCy）以及平衡和不平衡数据集，训练和评估命名实体识别（NER）模型的合奏；3）取消身份识别：从医院出院摘要中删除PHI。结果：我们框架中的最终模型是一个集合，该集合使用平衡和不平衡数据集组合了六个单一模型，用于训练多数投票。它的精度为0.9866，召回率为0.9862，F1得分为0.9864。大部分误报和误报与人类别有关。讨论：我们的研究表明，使用三种不同的NLP工具包对平衡和不平衡的数据集进行训练的不同模型的集成即使在语料库相对较小的情况下也可以取得良好的结果。结论：我们的端到端框架提供了可靠的解决方案，可以安全地对临床叙事语料库进行身份识别。它可以轻松地应用于任何种类的临床叙事文件。

更新日期：2021-01-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文