A study of deep learning methods for de-identification of clinical notes in cross-institute settings.,BMC Medical Informatics and Decision Making

当前位置： X-MOL 学术 › BMC Med. Inform. Decis. Mak. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A study of deep learning methods for de-identification of clinical notes in cross-institute settings.
BMC Medical Informatics and Decision Making ( IF 3.3 ) Pub Date : 2019-12-05 , DOI: 10.1186/s12911-019-0935-4
Xi Yang ₁ , Tianchen Lyu ₁ , Qian Li ₁ , Chih-Yin Lee ₁ , Jiang Bian ₁ , William R Hogan ₁ , Yonghui Wu ₁

Affiliation

BACKGROUND De-identification is a critical technology to facilitate the use of unstructured clinical text while protecting patient privacy and confidentiality. The clinical natural language processing (NLP) community has invested great efforts in developing methods and corpora for de-identification of clinical notes. These annotated corpora are valuable resources for developing automated systems to de-identify clinical text at local hospitals. However, existing studies often utilized training and test data collected from the same institution. There are few studies to explore automated de-identification under cross-institute settings. The goal of this study is to examine deep learning-based de-identification methods at a cross-institute setting, identify the bottlenecks, and provide potential solutions. METHODS We created a de-identification corpus using a total 500 clinical notes from the University of Florida (UF) Health, developed deep learning-based de-identification models using 2014 i2b2/UTHealth corpus, and evaluated the performance using UF corpus. We compared five different word embeddings trained from the general English text, clinical text, and biomedical literature, explored lexical and linguistic features, and compared two strategies to customize the deep learning models using UF notes and resources. RESULTS Pre-trained word embeddings using a general English corpus achieved better performance than embeddings from de-identified clinical text and biomedical literature. The performance of deep learning models trained using only i2b2 corpus significantly dropped (strict and relax F1 scores dropped from 0.9547 and 0.9646 to 0.8568 and 0.8958) when applied to another corpus annotated at UF Health. Linguistic features could further improve the performance of de-identification in cross-institute settings. After customizing the models using UF notes and resource, the best model achieved the strict and relaxed F1 scores of 0.9288 and 0.9584, respectively. CONCLUSIONS It is necessary to customize de-identification models using local clinical text and other resources when applied in cross-institute settings. Fine-tuning is a potential solution to re-use pre-trained parameters and reduce the training time to customize deep learning-based de-identification models trained using clinical corpus from a different institution.

中文翻译：

跨研究所环境中用于识别临床笔记的深度学习方法的研究。

背景技术去识别是在保护患者隐私和机密性的同时促进非结构化临床文本的使用的关键技术。临床自然语言处理（NLP）社区已投入大量精力来开发方法和语料库，以取消对临床笔记的识别。这些带注释的语料库是用于开发自动化系统以在本地医院对临床文本进行识别的宝贵资源。但是，现有研究经常利用从同一机构收集的培训和测试数据。很少有研究探讨跨机构设置下的自动取消身份识别。这项研究的目的是在跨机构的环境中研究基于深度学习的去识别方法，确定瓶颈，并提供潜在的解决方案。方法我们使用佛罗里达大学（UF）卫生大学的总计500份临床笔记创建了一个去身份识别语料库，并使用2014 i2b2 / UTHealth语料库开发了基于深度学习的去身份识别模型，并使用UF语料库评估了性能。我们比较了从通用英语文本，临床文本和生物医学文献中训练的五个不同的词嵌入，探索了词汇和语言特征，并比较了使用UF笔记和资源定制深度学习模型的两种策略。结果使用一般英语语料库进行的预训练词嵌入比未识别的临床文本和生物医学文献中的嵌入效果更好。仅使用i2b2语料库训练的深度学习模型的性能显着下降（严格和放松F1分数从0.9547和0.9646降至0。8568和0.8958）应用于适用于UF Health的另一个语料库。语言功能可以进一步提高跨机构设置中的去识别性能。使用UF注释和资源自定义模型后，最佳模型的严格F1分数分别为0.9288和0.9584。结论在跨机构设置中应用时，有必要使用本地临床文本和其他资源来自定义取消识别模型。微调是重新使用预先训练的参数并减少训练时间以定制使用来自不同机构的临床语料库训练的基于深度学习的去识别模型的潜在解决方案。语言功能可以进一步提高跨机构设置中的去识别性能。使用UF注释和资源自定义模型后，最佳模型的严格F1分数分别为0.9288和0.9584。结论在跨机构设置中应用时，有必要使用本地临床文本和其他资源来自定义取消识别模型。微调是重新使用预先训练的参数并减少训练时间以定制使用来自不同机构的临床语料库训练的基于深度学习的去识别模型的潜在解决方案。语言功能可以进一步提高跨机构设置中的去识别性能。使用UF注释和资源自定义模型后，最佳模型的严格F1分数分别为0.9288和0.9584。结论在跨机构设置中应用时，有必要使用本地临床文本和其他资源来自定义取消识别模型。微调是重新使用预先训练的参数并减少训练时间以定制使用来自不同机构的临床语料库训练的基于深度学习的去识别模型的潜在解决方案。结论在跨机构设置中应用时，有必要使用本地临床文本和其他资源来自定义取消识别模型。微调是重新使用预先训练的参数并减少训练时间以定制使用来自不同机构的临床语料库训练的基于深度学习的去识别模型的潜在解决方案。结论在跨机构设置中应用时，有必要使用本地临床文本和其他资源来自定义取消识别模型。微调是重新使用预先训练的参数并减少训练时间以定制使用来自不同机构的临床语料库训练的基于深度学习的去识别模型的潜在解决方案。

更新日期：2019-12-05

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11