Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit,Artificial Intelligence in Medicine

当前位置： X-MOL 学术 › Artif. Intell. Med. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit
Artificial Intelligence in Medicine ( IF 6.1 ) Pub Date : 2021-05-01 , DOI: 10.1016/j.artmed.2021.102083
Zeljko Kraljevic ₁ , Thomas Searle ₂ , Anthony Shek ₃ , Lukasz Roguski ₄ , Kawsar Noor ₄ , Daniel Bean ₅ , Aurelie Mascio ₂ , Leilei Zhu ₆ , Amos A Folarin ₇ , Angus Roberts ₈ , Rebecca Bendayan ₂ , Mark P Richardson ₃ , Robert Stewart ₉ , Anoop D Shah ₄ , Wai Keong Wong ₆ , Zina Ibrahim ₁ , James T Teo ₁₀ , Richard J B Dobson ₁₁

Affiliation

Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK.
Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK.
Department of Clinical Neuroscience, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK.
Health Data Research UK London, University College London, London, UK; Institute of Health Informatics, University College London, London, UK; NIHR BRC Clinical Research Informatics Unit, University College London Hospitals, NHS Foundation Trust, London, UK.
Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Health Data Research UK London, University College London, London, UK.
Institute of Health Informatics, University College London, London, UK; NIHR BRC Clinical Research Informatics Unit, University College London Hospitals, NHS Foundation Trust, London, UK.
Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Institute of Health Informatics, University College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK.
Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Health Data Research UK London, University College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK.
Department of Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK.
Department of Clinical Neuroscience, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Department of Neurology, King's College Hospital NHS Foundation Trust, London, UK.
Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Health Data Research UK London, University College London, London, UK; Institute of Health Informatics, University College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK.

Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of information extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides: (a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; (b) a feature-rich annotation interface for customizing and training IE models; and (c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448–0.738 vs 0.429–0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over $\sim$ 8.8B words from $\sim$ 17M clinical records and further fine-tuning with $\sim$ 6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.

中文翻译：

使用 MedCAT 进行多领域临床自然语言处理：医学概念注释工具包

电子健康记录 (EHR) 包含大量非结构化文本，需要应用信息提取 (IE) 技术来实现临床分析。我们提出了开源医学概念注释工具包（MedCAT），它提供了：（a）一种新颖的自监督机器学习算法，用于使用任何概念词汇（包括 UMLS/SNOMED-CT）提取概念； (b) 功能丰富的注释接口，用于定制和训练 IE 模型； (c) 集成到更广泛的 CogStack 生态系统，以实现与供应商无关的卫生系统部署。我们在从开放数据集中提取 UMLS 概念方面表现出了改进的性能（F1：0.448–0.738 vs 0.429–0.650）。进一步的现实世界验证表明 SNOMED-CT 提取在伦敦 3 家大型医院进行了自我监督培训8.8B 字来自17M 临床记录并进一步微调6K 临床医生注释示例。我们在医院、数据集和概念类型之间表现出强大的可转移性 (F1 > 0.94)，表明跨域 EHR 不可知的实用性可加速临床和研究用例。

更新日期：2021-05-28

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11