当前位置: X-MOL 学术Methods Inf. Med. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing
Methods of Information in Medicine ( IF 1.7 ) Pub Date : 2020-10-14 , DOI: 10.1055/s-0040-1716403
Antje Wulff 1 , Marcel Mast 1 , Marcus Hassler 2 , Sara Montag 1 , Michael Marschollek 1 , Thomas Jack 3
Affiliation  

Abstract

Background Merging disparate and heterogeneous datasets from clinical routine in a standardized and semantically enriched format to enable a multiple use of data also means incorporating unstructured data such as medical free texts. Although the extraction of structured data from texts, known as natural language processing (NLP), has been researched at least for the English language extensively, it is not enough to get a structured output in any format. NLP techniques need to be used together with clinical information standards such as openEHR to be able to reuse and exchange still unstructured data sensibly.

Objectives The aim of the study is to automatically extract crucial information from medical free texts and to transform this unstructured clinical data into a standardized and structured representation by designing and implementing an exemplary pipeline for the processing of pediatric medical histories.

Methods We constructed a pipeline that allows reusing medical free texts such as pediatric medical histories in a structured and standardized way by (1) selecting and modeling appropriate openEHR archetypes as standard clinical information models, (2) defining a German dictionary with crucial text markers serving as expert knowledge base for a NLP pipeline, and (3) creating mapping rules between the NLP output and the archetypes. The approach was evaluated in a first pilot study by using 50 manually annotated medical histories from the pediatric intensive care unit of the Hannover Medical School.

Results We successfully reused 24 existing international archetypes to represent the most crucial elements of unstructured pediatric medical histories in a standardized form. The self-developed NLP pipeline was constructed by defining 3.055 text marker entries, 132 text events, 66 regular expressions, and a text corpus consisting of 776 entries for automatic correction of spelling mistakes. A total of 123 mapping rules were implemented to transform the extracted snippets to an openEHR-based representation to be able to store them together with other structured data in an existing openEHR-based data repository. In the first evaluation, the NLP pipeline yielded 97% precision and 94% recall.

Conclusion The use of NLP and openEHR archetypes was demonstrated as a viable approach for extracting and representing important information from pediatric medical histories in a structured and semantically enriched format. We designed a promising approach with potential to be generalized, and implemented a prototype that is extensible and reusable for other use cases concerning German medical free texts. In a long term, this will harness unstructured clinical data for further research purposes such as the design of clinical decision support systems. Together with structured data already integrated in openEHR-based representations, we aim at developing an interoperable openEHR-based application that is capable of automatically assessing a patient's risk status based on the patient's medical history at time of admission.

Authors' Contributions

A.W. was responsible for drafting the methodological approach, managed the overall project work, led the proof-of-concept evaluation, and has authored the manuscript. M. M. developed the described NLP pipeline, designed the openEHR archetypes and template, and co-authored the manuscript. T. J. and S. M. provided clinical expertise for requirement analysis and dictionary construction. M. H. gave subject-specific advices on the design of NLP pipelines and provided the NLP software. M. M. provided further technical and medical expertise and, together with all authors, co-authored and proofread the manuscript. All authors read and approved the final manuscript.




Publication History

Received: 12 May 2020

Accepted: 18 July 2020

Publication Date:
14 October 2020 (online)

© 2020. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial-License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/).

Georg Thieme Verlag KG
Stuttgart · New York



中文翻译:

设计一个基于 openEHR 的管道,用于使用自然语言处理提取和标准化非结构化临床数据

摘要

背景 以标准化和语义丰富的格式合并来自临床常规的不同和异构数据集以实现数据的多次使用还意味着合并非结构化数据,例如医学自由文本。尽管从文本中提取结构化数据,称为自然语言处理 (NLP),至少已经针对英语进行了广泛的研究,但仅获得任何格式的结构化输出是不够的。NLP 技术需要与 openEHR 等临床信息标准一起使用,以便能够明智地重用和交换非结构化数据。

目标 本研究的目的是通过设计和实施用于处理儿科病史的示范管道,从医学自由文本中自动提取关键信息,并将这种非结构化临床数据转换为标准化和结构化的表示。

方法 我们构建了一个管道,允许以结构化和标准化的方式重用医学自由文本,例如儿科病史,方法是 (1) 选择合适的 openEHR 原型并将其建模为标准临床信息模型,(2) 定义带有关键文本标记的德语词典。作为 NLP 管道的专家知识库,以及 (3) 在 NLP 输出和原型之间创建映射规则。该方法在第一项试点研究中得到了评估,该研究使用来自汉诺威医学院儿科重症监护病房的 50 份手动注释病史。

结果 我们成功地重用了 24 个现有的国际原型,以标准化形式表示非结构化儿科病史的最关键元素。通过定义3.055个文本标记条目、132个文本事件、66个正则表达式和一个由776个条目组成的文本语料库来构建自研的NLP管道,用于自动更正拼写错误。总共实施了 123 条映射规则,将提取的片段转换为基于 openEHR 的表示,以便能够将它们与其他结构化数据一起存储在现有的基于 openEHR 的数据存储库中。在第一次评估中,NLP 管道产生了 97% 的准确率和 94% 的召回率。

结论 NLP 和 openEHR 原型的使用被证明是一种可行的方法,用于以结构化和语义丰富的格式从儿科病史中提取和表示重要信息。我们设计了一种具有推广潜力的有前途的方法,并实现了一个原型,该原型可扩展和可重用于与德国医学自由文本相关的其他用例。从长远来看,这将利用非结构化临床数据用于进一步的研究目的,例如临床决策支持系统的设计。结合已经集成在基于 openEHR 的表示中的结构化数据,我们旨在开发一种可互操作的基于 openEHR 的应用程序,该应用程序能够根据患者入院时的病史自动评估患者的风险状态。

作者的贡献

AW 负责起草方法论方法,管理整个项目工作,领导概念验证评估,并撰写了手稿。MM 开发了所描述的 NLP 管道,设计了 openEHR 原型和模板,并共同撰写了手稿。TJ 和 SM 为需求分析和字典构建提供了临床专业知识。MH 就 NLP 管道的设计提出了针对特定主题的建议,并提供了 NLP 软件。MM 提供了进一步的技术和医学专业知识,并与所有作者共同撰写和校对了手稿。所有作者阅读并认可的终稿。




出版历史

收稿日期:2020 年 5 月 12 日

接受日期:2020年 7 月 18 日

出版日期:
2020 年 10 月 14 日(在线)

© 2020。作者。这是一篇由 Thieme 根据知识共享署名-非衍生-非商业-许可条款发布的开放获取文章,只要原创作品得到适当的认可,就允许复制和复制。内容不得用于商业目的,也不得改编、重新混合、转换或构建。(https://creativecommons.org/licenses/by-nc-nd/4.0/)。

Georg Thieme Verlag KG
斯图加特·纽约

更新日期:2020-10-16
down
wechat
bug