当前位置: X-MOL 学术Int. J. Med. Inform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Automatic classification of scanned electronic health record documents
International Journal of Medical Informatics ( IF 3.7 ) Pub Date : 2020-10-17 , DOI: 10.1016/j.ijmedinf.2020.104302
Heath Goodrum 1 , Kirk Roberts 1 , Elmer V Bernstam 2
Affiliation  

Objectives

Electronic Health Records (EHRs) contain scanned documents from a variety of sources such as identification cards, radiology reports, clinical correspondence, and many other document types. We describe the distribution of scanned documents at one health institution and describe the design and evaluation of a system to categorize documents into clinically relevant and non-clinically relevant categories as well as further sub-classifications. Our objective is to demonstrate that text classification systems can accurately classify scanned documents.

Methods

We extracted text using Optical Character Recognition (OCR). We then created and evaluated multiple text classification machine learning models, including both “bag of words” and deep learning approaches. We evaluated the system on three different levels of classification using both the entire document as input, as well as the individual pages of the document. Finally, we compared the effects of different text processing methods.

Results

A deep learning model using ClinicalBERT performed best. This model distinguished between clinically-relevant documents and not clinically-relevant documents with an accuracy of 0.973; between intermediate sub-classifications with an accuracy of 0.949; and between individual classes with an accuracy of 0.913.

Discussion

Within the EHR, some document categories such as “external medical records” may contain hundreds of scanned pages without clear document boundaries. Without further sub-classification, clinicians must view every page or risk missing clinically-relevant information. Machine learning can automatically classify these scanned documents to reduce clinician burden.

Conclusion

Using machine learning applied to OCR-extracted text has the potential to accurately identify clinically-relevant scanned content within EHRs.



中文翻译:

扫描电子病历文件的自动分类

目标

电子健康记录 (EHR) 包含来自各种来源的扫描文档,例如身份证、放射学报告、临床信函和许多其他文档类型。我们描述了一个卫生机构扫描文件的分布,并描述了一个系统的设计和评估,以将文件分为临床相关和非临床相关类别以及进一步的子分类。我们的目标是证明文本分类系统可以准确地对扫描的文档进行分类。

方法

我们使用光学字符识别 (OCR) 提取文本。然后,我们创建并评估了多个文本分类机器学习模型,包括“词袋”和深度学习方法。我们使用整个文档以及文档的各个页面作为输入,在三个不同的分类级别上对系统进行了评估。最后,我们比较了不同文本处理方法的效果。

结果

使用 ClinicalBERT 的深度学习模型表现最佳。该模型区分临床相关文件和非临床相关文件的准确度为 0.973;中间子分类之间的精度为 0.949;以及各个类别之间的精度为 0.913。

讨论

在 EHR 中,某些文档类别(例如“外部医疗记录”)可能包含数百个没有明确文档边界的扫描页面。如果没有进一步的子分类,临床医生必须查看每一页,否则可能会遗漏临床相关信息。机器学习可以自动对这些扫描的文档进行分类,以减轻临床医生的负担。

结论

使用应用于 OCR 提取文本的机器学习有可能准确识别 EHR 中临床相关的扫描内容。

更新日期:2020-10-19
down
wechat
bug