当前位置: X-MOL 学术J. Am. Med. Inform. Assoc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
UMLS-based data augmentation for natural language processing of clinical research literature
Journal of the American Medical Informatics Association ( IF 6.4 ) Pub Date : 2020-12-23 , DOI: 10.1093/jamia/ocaa309
Tian Kang 1 , Adler Perotte 1 , Youlan Tang 1 , Casey Ta 1 , Chunhua Weng 1
Affiliation  

Abstract
Objective
The study sought to develop and evaluate a knowledge-based data augmentation method to improve the performance of deep learning models for biomedical natural language processing by overcoming training data scarcity.
Materials and Methods
We extended the easy data augmentation (EDA) method for biomedical named entity recognition (NER) by incorporating the Unified Medical Language System (UMLS) knowledge and called this method UMLS-EDA. We designed experiments to systematically evaluate the effect of UMLS-EDA on popular deep learning architectures for both NER and classification. We also compared UMLS-EDA to BERT.
Results
UMLS-EDA enables substantial improvement for NER tasks from the original long short-term memory conditional random fields (LSTM-CRF) model (micro-F1 score: +5%, + 17%, and +15%), helps the LSTM-CRF model (micro-F1 score: 0.66) outperform LSTM-CRF with transfer learning by BERT (0.63), and improves the performance of the state-of-the-art sentence classification model. The largest gain on micro-F1 score is 9%, from 0.75 to 0.84, better than classifiers with BERT pretraining (0.82).
Conclusions
This study presents a UMLS-based data augmentation method, UMLS-EDA. It is effective at improving deep learning models for both NER and sentence classification, and contributes original insights for designing new, superior deep learning approaches for low-resource biomedical domains.


中文翻译:

用于临床研究文献自然语言处理的基于 UMLS 的数据增强

摘要
客观的
该研究旨在开发和评估一种基于知识的数据增强方法,以通过克服训练数据稀缺来提高用于生物医学自然语言处理的深度学习模型的性能。
材料和方法
我们通过结合统一医学语言系统 (UMLS) 知识扩展了用于生物医学命名实体识别 (NER) 的简单数据增强 (EDA) 方法,并将这种方法称为 UMLS-EDA。我们设计了实验来系统地评估 UMLS-EDA 对 NER 和分类的流行深度学习架构的影响。我们还将 UMLS-EDA 与 BERT 进行了比较。
结果
UMLS-EDA 能够从原始的长短期记忆条件随机场 (LSTM-CRF) 模型(micro-F1 分数:+5%、+17% 和 +15%)中对 NER 任务进行实质性改进,帮助 LSTM- CRF 模型(micro-F1 分数:0.66)通过 BERT 的迁移学习(0.63)优于 LSTM-CRF,并提高了最先进的句子分类模型的性能。micro-F1 分数的最大增益为 9%,从 0.75 到 0.84,优于使用 BERT 预训练的分类器 (0.82)。
结论
本研究提出了一种基于 UMLS 的数据增强方法,UMLS-EDA。它可有效改进 NER 和句子分类的深度学习模型,并为为低资源生物医学领域设计新的、卓越的深度学习方法提供原始见解。
更新日期:2020-12-23
down
wechat
bug