The quest for better clinical word vectors: Ontology based and lexical vector augmentation versus clinical contextual embeddings,Computers in Biology and Medicine

当前位置： X-MOL 学术 › Comput. Biol. Med. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The quest for better clinical word vectors: Ontology based and lexical vector augmentation versus clinical contextual embeddings
Computers in Biology and Medicine ( IF 7.0 ) Pub Date : 2021-04-28 , DOI: 10.1016/j.compbiomed.2021.104433
Namrata Nath ₁ , Sang-Heon Lee ₁ , Mark D McDonnell ₁ , Ivan Lee ₁

Affiliation

Background

Word vectors or word embeddings are n-dimensional representations of words and form the backbone of Natural Language Processing of textual data. This research experiments with algorithms that augment word vectors with lexical constraints that are popular in NLP research and clinical domain constraints derived from the Unified Medical Language System (UMLS). It also compares the performance of the augmented vectors with Bio + Clinical BERT vectors which have been trained and fine-tuned on clinical datasets.

Methods

Word2vec vectors are generated for words in a publicly available de-identified Electronic Health Records (EHR) dataset and augmented by ontologies using three algorithms that have fundamentally different approaches to vector augmentation. The augmented vectors are then evaluated alongside publicly available Bio + Clinical BERT on their correlation with human-annotated lists using Spearman's correlation coefficient. They are also evaluated on the downstream task of Named Entity Recognition (NER). Quantitative and empirical evaluations are used to highlight the strengths and weaknesses of the different approaches.

Results

The counter-fitted word2vec vectors augmented with information from the UMLS ontology produced the best correlation overall with human-annotated evaluation lists (Spearman's correlation of 0.733 with mini mayo-doctors’ annotation) while Bio + Clinical BERT produces the best results in the NER task (F1 of 0.87 and 0.811 on the i2b2 2010 and i2b2 2012 datasets respectively) in our experiments.

Conclusion

Clinically adapted word2vec vectors successfully encapsulate concepts of lexical and clinical synonymy and antonymy and to a smaller extent, hyponymy and hypernymy. Bio + Clinical BERT vectors perform better at NER and avoid out-of-vocabulary words.

中文翻译：

寻求更好的临床单词向量：基于本体和词汇向量增强与临床上下文嵌入

背景

词向量或词嵌入是词的n维表示，形成文本数据自然语言处理的主干。这项研究使用的算法对词向量进行了扩充，这些词向量具有在NLP研究中很流行的词法约束和从统一医学语言系统（UMLS）派生的临床领域约束。它还将增强型载体的性能与经过临床数据集训练和微调的Bio + Clinical BERT载体进行了比较。

方法

Word2vec向量是为可公开获得的未识别电子病历（EHR）数据集中的词生成的，并使用三种算法进行本体扩充，而这三种算法在向量增强方面具有根本不同的方法。然后，使用Spearman的相关系数，将扩增后的载体与可公开获得的Bio + Clinical BERT一起与人类注释列表进行相关性评估。还对命名实体识别（NER）的下游任务进行了评估。定量和经验评估被用来强调不同方法的优点和缺点。

结果

逆向拟合的word2vec向量，加上来自UMLS本体的信息，产生了与人工注释的评估列表总体上的最佳相关性（在迷你梅奥医生的注释下，斯皮尔曼相关性为0.733），而Bio + Clinical BERT在NER任务中产生了最佳结果（在我们的实验中，i2b2 2010和i2b2 2012数据集的F1分别为0.87和0.811）。