当前位置: X-MOL 学术J. Biomed. Inform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Medical concept normalization in French using multilingual terminologies and contextual embeddings
Journal of Biomedical informatics ( IF 4.5 ) Pub Date : 2021-01-12 , DOI: 10.1016/j.jbi.2021.103684
Perceval Wajsbürt 1 , Arnaud Sarfati 2 , Xavier Tannier 1
Affiliation  

Introduction

Concept normalization is the task of linking terms from textual medical documents to their concept in terminologies such as the UMLS®. Traditional approaches to this problem depend heavily on the coverage of available resources, which poses a problem for languages other than English.

Objective

We present a system for concept normalization in French. We consider textual mentions already extracted and labeled by a named entity recognition system, and we classify these mentions with a UMLS concept unique identifier. We take advantage of the multilingual nature of available terminologies and embedding models to improve concept normalization in French without translation nor direct supervision.

Materials and methods

We consider the task as a highly-multiclass classification problem. The terms are encoded with contextualized embeddings and classified via cosine similarity and softmax. A first step uses a subset of the terminology to finetune the embeddings and train the model. A second step adds the entire target terminology, and the model is trained further with hard negative selection and softmax sampling.

Results

On two corpora from the Quaero FrenchMed benchmark, we show that our approach can lead to good results even with no labeled data at all; and that it outperforms existing supervised methods with labeled data.

Discussion

Training the system with both French and English terms improves by a large margin the performance of the system on a French benchmark, regardless of the way the embeddings were pretrained (French, English, multilingual). Our distantly supervised method can be applied to any kind of documents or medical domain, as it does not require any concept-labeled documents.

Conclusion

These experiments pave the way for simpler and more effective multilingual approaches to processing medical texts in languages other than English.



中文翻译:

法语中使用多语言术语和上下文嵌入的医学概念规范化

介绍

概念规范化是将文本医疗文档中的术语与诸如UMLS®之类的术语相关联的任务。解决该问题的传统方法在很大程度上取决于可用资源的覆盖范围,这给英语以外的其他语言带来了问题。

目的

我们用法语提出了一个概念标准化系统。我们认为文本提及已经由命名实体识别系统提取并标记,并且使用UMLS概念唯一标识符对这些提及进行分类。我们利用可用术语和嵌入模型的多语言性质来改进法语的概念归一化,而无需翻译或直接监督。

材料和方法

我们认为该任务是高度多分类的问题。术语使用上下文嵌入进行编码,并通过余弦相似度和softmax进行分类。第一步,使用术语的子集来微调嵌入并训练模型。第二步添加了整个目标术语,然后通过硬否定选择和softmax采样进一步训练模型。

结果

在Quaero FrenchMed基准测试的两个语料库上,我们证明了即使没有标签数据,我们的方法也可以产生良好的结果。并且它在带标签数据方面优于现有的监督方法。

讨论区

无论嵌入方式是经过预培训的方式(法语,英语,多语言),使用法语和英语术语对系统进行培训都会大大提高以法语为基准的系统性能。我们的远程监督方法可应用于任何类型的文档或医疗领域,因为它不需要任何带有概念标签的文档。

结论

这些实验为更简单,更有效的多语言方法处理除英语以外的其他语言的医学文本铺平了道路。

更新日期:2021-01-25
down
wechat
bug