当前位置: X-MOL 学术Neural Comput. & Applic. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
TermInformer: unsupervised term mining and analysis in biomedical literature.
Neural Computing and Applications ( IF 6 ) Pub Date : 2020-09-16 , DOI: 10.1007/s00521-020-05335-2
Prayag Tiwari 1 , Sagar Uprety 2 , Shahram Dehdashti 3 , M Shamim Hossain 4
Affiliation  

Terminology is the most basic information that researchers and literature analysis systems need to understand. Mining terms and revealing the semantic relationships between terms can help biomedical researchers find solutions to some major health problems and motivate researchers to explore innovative biomedical research issues. However, how to mine terms from biomedical literature remains a challenge. At present, the research on text segmentation in natural language processing (NLP) technology has not been well applied in the biomedical field. Named entity recognition models usually require a large amount of training corpus, and the types of entities that the model can recognize are limited. Besides, dictionary-based methods mainly use pre-established vocabularies to match the text. However, this method can only match terms in a specific field, and the process of collecting terms is time-consuming and labour-intensive. Many scenarios faced in the field of biomedical research are unsupervised, i.e. unlabelled corpora, and the system may not have much prior knowledge. This paper proposes the TermInformer project, which aims to mine the meaning of terms in an open fashion by calculating terms and find solutions to some of the significant problems in our society. We propose an unsupervised method that can automatically mine terms in the text without relying on external resources. Our method can generally be applied to any document data. Combined with the word vector training algorithm, we can obtain reusable term embeddings, which can be used in any NLP downstream application. This paper compares term embeddings with existing word embeddings. The results show that our method can better reflect the semantic relationship between terms. Finally, we use the proposed method to find potential factors and treatments for lung cancer, breast cancer, and coronavirus.



中文翻译:

TermInformer:生物医学文献中无监督的术语挖掘和分析。

术语是研究人员和文献分析系统需要了解的最基本信息。挖掘术语并揭示术语之间的语义关系可以帮助生物医学研究人员找到一些重大健康问题的解决方案,并激发研究人员探索创新的生物医学研究问题。然而,如何从生物医学文献中挖掘术语仍然是一个挑战。目前,自然语言处理(NLP)技术中的文本分割研究尚未在生物医学领域得到很好的应用。命名实体识别模型通常需要大量的训练语料,并且该模型可以识别的实体类型是有限的。此外,基于字典的方法主要使用预先建立的词汇来匹配文本。但是,此方法只能匹配特定字段中的字词,而且收集术语的过程既费时又费力。生物医学研究领域面临的许多情况都是无监督的,即未标记的语料库,并且系统可能没有很多先验知识。本文提出了TermInformer项目,该项目旨在通过计算术语并为我们社会中的一些重要问题找到解决方案,以开放的方式挖掘术语的含义。我们提出了一种不受监督的方法,该方法可以自动挖掘文本中的术语,而无需依赖外部资源。我们的方法通常可以应用于任何文档数据。结合词向量训练算法,我们可以获得可重用的术语嵌入,可在任何NLP下游应用中使用。本文将术语嵌入与现有单词嵌入进行比较。结果表明,我们的方法可以更好地反映词语之间的语义关系。最后,我们使用提出的方法来发现肺癌,乳腺癌和冠状病毒的潜在因素和治疗方法。

更新日期:2020-09-18
down
wechat
bug