Leveraging output term co-occurrence frequencies and latent associations in predicting medical subject headings.,Data & Knowledge Engineering

当前位置： X-MOL 学术 › Data Knowl. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Leveraging output term co-occurrence frequencies and latent associations in predicting medical subject headings.
Data & Knowledge Engineering ( IF 2.7 ) Pub Date : 2014-09-18 , DOI: 10.1016/j.datak.2014.09.002
Ramakanth Kavuluru _{1,

2} , Yuan Lu ₂

Affiliation

Trained indexers at the National Library of Medicine (NLM) manually tag each biomedical abstract with the most suitable terms from the Medical Subject Headings (MeSH) terminology to be indexed by their PubMed information system. MeSH has over 26,000 terms and indexers look at each article's full text while assigning the terms. Recent automated attempts focused on using the article title and abstract text to identify MeSH terms for the corresponding article. Most of these approaches used supervised machine learning techniques that use already indexed articles and the corresponding MeSH terms. In this paper, we present a new indexing approach that leverages term co-occurrence frequencies and latent term associations computed using MeSH term sets corresponding to a set of nearly 18 million articles already indexed with MeSH terms by indexers at NLM. The main goal of our study is to gauge the potential of output label co-occurrences, latent associations, and relationships extracted from free text in both unsupervised and supervised indexing approaches. In this paper, using a novel and purely unsupervised approach, we achieve a micro-F-score that is comparable to those obtained using supervised machine learning techniques. By incorporating term co-occurrence and latent association features into a supervised learning framework, we also improve over the best results published on two public datasets.

中文翻译：

利用输出术语共现频率和潜在关联来预测医学主题词。

国家医学图书馆 (NLM) 训练有素的索引员使用医学主题词 (MeSH) 术语中最合适的术语手动标记每个生物医学摘要，以便由其 PubMed 信息系统进行索引。 MeSH 有超过 26,000 个术语，索引器在分配术语时会查看每篇文章的全文。最近的自动化尝试侧重于使用文章标题和摘要文本来识别相应文章的 MeSH 术语。这些方法大多数都使用监督机器学习技术，这些技术使用已索引的文章和相应的 MeSH 术语。在本文中，我们提出了一种新的索引方法，该方法利用术语共现频率和使用 MeSH 术语集计算的潜在术语关联，这些术语集对应于 NLM 索引器已使用 MeSH 术语索引的近 1800 万篇文章。我们研究的主要目标是衡量在无监督和监督索引方法中从自由文本中提取的输出标签共现、潜在关联和关系的潜力。在本文中，我们使用一种新颖且纯无监督的方法，获得了与使用监督机器学习技术获得的微 F 分数相当的分数。通过将术语共现和潜在关联特征纳入监督学习框架，我们还改进了两个公共数据集上发布的最佳结果。

更新日期：2019-11-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11