Content-based subject classification at article level in biomedical context,arXiv - CS - Digital Libraries

当前位置： X-MOL 学术 › arXiv.cs.DL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Content-based subject classification at article level in biomedical context
arXiv - CS - Digital Libraries Pub Date : 2021-04-30 , DOI: arxiv-2104.14800
Eric JeangirardM.E.N.E.S.R.

Subject classification is an important task to analyze scholarly publications. In general, mainly two kinds of approaches are used: classification at a journal level and classification at the article level. We propose a mixed approach, leveraging on embeddings technique in NLP to train classifiers with article metadata (title, abstract, keywords in particular) labelled with the journal-level classification FoR (Fields of Research) and then apply these classifiers at the article level. We use this approach in the context of biomedical publications using metadata from Pubmed. Fasttext classifiers are trained with FoR codes and used to classify publications based on their available metadata. Results show that using a stratification sampling strategy for training help reduce the bias due to unbalanced field distribution.

中文翻译：

生物医学环境中文章级别的基于内容的主题分类

学科分类是分析学术出版物的重要任务。通常，主要使用两种方法：期刊级别的分类和文章级别的分类。我们提出一种混合方法，利用NLP中的嵌入技术来训练带有文章元数据（特别是标题，摘要，关键字）的分类器，这些元数据被标记为期刊级分类FoR（研究领域），然后在文章级应用这些分类器。我们在使用Pubmed的元数据的生物医学出版物中使用这种方法。快速文本分类器使用FoR代码进行训练，并用于根据其可用元数据对出版物进行分类。结果表明，采用分层抽样策略进行培训有助于减少由于田间分布不平衡而造成的偏差。

更新日期：2021-05-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文