当前位置: X-MOL 学术Scientometrics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Disambiguation of author entities in ADS using supervised learning and graph theory methods
Scientometrics ( IF 3.5 ) Pub Date : 2021-04-20 , DOI: 10.1007/s11192-021-03951-w
Helena Mihaljević , Lucía Santamaría

Disambiguation of authors in digital libraries is essential for many tasks, including efficient bibliographical searches and scientometric analyses to the level of individuals. The question of how to link documents written by the same person has been given much attention by academic publishers and information retrieval researchers alike. Usual approaches rely on publications’ metadata such as affiliations, email addresses, co-authors, or scholarly topics. Lack of homogeneity in the structure of bibliographic collections and discipline-specific dissimilarities between them make the creation of general-purpose disambiguators arduous. We present an algorithm to disambiguate authorships in the Astrophysics Data System (ADS) following an established semi-supervised approach of training a classifier on authorship pairs and clustering the resulting graphs. Due to the lack of high-signal features such as email addresses and citations, we engineer additional content- and location-based features via text embeddings and named-entity recognition. We train various nonlinear tree-based classifiers and detect communities from the resulting weighted graphs through label propagation, a fast yet efficient algorithm that requires no tuning. The resulting procedure reaches reasonable complexity and offers possibilities for interpretation. We apply our method to the creation of author entities in a recent ADS snapshot. The algorithm is evaluated on 39 manually-labeled author blocks comprising 9545 authorships from 562 author profiles. Our best approach utilizes the Random Forest classifier and yields a micro- and macro-averaged BCubed \(\mathrm {F}_1\) score of 0.95 and 0.87, respectively. We release our code and labeled data publicly to foster the development of further disambiguation procedures for ADS.



中文翻译:

使用监督学习和图论方法消除ADS中作者实体的歧义

数字图书馆中作者的歧义对于许多任务至关重要,包括对个人水平的有效书目搜索和科学计量分析。学术出版商和信息检索研究人员都非常重视如何链接由同一人撰写的文档的问题。通常的方法依赖于出版物的元数据,例如从属关系,电子邮件地址,共同作者或学术主题。书目集合的结构缺乏统一性,而且书目集合之间的特定学科差异使得创建通用歧义歧义词的工作十分艰巨。我们提出了一种算法,用于在训练作者对对的分类器并将结果图进行聚类的已建立的半监督方法下,消除天体物理学数据系统(ADS)中的作者歧义。由于缺少诸如电子邮件地址和引用之类的高信号功能,我们通过文本嵌入和命名实体识别来设计基于内容和位置的其他功能。我们训练各种基于非线性树的分类器,并通过标签传播从生成的加权图中检测社区,这是一种快速而有效的算法,不需要进行调整。由此产生的过程达到了合理的复杂度,并为解释提供了可能性。我们将我们的方法应用于在最近的ADS快照中创建作者实体。该算法在39个手动标记的作者块上进行了评估,这些作者块包含来自562位作者个人资料的9545位作者。我们最好的方法是利用随机森林分类器,得出微观和宏观平均的BCubed\(\ mathrm {F} _1 \)得分分别为0.95和0.87。我们公开发布代码和带有标签的数据,以促进ADS进一步的消歧程序的开发。

更新日期:2021-04-20
down
wechat
bug