LAGOS-AND: A Large, Gold Standard Dataset for Scholarly Author Name Disambiguation,arXiv - CS - Digital Libraries

当前位置： X-MOL 学术 › arXiv.cs.DL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

LAGOS-AND: A Large, Gold Standard Dataset for Scholarly Author Name Disambiguation
arXiv - CS - Digital Libraries Pub Date : 2021-04-05 , DOI: arxiv-2104.01821
Li Zhang, Wei Lu, Jinqing Yang

In this paper, we present a method to automatically generate a large-scale labeled dataset for author name disambiguation (AND) in the academic world by leveraging authoritative sources, ORCID and DOI. Using the method, we built LAGOS-AND, a large, gold standard dataset for AND, which is substantially different from existing ones. It contains 7.5M citations authored by 797K unique authors and shows close similarities to the entire Microsoft Academic Graph (MAG) across six gold standard validations. In building the dataset, we investigated the long-standing name synonym problem and revealed the degree of variation in the last name for the first time. Evidence from PubMed, MAG, and Semantic Scholar all suggests that there are ~7.5% of authorships who have varied their last names from the credible last names in the ORCID system when ignoring the variants introduced by special characters. Furthermore, we provided a classification-based AND benchmark on the new dataset and released our model for disambiguation in general scenarios. If this work is helpful for future studies, we believe it will challenge (1) the widely accepted block-based disambiguation framework in production environment and, (2) the state-of-the-art methods or models on AND. The code, dataset, and pre-trained model are publicly available.

中文翻译：

LAGOS-AND：学术作者姓名消除歧义的大型黄金标准数据集

在本文中，我们提出了一种利用权威资源，ORCID和DOI自动为学术界作者姓名歧义消除（AND）生成大规模标签数据集的方法。使用该方法，我们建立了LAGOS-AND，这是AND的大型金标准数据集，与现有数据集有很大不同。它包含797万位独特作者撰写的750万引用，并且在六个金标准验证中均与整个Microsoft Academic Graph（MAG）密切相似。在建立数据集时，我们调查了长期存在的名称同义词问题，并首次揭示了姓氏的变化程度。来自PubMed，MAG和Semantic Scholar的证据都表明大约有7个。忽略特殊字符引入的变体而在ORCID系统中从可靠的姓氏更改了姓氏的作者中有5％的作者。此外，我们在新数据集上提供了基于分类的AND基准，并发布了我们的模型以在一般情况下消除歧义。如果这项工作对将来的研究有所帮助，我们相信它将挑战（1）生产环境中被广泛接受的基于块的消歧框架，以及（2）AND的最新方法或模型。代码，数据集和预先训练的模型是公开可用的。我们相信它将挑战（1）生产环境中广泛接受的基于块的消歧框架，以及（2）AND的最新方法或模型。代码，数据集和预先训练的模型是公开可用的。我们相信它将挑战（1）生产环境中广泛接受的基于块的消歧框架，以及（2）AND的最新方法或模型。代码，数据集和预先训练的模型是公开可用的。

更新日期：2021-04-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文