当前位置: X-MOL 学术arXiv.cs.SI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A framework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration
arXiv - CS - Social and Information Networks Pub Date : 2020-07-04 , DOI: arxiv-2007.02086
Zhuoyue Xiao, Yutao Zhang, Bo Chen, Xiaozhao Liu, Jie Tang

We present a manually-labeled Author Name Disambiguation(AND) Dataset called WhoisWho, which consists of 399,255 documents and 45,187 distinct authors with 421 ambiguous author names. To label such a great amount of AND data of high accuracy, we propose a novel annotation framework where the human and computer collaborate efficiently and precisely. Within the framework, we also propose an inductive disambiguation model to classify whether two documents belong to the same author. We evaluate the proposed method and other state-of-the-art disambiguation methods on WhoisWho. The experiment results show that: (1) Our model outperforms other disambiguation algorithms on this challenging benchmark. (2) The AND problem still remains largely unsolved and requires more in-depth research. We believe that such a large-scale benchmark would bring great value for the author name disambiguation task. We also conduct several experiments to prove our annotation framework could assist annotators to make accurate results efficiently and eliminate wrong label problems made by human annotators effectively.

中文翻译:

构建海量名称消歧数据集的框架:算法、可视化和人类协作

我们提出了一个名为 WhoisWho 的手动标记的作者姓名消歧(AND)数据集,它由 399,255 个文档和 45,187 个不同的作者组成,其中包含 421 个不明确的作者姓名。为了标记如此大量的高精度 AND 数据,我们提出了一种新颖的注释框架,人机可以高效精确地协作。在该框架内,我们还提出了一个归纳消歧模型来分类两个文档是否属于同一作者。我们在 WhoisWho 上评估了所提出的方法和其他最先进的消歧方法。实验结果表明:(1)我们的模型在这个具有挑战性的基准测试中优于其他消歧算法。(2) AND 问题在很大程度上仍未解决,需要更深入的研究。我们相信,如此大规模的基准测试将为作者姓名消歧任务带来巨大价值。我们还进行了多次实验,以证明我们的注释框架可以帮助注释者有效地做出准确的结果,并有效地消除人工注释者造成的错误标签问题。
更新日期:2020-07-07
down
wechat
bug