Ethnicity-based name partitioning for author name disambiguation using supervised machine learning,Journal of the Association for Information Science and Technology

当前位置： X-MOL 学术 › J. Assoc. Inf. Sci. Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Ethnicity-based name partitioning for author name disambiguation using supervised machine learning
Journal of the Association for Information Science and Technology ( IF 3.5 ) Pub Date : 2021-02-23 , DOI: 10.1002/asi.24459
Jinseok Kim ₁ , Jenna Kim ₂ , Jason Owen-Smith ₃

Affiliation

In several author name disambiguation studies, some ethnic name groups such as East Asian names are reported to be more difficult to disambiguate than others. This implies that disambiguation approaches might be improved if ethnic name groups are distinguished before disambiguation. We explore the potential of ethnic name partitioning by comparing performance of four machine learning algorithms trained and tested on the entire data or specifically on individual name groups. Results show that ethnicity-based name partitioning can substantially improve disambiguation performance because the individual models are better suited for their respective name group. The improvements occur across all ethnic name groups with different magnitudes. Performance gains in predicting matched name pairs outweigh losses in predicting nonmatched pairs. Feature (e.g., coauthor name) similarities of name pairs vary across ethnic name groups. Such differences may enable the development of ethnicity-specific feature weights to improve prediction for specific ethic name categories. These findings are observed for three labeled data with a natural distribution of problem sizes as well as one in which all ethnic name groups are controlled for the same sizes of ambiguous names. This study is expected to motive scholars to group author names based on ethnicity prior to disambiguation.

中文翻译：

使用监督机器学习进行作者姓名消歧的基于种族的姓名分区

在一些作者姓名消歧研究中，据报道，东亚人名等一些种族名称组比其他人更难消歧。这意味着如果在消除歧义之前区分种族名称群体，则消除歧义的方法可能会得到改进。我们通过比较四种机器学习算法在整个数据上或特别是在单个名称组上训练和测试的性能来探索民族名称划分的潜力。结果表明，基于种族的名称分区可以显着提高消歧性能，因为各个模型更适合其各自的名称组。改善发生在所有种族名称群体中，但幅度不同。预测匹配名称对的性能收益超过预测不匹配对的损失。特征（例如，合著者姓名）姓名对的相似性因种族名称组而异。这种差异可能有助于开发特定种族的特征权重，以改进对特定种族名称类别的预测。这些发现是针对三个标记数据观察到的，这些数据具有问题大小的自然分布，以及一个所有种族名称组都针对相同大小的歧义名称进行控制的数据。预计这项研究将促使学者在消除歧义之前根据种族对作者姓名进行分组。这些发现是针对三个标记数据观察到的，这些数据具有问题大小的自然分布，以及一个所有种族名称组都针对相同大小的歧义名称进行控制的数据。预计这项研究将促使学者在消除歧义之前根据种族对作者姓名进行分组。这些发现是针对三个标记数据观察到的，这些数据具有问题大小的自然分布，以及一个所有种族名称组都针对相同大小的歧义名称进行控制的数据。预计这项研究将促使学者在消除歧义之前根据种族对作者姓名进行分组。

更新日期：2021-02-23

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>