A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits,arXiv - CS - Software Engineering

当前位置： X-MOL 学术 › arXiv.cs.SE › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits
arXiv - CS - Software Engineering Pub Date : 2020-03-18 , DOI: arxiv-2003.08349
Tanner Fry, Tapajit Dey, Andrey Karnauch, Audris Mockus

The data collected from open source projects provide means to model large software ecosystems, but often suffer from data quality issues, specifically, multiple author identification strings in code commits might actually be associated with one developer. While many methods have been proposed for addressing this problem, they are either heuristics requiring manual tweaking, or require too much calculation time to do pairwise comparisons for 38M author IDs in, for example, the World of Code collection. In this paper, we propose a method that finds all author IDs belonging to a single developer in this entire dataset, and share the list of all author IDs that were found to have aliases. To do this, we first create blocks of potentially connected author IDs and then use a machine learning model to predict which of these potentially related IDs belong to the same developer. We processed around 38 million author IDs and found around 14.8 million IDs to have an alias, which belong to 5.4 million different developers, with the median number of aliases being 2 per developer. This dataset can be used to create more accurate models of developer behaviour at the entire OSS ecosystem level and can be used to provide a service to rapidly resolve new author IDs.

中文翻译：

从 2B Git 提交中提取的 3800 万作者 ID 的身份解析数据集和方法

从开源项目收集的数据提供了对大型软件生态系统进行建模的方法，但通常会遇到数据质量问题，具体而言，代码提交中的多个作者标识字符串实际上可能与一个开发人员相关联。虽然已经提出了许多方法来解决这个问题，但它们要么是需要手动调整的启发式方法，要么需要太多的计算时间来对 3800 万作者 ID 进行成对比较，例如，代码世界。在本文中，我们提出了一种方法，可以在整个数据集中查找属于单个开发人员的所有作者 ID，并共享所有发现具有别名的作者 ID 的列表。去做这个，我们首先创建可能连接的作者 ID 块，然后使用机器学习模型来预测这些可能相关的 ID 中哪些属于同一开发人员。我们处理了大约 3800 万个作者 ID，发现大约 1480 万个 ID 具有别名，它们属于 540 万个不同的开发者，每个开发者的别名中位数为 2 个。该数据集可用于在整个 OSS 生态系统层面创建更准确的开发者行为模型，并可用于提供快速解析新作者 ID 的服务。

更新日期：2020-03-31

点击分享查看原文

点击收藏

阅读更多本刊最新论文