ALFAA: Active Learning Fingerprint based Anti-Aliasing for correcting developer identity errors in version control systems,Empirical Software Engineering

当前位置： X-MOL 学术 › Empir. Software Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ALFAA: Active Learning Fingerprint based Anti-Aliasing for correcting developer identity errors in version control systems
Empirical Software Engineering ( IF 3.5 ) Pub Date : 2020-01-03 , DOI: 10.1007/s10664-019-09786-7
Sadika Amreen , Audris Mockus , Russell Zaretzki , Christopher Bogart , Yuxia Zhang

An accurate determination of developer identities is important for software engineering research and practice. Without it, even simple questions such as “how many developers does a project have?” cannot be answered. The commonly used version control data from Git is full of identity errors and the existing approaches to correct these errors are difficult to validate on large scale and cannot be easily improved. We, therefore, aim to develop a scalable, highly accurate, easy to use and easy to improve approach to correct software developer identity errors. We first amalgamate developer identities from version control systems in open source software repositories and investigate the nature and prevalence of these errors, design corrective algorithms, and estimate the impact of the errors on networks inferred from this data. We investigate these questions using a collection of over 1B Git commits with over 23M recorded author identities. By inspecting the author strings that occur most frequently, we group identity errors into categories. We then augment the author strings with three behavioral fingerprints: time-zone frequencies, the set of files modified, and a vector embedding of the commit messages. We create a manually validated set of identities for a subset of OpenStack developers using an active learning approach and use it to fit supervised learning models to predict the identities for the remaining author strings in OpenStack. We then compare these predictions with a competing commercially available effort and a leading research method. Finally, we compare network measures for file-induced author networks based on corrected and raw data. We find commits done from different environments, misspellings, organizational ids, default values, and anonymous IDs to be the major sources of errors. We also find supervised learning methods to reduce errors by several times in comparison to existing research and commercial methods and the active learning approach to be an effective way to create validated datasets. Results also indicate that correction of developer identity has a large impact on the inference of the social network. We believe that our proposed Active Learning Fingerprint Based Anti-Aliasing (ALFAA) approach will expedite research progress in the software engineering domain for applications that involve developer identities.

中文翻译：

ALFAA：基于主动学习指纹的抗锯齿，用于纠正版本控制系统中的开发人员身份错误

准确确定开发人员身份对于软件工程研究和实践非常重要。没有它，即使是简单的问题，例如“一个项目有多少开发人员？” 无法回答。来自 Git 的常用版本控制数据充满了身份错误，现有的纠正这些错误的方法很难大规模验证，也不容易改进。因此，我们的目标是开发一种可扩展、高度准确、易于使用且易于改进的方法来纠正软件开发人员身份错误。我们首先从开源软件存储库中的版本控制系统中合并开发人员身份，并调查这些错误的性质和普遍性，设计纠正算法，并估计这些数据推断出的错误对网络的影响。我们使用超过 1B 个 Git 提交的集合来调查这些问题，其中记录了超过 2300 万个作者身份。通过检查最常出现的作者字符串，我们将身份错误分组。然后我们用三个行为指纹来增加作者字符串：时区频率、修改的文件集和提交消息的向量嵌入。我们使用主动学习方法为 OpenStack 开发人员的子集创建了一组手动验证的身份，并使用它来拟合监督学习模型，以预测 OpenStack 中其余作者字符串的身份。然后，我们将这些预测与竞争性商业上可用的努力和领先的研究方法进行比较。最后，我们比较基于更正数据和原始数据的文件诱导作者网络的网络度量。我们发现来自不同环境的提交、拼写错误、组织 ID、默认值和匿名 ID 是错误的主要来源。我们还发现，与现有的研究和商业方法相比，监督学习方法可以减少数倍的错误，而主动学习方法是创建经过验证的数据集的有效方法。结果还表明，开发者身份的修正对社交网络的推断有很大影响。我们相信，我们提出的基于主动学习指纹的抗锯齿 (ALFAA) 方法将加快软件工程领域涉及开发人员身份的应用程序的研究进展。我们还发现，与现有的研究和商业方法相比，监督学习方法可以减少数倍的错误，而主动学习方法是创建经过验证的数据集的有效方法。结果还表明，开发者身份的修正对社交网络的推断有很大影响。我们相信，我们提出的基于主动学习指纹的抗锯齿 (ALFAA) 方法将加快软件工程领域涉及开发人员身份的应用程序的研究进展。我们还发现，与现有的研究和商业方法相比，监督学习方法可以减少数倍的错误，而主动学习方法是创建经过验证的数据集的有效方法。结果还表明，开发者身份的修正对社交网络的推断有很大影响。我们相信，我们提出的基于主动学习指纹的抗锯齿 (ALFAA) 方法将加快软件工程领域涉及开发人员身份的应用程序的研究进展。结果还表明，开发者身份的修正对社交网络的推断有很大影响。我们相信，我们提出的基于主动学习指纹的抗锯齿 (ALFAA) 方法将加快软件工程领域涉及开发人员身份的应用程序的研究进展。结果还表明，开发者身份的修正对社交网络的推断有很大影响。我们相信，我们提出的基于主动学习指纹的抗锯齿 (ALFAA) 方法将加快软件工程领域涉及开发人员身份的应用程序的研究进展。

更新日期：2020-01-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11