Predicting long-time contributors for GitHub projects using machine learning,Information and Software Technology

当前位置： X-MOL 学术 › Inf. Softw. Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Predicting long-time contributors for GitHub projects using machine learning
Information and Software Technology ( IF 3.8 ) Pub Date : 2021-05-10 , DOI: 10.1016/j.infsof.2021.106616
Vijaya Kumar Eluri , Thomas A. Mazzuchi , Shahram Sarkani

Context:

Many organizations develop software systems using open source software (OSS), which is risky due to the high possibility of losing support. Contributors are critical for the survival of OSS projects, but very few new contributors remain with OSS projects to become long-time contributors (LTCs). Identification of factors that contribute to become an LTC can help OSS project owners utilize limited resources to retain new contributors.

Objective:

In this paper, we investigate whether we can effectively predict new contributors to OSS repositories becoming long time contributors based on repository and contributor meta-data collected from GitHub.

Method:

We construct a dataset containing 70,899 observations from 888 most popular repositories with 56,766 contributors. Each observation represents a contributor who joined the repository and is categorized as either an LTC or a non-LTC, depending on whether their project tenure is longer than 3 years. Each observation has 31 features that are calculated using the information of the new contributor and the repository when a new contributor joins the project. We build several machine learning models, including naive Bayes, k-nearest neighbor, logistic regression, decision tree, and random forest to predict LTC validated using 10-fold cross-validation. We compare our best model with state of the art model in terms of precision, recall, F1-score, Matthews correlation coefficient (MCC), and area under the curve (AUC).

Results:

In 10-fold cross-validation, the precision, recall, F1-score, MCC, and AUC of our best model (random forest) are 0.695, 0.079, 0.140, 0.226, and 0.913, respectively. These values are 27.29%, 92.68%, 86.67%, 56.94%, and 0.55%, respectively better than the best baseline state of the art model (random forest).

Conclusion:

Compared to state of the art models, the models built using our approach use less than 50% features (31 vs 63), have no wait time of one month after the contributor joins to predict future LTC status, and produce better results.

中文翻译：

使用机器学习预测GitHub项目的长期贡献者

语境：

许多组织使用开源软件（OSS）开发软件系统，由于失去支持的可能性很高，因此存在风险。贡献者对于OSS项目的生存至关重要，但是很少有新的贡献者与OSS项目一起成为长期贡献者（LTC）。确定有助于成为LTC的因素可以帮助OSS项目所有者利用有限的资源来保留新的贡献者。

客观的：

在本文中，我们研究了是否可以基于从GitHub收集的存储库和贡献者元数据有效地预测OSS存储库的新贡献者成为长期贡献者。

方法：

我们构建了一个数据集，其中包含来自888个最热门存储库的70,899个观测值，其中有56,766位贡献者。每个观察值都代表一个加入知识库的贡献者，并根据他们的项目期限是否超过3年将其归类为LTC或非LTC。每个观察值都有31个要素，当新的参与者加入项目时，这些要素将使用新的参与者和存储库的信息进行计算。我们建立了几种机器学习模型，包括朴素贝叶斯，k最近邻，逻辑回归，决策树和随机森林，以预测使用10倍交叉验证所验证的LTC。在准确性，召回率，F1得分，马修斯相关系数（MCC）和曲线下面积（AUC）方面，我们将最佳模型与最新模型进行了比较。