PHANTOM: Curating GitHub for engineered software projects using time-series clustering,Empirical Software Engineering

当前位置： X-MOL 学术 › Empir. Software Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

PHANTOM: Curating GitHub for engineered software projects using time-series clustering
Empirical Software Engineering ( IF 4.1 ) Pub Date : 2020-05-27 , DOI: 10.1007/s10664-020-09825-8
Peter Pickerill , Heiko Joshua Jungen , Mirosław Ochodek , Michał Maćkowiak , Miroslaw Staron

Context Within the field of Mining Software Repositories, there are numerous methods employed to filter datasets in order to avoid analysing low-quality projects. Unfortunately, the existing filtering methods have not kept up with the growth of existing data sources, such as GitHub, and researchers often rely on quick and dirty techniques to curate datasets. Objective The objective of this study is to develop a method capable of filtering large quantities of software projects in a resource-efficient way. Method This study follows the Design Science Research (DSR) methodology. The proposed method, PHANTOM, extracts five measures from Git logs. Each measure is transformed into a time-series, which is represented as a feature vector for clustering using the k-means algorithm. Results Using the ground truth from a previous study, PHANTOM was shown to be able to rediscover the ground truth on the training dataset, and was able to identify “engineered” projects with up to 0.87 Precision and 0.94 Recall on the validation dataset. PHANTOM downloaded and processed the metadata of 1,786,601 GitHub repositories in 21.5 days using a single personal computer, which is over 33% faster than the previous study which used a computer cluster of 200 nodes. The possibility of applying the method outside of the open-source community was investigated by curating 100 repositories owned by two companies. Conclusions It is possible to use an unsupervised approach to identify engineered projects. PHANTOM was shown to be competitive compared to the existing supervised approaches while reducing the hardware requirements by two orders of magnitude.

中文翻译：

PHANTOM：使用时间序列聚类为工程软件项目管理 GitHub

上下文在挖掘软件存储库领域，有许多方法用于过滤数据集以避免分析低质量的项目。不幸的是，现有的过滤方法并没有跟上现有数据源（如 GitHub）的增长，研究人员经常依靠快速而肮脏的技术来管理数据集。目标本研究的目标是开发一种能够以资源高效的方式过滤大量软件项目的方法。方法本研究遵循设计科学研究 (DSR) 方法。所提出的方法 PHANTOM 从 Git 日志中提取了五个度量。每个度量都被转换成一个时间序列，它被表示为一个特征向量，用于使用 k-means 算法进行聚类。结果使用先前研究的基本事实，PHANTOM 被证明能够重新发现训练数据集上的真实情况，并且能够在验证数据集上以高达 0.87 的精度和 0.94 的召回率识别“工程”项目。PHANTOM 使用一台个人计算机在 21.5 天内下载并处理了 1,786,601 个 GitHub 存储库的元数据，这比之前使用 200 个节点的计算机集群的研究快 33% 以上。通过管理两家公司拥有的 100 个存储库，调查了在开源社区之外应用该方法的可能性。结论可以使用无监督的方法来识别工程项目。与现有的监督方法相比，PHANTOM 被证明具有竞争力，同时将硬件要求降低了两个数量级。

更新日期：2020-05-27

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>