Predicting technical debt from commit contents: reproduction and extension with automated feature selection,Software Quality Journal

当前位置： X-MOL 学术 › Software Qual. J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Predicting technical debt from commit contents: reproduction and extension with automated feature selection
Software Quality Journal ( IF 1.7 ) Pub Date : 2020-07-04 , DOI: 10.1007/s11219-020-09520-3
Leevi Rantala , Mika Mäntylä

Self-admitted technical debt refers to sub-optimal development solutions that are expressed in written code comments or commits. We reproduce and improve on a prior work by Yan et al. ( 2018 ) on detecting commits that introduce self-admitted technical debt. We use multiple natural language processing methods: Bag-of-Words, topic modeling, and word embedding vectors. We study 5 open-source projects. Our NLP approach uses logistic Lasso regression from Glmnet to automatically select best predictor words. A manually labeled dataset from prior work that identified self-admitted technical debt from code level commits serves as ground truth. Our approach achieves + 0.15 better area under the ROC curve performance than a prior work, when comparing only commit message features, and + 0.03 better result overall when replacing manually selected features with automatically selected words. In both cases, the improvement was statistically significant ( p < 0.0001). Our work has four main contributions, which are comparing different NLP techniques for SATD detection, improved results over previous work, showing how to generate generalizable predictor words when using multiple repositories, and producing a list of words correlating with SATD. As a concrete result, we release a list of the predictor words that correlate positively with SATD, as well as our used datasets and scripts to enable replication studies and to aid in the creation of future classifiers.

中文翻译：

从提交内容预测技术债务：使用自动特征选择进行复制和扩展

自我承认的技术债务是指在书面代码注释或提交中表达的次优开发解决方案。我们对 Yan 等人的先前工作进行了复制和改进。( 2018 ) 关于检测引入自我承认技术债务的提交。我们使用多种自然语言处理方法：词袋、主题建模和词嵌入向量。我们研究了 5 个开源项目。我们的 NLP 方法使用来自 Glmnet 的逻辑套索回归来自动选择最佳预测词。来自先前工作的手动标记数据集从代码级提交中识别出自我承认的技术债务，作为基本事实。当仅比较提交消息特征时，我们的方法在 ROC 曲线性能下的面积比之前的工作好 + 0.15，并且 + 0。03 用自动选择的词替换手动选择的特征时，整体效果更好。在这两种情况下，改善都具有统计学意义（p < 0.0001）。我们的工作有四个主要贡献，即比较用于 SATD 检测的不同 NLP 技术、改进了先前工作的结果、展示了如何在使用多个存储库时生成可泛化的预测词，以及生成与 SATD 相关的词列表。作为一个具体的结果，我们发布了与 SATD 正相关的预测词列表，以及我们使用的数据集和脚本，以支持复制研究并帮助创建未来的分类器。它们比较了不同的 NLP 技术进行 SATD 检测，改进了以前的工作结果，展示了如何在使用多个存储库时生成可泛化的预测词，并生成与 SATD 相关的词列表。作为一个具体的结果，我们发布了与 SATD 正相关的预测词列表，以及我们使用的数据集和脚本，以支持复制研究并帮助创建未来的分类器。它们比较了不同的 NLP 技术进行 SATD 检测，改进了以前的工作结果，展示了如何在使用多个存储库时生成可泛化的预测词，并生成与 SATD 相关的词列表。作为一个具体的结果，我们发布了与 SATD 正相关的预测词列表，以及我们使用的数据集和脚本，以支持复制研究并帮助创建未来的分类器。

更新日期：2020-07-04

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11