REPD: Source code defect prediction as anomaly detection,Journal of Systems and Software

当前位置： X-MOL 学术 › J. Syst. Softw. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

REPD: Source code defect prediction as anomaly detection
Journal of Systems and Software ( IF 3.7 ) Pub Date : 2020-10-01 , DOI: 10.1016/j.jss.2020.110641
Petar Afric , Lucija Sikic , Adrian Satja Kurdija , Marin Silic

Abstract In this paper, we present a novel approach for within-project source code defect prediction. Since defect prediction datasets are typically imbalanced, and there are few defective examples, we treat defect prediction as anomaly detection. We present our Reconstruction Error Probability Distribution (REPD) model which can handle point and collective anomalies. We compare it on five different traditional code feature datasets against five models: Gaussian Naive Bayes, logistic regression, k-nearest-neighbors, decision tree, and Hybrid SMOTE-Ensemble. In addition, REPD is compared on 24 semantic features datasets against previously mentioned models. In order to compare the performance of competing models, we utilize F1-score measure. By using statistical means, we show that our model produces significantly better results, improving F1-score up to 7.12%. Additionally, REPD’s robustness to dataset imbalance is analyzed by creating defect undersampled and non-defect oversampled datasets.

中文翻译：

REPD：源代码缺陷预测作为异常检测

摘要在本文中，我们提出了一种用于项目内源代码缺陷预测的新方法。由于缺陷预测数据集通常是不平衡的，并且缺陷示例很少，因此我们将缺陷预测视为异常检测。我们提出了我们的重建错误概率分布 (REPD) 模型，该模型可以处理点异常和集体异常。我们将五个不同的传统代码特征数据集与五个模型进行比较：高斯朴素贝叶斯、逻辑回归、k-最近邻、决策树和混合 SMOTE-Ensemble。此外，REPD 在 24 个语义特征数据集上与前面提到的模型进行了比较。为了比较竞争模型的性能，我们使用 F1-score 度量。通过使用统计方法，我们表明我们的模型产生了明显更好的结果，将 F1 分数提高到 7.12%。此外，通过创建缺陷欠采样和非缺陷过采样数据集来分析 REPD 对数据集不平衡的稳健性。

更新日期：2020-10-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11