当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Unsupervised software defect prediction using median absolute deviation threshold based spectral classifier on signed Laplacian matrix
Journal of Big Data ( IF 8.1 ) Pub Date : 2019-09-27 , DOI: 10.1186/s40537-019-0250-z
Aris Marjuni , Teguh B. Adji , Ridi Ferdiana

Area of interest

The trend of current software inevitably leads to the big data era. There are much of large software developed from hundreds to thousands of modules. In software development projects, finding the defect proneness manually on each module in large software dataset is probably inefficient in resources. In this task, the use of a software defect prediction model becomes a popular solution with much more cost-effective rather than manual reviews. This study presents a specific machine learning algorithm, which is the spectral classifier, to develop a software defect prediction model using unsupervised learning approach.

Background and objective

The spectral classifier has been successfully used in software defect prediction because of its reliability to consider the similarities between software entities. However, there are conditional issues when it uses the zero value as partitioning threshold. The classifier will produce the predominantly cluster when the eigenvector values are mostly positives. Besides, it will also generate low clusters compactness when the eigenvector contains outliers. The objective of this study is mainly to propose an alternative partitioning threshold in dealing with the zero threshold issues. Generally, the proposed method is expected to improve the spectral classifier based software defect prediction performances.

Methods

This study proposes the median absolute deviation threshold based spectral classifier to carry out the zero value threshold issues. The proposed method considers the eigenvector values dispersion measure as the new partitioning threshold, rather than using a central tendency measure (e.g., zero, mean, median). The baseline method of this study is the zero value threshold based spectral classifier. Both methods are performed on the signed Laplacian matrix to meet the non-negative Laplacian graph assumption. For classification, the heuristic row sum method is used to assign the entity class as the prediction label.

Results and conclusion

In terms of clustering, the proposed method can produce better cluster memberships that affect the cluster compactness and the classifier performances improvement. The cluster compactness average of both the proposed and baseline methods are 1.4 DBI and 1.8 DBI, respectively. In classification performance, the proposed method performs better accuracy with lower error rates than the baseline method. The proposed method also has high precision but low in the recall, which means that the proposed method can detect the software defect more precisely, although in the small number in detection. The proposed method has the accuracy, precision, recall, and error rates with average values of 0.79, 0.84, 0.72, and 0.21, respectively. While the baseline method has the accuracy, precision, recall, and error rates with average values of 0.74, 0.74, 0.89, and 0.26, respectively. Based on those results, the proposed method able to provide a viable solution to address the zero threshold issues in the spectral classifier. Hence, this study concludes that the use of the median absolute deviation threshold can improve the spectral based unsupervised software defect prediction method.


中文翻译:

在带符号拉普拉斯矩阵上使用基于中值绝对偏差阈值的光谱分类器进行无监督软件缺陷预测

感兴趣的领域

当前软件的趋势不可避免地导致了大数据时代。有许多大型软件是从数百个模块开发到数千个模块。在软件开发项目中,在大型软件数据集中的每个模块上手动查找缺陷倾向性可能会导致资源效率低下。在此任务中,使用软件缺陷预测模型已成为一种流行的解决方案,其成本效益比人工审查高得多。这项研究提出了一种特定的机器学习算法,即频谱分类器,以使用无监督学习方法开发软件缺陷预测模型。

背景和目标

频谱分类器由于能够可靠地考虑软件实体之间的相似性,因此已成功用于软件缺陷预测中。但是,使用零值作为分区阈值时会出现条件问题。当特征向量值大多为正值时,分类器将主要生成聚类。此外,当特征向量包含离群值时,它也会产生较低的簇紧密度。这项研究的目的主要是提出一个替代的划分阈值来处理零阈值问题。通常,所提出的方法有望改善基于频谱分类器的软件缺陷预测性能。

方法

这项研究提出了基于中值绝对偏差阈值的频谱分类器来执行零值阈值问题。所提出的方法将特征向量值离散度度量视为新的划分阈值,而不是使用集中趋势度量(例如零,均值,中位数)。这项研究的基线方法是基于零值阈值的光谱分类器。两种方法都在带符号的拉普拉斯矩阵上执行,以满足非负拉普拉斯图假设。对于分类,使用启发式行和方法将实体类分配为预测标签。

结果与结论

在聚类方面,所提出的方法可以产生更好的聚类成员,这会影响聚类的紧密度和分类器性能的提高。拟议方法和基准方法的聚类紧密度平均值分别为1.4 DBI和1.8 DBI。在分类性能方面,与基线方法相比,所提出的方法具有更高的准确度和更低的错误率。所提出的方法还具有较高的精度,但是召回率较低,这意味着所提出的方法尽管检测次数较少,但是可以更精确地检测软件缺陷。所提出的方法具有的准确性,准确性,召回率和错误率的平均值分别为0.79、0.84、0.72和0.21。基准线方法具有准确性,精确度,召回率和错误率,平均值为0.74、0.74、0.89和0。26。基于这些结果,所提出的方法能够提供可行的解决方案,以解决频谱分类器中的零阈值问题。因此,这项研究得出的结论是,使用中值绝对偏差阈值可以改善基于频谱的无监督软件缺陷预测方法。
更新日期:2019-09-27
down
wechat
bug