Code smell detection using feature selection and stacking ensemble: An empirical investigation

https://doi.org/10.1016/j.infsof.2021.106648Get rights and content

Abstract

Context:

Code smell detection is the process of identifying code pieces that are poorly designed and implemented. Recently more research has been directed towards machine learning-based approaches for code smells detection. Many classifiers have been explored in the literature, yet, finding an effective model to detect different code smells types has not yet been achieved.

Objective:

The main objective of this paper is to empirically investigate the capabilities of stacking heterogeneous ensemble model in code smell detection.

Methods:

Gain feature selection technique was applied to select relevant features in code smell detection. Detection performance of 14 individual classifiers was investigated in the context of two class-level and four method-level code smells. Then, three stacking ensembles were built using all individual classifiers as base classifiers, and three different meta-classifiers (LR, SVM and DT).

Results:

GP, MLP, DT and SVM(Lin) classifiers were among the best performing classifiers in detecting most of the code smells. On the other hand, SVM(Sig), NB(B), NB(M), and SGD were among the least accurate classifiers for most smell types. The stacking ensemble with LR and SVM meta-classifiers achieved a consistent high detection performance in class-level and method-level code smells compared to all individual models.

Conclusion:

This paper concludes that the detection performance of the majority of individual classifiers varied from one code smell type to another. However, the detection performance of the stacking ensemble with LR and SVM meta-classifiers was consistently superior over all individual classifiers in detecting different code smell types.

Introduction

Software engineers occasionally make poor implementation decisions due to time pressures resulting from tight project deadlines that lead to the need for code cleaning (i.e. refactoring). Software engineers consider the refactoring process when the software is being used for a long time and has the potential for additional development [1]. Refactoring is defined as the process of improving the software internal structure quality without altering system functionality [2]. Even though refactoring is considered a time-consuming and labor-intensive process, many benefits are gained from the refactoring process such as: improved software quality and design, increased software readability and understandability, locating software defects and facilitating the software development process [2]. However, the refactoring process is tricky and raises several challenges since it requires changes that might introduce new bugs and alter the software’s internal behavior [3]. Therefore, a key element to the success of the refactoring process is identifying the code fragments that need to be changed (i.e. code smell).

Code smells are poor design and implementation choices that might negatively affect important software quality attributes (e.g. understandability, reusability and maintainability) [4], [5]. Different code smell types exist in the software engineering literature such as: duplicated code, long method, large class, long parameter list, etc. [2]. These smells can be considered as a refactoring opportunity indicator [6]. Therefore, it is quite important to detect code smells in order to refactor them. The process of identifying code smells is commonly known as code smell detection. There are three main approaches to code smell detection: metrics-based [7], rule-based [8] and machine learning-based approaches [9]. The metrics-based approach requires defining quality metrics such as: inheritance, size and cohesion then establishing a threshold value for each metric. However, choosing and identifying the right threshold is not a trivial task in this approach. Next, in the rule-based approach, the domain experts need to specify certain rules to define each code smell. These rules are sometimes manually generated and use domain-specific language. Due to the effort and cognitive load that is required from software engineers in the former approaches, recent research has been directed more towards machine learning approaches [9], [10], [11].

Recently, the application of machine learning classifiers in software code smell detection has been investigated, where machine learning classifiers create code smell detection rules and thresholds. In this approach, software metrics are extracted from the source code and fed to the machine learning classifiers to produce detection rules and thresholds automatically. Hence, this reduces the effort required by software engineers and facilitates the detection process. Based on recently published systematic literature reviews [12], [13], several machine learning classifiers have been utilized in detecting code smells including [14], [15], [16]: decision trees, support vector machines, random forest and naive Bayes. However, finding a classifier which has an effective detection performance over different code smells has not yet been achieved [12], [13].

Ensemble learning [17] is an active research area in the machine learning field, which can enhance the performance of individual machine learning classifiers. Ensembles combine the output of individual classifiers into a single output. Ensembles can be classified into two categories: homogeneous and heterogeneous. Homogeneous ensembles are built with classifiers of the same type trained on different parts of the dataset, while heterogeneous ensembles are built using different classifier types. Ensemble models have been proven to be an effective method to achieve better performance over individual classifiers in class defect prediction [18], [19] and software maintainability [20]. However, in recent systematic literature reviews [12], [13], it was reported that ensemble models have not been largely explored in relation to code smell detection. In this paper, we investigate the use of machine learning-based approaches employing ensemble learning in code smell detection. An empirical study is conducted to investigate to what extent stacking heterogeneous ensemble [21] increase the performance of code smell detection over individual classifiers. The use of ensemble learning is expected to improve the performance of the detection models by facilitating the refactoring process, which leads to overall improved software quality.

Paper Organization. We followed a structured experiment report template and guidelines proposed by A. Jedlitschka et al. [22] to report empirical software engineering investigations. The rest of this paper is organized as follows: Section 2 sheds light on the background needed in this research. Section 3 summarizes the related literature. Section 4 gives a detailed description of the empirical study goal, the code smell datasets used, data preprocessing performed, and evaluation measures used to evaluate classifiers detection performance. Section 5 discusses in detail the detection performance results for the individual classifiers and ensemble models. Section 6 discusses the identified threats to validity of our empirical study. Section 7 concludes the paper with directions for future work.

Section snippets

Background

In this section, we outline the definition of the code smells investigated in this research. Then, we present an overview of the heterogeneous stacking ensemble used for code smell detection.

Literature review

Identifying code smells in source code is an active research area in software refactoring. Different techniques have been proposed to identify and detect code smells such as metrics-based [7], rule-based [8] and machine learning-based [9]. However, few researchers have investigated the employment of machine learning techniques in code smell detection [12].

Different types of machine learning classifiers have been used to detect code smells, and most of the reported empirical studies employ a

Empirical study

In this section, we describe in detail our empirical study. The empirical study goal is clearly stated. Then, we overview the used code smell datasets, followed by a description of the data processing steps. We then describe the model validation implemented to validate the built machine learning classifiers. Finally, detection performance metrics are presented.

Results and discussions

In our experiment, the machine learning pipeline was implemented in Python. A total of 14 classifiers were used: DT, SVM with four kernels (Lin, Poly, Sig and RBF), Bernoulli NB, Gaussian NB, multinomial NB, LR, MLP, SGD, GP, KNN and LDA. We used the 14 classifiers as base classifiers to build the stacking ensemble. All of the classifiers and ensemble models were built and trained using the scikit-learn framework [63].

Threats to validity

Identifying and assessing threats to validity is necessary to ensure the quality of our empirical study findings. Each threat is discussed with the measures taken to mitigate the identified threats.

Conclusion

This paper empirically investigated the application of stacking ensemble in code smell detection and evaluated to what extent the stacking ensemble offers an increase in detection performance over individual classifiers. Paper main contributions can be summarized as follows: First, we applied the gain ratio feature selection technique to investigate the importance of metrics with different granularity as predictors in code smell detection. Second, we evaluated the application of 14 individual

CRediT authorship contribution statement

Amal Alazba: Conceptualization, Methodology, Software, Data curation, Design, analysis, Writing, and revision of the manuscript. Hamoud Aljamaan: Conceptualization, Methodology, Software, Data curation, Design, analysis, Writing, and revision of the manuscrip.

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.infsof.2021.106648.

Acknowledgment

The authors would like to acknowledge the support of King Fahd University of Petroleum and Minerals (KFUPM), Saudi Arabia in the development of this work.

References (65)

  • KimM. et al.

    A field study of refactoring challenges and benefits

  • PérezJ.

    Refactoring planning for design smell correction: Summary, opportunities and lessons learned

  • CharalampidouS. et al.

    Size and cohesion metrics as indicators of the long method bad smell: An empirical study

  • MohaN. et al.

    DECOR: A method for the specification and detection of code and design smells

    IEEE Trans. Softw. Eng.

    (2010)
  • AlkharabshehK. et al.

    Software design smell detection: a systematic mapping study

    Softw. Qual. J.

    (2019)
  • LiuH. et al.

    Deep learning based feature envy detection

  • BarbezA. et al.

    A machine-learning based ensemble method for anti-patterns detection

    J. Syst. Softw.

    (2019)
  • Al-ShaabyA. et al.

    Bad smell detection using machine learning techniques: A systematic literature review

    Arab. J. Sci. Eng.

    (2020)
  • . Lucas Amorim, N. Antunes, B. Fonseca, M. Ribeiro, Experience report: Evaluating the effectiveness of decision trees...
  • Arcelli FontanaF. et al.

    Comparing and experimenting machine learning techniques for code smell detection

    Empir. Softw. Eng.

    (2016)
  • RokachL.

    Ensemble-based classifiers

    Artif. Intell. Rev.

    (2010)
  • D.D. Nucci, F. Palomba, A. De Lucia, Evaluating the Adaptive Selection of Classifiers for Cross-Project Bug Prediction,...
  • H.I. Aljamaan, M.O. Elish, An empirical study of bagging and boosting ensembles for identifying faulty classes in...
  • ElishM.O. et al.

    Three empirical studies on predicting software maintainability using ensemble methods

    Soft Comput.

    (2015)
  • JedlitschkaA. et al.

    Reporting experiments in software engineering

  • BrownW.H. et al.

    AntiPatterns: Refactoring Software, Architectures, and Projects in Crisis

    (1998)
  • SafavianS. et al.

    A survey of decision tree classifier methodology

    IEEE Trans. Syst. Man Cybern.

    (1991)
  • CortesC. et al.

    Support-vector networks

    Mach. Learn.

    (1995)
  • DomingosP.M. et al.

    Beyond independence: conditions for the optimality of the simple Bayesian classifier

  • BishopC.
  • DawsonC.W. et al.

    Hydrological modelling using artificial neural networks

    Prog. Phys. Geogr. Earth Environ.

    (2001)
  • BottouL.

    Large-scale machine learning with stochastic gradient descent

  • Cited by (35)

    View all citing articles on Scopus
    1

    Both authors contributed equally to the work done in this manuscript.

    View full text