PathPair2Vec: An AST path pair-based code representation method for defect prediction

https://doi.org/10.1016/j.cola.2020.100979Get rights and content

Abstract

Software project defect prediction (SDP) can predict the bug probability of software by their features and allocate their testing efforts. The existing software defect prediction methods can be divided into two categories: methods based on traditional handcrafted features and methods based on automatically made abstract features, especially those made by deep learning. The current research indicates that deep learning-based automatic features can achieve better performance than handcrafted features.

Code2vec (Alon et al. 2019) is one of the best source code representation models, which leverages deep learning to learn automatic representations from code. In this paper, inspired by code2vec, we propose a new AST path pair-based source code representation method (PathPair2Vec) and apply it to software project defect prediction. We first propose the concept of the short path to describe each terminal node and its control logic. Then, we design a new sequence encoding method to code the different parts of the terminal node and its control logic. Finally, by pairs of short paths, we describe the semantic information of code and fuse them by an attention mechanism. Experiments on the PROMISE dataset show that our method improves the F1 score by 17.88% over the state-of-the-art SDP method, and the AST path pair-based source code representation can better identify the defect features of the source code.

Introduction

Software is becoming more complex with the increasing requirements of users, which causes the cost of maintenance and debugging to increase rapidly. Software defect prediction can obtain the bug probability of internal logic by the external features of software source code, so we can reallocate the debugging resources to higher bug probability modules, decrease software development costs and improve the reliability of software. Recently, numerous models have been developed for software defect prediction, which are mostly based on machine learning technology. These models can be divided into two categories by the features they use: models that use handcrafted features [1] and models that use automatically learned features [2], [3], [4]. The handcrafted features are selected and set by human experts by features from the software source code or development process. They greatly depend on human experts’ experience and are subjective. The common handcrafted features include CK features [5] for object-oriented programs, Halstead features [6] based on operator and operand counts, McCabe features [7] based on dependencies, and MOOD features [7] based on polymorphism factors. Automatic features mainly depend on the feature extraction and reforming ability of deep neural networks (DNNs) that can extract features directly from the source code of the software project and other additional information. The DNN logically contains several network layers, connected one by one, and the output of the upper network layer is used as the input of the next layer. The output of each layer can be seen as a higher-level logical abstraction of the raw data. By numerous training data and effective training methods, DNNs can learn to extract the abstract features of different classes of data, so they can achieve a better result compared to existing models in multiple areas. The existing conclusions show that the feature generation method based on automation has significant advantages over the traditional handcrafted feature method.

Automatic features are objective compared to handcrafted features; they are more common, and can be applied to a variety of domain tasks by an end-to-end model. In the area of software engineering, DNN-based methods have already been applied to code clone detection [8], [9], [10], [11], automatic debugging [12], [13], [14], bug detection [15], [16] and automatic summary generation [17], [18], and have achieved remarkable results.

The usual DNN model tends to build an end-to-end model without focusing enough on the structural data representation of the front end. These models can also achieve better results with a large amount of computing power. However, under the same architecture, if a certain prior knowledge constraint is imposed on the front-end structured data representation, the solution space of the back-end model can be directly reduced. Thus, the learning difficulty is also reduced, and the calculation performance and final result improve. Direct analysis of source code to obtain features is an ideal method for representing software since all the functions of the software are described by the source code. The software source code is different from natural language. It consists of reserved words and manually defined identifiers and has a strong structure. The semantics of the program are expressed through the structured combination of these reserved words and identifiers. How to capture and encode source code tokens and structured information between them is a critical research topic.

The representation learning of code can learn from a variety of code granularities, such as byte code, identifiers, abstract syntax trees (ASTs) or control flow graphs (CFGs) [19]. Raychev et al. [20] proposed a method for building source code dependency graphs by human-defined rules and leveraged a conditional random field to model the graph. They achieved good results in predicting the variable names and variable types of JavaScript language. Allamanis et al. [21] added human-defined data flow edges to AST to build a syntax structure graph. Then, they learned the embedding vector of the node by gated graph neural networks (GGNNs). The results of predicting variable names and detecting misuse of variables showed better performance. Alon et al. [22], [23] proposed an AST path-based source code representation method that uses paths between nodes in ASTs as building blocks. They explored the effect of path representation using the word2vec method, the CRF method and an attention mechanism with a constrained path of length and width. The results on several programming languages (JavaScript, Java, Python, C#) showed that the AST path-based representation achieves better performance in tasks of predicting variables, methods and type names.

Inspired by [23], we propose a new AST path pair-based source code representation method. The main contributions are as follows:

  • Improvement to the representation of terminal nodes. The existing terminal node representation method treats identifiers as a whole. Since different software projects have different naming rules for identifiers, this leads to the lack of generalization ability for learned vectors. Inspired by Alon et al. [24], we split the identifiers into multiple subtoken sequences, which greatly enhances the generalization ability of embedding vectors. Additionally, the terminal nodes are heterogeneous; we encode the AST syntax type information and represent it together with the identifier information. We discuss how to encode and fuse the subtoken vectors of identifier and type information and note that the Bi-LSTM-based coding method with a concatenation of identifier and type information is better than the elementwise add method.

  • Improvement to the representation of path sequence. The existing model treats the internal nodes between two terminal nodes as one sequence. However, they actually correspond to two separate source code grammar sequences. We first propose the concept of the short path to describe each terminal node and its control logic and encode each control logic by the same Bi-LSTM. Then, we unite the two paths as a path pair to represent the source code. Experiments show that the proposed method enhances the ability to express the source code control logic and improves the performance of the model.

  • Discussion on the expression method of source code files for software defect prediction tasks. Most of the existing software defect predictions identify the defects by the whole source file. We discuss the representation of source code files based on the overall AST source code file or based on the sub-AST method in the source code to build path-based representations and demonstrate their proficiency in experiments.

Section snippets

Software defect prediction

Software defect prediction is an important research topic in software engineering. Software defect prediction considers that there is an essential relationship between software defects and features exhibited by the software. The prediction model can be built by studying the relationship between software features and software defects. Defect prediction can be divided into two categories by the difference in the selection of training and testing data, which are within-project defect prediction

Proposed model

Computer programs are data-centric. All program code is essentially a series of operations on the data to make the data meet human needs. The programmer first obtains the raw data by some input method, then analyzes the data and human requirements, writes the source code to manipulate the data to obtain the resulting data that meets the demand, and finally outputs it in some form. In the source code, the data are stored in a data structure and manipulated by a programmer-named identifier. The

Experimental setup

In this section, we introduce specific settings of our experiment.

Experimental results and discussion

In this section, we explore how to better encode the terminal node and internal node sequence of the AST path through experiments. First, we determine the source of the AST path.

Conclusion and future work

This paper systematically analyzes the shortcomings of existing path-based representation methods from path extraction, internal node sequence coding, and terminal node coding methods. Then, we propose a new software source code representation method based on path pairs in AST. The proposed model achieves the best F1 value in the software defect prediction task, which exceeds the current state-of-the-art model by 17.88%. Our model also outperform the code2vec with 3.01% on F1 and 5.45% on

CRediT authorship contribution statement

Ke Shi: Conceptualization, Methodology, Software, Writing - original draft. Yang Lu: Supervision, Writing - review & editing. Jingfei Chang: Investigation, Validation, Visualization. Zhen Wei: Data curation, Resources.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Key Research and Development Program, China (Grant No. 2018YFC0604404, 2016YFC0801804), the National Natural Science Foundation of China (Grant No. 61806067) and the Fundamental Research Funds for the Central Universities (Grant No. PA2019GDPK0079).

References (35)

  • FuW. et al.

    Revisiting unsupervised learning for defect prediction

  • WangS. et al.

    Automatically learning semantic features for defect prediction

  • DamH.K. et al.

    Lessons learned from using a deep tree-based model for software defect prediction in practice

  • DamH.K. et al.

    Automatic feature learning for vulnerability prediction

    (2017)
  • ChidamberS.R. et al.

    A Metrics Suite for Object Oriented Design

    (1994)
  • HalsteadM.H.

    Elements of Software Science (Operating and Programming Systems Series)

    (1978)
  • MccabeT.J.

    A complexity measure

    IEEE Trans. Softw. Eng.

    (2006)
  • ChilowiczM. et al.

    Syntax tree fingerprinting for source code similarity detection

  • YangJ. et al.

    Classification model for code clones based on machine learning

    Empir. Softw. Eng.

    (2015)
  • ZibranM.F. et al.

    Towards flexible code clone detection, management, and refactoring in IDE

  • SajnaniH. et al.

    SourcererCC: Scaling code clone detection to big-code

  • Le GouesC. et al.

    Genprog: A generic method for automatic software repair

    Ieee Trans. Softw. Eng.

    (2012)
  • BöhmeM. et al.

    Where is the bug and how is it fixed? An experiment with practitioners

  • JeffreyD. et al.

    BugFix: A learning-based tool to assist developers in fixing bugs

  • PradelM. et al.

    DeepBugs: A learning approach to name-based bug detection

    Proc. ACM Program. Lang.

    (2018)
  • SteidlD. et al.

    Feature-based detection of bugs in clones

  • M. Allamanis, H. Peng, C. Sutton, A convolutional attention network for extreme summarization of source code, in:...
  • Cited by (35)

    • ★piler: Compilers in search of compilations

      2024, Journal of Systems and Software
    • On the use of deep learning in software defect prediction

      2023, Journal of Systems and Software
      Citation Excerpt :

      Some researchers, such as Li et al. (2017), Dam et al. (2019), and Liu et al. (2020), used word embeddings to obtain numeric vectors from ASTs. Shi et al. (2020) built embedding vectors using an AST path pair-based source code representation method named PathPair2Vec. Li et al. (2019b) modeled and analyzed the relations among paths of ASTs from different methods using Program Dependency Graph (PDG) and Data Flow Graph (DFG).

    • An ensemble meta-estimator to predict source code testability[Formula presented]

      2022, Applied Soft Computing
      Citation Excerpt :

      Indeed, as an inherent feature, human factors should not affect testability. So far, machine learning approaches have been applied to different aspects of software testing and debugging [40], including test data generation [41], fault prediction [42–44], and fault localization [45,46]. Mesquita et al. [44] have used the extreme learning machine (ELM) algorithm to classify source code modules as faulty and nonfaulty with a reject option using 17 source code metrics.

    View all citing articles on Scopus
    View full text