PathPair2Vec: An AST path pair-based code representation method for defect prediction

doi:10.1016/j.cola.2020.100979

Journal of Computer Languages

Volume 59, August 2020, 100979

https://doi.org/10.1016/j.cola.2020.100979 Get rights and content

Abstract

Software project defect prediction (SDP) can predict the bug probability of software by their features and allocate their testing efforts. The existing software defect prediction methods can be divided into two categories: methods based on traditional handcrafted features and methods based on automatically made abstract features, especially those made by deep learning. The current research indicates that deep learning-based automatic features can achieve better performance than handcrafted features.

Code2vec (Alon et al. 2019) is one of the best source code representation models, which leverages deep learning to learn automatic representations from code. In this paper, inspired by code2vec, we propose a new AST path pair-based source code representation method (PathPair2Vec) and apply it to software project defect prediction. We first propose the concept of the short path to describe each terminal node and its control logic. Then, we design a new sequence encoding method to code the different parts of the terminal node and its control logic. Finally, by pairs of short paths, we describe the semantic information of code and fuse them by an attention mechanism. Experiments on the PROMISE dataset show that our method improves the F1 score by 17.88% over the state-of-the-art SDP method, and the AST path pair-based source code representation can better identify the defect features of the source code.

Introduction

Software is becoming more complex with the increasing requirements of users, which causes the cost of maintenance and debugging to increase rapidly. Software defect prediction can obtain the bug probability of internal logic by the external features of software source code, so we can reallocate the debugging resources to higher bug probability modules, decrease software development costs and improve the reliability of software. Recently, numerous models have been developed for software defect prediction, which are mostly based on machine learning technology. These models can be divided into two categories by the features they use: models that use handcrafted features [1] and models that use automatically learned features [2], [3], [4]. The handcrafted features are selected and set by human experts by features from the software source code or development process. They greatly depend on human experts’ experience and are subjective. The common handcrafted features include CK features [5] for object-oriented programs, Halstead features [6] based on operator and operand counts, McCabe features [7] based on dependencies, and MOOD features [7] based on polymorphism factors. Automatic features mainly depend on the feature extraction and reforming ability of deep neural networks (DNNs) that can extract features directly from the source code of the software project and other additional information. The DNN logically contains several network layers, connected one by one, and the output of the upper network layer is used as the input of the next layer. The output of each layer can be seen as a higher-level logical abstraction of the raw data. By numerous training data and effective training methods, DNNs can learn to extract the abstract features of different classes of data, so they can achieve a better result compared to existing models in multiple areas. The existing conclusions show that the feature generation method based on automation has significant advantages over the traditional handcrafted feature method.

Automatic features are objective compared to handcrafted features; they are more common, and can be applied to a variety of domain tasks by an end-to-end model. In the area of software engineering, DNN-based methods have already been applied to code clone detection [8], [9], [10], [11], automatic debugging [12], [13], [14], bug detection [15], [16] and automatic summary generation [17], [18], and have achieved remarkable results.

The usual DNN model tends to build an end-to-end model without focusing enough on the structural data representation of the front end. These models can also achieve better results with a large amount of computing power. However, under the same architecture, if a certain prior knowledge constraint is imposed on the front-end structured data representation, the solution space of the back-end model can be directly reduced. Thus, the learning difficulty is also reduced, and the calculation performance and final result improve. Direct analysis of source code to obtain features is an ideal method for representing software since all the functions of the software are described by the source code. The software source code is different from natural language. It consists of reserved words and manually defined identifiers and has a strong structure. The semantics of the program are expressed through the structured combination of these reserved words and identifiers. How to capture and encode source code tokens and structured information between them is a critical research topic.

The representation learning of code can learn from a variety of code granularities, such as byte code, identifiers, abstract syntax trees (ASTs) or control flow graphs (CFGs) [19]. Raychev et al. [20] proposed a method for building source code dependency graphs by human-defined rules and leveraged a conditional random field to model the graph. They achieved good results in predicting the variable names and variable types of JavaScript language. Allamanis et al. [21] added human-defined data flow edges to AST to build a syntax structure graph. Then, they learned the embedding vector of the node by gated graph neural networks (GGNNs). The results of predicting variable names and detecting misuse of variables showed better performance. Alon et al. [22], [23] proposed an AST path-based source code representation method that uses paths between nodes in ASTs as building blocks. They explored the effect of path representation using the word2vec method, the CRF method and an attention mechanism with a constrained path of length and width. The results on several programming languages (JavaScript, Java, Python, C#) showed that the AST path-based representation achieves better performance in tasks of predicting variables, methods and type names.

Inspired by [23], we propose a new AST path pair-based source code representation method. The main contributions are as follows:

•
Improvement to the representation of terminal nodes. The existing terminal node representation method treats identifiers as a whole. Since different software projects have different naming rules for identifiers, this leads to the lack of generalization ability for learned vectors. Inspired by Alon et al. [24], we split the identifiers into multiple subtoken sequences, which greatly enhances the generalization ability of embedding vectors. Additionally, the terminal nodes are heterogeneous; we encode the AST syntax type information and represent it together with the identifier information. We discuss how to encode and fuse the subtoken vectors of identifier and type information and note that the Bi-LSTM-based coding method with a concatenation of identifier and type information is better than the elementwise add method.
•
Improvement to the representation of path sequence. The existing model treats the internal nodes between two terminal nodes as one sequence. However, they actually correspond to two separate source code grammar sequences. We first propose the concept of the short path to describe each terminal node and its control logic and encode each control logic by the same Bi-LSTM. Then, we unite the two paths as a path pair to represent the source code. Experiments show that the proposed method enhances the ability to express the source code control logic and improves the performance of the model.
•
Discussion on the expression method of source code files for software defect prediction tasks. Most of the existing software defect predictions identify the defects by the whole source file. We discuss the representation of source code files based on the overall AST source code file or based on the sub-AST method in the source code to build path-based representations and demonstrate their proficiency in experiments.

Section snippets

Software defect prediction

Software defect prediction is an important research topic in software engineering. Software defect prediction considers that there is an essential relationship between software defects and features exhibited by the software. The prediction model can be built by studying the relationship between software features and software defects. Defect prediction can be divided into two categories by the difference in the selection of training and testing data, which are within-project defect prediction

Proposed model

Computer programs are data-centric. All program code is essentially a series of operations on the data to make the data meet human needs. The programmer first obtains the raw data by some input method, then analyzes the data and human requirements, writes the source code to manipulate the data to obtain the resulting data that meets the demand, and finally outputs it in some form. In the source code, the data are stored in a data structure and manipulated by a programmer-named identifier. The

Experimental setup

In this section, we introduce specific settings of our experiment.

Experimental results and discussion

In this section, we explore how to better encode the terminal node and internal node sequence of the AST path through experiments. First, we determine the source of the AST path.

Conclusion and future work

This paper systematically analyzes the shortcomings of existing path-based representation methods from path extraction, internal node sequence coding, and terminal node coding methods. Then, we propose a new software source code representation method based on path pairs in AST. The proposed model achieves the best F1 value in the software defect prediction task, which exceeds the current state-of-the-art model by 17.88%. Our model also outperform the code2vec with 3.01% on F1 and 5.45% on

CRediT authorship contribution statement

Ke Shi: Conceptualization, Methodology, Software, Writing - original draft. Yang Lu: Supervision, Writing - review & editing. Jingfei Chang: Investigation, Validation, Visualization. Zhen Wei: Data curation, Resources.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Key Research and Development Program, China (Grant No. 2018YFC0604404, 2016YFC0801804), the National Natural Science Foundation of China (Grant No. 61806067) and the Fundamental Research Funds for the Central Universities (Grant No. PA2019GDPK0079).

References (35)

FuW. et al.
Revisiting unsupervised learning for defect prediction
WangS. et al.
Automatically learning semantic features for defect prediction
DamH.K. et al.
Lessons learned from using a deep tree-based model for software defect prediction in practice
DamH.K. et al.
Automatic feature learning for vulnerability prediction
(2017)
ChidamberS.R. et al.
A Metrics Suite for Object Oriented Design
(1994)
HalsteadM.H.
Elements of Software Science (Operating and Programming Systems Series)
(1978)
MccabeT.J.
A complexity measure
IEEE Trans. Softw. Eng.
(2006)
ChilowiczM. et al.
Syntax tree fingerprinting for source code similarity detection
YangJ. et al.
Classification model for code clones based on machine learning
Empir. Softw. Eng.
(2015)
ZibranM.F. et al.
Towards flexible code clone detection, management, and refactoring in IDE

SajnaniH. et al.

SourcererCC: Scaling code clone detection to big-code

Le GouesC. et al.

Genprog: A generic method for automatic software repair

Ieee Trans. Softw. Eng.

(2012)

BöhmeM. et al.

Where is the bug and how is it fixed? An experiment with practitioners

JeffreyD. et al.

BugFix: A learning-based tool to assist developers in fixing bugs

PradelM. et al.

DeepBugs: A learning approach to name-based bug detection

Proc. ACM Program. Lang.

(2018)

SteidlD. et al.

Feature-based detection of bugs in clones

M. Allamanis, H. Peng, C. Sutton, A convolutional attention network for extreme summarization of source code, in:...

Cited by (35)

★piler: Compilers in search of compilations
2024, Journal of Systems and Software
Compilers pose significant challenges in their development as software products. Language developers face the complexities of ensuring efficiency, adhering to good design practices, and maintaining the overall codebase. These factors make it difficult to predict the unexpected impact of updates on existing software built on the current compiler stack. Furthermore, software created for a specific compiler often lacks reusability for other compiler environments. In this study, we propose a comprehensive framework for the uniform development of compilers that addresses these issues. Our approach involves developing compilers as a collection of small transpilation units, referred to as deltas. The transpilation infrastructure takes source code written in a particular source language and searches for a path of deltas to generate equivalent source code in the target language. By adopting this methodology, language developers can easily update their languages by introducing new deltas into the system. Existing code remains unaffected as old transpilation paths remain available. To support this framework, we have devised a metric space for efficient delta search. This metric space enables us to define a non-overestimating heuristic function, which proves valuable in solving the search problem. Leveraging the A* search algorithm, we can efficiently transpile programs from a source language to the target language. To evaluate the effectiveness of our approach, we conducted a benchmark comparison between the A* search algorithm and the simpler breadth-first search (BFS) algorithm. The benchmark consisted of over 100 transpilation searches, providing valuable insights into the performance and capabilities of this framework.
On the impact of multiple source code representations on software engineering tasks — An empirical study
2024, Journal of Systems and Software
Efficiently representing source code is crucial for various software engineering tasks such as code classification and clone detection. Existing approaches primarily use Abstract Syntax Tree (AST), and only a few focus on semantic graphs such as Control Flow Graph (CFG) and Program Dependency Graph (PDG), which contain information about source code that AST does not. Even though some works tried to utilize multiple representations, they do not provide any insights about the costs and benefits of using multiple representations. The primary goal of this paper is to discuss the implications of utilizing multiple source code representations, specifically AST, CFG, and PDG. We modify an AST path-based approach to accept multiple representations as input to an attention-based model. We do this to measure the impact of additional representations (such as CFG and PDG) over AST. We evaluate our approach on three tasks: Method Naming, Program Classification, and Clone Detection. Our approach increases the performance on these tasks by 11% (F1), 15.7% (Accuracy), and 9.3% (F1), respectively, over the baseline. In addition to the effect on performance, we discuss timing overheads incurred with multiple representations. We envision that this work can provide a base for researchers to explore and experiment with a variety of source code representations for software engineering tasks.
A survey on machine learning techniques applied to source code
2024, Journal of Systems and Software
The advancements in machine learning techniques have encouraged researchers to apply these techniques to a myriad of software engineering tasks that use source code analysis, such as testing and vulnerability detection. Such a large number of studies hinders the community from understanding the current research landscape. This paper aims to summarize the current knowledge in applied machine learning for source code analysis. We review studies belonging to twelve categories of software engineering tasks and corresponding machine learning techniques, tools, and datasets that have been applied to solve them. To do so, we conducted an extensive literature search and identified $494$ studies. We summarize our observations and findings with the help of the identified studies. Our findings suggest that the use of machine learning techniques for source code analysis tasks is consistently increasing. We synthesize commonly used steps and the overall workflow for each task and summarize machine learning techniques employed. We identify a comprehensive list of available datasets and tools useable in this context. Finally, the paper discusses perceived challenges in this area, including the availability of standard datasets, reproducibility and replicability, and hardware resources.
Editor’s note: Open Science material was validated by the Journal of Systems and Software Open Science Board.
Deep learning with class-level abstract syntax tree and code histories for detecting code modification requirements
2023, Journal of Systems and Software
Improving code quality is one of the most significant issues in the software industry. Deep learning is an emerging area of research for detecting code smells and addressing refactoring requirements. The aim of this study is to develop a deep learning-based system for code modification analysis to predict the locations and types of code modifications, while significantly reducing the need for manual labeling. We created an experimental dataset by collecting historical code data from open-source project repositories on the Internet. We introduce a novel class-level abstract syntax tree-based code embedding method for code analysis. A recurrent neural network was employed to effectively identify code modification requirements. Our system achieves an average accuracy of approximately 83% across different repositories and 86% for the entire dataset. These findings indicate that our system provides higher performance than the method-based and text-based code embedding approaches. In addition, we performed a comparative analysis with a static code analysis tool to justify the readiness of the proposed model for deployment. The correlation coefficient between the outputs demonstrates a significant correlation of 67%. Consequently, this research highlights that the deep learning-based analysis of code histories empowers software teams in identifying potential code modification requirements.
On the use of deep learning in software defect prediction
2023, Journal of Systems and Software
Citation Excerpt :
Some researchers, such as Li et al. (2017), Dam et al. (2019), and Liu et al. (2020), used word embeddings to obtain numeric vectors from ASTs. Shi et al. (2020) built embedding vectors using an AST path pair-based source code representation method named PathPair2Vec. Li et al. (2019b) modeled and analyzed the relations among paths of ASTs from different methods using Program Dependency Graph (PDG) and Data Flow Graph (DFG).
Automated software defect prediction (SDP) methods are increasingly applied, often with the use of machine learning (ML) techniques. Yet, the existing ML-based approaches require manually extracted features, which are cumbersome, time consuming and hardly capture the semantic information reported in bug reporting tools. Deep learning (DL) techniques provide practitioners with the opportunities to automatically extract and learn from more complex and high-dimensional data.
The purpose of this study is to systematically identify, analyze, summarize, and synthesize the current state of the utilization of DL algorithms for SDP in the literature.
We systematically selected a pool of 102 peer-reviewed studies and then conducted a quantitative and qualitative analysis using the data extracted from these studies.
Main highlights include: (1) most studies applied supervised DL; (2) two third of the studies used metrics as an input to DL algorithms; (3) Convolutional Neural Network is the most frequently used DL algorithm.
Based on our findings, we propose to (1) develop more comprehensive DL approaches that automatically capture the needed features; (2) use diverse software artifacts other than source code; (3) adopt data augmentation techniques to tackle the class imbalance problem; (4) publish replication packages.
An ensemble meta-estimator to predict source code testability[Formula presented]
2022, Applied Soft Computing
Citation Excerpt :
Indeed, as an inherent feature, human factors should not affect testability. So far, machine learning approaches have been applied to different aspects of software testing and debugging [40], including test data generation [41], fault prediction [42–44], and fault localization [45,46]. Mesquita et al. [44] have used the extreme learning machine (ELM) algorithm to classify source code modules as faulty and nonfaulty with a reject option using 17 source code metrics.
Unlike most other software quality attributes, testability cannot be evaluated solely based on the characteristics of the source code. The effectiveness of the test suite and the budget assigned to the test highly impact the testability of the code under test. The size of a test suite determines the test effort and cost, while the coverage measure indicates the test effectiveness. Therefore, testability can be measured based on the coverage and number of test cases provided by a test suite, considering the test budget. This paper offers a new equation to estimate testability regarding the size and coverage of a given test suite. The equation has been used to label 23,000 classes belonging to 110 Java projects with their testability measure. The labeled classes were vectorized using 262 metrics. The labeled vectors were fed into a family of supervised machine learning algorithms, regression, to predict testability in terms of the source code metrics. Regression models predicted testability with an R² of 0.68 and a mean squared error of 0.03, suitable in practice. Fifteen software metrics highly affecting testability prediction were identified using a feature importance analysis technique on the learned model. The proposed models have improved mean absolute error by 38% due to utilizing new criteria, metrics, and data compared with the relevant study on predicting branch coverage as a test criterion. As an application of testability prediction, it is demonstrated that automated refactoring of 42 smelly Java classes targeted at improving the 15 influential software metrics could elevate their testability by an average of 86.87%.

View all citing articles on Scopus

View full text

PathPair2Vec: An AST path pair-based code representation method for defect prediction

Abstract

Introduction

Section snippets

Software defect prediction

Proposed model

Experimental setup

Experimental results and discussion

Conclusion and future work

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Revisiting unsupervised learning for defect prediction

Automatically learning semantic features for defect prediction

Lessons learned from using a deep tree-based model for software defect prediction in practice

Automatic feature learning for vulnerability prediction

A Metrics Suite for Object Oriented Design

Elements of Software Science (Operating and Programming Systems Series)

A complexity measure

IEEE Trans. Softw. Eng.

Syntax tree fingerprinting for source code similarity detection

Classification model for code clones based on machine learning

Empir. Softw. Eng.

Towards flexible code clone detection, management, and refactoring in IDE