Towards automatically generating block comments for code snippets

https://doi.org/10.1016/j.infsof.2020.106373Get rights and content

Abstract

Code commenting is a common programming practice of practical importance to help developers review and comprehend source code. There are two main types of code comments for a method: header comments that summarize the method functionality located before a method, and block comments that describe the functionality of the code snippets within a method. Inspired by the effectiveness of deep learning techniques in the NLP field, many studies focus on using the machine translation model to automatically generate comment for the source code. Because the data set of block comments is difficult to collect, current studies focus more on the automatic generation of header comments than that of block comments. However, block comments are important for program comprehension due to their explanation role for the code snippets in a method. To fill the gap, we have proposed an approach that combines heuristic rules and learning-based method to collect a large number of comment-code pairs from 1,032 open source projects in our previous study. In this paper, we propose a reinforcement learning-based method, RL-BlockCom, to automatically generate block comments for code snippets based on the collected comment-code pairs. Specifically, we utilize the abstract syntax tree (i.e., AST) of a code snippet to generate a token sequence with a statement-based traversal way. Then we propose a composite learning model, which combines the actor-critic algorithm of reinforcement learning with the encoder-decoder algorithm, to generate block comments. On the data set of the comment-code pairs, the BLEU-4 score of our method is 24.28, which outperforms the baselines and state-of-the-art in comment generation.

Introduction

Code commenting, an integral part of software development [1], has been a standard practice in the industry [2], [3]. Comments improve software maintainability through helping developers understand the source code [4], [5]. Obviously, high-quality code comments play an important role in most software activities, such as: code review [6], [7], [8] and program comprehension [9], [10].

Owing to the importance of code comments, a number of approaches [11], [12], [13], [14] have been proposed to automatically generate comments via “translating” the programming language source code to natural language comment benefitting from the effectiveness of deep learning techniques. There are two types of comments used for annotating a method, i.e., header comments and block comments [15], [16]. Header comments are used to summarize the method functionality that locate before a method (e.g., the comment at line 3 in Code Example 1), while block comments are used to describe the functionality of a code snippet in a method (e.g., the comment at lines 9 and 10 in Code Example 1). The scope of a header comment is the whole method, while the scope of a block comment is variable, i.e., from one code line to several continuous code lines.

Due to the variability of the scope of a block comment, it is difficult to automatically collect a data set for block comments that is used to train a block comment generation model. Meanwhile, a considerable amount of work will be involved if using a manual way (i.e., manually label data) to collect the data set. As a result, current studies mainly focus on the automatic generation of header comment [11], [12], [13], [14] (because the pair of header comment and the method is easy to collect). However, block comments are as important as header comments in code understanding [17], [18]. In most time, header comments summarize the functionality of a method at a higher level, while block comments focus on describing the detailed implementation of the code, and the two complement each other [17].

Therefore, it is necessary to build a data set for block comments and further to train a model on the data set to automatically generate block comments. Recently, some studies [11], [12], [13] have adopted deep learning approaches to generate header comments by combining the machine translation model and the structural and semantic information within the Java methods. Although deep learning techniques are successful in the first step toward automatic header comment generation, it is difficult to directly apply these models to the automatic generation of block comments.

Intuitively, the source code covered by the block comments is more fragmented comparing with that covered by the header comments, i.e., a code snippet vs. a complete method. A method is a unit with complete semantics and structure, while a code snippet lacks the context information when we extract them from the source code to form a comment-code pair. As a result, it is more difficult to employ a general machine translation algorithm to generate the comment on a “fragmented” data set, because the mapping space between the code statements and comment words becomes very huge.

In this paper, to address the challenge of collecting the data set of the block comments, we employ the approach proposed in our previous work [19] that combines heuristic rules and learning-based method to identify the scope of a block comment. The heuristic rules can segment the source code into initial comment-code pairs, which can reduce the manual validation effort. Then our approach extracts three dimensions of features including comment features, code features and comment-code relationship features to further identify the scope of a block comment. After that, we apply this approach to automatically collect a data set that contains more than 123,900 comment-code pairs (after the filtering) from 1,032 open-source Java projects. Our approach achieves an accuracy of 81.15% in detecting the scope of block comments.

To address the challenge of automatically generating block comments, we propose a composite learning model, RL-BlockCom, which combines the actor-critic algorithm of reinforcement learning with the encoder-decoder algorithm, to generate block comments. The combination of the two algorithms can resolve the loss-evaluation mismatch issue. More importantly, the reinforcement learning algorithm can learn effective strategies from infinite mapping space to decide the profit-maximizing actions, which is ideal for learning the mappings between the comment words and the fragmented code. Specifically, RL-BlockCom utilizes the abstract syntax tree of a code snippet to generate a token sequence with a statement-based traversal way, then we train RL-BlockCom on the block comment data set. In the evaluation, the BLEU-4 score of RL-BlockCom is 24.28, which outperforms the baseline and the state-of-the-art methods. To facilitate research and application, our RL-BlockCom is available at https://github.com/huangshh/RLComGen.

The contributions of our work are as follows:

  • We propose a statement-based AST traversal algorithm to generate the code token sequence with preserving the semantic, syntactic and structural information in the code snippet.

  • We propose a composite approach named RL-BlockCom that generates block comments with combining the actor-critic algorithm of reinforcement learning and the encoder-decoder algorithm.

  • We implement RL-BlockCom and perform careful experiments with the collected block comment data set to evaluate its performance. The experimental results show that RL-BlockCom outperforms the baselines and state-of-the-art in comment generation.

The rest of this paper is organized as follows. Section 2 shows the overall framework of our approach. Section 3 presents the data collection and the main method of learning to identify the scope of block comments. The detailed steps of learning to generate block comments are discussed in Section 4. Section 5 presents the experiment setting, and Section 6 shows the experiment results. More discussions are presented in Section 7, while Section 8 discusses related works. Section 9 outlines the threats to validity and Section 10 summarizes our approach and outlines directions of future work.

Section snippets

Overall framework

Fig. 1 shows the overall framework of the proposed approach. The framework includes two phases: the data collection phase and the comment automatic generation phase. In the data collection phase, our goal is to build a prediction model to identify the scope of a comment and automatically collect the comment-code pairs from the source code. In the comment generation phase, our generative model extracts the features from the code token sequence and automatically generates a block comment for a

Data collection

We employ the approach proposed in our previous work [19] to collect the data set. There are three phases for the data set collection: 1) We firstly use the heuristic rules to roughly identify the scope of each block comment; 2) We invite several participants to validate the accurate scope of a block comment manually via an online portal; 3) We propose a learning based approach to further determine the scope of each block comment based on the manual validated data set.

Basic concepts

We first introduce the basic concepts of neural machine translation in this section, then describe the encoder-decoder model, and finally describe the combination of reinforcement learning and neural machine translation used in our study.

Data preparation and analysis

The goal of this paper is to generate the comments for code snippets, hence we need to collect the block comments from the source code of the open-source projects. To our knowledge, it is difficult to automatically judge the quality of the block comments in a project (it is a heavy task if we take a manual verification). So what we can do is to pick the open-source projects as popular as possible. For the popular Java projects, most of them are well-known projects and have a large user base and

Results

We train and optimize the comment generation model RL-BlockCom on the data set identified by the comment scope detection algorithm, and compare the performance with the current optimal model TL-CodeSum [14], which is a state-of-the-art code comment generation approach. With the help of API call sequences in Java methods, TL-CodeSum transfers API knowledge to the comment generation model and generates header comments for methods. Specifically, TL-CodeSum uses the approach proposed by Gu et al.

Comparison to header comment generation baselines

In theory, all source code inside a method can be seen as a code snippet, i.e., the entire method can be regarded as a code snippet. Then, we want to investigate whether RL-BlockCom can still perform well on the data set of the entire methods. To achieve this goal, we compare RL-BlockCom with 3 state-of-the-art baselines. They are: Deepcom [13], Code2Seq [29], TL-CodeSum [14].

TL-CodeSum has been introduced in the previous section, and we briefly introduce other two models. Deepcom [13] is often

Related work

In recent years, with the development of information retrieval technology [35] and deep learning [36], automatic comment generation has drawn a lot of attention. There are three kinds of methods to automatically generate code comments: template filling based, information retrieval based and deep learning based.

The template filling based methods [37], [38], [39] firstly extract the keywords from the source code via static analysis techniques, then define a number of comment templates via the

Threats to validity

In this section we focus on the threats that could affect the results of our case studies.

Threats to internal validity relates to the scale of the data set using for training the comment scope detection model. Since we need to extract the existing comment scope as the ground truth from the open-source projects, it requires that we manually validate the comment scope. However, manual validation of comment scope is tedious and time-consuming, it is difficult to validate the comment scopes for all

Conclusion

Code comments play an important role in program comprehension and automatic comment generation is a goal pursued by software developers. This paper proposes a novel method, RL-BlockCom, automatically generating block comment for the code snippets. To build a data set for block comments, we design an approach that combines heuristic rules and machine learning algorithms to identify the scope of the block comment. Then, we propose a composite approach named RL-BlockCom that generates block

Funding agency

  • 1.

    Sun Yat-sen University.

  • 2.

    The Hong Kong Polytechnic University.

CRediT authorship contribution statement

Yuan Huang: Methodology, Validation, Writing - original draft. Shaohao Huang: Writing - review & editing, Software, Validation. Huanchao Chen: Data curation, Software, Resources. Xiangping Chen: Project administration. Zibin Zheng: Supervision, Investigation. Xiapu Luo: Writing - review & editing. Nan Jia: Methodology, Software. Xinyu Hu: Data curation, Resources. Xiaocong Zhou: Conceptualization.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research is supported by the Key-Area Research and Development Program of Guangdong Province (2020B010164002), National Natural Science Foundation of China (61902441, 61672545, 61722214), Hong Kong RGC Projects (No. 152223/17E, 152239/18E), Guangdong Basic and Applied Basic Research Foundation (2020A1515010973), China Postdoctoral Science Foundation (2018M640855), Fundamental Research Funds for the Central Universities (20wkpy06, 20lgpy129). Xiangping Chen is the corresponding author.

References (44)

  • H. Chen et al.

    Automatically detecting the scopes of source code comments

    J. Syst. Softw.

    (2019)
  • J. Gosling et al.

    The java language specification (java se 8 edition), java se 8 ed

    (2014)
  • E. Wong et al.

    Autocomment: Mining question and answer sites for automatic comment generation

    Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on

    (2013)
  • M.A. de Freitas Farias et al.

    Identifying self-admitted technical debt through code comment analysis with a contextualized vocabulary

    Inform. Softw. Technol.

    (2020)
  • J. Keyes

    Software Engineering Handbook

    (2002)
  • A. Kuhn et al.

    Semantic clustering: Identifying topics in source code

    Inform. Softw. Technol.

    (2007)
  • L. Moreno et al.

    Automatic generation of release notes

    Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

    (2014)
  • M.-A. Storey et al.

    Todo or to bug: Exploring how task annotations play a role in the work practices of software developers

    Proceedings of the 30th International Conference on Software Engineering

    (2008)
  • Y. Huang et al.

    Mining version control system for automatically generating commit comment

    2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

    (2017)
  • Y. Tao et al.

    How do software engineers understand code changes?: An exploratory study in industry

    Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering

    (2012)
  • D. Steidl et al.

    Quality analysis of source code comments

    2013 21st International Conference on Program Comprehension (ICPC)

    (2013)
  • S. Iyer et al.

    Summarizing source code using a neural attention model

    Proceedings of the 54th Annual Meeting of the Association for Computational

    (2016)
  • M. Allamanis et al.

    A convolutional attention network for extreme summarization of source code

    Proceedings of The 33rd International Conference on Machine Learning, PMLR

    (2016)
  • X. Hu et al.

    A deep code comment generation

    Proceedings of the 26th IEEE International Conference on Program Comprehension

    (2018)
  • X. Hu et al.

    Summarizing source code with transferred api knowledge

    Proceedings of the 27th International Joint Conference on Artificial Intelligence

    (2018)
  • B. Fluri et al.

    Do code and comments co-evolve? on the relation between source code and comment changes

    Proceedings of the 14th Working Conference on Reverse Engineering

    (2007)
  • Y. Huang et al.

    Does your code need comment?

    Software

    (2020)
  • L. Pascarella et al.

    Classifying code comments in java software systems

    Empir. Softw. Eng.

    (2019)
  • Y. Huang et al.

    Learning code context information to predict comment locations

    IEEE Trans. Reliab.

    (2020)
  • P. Koehn

    Neural Machine Translation

    (2020)
  • Y. Goldberg

    A primer on neural network models for natural language processing

    J. Artif. Intell. Res.

    (2016)
  • T. Mikolov et al.

    Recurrent neural network based language model

    Eleventh annual conference of the international speech communication association

    (2010)
  • Cited by (24)

    View all citing articles on Scopus
    View full text