Towards automatically generating block comments for code snippets
Introduction
Code commenting, an integral part of software development [1], has been a standard practice in the industry [2], [3]. Comments improve software maintainability through helping developers understand the source code [4], [5]. Obviously, high-quality code comments play an important role in most software activities, such as: code review [6], [7], [8] and program comprehension [9], [10].
Owing to the importance of code comments, a number of approaches [11], [12], [13], [14] have been proposed to automatically generate comments via “translating” the programming language source code to natural language comment benefitting from the effectiveness of deep learning techniques. There are two types of comments used for annotating a method, i.e., header comments and block comments [15], [16]. Header comments are used to summarize the method functionality that locate before a method (e.g., the comment at line 3 in Code Example 1), while block comments are used to describe the functionality of a code snippet in a method (e.g., the comment at lines 9 and 10 in Code Example 1). The scope of a header comment is the whole method, while the scope of a block comment is variable, i.e., from one code line to several continuous code lines.
Due to the variability of the scope of a block comment, it is difficult to automatically collect a data set for block comments that is used to train a block comment generation model. Meanwhile, a considerable amount of work will be involved if using a manual way (i.e., manually label data) to collect the data set. As a result, current studies mainly focus on the automatic generation of header comment [11], [12], [13], [14] (because the pair of header comment and the method is easy to collect). However, block comments are as important as header comments in code understanding [17], [18]. In most time, header comments summarize the functionality of a method at a higher level, while block comments focus on describing the detailed implementation of the code, and the two complement each other [17].
Therefore, it is necessary to build a data set for block comments and further to train a model on the data set to automatically generate block comments. Recently, some studies [11], [12], [13] have adopted deep learning approaches to generate header comments by combining the machine translation model and the structural and semantic information within the Java methods. Although deep learning techniques are successful in the first step toward automatic header comment generation, it is difficult to directly apply these models to the automatic generation of block comments.
Intuitively, the source code covered by the block comments is more fragmented comparing with that covered by the header comments, i.e., a code snippet vs. a complete method. A method is a unit with complete semantics and structure, while a code snippet lacks the context information when we extract them from the source code to form a comment-code pair. As a result, it is more difficult to employ a general machine translation algorithm to generate the comment on a “fragmented” data set, because the mapping space between the code statements and comment words becomes very huge.
In this paper, to address the challenge of collecting the data set of the block comments, we employ the approach proposed in our previous work [19] that combines heuristic rules and learning-based method to identify the scope of a block comment. The heuristic rules can segment the source code into initial comment-code pairs, which can reduce the manual validation effort. Then our approach extracts three dimensions of features including comment features, code features and comment-code relationship features to further identify the scope of a block comment. After that, we apply this approach to automatically collect a data set that contains more than 123,900 comment-code pairs (after the filtering) from 1,032 open-source Java projects. Our approach achieves an accuracy of 81.15% in detecting the scope of block comments.
To address the challenge of automatically generating block comments, we propose a composite learning model, RL-BlockCom, which combines the actor-critic algorithm of reinforcement learning with the encoder-decoder algorithm, to generate block comments. The combination of the two algorithms can resolve the loss-evaluation mismatch issue. More importantly, the reinforcement learning algorithm can learn effective strategies from infinite mapping space to decide the profit-maximizing actions, which is ideal for learning the mappings between the comment words and the fragmented code. Specifically, RL-BlockCom utilizes the abstract syntax tree of a code snippet to generate a token sequence with a statement-based traversal way, then we train RL-BlockCom on the block comment data set. In the evaluation, the BLEU-4 score of RL-BlockCom is 24.28, which outperforms the baseline and the state-of-the-art methods. To facilitate research and application, our RL-BlockCom is available at https://github.com/huangshh/RLComGen.
The contributions of our work are as follows:
- •
We propose a statement-based AST traversal algorithm to generate the code token sequence with preserving the semantic, syntactic and structural information in the code snippet.
- •
We propose a composite approach named RL-BlockCom that generates block comments with combining the actor-critic algorithm of reinforcement learning and the encoder-decoder algorithm.
- •
We implement RL-BlockCom and perform careful experiments with the collected block comment data set to evaluate its performance. The experimental results show that RL-BlockCom outperforms the baselines and state-of-the-art in comment generation.
The rest of this paper is organized as follows. Section 2 shows the overall framework of our approach. Section 3 presents the data collection and the main method of learning to identify the scope of block comments. The detailed steps of learning to generate block comments are discussed in Section 4. Section 5 presents the experiment setting, and Section 6 shows the experiment results. More discussions are presented in Section 7, while Section 8 discusses related works. Section 9 outlines the threats to validity and Section 10 summarizes our approach and outlines directions of future work.
Section snippets
Overall framework
Fig. 1 shows the overall framework of the proposed approach. The framework includes two phases: the data collection phase and the comment automatic generation phase. In the data collection phase, our goal is to build a prediction model to identify the scope of a comment and automatically collect the comment-code pairs from the source code. In the comment generation phase, our generative model extracts the features from the code token sequence and automatically generates a block comment for a
Data collection
We employ the approach proposed in our previous work [19] to collect the data set. There are three phases for the data set collection: 1) We firstly use the heuristic rules to roughly identify the scope of each block comment; 2) We invite several participants to validate the accurate scope of a block comment manually via an online portal; 3) We propose a learning based approach to further determine the scope of each block comment based on the manual validated data set.
Basic concepts
We first introduce the basic concepts of neural machine translation in this section, then describe the encoder-decoder model, and finally describe the combination of reinforcement learning and neural machine translation used in our study.
Data preparation and analysis
The goal of this paper is to generate the comments for code snippets, hence we need to collect the block comments from the source code of the open-source projects. To our knowledge, it is difficult to automatically judge the quality of the block comments in a project (it is a heavy task if we take a manual verification). So what we can do is to pick the open-source projects as popular as possible. For the popular Java projects, most of them are well-known projects and have a large user base and
Results
We train and optimize the comment generation model RL-BlockCom on the data set identified by the comment scope detection algorithm, and compare the performance with the current optimal model TL-CodeSum [14], which is a state-of-the-art code comment generation approach. With the help of API call sequences in Java methods, TL-CodeSum transfers API knowledge to the comment generation model and generates header comments for methods. Specifically, TL-CodeSum uses the approach proposed by Gu et al.
Comparison to header comment generation baselines
In theory, all source code inside a method can be seen as a code snippet, i.e., the entire method can be regarded as a code snippet. Then, we want to investigate whether RL-BlockCom can still perform well on the data set of the entire methods. To achieve this goal, we compare RL-BlockCom with 3 state-of-the-art baselines. They are: Deepcom [13], Code2Seq [29], TL-CodeSum [14].
TL-CodeSum has been introduced in the previous section, and we briefly introduce other two models. Deepcom [13] is often
Related work
In recent years, with the development of information retrieval technology [35] and deep learning [36], automatic comment generation has drawn a lot of attention. There are three kinds of methods to automatically generate code comments: template filling based, information retrieval based and deep learning based.
The template filling based methods [37], [38], [39] firstly extract the keywords from the source code via static analysis techniques, then define a number of comment templates via the
Threats to validity
In this section we focus on the threats that could affect the results of our case studies.
Threats to internal validity relates to the scale of the data set using for training the comment scope detection model. Since we need to extract the existing comment scope as the ground truth from the open-source projects, it requires that we manually validate the comment scope. However, manual validation of comment scope is tedious and time-consuming, it is difficult to validate the comment scopes for all
Conclusion
Code comments play an important role in program comprehension and automatic comment generation is a goal pursued by software developers. This paper proposes a novel method, RL-BlockCom, automatically generating block comment for the code snippets. To build a data set for block comments, we design an approach that combines heuristic rules and machine learning algorithms to identify the scope of the block comment. Then, we propose a composite approach named RL-BlockCom that generates block
Funding agency
- 1.
Sun Yat-sen University.
- 2.
The Hong Kong Polytechnic University.
CRediT authorship contribution statement
Yuan Huang: Methodology, Validation, Writing - original draft. Shaohao Huang: Writing - review & editing, Software, Validation. Huanchao Chen: Data curation, Software, Resources. Xiangping Chen: Project administration. Zibin Zheng: Supervision, Investigation. Xiapu Luo: Writing - review & editing. Nan Jia: Methodology, Software. Xinyu Hu: Data curation, Resources. Xiaocong Zhou: Conceptualization.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This research is supported by the Key-Area Research and Development Program of Guangdong Province (2020B010164002), National Natural Science Foundation of China (61902441, 61672545, 61722214), Hong Kong RGC Projects (No. 152223/17E, 152239/18E), Guangdong Basic and Applied Basic Research Foundation (2020A1515010973), China Postdoctoral Science Foundation (2018M640855), Fundamental Research Funds for the Central Universities (20wkpy06, 20lgpy129). Xiangping Chen is the corresponding author.
References (44)
- et al.
Automatically detecting the scopes of source code comments
J. Syst. Softw.
(2019) - et al.
The java language specification (java se 8 edition), java se 8 ed
(2014) - et al.
Autocomment: Mining question and answer sites for automatic comment generation
Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on
(2013) - et al.
Identifying self-admitted technical debt through code comment analysis with a contextualized vocabulary
Inform. Softw. Technol.
(2020) Software Engineering Handbook
(2002)- et al.
Semantic clustering: Identifying topics in source code
Inform. Softw. Technol.
(2007) - et al.
Automatic generation of release notes
Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering
(2014) - et al.
Todo or to bug: Exploring how task annotations play a role in the work practices of software developers
Proceedings of the 30th International Conference on Software Engineering
(2008) - et al.
Mining version control system for automatically generating commit comment
2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)
(2017) - et al.
How do software engineers understand code changes?: An exploratory study in industry
Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
(2012)
Quality analysis of source code comments
2013 21st International Conference on Program Comprehension (ICPC)
Summarizing source code using a neural attention model
Proceedings of the 54th Annual Meeting of the Association for Computational
A convolutional attention network for extreme summarization of source code
Proceedings of The 33rd International Conference on Machine Learning, PMLR
A deep code comment generation
Proceedings of the 26th IEEE International Conference on Program Comprehension
Summarizing source code with transferred api knowledge
Proceedings of the 27th International Joint Conference on Artificial Intelligence
Do code and comments co-evolve? on the relation between source code and comment changes
Proceedings of the 14th Working Conference on Reverse Engineering
Does your code need comment?
Software
Classifying code comments in java software systems
Empir. Softw. Eng.
Learning code context information to predict comment locations
IEEE Trans. Reliab.
Neural Machine Translation
A primer on neural network models for natural language processing
J. Artif. Intell. Res.
Recurrent neural network based language model
Eleventh annual conference of the international speech communication association
Cited by (24)
A survey on machine learning techniques applied to source code
2024, Journal of Systems and SoftwareICG: A Machine Learning Benchmark Dataset and Baselines for Inline Code Comments Generation Task
2024, International Journal of Software Engineering and Knowledge EngineeringSnippet Comment Generation Based on Code Context Expansion
2023, ACM Transactions on Software Engineering and Methodology