Context-aware Retrieval-based Deep Commit Message Generation,ACM Transactions on Software Engineering and Methodology

当前位置： X-MOL 学术 › ACM Trans. Softw. Eng. Methodol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Context-aware Retrieval-based Deep Commit Message Generation
ACM Transactions on Software Engineering and Methodology ( IF 6.6 ) Pub Date : 2021-07-23 , DOI: 10.1145/3464689
Haoye Wang ₁ , Xin Xia ₂ , David Lo ₃ , Qiang He ₄ , Xinyu Wang ₁ , John Grundy ₂

Affiliation

Commit messages recorded in version control systems contain valuable information for software development, maintenance, and comprehension. Unfortunately, developers often commit code with empty or poor quality commit messages. To address this issue, several studies have proposed approaches to generate commit messages from commit diffs . Recent studies make use of neural machine translation algorithms to try and translate git diffs into commit messages and have achieved some promising results. However, these learning-based methods tend to generate high-frequency words but ignore low-frequency ones. In addition, they suffer from exposure bias issues, which leads to a gap between training phase and testing phase. In this article, we propose CoRec to address the above two limitations. Specifically, we first train a context-aware encoder-decoder model that randomly selects the previous output of the decoder or the embedding vector of a ground truth word as context to make the model gradually aware of previous alignment choices. Given a diff for testing, the trained model is reused to retrieve the most similar diff from the training set. Finally, we use the retrieval diff to guide the probability distribution for the final generated vocabulary. Our method combines the advantages of both information retrieval and neural machine translation. We evaluate CoRec on a dataset from Liu et al. and a large-scale dataset crawled from 10K popular Java repositories in Github. Our experimental results show that CoRec significantly outperforms the state-of-the-art method NNGen by 19% on average in terms of BLEU.

中文翻译：

基于上下文感知检索的深度提交消息生成

版本控制系统中记录的提交消息包含对软件开发、维护和理解有价值的信息。不幸的是，开发人员经常提交带有空或质量差的提交消息的代码。为了解决这个问题，一些研究提出了从提交生成提交消息的方法差异. 最近的研究利用神经机器翻译算法来尝试翻译 git差异进入提交消息并取得了一些有希望的结果。然而，这些基于学习的方法往往会生成高频词而忽略低频词。此外，他们还存在暴露偏差问题，导致训练阶段和测试阶段之间存在差距。在本文中，我们提出CoRec解决以上两个限制。具体来说，我们首先训练了一个上下文感知的编码器-解码器模型，该模型随机选择解码器的先前输出或地面实况词的嵌入向量作为上下文，以使模型逐渐了解先前的对齐选择。给定一个差异为了测试，训练的模型被重用来检索最相似的差异从训练集中。最后，我们使用检索差异指导最终生成词汇的概率分布。我们的方法结合了信息检索和神经机器翻译的优点。我们评估CoRec在 Liu 等人的数据集上。以及从 Github 上 10K 流行的 Java 存储库中爬取的大规模数据集。我们的实验结果表明CoRec就 BLEU 而言，其平均性能显着优于最先进的方法 NNGen 19%。

更新日期：2021-07-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11