Contextualized Code Representation Learning for Commit Message Generation,arXiv - CS - Software Engineering

当前位置： X-MOL 学术 › arXiv.cs.SE › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Contextualized Code Representation Learning for Commit Message Generation
arXiv - CS - Software Engineering Pub Date : 2020-07-14 , DOI: arxiv-2007.06934
Lun Yiu Nie, Cuiyun Gao, Zhicong Zhong, Wai Lam, Yang Liu and Zenglin Xu

Automatic generation of high-quality commit messages for code commits can substantially facilitate developers' works and coordination. However, the semantic gap between source code and natural language poses a major challenge for the task. Several studies have been proposed to alleviate the challenge but none explicitly involves code contextual information during commit message generation. Specifically, existing research adopts static embedding for code tokens, which maps a token to the same vector regardless of its context. In this paper, we propose a novel Contextualized code representation learning method for commit message Generation (CoreGen). CoreGen first learns contextualized code representation which exploits the contextual information behind code commit sequences. The learned representations of code commits built upon Transformer are then transferred for downstream commit message generation. Experiments on the benchmark dataset demonstrate the superior effectiveness of our model over the baseline models with an improvement of 28.18% in terms of BLEU-4 score. Furthermore, we also highlight the future opportunities in training contextualized code representations on larger code corpus as a solution to low-resource settings and adapting the pretrained code representations to other downstream code-to-text generation tasks.

中文翻译：

用于提交消息生成的上下文化代码表示学习

为代码提交自动生成高质量的提交消息可以极大地促进开发人员的工作和协调。然而，源代码和自然语言之间的语义鸿沟对这项任务构成了重大挑战。已经提出了几项研究来缓解这一挑战，但没有一项研究明确涉及提交消息生成期间的代码上下文信息。具体而言，现有研究采用静态嵌入代码标记，将标记映射到相同的向量而不管其上下文。在本文中，我们提出了一种用于提交消息生成（CoreGen）的新型上下文化代码表示学习方法。CoreGen 首先学习上下文化代码表示，它利用代码提交序列背后的上下文信息。构建在 Transformer 上的代码提交的学习表示然后被传输用于下游提交消息生成。在基准数据集上的实验证明了我们的模型在 BLEU-4 分数方面比基线模型的优越性提高了 28.18%。此外，我们还强调了在更大的代码语料库上训练上下文化代码表示的未来机会，作为解决低资源设置和将预训练的代码表示适应其他下游代码到文本生成任务的方法。

更新日期：2020-07-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>