Learning and Evaluating Contextual Embedding of Source Code,arXiv - CS - Software Engineering

当前位置： X-MOL 学术 › arXiv.cs.SE › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning and Evaluating Contextual Embedding of Source Code
arXiv - CS - Software Engineering Pub Date : 2019-12-21 , DOI: arxiv-2001.00059
Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline.

中文翻译：

学习和评估源代码的上下文嵌入

通过建立为自然语言开发的机器学习技术，最近的研究在理解和改进源代码方面取得了令人瞩目的成果。随着预训练上下文嵌入（如 BERT）的发展，自然语言理解取得了重大进展，可以对标记数据和训练预算较少的下游任务进行微调，同时实现更好的准确性。然而，目前还没有尝试获得高质量的源代码上下文嵌入，并同时在多个程序理解任务上对其进行评估；这就是本文旨在缩小的差距。具体来说，首先，我们从 GitHub 中精选了一个包含 740 万个 Python 文件的大量去重语料库，我们用它来预训练 CuBERT，这是一种开源代码理解 BERT 模型；其次，我们创建了一个开源基准测试，其中包含五个分类任务和一个程序修复任务，类似于之前文献中提出的代码理解任务。我们在基准任务上对 CuBERT 进行了微调，并将生成的模型与 Word2Vec 标记嵌入的不同变体、BiLSTM 和 Transformer 模型以及已发布的最先进模型进行了比较，表明 CuBERT 的表现优于所有模型，即使训练时间更短，标记示例更少。源代码嵌入的未来工作可以受益于重用我们的基准，并通过与 CuBERT 模型进行比较作为强大的基线。并将生成的模型与 Word2Vec 标记嵌入的不同变体、BiLSTM 和 Transformer 模型以及已发布的最先进模型进行比较，结果表明 CuBERT 优于所有模型，即使训练时间较短且标记示例较少。源代码嵌入的未来工作可以从重用我们的基准中受益，并将与 CuBERT 模型作为强基线进行比较。并将生成的模型与 Word2Vec 标记嵌入的不同变体、BiLSTM 和 Transformer 模型以及已发布的最先进模型进行比较，结果表明 CuBERT 优于所有模型，即使训练时间较短且标记示例较少。源代码嵌入的未来工作可以受益于重用我们的基准，并通过与 CuBERT 模型进行比较作为强大的基线。

更新日期：2020-08-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文