当前位置: X-MOL 学术arXiv.cs.SE › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning and Evaluating Contextual Embedding of Source Code
arXiv - CS - Software Engineering Pub Date : 2019-12-21 , DOI: arxiv-2001.00059
Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline.

中文翻译:

学习和评估源代码的上下文嵌入

通过建立为自然语言开发的机器学习技术,最近的研究在理解和改进源代码方面取得了令人瞩目的成果。随着预训练上下文嵌入(如 BERT)的发展,自然语言理解取得了重大进展,可以对标记数据和训练预算较少的下游任务进行微调,同时实现更好的准确性。然而,目前还没有尝试获得高质量的源代码上下文嵌入,并同时在多个程序理解任务上对其进行评估;这就是本文旨在缩小的差距。具体来说,首先,我们从 GitHub 中精选了一个包含 740 万个 Python 文件的大量去重语料库,我们用它来预训练 CuBERT,这是一种开源代码理解 BERT 模型;其次,我们创建了一个开源基准测试,其中包含五个分类任务和一个程序修复任务,类似于之前文献中提出的代码理解任务。我们在基准任务上对 CuBERT 进行了微调,并将生成的模型与 Word2Vec 标记嵌入的不同变体、BiLSTM 和 Transformer 模型以及已发布的最先进模型进行了比较,表明 CuBERT 的表现优于所有模型,即使训练时间更短,标记示例更少。源代码嵌入的未来工作可以受益于重用我们的基准,并通过与 CuBERT 模型进行比较作为强大的基线。并将生成的模型与 Word2Vec 标记嵌入的不同变体、BiLSTM 和 Transformer 模型以及已发布的最先进模型进行比较,结果表明 CuBERT 优于所有模型,即使训练时间较短且标记示例较少。源代码嵌入的未来工作可以从重用我们的基准中受益,并将与 CuBERT 模型作为强基线进行比较。并将生成的模型与 Word2Vec 标记嵌入的不同变体、BiLSTM 和 Transformer 模型以及已发布的最先进模型进行比较,结果表明 CuBERT 优于所有模型,即使训练时间较短且标记示例较少。源代码嵌入的未来工作可以受益于重用我们的基准,并通过与 CuBERT 模型进行比较作为强大的基线。
更新日期:2020-08-19
down
wechat
bug