当前位置: X-MOL 学术arXiv.cs.SE › 论文详情
Pre-trained Contextual Embedding of Source Code
arXiv - CS - Software Engineering Pub Date : 2019-12-21 , DOI: arxiv-2001.00059
Aditya Kanade; Petros Maniatis; Gogul Balakrishnan; Kensen Shi

The source code of a program not only serves as a formal description of an executable task, but it also serves to communicate developer intent in a human-readable form. To facilitate this, developers use meaningful identifier names and natural-language documentation. This makes it possible to successfully apply sequence-modeling approaches, shown to be effective in natural-language processing, to source code. A major advancement in natural-language understanding has been the use of pre-trained token embeddings; BERT and other works have further shown that pre-trained contextual embeddings can be extremely powerful and can be fine-tuned effectively for a variety of downstream supervised tasks. Inspired by these developments, we present the first attempt to replicate this success on source code. We curate a massive corpus of Python programs from GitHub to pre-train a BERT model, which we call Code Understanding BERT (CuBERT). We also pre-train Word2Vec embeddings on the same dataset. We create a benchmark of five classification tasks and compare fine-tuned CuBERT against sequence models trained with and without the Word2Vec embeddings. Our results show that CuBERT outperforms the baseline methods by a margin of 2.9-22%. We also show its superiority when fine-tuned with smaller datasets, and over fewer epochs. We further evaluate CuBERT's effectiveness on a joint classification, localization and repair task involving prediction of two pointers.
更新日期:2020-01-04

 

全部期刊列表>>
物理学研究前沿热点精选期刊推荐
chemistry
自然职位线上招聘会
欢迎报名注册2020量子在线大会
化学领域亟待解决的问题
材料学研究精选新
GIANT
ACS ES&T Engineering
ACS ES&T Water
ACS Publications填问卷
屿渡论文,编辑服务
阿拉丁试剂right
南昌大学
王辉
南方科技大学
彭小水
隐藏1h前已浏览文章
课题组网站
新版X-MOL期刊搜索和高级搜索功能介绍
ACS材料视界
天合科研
x-mol收录
赵延川
李霄羽
廖矿标
朱守非
试剂库存
down
wechat
bug