当前位置: X-MOL 学术arXiv.cs.SE › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code
arXiv - CS - Software Engineering Pub Date : 2020-03-17 , DOI: arxiv-2003.07914
Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton, Andrea Janes

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported. All datasets, code, and trained models used in this work are publicly available.

中文翻译:

大代码 != 大词汇:源代码的开放词汇模型

统计语言建模技术已成功应用于大型源代码语料库,产生了各种新的软件开发工具,例如代码建议、提高可读性和 API 迁移工具。这些技术的一个主要问题是,随着新标识符名称的激增,代码以远高于自然语言的速度引入新词汇。大型词汇表和词汇表外问题都会严重影响源代码的神经语言模型 (NLM),降低其性能并使其无法扩展。在本文中,我们通过以下方式解决这个问题:1)研究各种建模选择如何影响由 13,362 个项目组成的大规模语料库产生的词汇;2) 提供一个开放的词汇源代码 NLM,可以扩展到这样一个语料库,比以前的工作大 100 倍;和 3) 表明此类模型在三个不同的代码语料库(Java、C、Python)上的表现优于现有技术。据我们所知,这些是已报告代码的最大 NLM。这项工作中使用的所有数据集、代码和经过训练的模型都是公开可用的。
更新日期:2020-03-19
down
wechat
bug