当前位置: X-MOL 学术arXiv.cs.SE › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Memorization and Generalization in Neural Code Intelligence Models
arXiv - CS - Software Engineering Pub Date : 2021-06-16 , DOI: arxiv-2106.08704
Md Rafiqul Islam Rabin, Aftab Hussain, Vincent J. Hellendoorn, Mohammad Amin Alipour

Deep Neural Networks (DNN) are increasingly commonly used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, training DNNs means walking a knife's edges, because their large capacity also renders them prone to memorizing data points. While traditionally thought of as an aspect of over-training, recent work suggests that the memorization risk manifests especially strongly when the training datasets are noisy and memorization is the only recourse. Unfortunately, most code intelligence tasks rely on rather noise-prone and repetitive data sources, such as GitHub, which, due to their sheer size, cannot be manually inspected and evaluated. We evaluate the memorization and generalization tendencies in neural code intelligence models through a case study across several benchmarks and model families by leveraging established approaches from other fields that use DNNs, such as introducing targeted noise into the training dataset. In addition to reinforcing prior general findings about the extent of memorization in DNNs, our results shed light on the impact of noisy dataset in training.

中文翻译:

神经代码智能模型中的记忆和泛化

深度神经网络 (DNN) 越来越常用于软件工程和代码智能任务。这些是强大的工具,能够通过数百万个参数从大型数据集中学习高度泛化的模式。同时,训练 DNN 意味着走刀路,因为它们的大容量也使它们容易记住数据点。虽然传统上认为是过度训练的一个方面,但最近的工作表明,当训练数据集嘈杂且记忆是唯一的手段时,记忆风险尤其明显。不幸的是,大多数代码智能任务依赖于相当容易产生噪音和重复的数据源,例如 GitHub,由于其庞大的规模,无法手动检查和评估。我们利用来自使用 DNN 的其他领域的既定方法(例如将目标噪声引入训练数据集),通过跨多个基准和模型系列的案例研究来评估神经代码智能模型的记忆和泛化趋势。除了加强先前关于 DNN 记忆程度的一般发现之外,我们的结果还阐明了训练中噪声数据集的影响。
更新日期:2021-06-17
down
wechat
bug