当前位置:
X-MOL 学术
›
arXiv.cs.SE
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Memorization and Generalization in Neural Code Intelligence Models
arXiv - CS - Software Engineering Pub Date : 2021-06-16 , DOI: arxiv-2106.08704 Md Rafiqul Islam Rabin, Aftab Hussain, Vincent J. Hellendoorn, Mohammad Amin Alipour
arXiv - CS - Software Engineering Pub Date : 2021-06-16 , DOI: arxiv-2106.08704 Md Rafiqul Islam Rabin, Aftab Hussain, Vincent J. Hellendoorn, Mohammad Amin Alipour
Deep Neural Networks (DNN) are increasingly commonly used in software
engineering and code intelligence tasks. These are powerful tools that are
capable of learning highly generalizable patterns from large datasets through
millions of parameters. At the same time, training DNNs means walking a knife's
edges, because their large capacity also renders them prone to memorizing data
points. While traditionally thought of as an aspect of over-training, recent
work suggests that the memorization risk manifests especially strongly when the
training datasets are noisy and memorization is the only recourse.
Unfortunately, most code intelligence tasks rely on rather noise-prone and
repetitive data sources, such as GitHub, which, due to their sheer size, cannot
be manually inspected and evaluated. We evaluate the memorization and
generalization tendencies in neural code intelligence models through a case
study across several benchmarks and model families by leveraging established
approaches from other fields that use DNNs, such as introducing targeted noise
into the training dataset. In addition to reinforcing prior general findings
about the extent of memorization in DNNs, our results shed light on the impact
of noisy dataset in training.
中文翻译:
神经代码智能模型中的记忆和泛化
深度神经网络 (DNN) 越来越常用于软件工程和代码智能任务。这些是强大的工具,能够通过数百万个参数从大型数据集中学习高度泛化的模式。同时,训练 DNN 意味着走刀路,因为它们的大容量也使它们容易记住数据点。虽然传统上认为是过度训练的一个方面,但最近的工作表明,当训练数据集嘈杂且记忆是唯一的手段时,记忆风险尤其明显。不幸的是,大多数代码智能任务依赖于相当容易产生噪音和重复的数据源,例如 GitHub,由于其庞大的规模,无法手动检查和评估。我们利用来自使用 DNN 的其他领域的既定方法(例如将目标噪声引入训练数据集),通过跨多个基准和模型系列的案例研究来评估神经代码智能模型的记忆和泛化趋势。除了加强先前关于 DNN 记忆程度的一般发现之外,我们的结果还阐明了训练中噪声数据集的影响。
更新日期:2021-06-17
中文翻译:
神经代码智能模型中的记忆和泛化
深度神经网络 (DNN) 越来越常用于软件工程和代码智能任务。这些是强大的工具,能够通过数百万个参数从大型数据集中学习高度泛化的模式。同时,训练 DNN 意味着走刀路,因为它们的大容量也使它们容易记住数据点。虽然传统上认为是过度训练的一个方面,但最近的工作表明,当训练数据集嘈杂且记忆是唯一的手段时,记忆风险尤其明显。不幸的是,大多数代码智能任务依赖于相当容易产生噪音和重复的数据源,例如 GitHub,由于其庞大的规模,无法手动检查和评估。我们利用来自使用 DNN 的其他领域的既定方法(例如将目标噪声引入训练数据集),通过跨多个基准和模型系列的案例研究来评估神经代码智能模型的记忆和泛化趋势。除了加强先前关于 DNN 记忆程度的一般发现之外,我们的结果还阐明了训练中噪声数据集的影响。