Improve Language Modeling for Code Completion Through Learning General Token Repetition of Source Code with Optimized Memory,International Journal of Software Engineering and Knowledge Engineering

当前位置： X-MOL 学术 › Int. J. Softw. Eng. Knowl. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Improve Language Modeling for Code Completion Through Learning General Token Repetition of Source Code with Optimized Memory
International Journal of Software Engineering and Knowledge Engineering ( IF 0.6 ) Pub Date : 2020-02-12 , DOI: 10.1142/s0218194019400229
Yixiao Yang ₁ , Xiang Chen ₁ , Jiaguang Sun ₁

Affiliation

In last few years, applying language model to source code is the state-of-the-art method for solving the problem of code completion. However, compared with natural language, code has more obvious repetition characteristics. For example, a variable can be used many times in the following code. Variables in source code have a high chance to be repetitive. Cloned code and templates, also have the property of token repetition. Capturing the token repetition of source code is important. In different projects, variables or types are usually named differently. This means that a model trained in a finite data set will encounter a lot of unseen variables or types in another data set. How to model the semantics of the unseen data and how to predict the unseen data based on the patterns of token repetition are two challenges in code completion. Hence, in this paper, token repetition is modelled as a graph, we propose a novel REP model which is based on deep neural graph network to learn the code toke repetition. The REP model is to identify the edge connections of a graph to recognize the token repetition. For predicting the token repetition of token [Formula: see text], the information of all the previous tokens needs to be considered. We use memory neural network (MNN) to model the semantics of each distinct token to make the framework of REP model more targeted. The experiments indicate that the REP model performs better than LSTM model. Compared with Attention-Pointer network, we also discover that the attention mechanism does not work in all situations. The proposed REP model could achieve similar or slightly better prediction accuracy compared to Attention-Pointer network and consume less training time. We also find other attention mechanism which could further improve the prediction accuracy.

中文翻译：

通过学习优化内存的源代码的通用标记重复来改进代码完成的语言建模

近年来，将语言模型应用于源代码是解决代码补全问题的最新方法。但是，与自然语言相比，代码具有更明显的重复特征。例如，一个变量可以在下面的代码中多次使用。源代码中的变量很可能重复。克隆的代码和模板，也具有标记重复的特性。捕获源代码的标记重复很重要。在不同的项目中，变量或类型的命名通常不同。这意味着在有限数据集中训练的模型会在另一个数据集中遇到很多看不见的变量或类型。如何对未见数据的语义进行建模以及如何根据令牌重复的模式来预测未见数据是代码完成的两个挑战。因此，在本文中，令牌重复被建模为一个图，我们提出了一种新的基于深度神经图网络的 REP 模型来学习代码令牌重复。REP模型是识别图的边连接以识别token重复。为了预测token的token重复[公式：见正文]，需要考虑之前所有token的信息。我们使用记忆神经网络 (MNN) 对每个不同标记的语义进行建模，以使 REP 模型的框架更具针对性。实验表明，REP 模型的性能优于 LSTM 模型。与 Attention-Pointer 网络相比，我们还发现注意力机制并非在所有情况下都有效。与 Attention-Pointer 网络相比，所提出的 REP 模型可以实现相似或略高的预测精度，并且消耗更少的训练时间。我们还发现了其他可以进一步提高预测准确性的注意机制。

更新日期：2020-02-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11