Improving the Robustness to Data Inconsistency between Training and Testing for Code Completion by Hierarchical Language Model,arXiv - CS - Software Engineering

当前位置： X-MOL 学术 › arXiv.cs.SE › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Improving the Robustness to Data Inconsistency between Training and Testing for Code Completion by Hierarchical Language Model
arXiv - CS - Software Engineering Pub Date : 2020-03-18 , DOI: arxiv-2003.08080
Yixiao Yang

In the field of software engineering, applying language models to the token sequence of source code is the state-of-art approach to build a code recommendation system. The syntax tree of source code has hierarchical structures. Ignoring the characteristics of tree structures decreases the model performance. Current LSTM model handles sequential data. The performance of LSTM model will decrease sharply if the noise unseen data is distributed everywhere in the test suite. As code has free naming conventions, it is common for a model trained on one project to encounter many unknown words on another project. If we set many unseen words as UNK just like the solution in natural language processing, the number of UNK will be much greater than the sum of the most frequently appeared words. In an extreme case, just predicting UNK at everywhere may achieve very high prediction accuracy. Thus, such solution cannot reflect the true performance of a model when encountering noise unseen data. In this paper, we only mark a small number of rare words as UNK and show the prediction performance of models under in-project and cross-project evaluation. We propose a novel Hierarchical Language Model (HLM) to improve the robustness of LSTM model to gain the capacity about dealing with the inconsistency of data distribution between training and testing. The newly proposed HLM takes the hierarchical structure of code tree into consideration to predict code. HLM uses BiLSTM to generate embedding for sub-trees according to hierarchies and collects the embedding of sub-trees in context to predict next code. The experiments on inner-project and cross-project data sets indicate that the newly proposed Hierarchical Language Model (HLM) performs better than the state-of-art LSTM model in dealing with the data inconsistency between training and testing and achieves averagely 11.2\% improvement in prediction accuracy.

中文翻译：

通过分层语言模型提高代码完成训练和测试之间数据不一致的鲁棒性

在软件工程领域，将语言模型应用于源代码的令牌序列是构建代码推荐系统的最新方法。源代码的语法树具有层次结构。忽略树结构的特征会降低模型性能。当前的 LSTM 模型处理顺序数据。如果噪声看不见的数据分布在测试套件中的任何地方，LSTM 模型的性能将急剧下降。由于代码具有自由命名约定，因此在一个项目上训练的模型在另一个项目中遇到许多未知词是很常见的。如果我们像自然语言处理中的解决方案一样将许多未见过的词设置为 UNK，则 UNK 的数量将远远大于最常出现的词的总和。在极端情况下，只是在任何地方预测 UNK 可能会达到非常高的预测精度。因此，当遇到噪声看不见的数据时，这种解决方案不能反映模型的真实性能。在本文中，我们仅将少数稀有词标记为 UNK，并展示了模型在项目内和跨项目评估下的预测性能。我们提出了一种新的分层语言模型（HLM）来提高 LSTM 模型的鲁棒性，以获得处理训练和测试之间数据分布不一致的能力。新提出的 HLM 考虑了代码树的层次结构来预测代码。HLM 使用 BiLSTM 根据层次结构为子树生成嵌入，并在上下文中收集子树的嵌入以预测下一个代码。

更新日期：2020-03-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>