Towards Demystifying Dimensions of Source Code Embeddings,arXiv - CS - Programming Languages

当前位置： X-MOL 学术 › arXiv.cs.PL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Towards Demystifying Dimensions of Source Code Embeddings
arXiv - CS - Programming Languages Pub Date : 2020-08-29 , DOI: arxiv-2008.13064
Md Rafiqul Islam Rabin, Arjun Mukherjee, Omprakash Gnawali, Mohammad Amin Alipour

Source code representations are key in applying machine learning techniques for processing and analyzing programs. A popular approach in representing source code is neural source code embeddings that represents programs with high-dimensional vectors computed by training deep neural networks on a large volume of programs. Although successful, there is little known about the contents of these vectors and their characteristics. In this paper, we present our preliminary results towards better understanding the contents of code2vec neural source code embeddings. In particular, in a small case study, we use the code2vec embeddings to create binary SVM classifiers and compare their performance with the handcrafted features. Our results suggest that the handcrafted features can perform very close to the highly-dimensional code2vec embeddings, and the information gains are more evenly distributed in the code2vec embeddings compared to the handcrafted features. We also find that the code2vec embeddings are more resilient to the removal of dimensions with low information gains than the handcrafted features. We hope our results serve a stepping stone toward principled analysis and evaluation of these code representations.

中文翻译：

揭开源代码嵌入维度的神秘面纱

源代码表示是应用机器学习技术处理和分析程序的关键。表示源代码的一种流行方法是神经源代码嵌入，它表示具有通过在大量程序上训练深度神经网络计算出的高维向量的程序。尽管成功，但人们对这些载体的内容及其特征知之甚少。在本文中，我们展示了我们的初步结果，以更好地理解 code2vec 神经源代码嵌入的内容。特别是，在一个小型案例研究中，我们使用 code2vec 嵌入来创建二进制 SVM 分类器，并将其性能与手工制作的特征进行比较。我们的结果表明，手工制作的特征可以非常接近高维 code2vec 嵌入，与手工制作的特征相比，信息增益在 code2vec 嵌入中分布更均匀。我们还发现，与手工制作的特征相比，code2vec 嵌入对去除具有低信息增益的维度更具弹性。我们希望我们的结果可以成为对这些代码表示进行原则性分析和评估的垫脚石。

更新日期：2020-09-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>