Embedding Java Classes with code2vec: Improvements from Variable Obfuscation,arXiv - CS - Programming Languages

当前位置： X-MOL 学术 › arXiv.cs.PL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Embedding Java Classes with code2vec: Improvements from Variable Obfuscation
arXiv - CS - Programming Languages Pub Date : 2020-04-06 , DOI: arxiv-2004.02942
Rhys Compton, Eibe Frank, Panos Patros, Abigail Koay

Automatic source code analysis in key areas of software engineering, such as code security, can benefit from Machine Learning (ML). However, many standard ML approaches require a numeric representation of data and cannot be applied directly to source code. Thus, to enable ML, we need to embed source code into numeric feature vectors while maintaining the semantics of the code as much as possible. code2vec is a recently released embedding approach that uses the proxy task of method name prediction to map Java methods to feature vectors. However, experimentation with code2vec shows that it learns to rely on variable names for prediction, causing it to be easily fooled by typos or adversarial attacks. Moreover, it is only able to embed individual Java methods and cannot embed an entire collection of methods such as those present in a typical Java class, making it difficult to perform predictions at the class level (e.g., for the identification of malicious Java classes). Both shortcomings are addressed in the research presented in this paper. We investigate the effect of obfuscating variable names during the training of a code2vec model to force it to rely on the structure of the code rather than specific names and consider a simple approach to creating class-level embeddings by aggregating sets of method embeddings. Our results, obtained on a challenging new collection of source-code classification problems, indicate that obfuscating variable names produces an embedding model that is both impervious to variable naming and more accurately reflects code semantics. The datasets, models, and code are shared for further ML research on source code.

中文翻译：

使用 code2vec 嵌入 Java 类：变量混淆的改进

软件工程关键领域（例如代码安全）的自动源代码分析可以从机器学习 (ML) 中受益。但是，许多标准 ML 方法需要数据的数字表示，并且不能直接应用于源代码。因此，为了启用 ML，我们需要将源代码嵌入到数字特征向量中，同时尽可能保持代码的语义。code2vec 是最近发布的嵌入方法，它使用方法名称预测的代理任务将 Java 方法映射到特征向量。然而，code2vec 的实验表明，它学会了依靠变量名称进行预测，因此很容易被拼写错误或对抗性攻击所欺骗。而且，它只能嵌入单个 Java 方法，不能嵌入整个方法集合，例如存在于典型 Java 类中的方法，因此很难在类级别执行预测（例如，用于识别恶意 Java 类）。这两个缺点都在本文提出的研究中得到了解决。我们研究了在 code2vec 模型的训练过程中混淆变量名称的影响，以迫使它依赖代码的结构而不是特定的名称，并考虑一种通过聚合方法嵌入集来创建类级嵌入的简单方法。我们的结果是在一个具有挑战性的新源代码分类问题集合上获得的，表明混淆变量名称会产生一个嵌入模型，该模型不受变量命名的影响，并且更准确地反映了代码语义。共享数据集、模型和代码，以便对源代码进行进一步的 ML 研究。

更新日期：2020-04-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>