Exploring Software Naturalness through Neural Language Models,arXiv - CS - Programming Languages

当前位置： X-MOL 学术 › arXiv.cs.PL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Exploring Software Naturalness through Neural Language Models
arXiv - CS - Programming Languages Pub Date : 2020-06-22 , DOI: arxiv-2006.12641
Luca Buratti, Saurabh Pujar, Mihaela Bornea, Scott McCarley, Yunhui Zheng, Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost, Yufan Zhuang, Giacomo Domeniconi

The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks. Present approaches to code analysis depend heavily on features derived from the Abstract Syntax Tree (AST) while our transformer-based language models work on raw source code. This work is the first to investigate whether such language models can discover AST features automatically. To achieve this, we introduce a sequence labeling task that directly probes the language models understanding of AST. Our results show that transformer based language models achieve high accuracy in the AST tagging task. Furthermore, we evaluate our model on a software vulnerability identification task. Importantly, we show that our approach obtains vulnerability identification results comparable to graph based approaches that rely heavily on compilers for feature extraction.

中文翻译：

通过神经语言模型探索软件的自然性

软件自然性假设认为可以通过自然语言处理中使用的相同技术来理解编程语言。我们通过使用预先训练的基于转换器的语言模型来执行代码分析任务来探索这一假设。当前的代码分析方法在很大程度上依赖于从抽象语法树 (AST) 派生的特征，而我们基于转换器的语言模型则处理原始源代码。这项工作是第一个调查此类语言模型是否可以自动发现 AST 特征的工作。为了实现这一点，我们引入了一个序列标记任务，直接探测对 AST 的语言模型理解。我们的结果表明，基于转换器的语言模型在 AST 标记任务中实现了高精度。此外，我们在软件漏洞识别任务上评估我们的模型。重要的是，我们表明我们的方法获得的漏洞识别结果可与严重依赖编译器进行特征提取的基于图的方法相媲美。

更新日期：2020-06-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>