Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy,arXiv - CS - Software Engineering

当前位置： X-MOL 学术 › arXiv.cs.SE › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy
arXiv - CS - Software Engineering Pub Date : 2021-09-17 , DOI: arxiv-2109.08780
Colin B. Clement, Shuai Lu, Xiaoyu Liu, Michele Tufano, Dawn Drain, Nan Duan, Neel Sundaresan, Alexey Svyatkovskiy

Statistical language modeling and translation with transformers have found many successful applications in program understanding and generation tasks, setting high benchmarks for tools in modern software development environments. The finite context window of these neural models means, however, that they will be unable to leverage the entire relevant context of large files and packages for any given task. While there are many efforts to extend the context window, we introduce an architecture-independent approach for leveraging the syntactic hierarchies of source code for incorporating entire file-level context into a fixed-length window. Using concrete syntax trees of each source file we extract syntactic hierarchies and integrate them into context window by selectively removing from view more specific, less relevant scopes for a given task. We evaluate this approach on code generation tasks and joint translation of natural language and source code in Python programming language, achieving a new state-of-the-art in code completion and summarization for Python in the CodeXGLUE benchmark. We also introduce new CodeXGLUE benchmarks for user-experience-motivated tasks: code completion with normalized literals, method body completion/code summarization conditioned on file-level context.

中文翻译：

使用 eWASH 对源代码文件进行远程建模：通过语法层次结构扩展窗口访问

使用转换器的统计语言建模和翻译在程序理解和生成任务中发现了许多成功的应用，为现代软件开发环境中的工具设定了高基准。然而，这些神经模型的有限上下文窗口意味着它们将无法为任何给定任务利用大文件和包的整个相关上下文。虽然有许多扩展上下文窗口的努力，但我们引入了一种独立于架构的方法，以利用源代码的句法层次结构将整个文件级上下文合并到一个固定长度的窗口中。使用每个源文件的具体语法树，我们提取语法层次结构并将它们集成到上下文窗口中，方法是从视图中选择性地删除给定任务更具体、相关性较低的范围。我们在代码生成任务和 Python 编程语言中的自然语言和源代码的联合翻译中评估了这种方法，在 CodeXGLUE 基准测试中实现了 Python 代码完成和总结的最新技术水平。我们还为用户体验驱动的任务引入了新的 CodeXGLUE 基准：使用规范化文字的代码完成、以文件级上下文为条件的方法体完成/代码摘要。

更新日期：2021-09-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>