Learning to Format Coq Code Using Language Models,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning to Format Coq Code Using Language Models
arXiv - CS - Computation and Language Pub Date : 2020-06-18 , DOI: arxiv-2006.16743
Pengyu Nie, Karl Palmskog, Junyi Jessy Li, Milos Gligoric

Should the final right bracket in a record declaration be on a separate line? Should arguments to the rewrite tactic be separated by a single space? Coq code tends to be written in distinct manners by different people and teams. The expressiveness, flexibility, and extensibility of Coq's languages and notations means that Coq projects have a wide variety of recognizable coding styles, sometimes explicitly documented as conventions on naming and formatting. In particular, even inexperienced users can distinguish vernacular using the standard library and plain Ltac from idiomatic vernacular using the Mathematical Components (MathComp) library and SSReflect. While coding conventions are important for comprehension and maintenance, they are costly to document and enforce. Rule-based formatters, such as Coq's beautifier, have limited flexibility and only capture small fractions of desired conventions in large verification projects. We believe that application of language models - a class of Natural Language Processing (NLP) techniques for capturing regularities in corpora - can provide a solution to this conundrum. More specifically, we believe that an approach based on automatically learning conventions from existing Coq code, and then suggesting idiomatic code to users in the proper context, can be superior to manual approaches and static analysis tools - both in terms of effort and results. As a first step, we here outline initial models to learn and suggest space formatting in Coq files, with a preliminary implementation for Coq 8.10, and evaluated on a corpus based on MathComp 1.9.0 which comprises 164k lines of Coq code from four core projects.

中文翻译：

学习使用语言模型格式化 Coq 代码

记录声明中的最后一个右括号是否应该在单独的行上？重写策略的参数应该用一个空格分隔吗？Coq 代码往往由不同的人和团队以不同的方式编写。Coq 语言和符号的表现力、灵活性和可扩展性意味着 Coq 项目具有多种可识别的编码风格，有时明确记录为命名和格式约定。特别是，即使是没有经验的用户也可以使用数学组件 (MathComp) 库和 SSReflect 区分使用标准库和普通 Ltac 的白话与惯用白话。虽然编码约定对于理解和维护很重要，但它们的记录和执行成本很高。基于规则的格式化程序，例如 Coq 的美化器，具有有限的灵活性，并且仅捕获大型验证项目中所需约定的一小部分。我们相信语言模型的应用——一类用于捕捉语料库中规律性的自然语言处理 (NLP) 技术——可以为这个难题提供解决方案。更具体地说，我们认为基于从现有 Coq 代码中自动学习约定，然后在适当的上下文中向用户建议惯用代码的方法可以优于手动方法和静态分析工具——无论是在工作量还是结果方面。作为第一步，我们在这里概述了在 Coq 文件中学习和建议空间格式的初始模型，以及 Coq 8.10 的初步实现，并在基于 MathComp 1.9.0 的语料库上进行评估，该语料库包含来自四个核心项目的 164k 行 Coq 代码. 我们相信语言模型的应用——一类用于捕捉语料库中规律性的自然语言处理 (NLP) 技术——可以为这个难题提供解决方案。更具体地说，我们认为基于从现有 Coq 代码中自动学习约定，然后在适当的上下文中向用户建议惯用代码的方法可以优于手动方法和静态分析工具——无论是在工作量还是结果方面。作为第一步，我们在这里概述了在 Coq 文件中学习和建议空间格式的初始模型，以及 Coq 8.10 的初步实现，并在基于 MathComp 1.9.0 的语料库上进行评估，该语料库包含来自四个核心项目的 164k 行 Coq 代码. 我们相信语言模型的应用——一类用于捕捉语料库中规律性的自然语言处理 (NLP) 技术——可以为这个难题提供解决方案。更具体地说，我们认为基于从现有 Coq 代码中自动学习约定，然后在适当的上下文中向用户建议惯用代码的方法可以优于手动方法和静态分析工具——无论是在工作量还是结果方面。作为第一步，我们在这里概述了在 Coq 文件中学习和建议空间格式的初始模型，以及 Coq 8.10 的初步实现，并在基于 MathComp 1.9.0 的语料库上进行评估，该语料库包含来自四个核心项目的 164k 行 Coq 代码. 我们相信，一种基于从现有 Coq 代码中自动学习约定，然后在适当的上下文中向用户建议惯用代码的方法，可以优于手动方法和静态分析工具——无论是在工作量还是结果方面。作为第一步，我们在这里概述了在 Coq 文件中学习和建议空间格式的初始模型，以及 Coq 8.10 的初步实现，并在基于 MathComp 1.9.0 的语料库上进行评估，该语料库包含来自四个核心项目的 164k 行 Coq 代码. 我们相信，一种基于从现有 Coq 代码中自动学习约定，然后在适当的上下文中向用户建议惯用代码的方法，可以优于手动方法和静态分析工具——无论是在工作量还是结果方面。作为第一步，我们在这里概述了在 Coq 文件中学习和建议空间格式的初始模型，以及 Coq 8.10 的初步实现，并在基于 MathComp 1.9.0 的语料库上进行评估，该语料库包含来自四个核心项目的 164k 行 Coq 代码.

更新日期：2020-07-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>