Learning lenient parsing & typing via indirect supervision,Empirical Software Engineering

当前位置： X-MOL 学术 › Empir. Software Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning lenient parsing & typing via indirect supervision
Empirical Software Engineering ( IF 4.1 ) Pub Date : 2021-03-05 , DOI: 10.1007/s10664-021-09942-y
Toufique Ahmed , Premkumar Devanbu , Vincent J Hellendoorn

Both professional coders and teachers frequently deal with imperfect (fragmentary, incomplete, ill-formed) code. Such fragments are common in StackOverflow; students also frequently produce ill-formed code, for which instructors, TAs (or students themselves) must find repairs. In either case, the developer experience could be greatly improved if such code could somehow be parsed & typed; this makes such code more amenable to use within IDEs and allows early detection and repair of potential errors. We introduce a lenient parser, which can parse & type fragments, even ones with simple errors. Training a machine learner to leniently parse and type imperfect code requires a large training set including many pairs of imperfect code and its repair (and/or type information); such training sets are limited by human effort and curation. In this paper, we present a novel, indirectly supervised, approach to train a lenient parser, without access to such human-curated training data. We leverage the huge corpus of mostly correct code available on Github, and the massive, efficient learning capacity of Transformer-based NN architectures. Using GitHub data, we first create a large dataset of fragments of code and corresponding tree fragments and type annotations; we then randomly corrupt the input fragments (while requiring correct output) by seeding errors that mimic corruptions found in StackOverflow and student data. Using this data, we train high-capacity transformer models to overcome both fragmentation and corruption. With this novel approach, we can achieve reasonable performance on parsing & typing StackOverflow fragments; we also demonstrate that our approach performs well on shorter student error program and achieves best-in-class performance on longer programs that have more than 400 tokens. We also show that by blending DeepFix and our tool, we could achieve 77% accuracy, which outperforms all previously reported student error correction tools.

中文翻译：

通过间接监督学习宽大的解析和打字

专业编码人员和教师都经常处理不完善的（碎片，不完整，格式错误的）代码。这样的片段在StackOverflow中很常见；学生还经常产生格式错误的代码，为此，教师，助教（或学生本人）必须寻求维修。在任何一种情况下，如果可以通过某种方式解析和键入此类代码，则可以大大改善开发人员的体验。这使得此类代码更适合在IDE中使用，并允许及早发现并修复潜在的错误。我们介绍宽大解析器，可以解析和键入片段，甚至是带有简单错误的片段。训练机器学习者宽大地解析和键入不完善的代码需要大量的培训，其中包括许多对不完善的代码及其修复（和/或类型信息）；这样的训练集受到人类努力和策展的限制。在本文中，我们提出了一种新颖的，间接受监督的方法来训练宽大的解析器，而无需访问此类人工策划的训练数据。我们利用了大部分正确的庞大语料库Github上可用的代码，以及基于Transformer的NN架构的庞大而高效的学习能力。我们首先使用GitHub数据创建一个大型的数据片段集，其中包含代码片段以及相应的树片段和类型注释；然后，我们通过播种模仿在StackOverflow和学生数据中发现的损坏的错误来随机破坏输入片段（同时要求正确的输出）。利用这些数据，我们训练了大容量的变压器模型，以克服碎片和损坏的问题。通过这种新颖的方法，我们可以在解析和键入StackOverflow时获得合理的性能。碎片我们还证明了我们的方法在较短的学生错误程序上表现良好，并且在具有400多个令牌的较长程序上达到了同类最佳的性能。我们还表明，通过将DeepFix和我们的工具进行混合，我们可以达到77％的准确性，这优于以前报告的所有学生错误纠正工具。

更新日期：2021-03-07

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>