当前位置: X-MOL 学术Empir. Software Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning lenient parsing & typing via indirect supervision
Empirical Software Engineering ( IF 4.1 ) Pub Date : 2021-03-05 , DOI: 10.1007/s10664-021-09942-y
Toufique Ahmed , Premkumar Devanbu , Vincent J Hellendoorn

Both professional coders and teachers frequently deal with imperfect (fragmentary, incomplete, ill-formed) code. Such fragments are common in StackOverflow; students also frequently produce ill-formed code, for which instructors, TAs (or students themselves) must find repairs. In either case, the developer experience could be greatly improved if such code could somehow be parsed & typed; this makes such code more amenable to use within IDEs and allows early detection and repair of potential errors. We introduce a lenient parser, which can parse & type fragments, even ones with simple errors. Training a machine learner to leniently parse and type imperfect code requires a large training set including many pairs of imperfect code and its repair (and/or type information); such training sets are limited by human effort and curation. In this paper, we present a novel, indirectly supervised, approach to train a lenient parser, without access to such human-curated training data. We leverage the huge corpus of mostly correct code available on Github, and the massive, efficient learning capacity of Transformer-based NN architectures. Using GitHub data, we first create a large dataset of fragments of code and corresponding tree fragments and type annotations; we then randomly corrupt the input fragments (while requiring correct output) by seeding errors that mimic corruptions found in StackOverflow and student data. Using this data, we train high-capacity transformer models to overcome both fragmentation and corruption. With this novel approach, we can achieve reasonable performance on parsing & typing StackOverflow fragments; we also demonstrate that our approach performs well on shorter student error program and achieves best-in-class performance on longer programs that have more than 400 tokens. We also show that by blending DeepFix and our tool, we could achieve 77% accuracy, which outperforms all previously reported student error correction tools.



中文翻译:

通过间接监督学习宽大的解析和打字

专业编码人员和教师都经常处理不完善的(碎片,不完整,格式错误的)代码。这样的片段在StackOverflow中很常见;学生还经常产生格式错误的代码,为此,教师,助教(或学生本人)必须寻求维修。在任何一种情况下,如果可以通过某种方式解析和键入此类代码,则可以大大改善开发人员的体验。这使得此类代码更适合在IDE中使用,并允许及早发现并修复潜在的错误。我们介绍宽大解析器,可以解析和键入片段,甚至是带有简单错误的片段。训练机器学习者宽大地解析和键入不完善的代码需要大量的培训,其中包括许多对不完善的代码及其修复(和/或类型信息);这样的训练集受到人类努力和策展的限制。在本文中,我们提出了一种新颖的,间接受监督的方法来训练宽大的解析器,而无需访问此类人工策划的训练数据。我们利用了大部分正确的庞大语料库Github上可用的代码,以及基于Transformer的NN架构的庞大而高效的学习能力。我们首先使用GitHub数据创建一个大型的数据片段集,其中包含代码片段以及相应的树片段和类型注释;然后,我们通过播种模仿在StackOverflow和学生数据中发现的损坏的错误来随机破坏输入片段(同时要求正确的输出)。利用这些数据,我们训练了大容量的变压器模型,以克服碎片和损坏的问题。通过这种新颖的方法,我们可以在解析和键入StackOverflow时获得合理的性能。碎片 我们还证明了我们的方法在较短的学生错误程序上表现良好,并且在具有400多个令牌的较长程序上达到了同类最佳的性能。我们还表明,通过将DeepFix和我们的工具进行混合,我们可以达到77%的准确性,这优于以前报告的所有学生错误纠正工具。

更新日期:2021-03-07
down
wechat
bug