The effectiveness of data augmentation in code readability classification,Information and Software Technology

当前位置： X-MOL 学术 › Inf. Softw. Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The effectiveness of data augmentation in code readability classification
Information and Software Technology ( IF 3.8 ) Pub Date : 2020-07-20 , DOI: 10.1016/j.infsof.2020.106378
Qing Mi , Yan Xiao , Zhi Cai , Xibin Jia

Context: Training deep learning models for code readability classification requires large datasets of quality pre-labeled data. However, it is almost always time-consuming and expensive to acquire readability data with manual labels.Objective: We thus propose to introduce data augmentation approaches to artificially increase the size of training set, this is to reduce the risk of overfitting caused by the lack of readability data and further improve the classification accuracy as the ultimate goal.Method: We create transformed versions of code snippets by manipulating original data from aspects such as comments, indentations, and names of classes/methods/variables based on domain-specific knowledge. In addition to basic transformations, we also explore the use of Auxiliary Classifier GANs to produce synthetic data.Results: To evaluate the proposed approach, we conduct a set of experiments. The results show that the classification performance of deep neural networks can be significantly improved when they are trained on the augmented corpus, achieving a state-of-the-art accuracy of 87.38%.Conclusion:We consider the findings of this study as primary evidence of the effectiveness of data augmentation in the field of code readability classification.

中文翻译：

数据增强在代码可读性分类中的有效性

上下文：训练深度学习模型以进行代码可读性分类需要大量高质量的预标记数据集。但是，使用手动标签获取可读性数据几乎总是耗时且昂贵的。目的：因此，我们建议引入数据扩充方法，以人为地增加训练集的大小，这是为了减少由于缺乏可读性数据而导致的过拟合风险，并进一步提高分类精度作为最终目标。方法：通过基于特定领域的知识，从注释，缩进和类/方法/变量的名称等方面处理原始数据，我们可以创建代码段的转换版本。除了基本转换之外，我们还探索了使用辅助分类器GAN生成合成数据的方法。结果：为了评估提出的方法，我们进行了一组实验。结果表明，当在增强语料库上训练深度神经网络时，它们的分类性能可以得到显着提高，达到了87.38％的最新准确性。结论：我们认为这项研究的结果是在代码可读性分类领域数据增强有效性的主要证据。

更新日期：2020-07-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11