当前位置: X-MOL 学术Knowl. Based Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
How well do pre-trained contextual language representations recommend labels for GitHub issues?
Knowledge-Based Systems ( IF 8.8 ) Pub Date : 2021-09-10 , DOI: 10.1016/j.knosys.2021.107476
Jun Wang 1 , Xiaofang Zhang 1 , Lin Chen 2
Affiliation  

Motivation:

Open-source organizations use issues to collect user feedback, software bugs, and feature requests in GitHub. Many issues do not have labels, which makes labeling time-consuming work for the maintainers. Recently, some researchers used deep learning to improve the performance of automated tagging for software objects. However, these researches use static pre-trained word vectors that cannot represent the semantics of the same word in different contexts. Pre-trained contextual language representations have been shown to achieve outstanding performance on lots of NLP tasks.

Description:

In this paper, we study whether the pre-trained contextual language models are really better than other previous language models in the label recommendation for the GitHub labels scenario. We try to give some suggestions in fine-tuning pre-trained contextual language representation models. First, we compared four deep learning models, in which three of them use traditional pre-trained word embedding. Furthermore, we compare the performances when using different corpora for pre-training.

Results:

The experimental results show that: (1) When using large training data, the performance of BERT model is better than other deep learning language models such as Bi-LSTM, CNN and RCNN. While with a small size training data, CNN performs better than BERT. (2) Further pre-training on domain-specific data can indeed improve the performance of models.

Conclusions:

When recommending labels for issues in GitHub, using pre-trained contextual language representations is better if the training dataset is large enough. Moreover, we discuss the experimental results and provide some implications to improve label recommendation performance for GitHub issues.



中文翻译:

预训练的上下文语言表示为 GitHub 问题推荐标签的效果如何?

动机:

开源组织使用问题在 GitHub 中收集用户反馈、软件错误和功能请求。许多问题没有标签,这使得维护人员的标签工作非常耗时。最近,一些研究人员使用深度学习来提高软件对象自动标记的性能。然而,这些研究使用静态的预训练词向量,无法表示同一个词在不同上下文中的语义。预训练的上下文语言表示已被证明可以在许多 NLP 任务上取得出色的表现。

描述:

在本文中,我们研究了在 GitHub 标签场景的标签推荐中,预训练的上下文语言模型是否真的优于其他以前的语言模型。我们尝试在微调预训练的上下文语言表示模型方面给出一些建议。首先,我们比较了四种深度学习模型,其中三种使用传统的预训练词嵌入。此外,我们比较了使用不同语料库进行预训练时的性能。

结果:

实验结果表明:(1)在使用大量训练数据时,BERT模型的性能优于其他深度学习语言模型如Bi-LSTM、CNN和RCNN。在训练数据较小的情况下,CNN 的表现优于 BERT。(2) 对特定领域数据的进一步预训练确实可以提高模型的性能。

结论:

在 GitHub 中为问题推荐标签时,如果训练数据集足够大,则使用预训练的上下文语言表示会更好。此外,我们讨论了实验结果,并为提高 GitHub 问题的标签推荐性能提供了一些启示。

更新日期:2021-09-15
down
wechat
bug