How well do pre-trained contextual language representations recommend labels for GitHub issues?,Knowledge-Based Systems

当前位置： X-MOL 学术 › Knowl. Based Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

How well do pre-trained contextual language representations recommend labels for GitHub issues?
Knowledge-Based Systems ( IF 7.2 ) Pub Date : 2021-09-10 , DOI: 10.1016/j.knosys.2021.107476
Jun Wang ₁ , Xiaofang Zhang ₁ , Lin Chen ₂

Affiliation

Motivation:

Open-source organizations use issues to collect user feedback, software bugs, and feature requests in GitHub. Many issues do not have labels, which makes labeling time-consuming work for the maintainers. Recently, some researchers used deep learning to improve the performance of automated tagging for software objects. However, these researches use static pre-trained word vectors that cannot represent the semantics of the same word in different contexts. Pre-trained contextual language representations have been shown to achieve outstanding performance on lots of NLP tasks.

Description:

In this paper, we study whether the pre-trained contextual language models are really better than other previous language models in the label recommendation for the GitHub labels scenario. We try to give some suggestions in fine-tuning pre-trained contextual language representation models. First, we compared four deep learning models, in which three of them use traditional pre-trained word embedding. Furthermore, we compare the performances when using different corpora for pre-training.

Results:

The experimental results show that: (1) When using large training data, the performance of BERT model is better than other deep learning language models such as Bi-LSTM, CNN and RCNN. While with a small size training data, CNN performs better than BERT. (2) Further pre-training on domain-specific data can indeed improve the performance of models.

Conclusions:

When recommending labels for issues in GitHub, using pre-trained contextual language representations is better if the training dataset is large enough. Moreover, we discuss the experimental results and provide some implications to improve label recommendation performance for GitHub issues.

中文翻译：

预训练的上下文语言表示为 GitHub 问题推荐标签的效果如何？

动机：

开源组织使用问题在 GitHub 中收集用户反馈、软件错误和功能请求。许多问题没有标签，这使得维护人员的标签工作非常耗时。最近，一些研究人员使用深度学习来提高软件对象自动标记的性能。然而，这些研究使用静态的预训练词向量，无法表示同一个词在不同上下文中的语义。预训练的上下文语言表示已被证明可以在许多 NLP 任务上取得出色的表现。