当前位置: X-MOL 学术IEEE Trans. Softw. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Changeset-Based Topic Modeling of Software Repositories
IEEE Transactions on Software Engineering ( IF 7.4 ) Pub Date : 2020-10-01 , DOI: 10.1109/tse.2018.2874960
Christopher S. Corley , Kostadin Damevski , Nicholas A. Kraft

The standard approach to applying text retrieval models to code repositories is to train models on documents representing program elements. However, code changes lead to model obsolescence and to the need to retrain the model from the latest snapshot. To address this, we previously introduced an approach that trains a model on documents representing changesets from a repository and demonstrated its feasibility for feature location. In this paper, we expand our work by investigating: a second task (developer identification), the effects of including different changeset parts in the model, the repository characteristics that affect the accuracy of our approach, and the effects of the time invariance assumption on evaluation results. Our results demonstrate that our approach is as accurate as the standard approach for projects with most changes localized to a subset of the code, but less accurate when changes are highly distributed throughout the code. Moreover, our results demonstrate that context and messages are key to the accuracy of changeset-based models and that the time invariance assumption has a statistically significant effect on evaluation results, providing overly-optimistic results. Our findings indicate that our approach is a suitable alternative to the standard approach, providing comparable accuracy while eliminating retraining costs.

中文翻译:

基于变更集的软件存储库主题建模

将文本检索模型应用于代码存储库的标准方法是在表示程序元素的文档上训练模型。但是,代码更改会导致模型过时并需要从最新快照重新训练模型。为了解决这个问题,我们之前介绍了一种方法,该方法可以在表示来自存储库的变更集的文档上训练模型,并证明了其对特征定位的可行性。在本文中,我们通过调查扩展我们的工作:第二个任务(开发人员识别)、在模型中包含不同变更集部分的影响、影响我们方法准确性的存储库特征以及时间不变性假设对评价结果。我们的结果表明,对于大多数更改本地化到代码子集的项目,我们的方法与标准方法一样准确,但当更改高度分布在整个代码中时准确度较低。此外,我们的结果表明,上下文和消息是基于变更集模型准确性的关键,并且时间不变性假设对评估结果具有统计上的显着影响,提供了过于乐观的结果。我们的研究结果表明,我们的方法是标准方法的合适替代方案,可提供可比的准确性,同时消除再培训成本。我们的结果表明,上下文和消息是基于变更集模型准确性的关键,并且时间不变性假设对评估结果具有统计上的显着影响,从而提供了过于乐观的结果。我们的研究结果表明,我们的方法是标准方法的合适替代方案,可提供可比的准确性,同时消除再培训成本。我们的结果表明,上下文和消息是基于变更集模型准确性的关键,并且时间不变性假设对评估结果具有统计上的显着影响,从而提供了过于乐观的结果。我们的研究结果表明,我们的方法是标准方法的合适替代方案,可提供可比的准确性,同时消除再培训成本。
更新日期:2020-10-01
down
wechat
bug