Detecting Duplicate Contributions in Pull-Based Model Combining Textual and Change Similarities,Journal of Computer Science and Technology

当前位置： X-MOL 学术 › J. Comput. Sci. Tech. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Detecting Duplicate Contributions in Pull-Based Model Combining Textual and Change Similarities
Journal of Computer Science and Technology ( IF 1.9 ) Pub Date : 2021-01-30 , DOI: 10.1007/s11390-020-9935-1
Zhi-Xing Li , Yue Yu , Tao Wang , Gang Yin , Xin-Jun Mao , Huai-Min Wang

Communication and coordination between open source software (OSS) developers who do not work physically in the same location have always been the challenging issues. The pull-based development model, as the state-of-the-art collaborative development mechanism, provides high openness and transparency to improve the visibility of contributors’ work. However, duplicate contributions may still be submitted by more than one contributor to solve the same problem due to the parallel and uncoordinated nature of this model. If not detected in time, duplicate pull-requests can cause contributors and reviewers to waste time and energy on redundant work. In this paper, we propose an approach combining textual and change similarities to automatically detect duplicate contributions in the pull-based model at submission time. For a new-arriving contribution, we first compute textual similarity and change similarity between it and other existing contributions. And then our method returns a list of candidate duplicate contributions that are most similar to the new contribution in terms of the combined textual and change similarity. The evaluation shows that 83.4% of the duplicates can be found in average when we use the combined textual and change similarity compared with 54.8% using only textual similarity and 78.2% using only change similarity.

中文翻译：

在结合文本和更改相似性的基于拉的模型中检测重复贡献

不在同一位置实际工作的开源软件（OSS）开发人员之间的通信和协调始终是具有挑战性的问题。基于拉式的开发模型，作为最新的协作开发机制，提供了高度的开放性和透明度，以提高贡献者工作的可见性。但是，由于此模型的并行性和不协调性，多个贡献者可能仍会提交重复的贡献来解决相同的问题。如果未及时发现，重复的请求会导致贡献者和审阅者浪费时间和精力进行多余的工作。在本文中，我们提出了一种结合文本和更改相似度的方法，以在提交时自动检测基于拉式模型的重复贡献。对于新来的贡献，我们首先计算文本相似度，然后更改它与其他现有贡献之间的相似度。然后，我们的方法将返回一个候选重复贡献列表，该列表在组合文本和更改相似性方面与新贡献最相似。评估显示，当我们使用组合的文本和更改相似度时，平均可发现83.4％的重复项，而仅使用文本相似度和仅使用更改相似度的重复项为54.8％。

更新日期：2021-02-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>