当前位置: X-MOL 学术Inform. Spektrum › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Same-Same But Different: On Understanding Duplicates in Stack Overflow
Informatik Spektrum Pub Date : 2019-07-17 , DOI: 10.1007/s00287-019-01185-y
Mathias Ellmann

Stack Overflow (SO) is one of the most popular online sites for asking and answering developers’ questions. New posts that cover exactly the same knowledge as previously posted questions get closed and deleted by the community. However, new posts that are very similar to previous questions but which are phrased slightly different are kept and tagged as duplicates: since they might include additional information, hints, or keywords. In this paper, we study exact duplicates and similar duplicates in SO in order to get insights about their properties and content and to understand how the community distinguishes useful from useless (i. e. to be deleted) redundant knowledge. We identified several interesting trends. Unique questions are significantly longer than others. Original questions get answered faster, include more answers, and get more frequently viewed than exact and similar duplicates. When comparing the overlapped text in duplicate pairs, we found almost no difference between exact and similar duplicates. In both cases, about 20–25 % of the question text and 40 % of the tags are identical in an original and its duplicate. However, the answers of the duplicates seem much more diverse with only 5–6 % repeated text.

中文翻译:

相同但不同:关于理解堆栈溢出中的重复项

Stack Overflow(SO)是用于询问和回答开发人员问题的最受欢迎的在线网站之一。与以前发布的问题完全相同的知识的新帖子将被社区关闭并删除。但是,与以前的问题非常相似但短语稍有不同的新帖子将保留并标记为重复:因为它们可能包含其他信息,提示或关键字。在本文中,我们研究了SO中的精确重复项和相似重复项,以便获得有关其属性和内容的见解,并了解社区如何区分有用的和无用的(即要删除的)冗余知识。我们确定了几个有趣的趋势。独特的问题比其他问题长得多。原始问题的回答速度更快,包括更多答案,并且比完全相同的相似副本获得更多的观看次数。比较重复对中的重叠文本时,我们发现精确重复和相似重复之间几乎没有区别。在这两种情况下,原件及其副本中约20–25%的问题文本和40%的标签相同。但是,重复的答案似乎更加多样,重复文本只有5%到6%。
更新日期:2019-07-17
down
wechat
bug