Source Code Comments: Overlooked in the Realm of Code Clone Detection,arXiv - CS - Software Engineering

当前位置： X-MOL 学术 › arXiv.cs.SE › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Source Code Comments: Overlooked in the Realm of Code Clone Detection
arXiv - CS - Software Engineering Pub Date : 2020-06-25 , DOI: arxiv-2006.14505
Sandeep Kaur Kuttal, Akash Ghosh

Reusing code can produce duplicate or near-duplicate code clones in code repositories. Current code clone detection techniques, like Program Dependence Graphs, rely on code structure and their dependencies to detect clones. These techniques are expensive, using large amounts of processing power, time, and memory. In practice, programmers often utilize code comments to comprehend and reuse code, as comments carry important domain knowledge. But current code detection techniques ignore code comments, mainly due to the ambiguity of the English language. Recent advances in information retrieval techniques may have the potential to utilize code comments for clone detection. We investigated this by empirically comparing the accuracy of detecting clones with solely comments versus solely source code (without comments) on the JHotDraw package, which contains 315 classes and 27K lines of code. To detect clones at the file level, we used a topic modeling technique, Latent Dirichlet Allocation, to analyze code comments and GRAPLE -- utilizing Program Dependency Graph -- to analyze code. Our results show 94.86 recall and 84.21 precision with Latent Dirichlet Allocation and 28.7 recall and 55.39 precision using GRAPLE. We found Latent Dirichlet Allocation generated false positives in cases where programs lacked quality comments. But this limitation can be addressed by using a hybrid approach: utilizing code comments at the file level to reduce the clone set and then using Program Dependency Graph-based techniques at the method level to detect precise clones. Our further analysis across Java and Python packages, Java Swing and PyGUI, found a recall of 74.86\% and a precision of 84.21\%. Our findings call for reexamining the assumptions regarding the use of code comments in current clone detection techniques.

中文翻译：

源码点评：在代码克隆检测领域被忽视

重用代码可能会在代码存储库中产生重复或接近重复的代码克隆。当前的代码克隆检测技术，如程序依赖图，依靠代码结构及其依赖项来检测克隆。这些技术很昂贵，需要大量的处理能力、时间和内存。在实践中，程序员经常利用代码注释来理解和重用代码，因为注释携带着重要的领域知识。但是当前的代码检测技术忽略了代码注释，主要是由于英语语言的歧义。信息检索技术的最新进展可能有可能利用代码注释进行克隆检测。我们通过经验比较 JHotDraw 包上仅带有注释的克隆检测与仅带有源代码（不带注释）的克隆的准确性来对此进行调查，其中包含 315 个类和 27K 行代码。为了在文件级别检测克隆，我们使用了一种主题建模技术，即潜在狄利克雷分配，来分析代码注释和 GRAPLE——利用程序依赖图——来分析代码。我们的结果显示了使用潜在狄利克雷分配的 94.86 召回率和 84.21 精度以及使用 GRAPLE 的 28.7 召回率和 55.39 精度。我们发现潜在狄利克雷分配在程序缺乏质量评论的情况下会产生误报。但是这个限制可以通过使用混合方法来解决：在文件级别利用代码注释来减少克隆集，然后在方法级别使用基于程序依赖图的技术来检测精确的克隆。我们对 Java 和 Python 包、Java Swing 和 PyGUI 的进一步分析发现，召回率为 74.86\%，精度为 84.21\%。

更新日期：2020-06-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文