Retrieving and classifying instances of source code plagiarism,Information Retrieval Journal

当前位置： X-MOL 学术 › Inf. Retrieval J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Retrieving and classifying instances of source code plagiarism
Information Retrieval Journal ( IF 1.7 ) Pub Date : 2017-09-13 , DOI: 10.1007/s10791-017-9313-y
Debasis Ganguly , Gareth J. F. Jones , Aarón Ramírez-de-la-Cruz , Gabriela Ramírez-de-la-Rosa , Esaú Villatoro-Tello

Automatic detection of source code plagiarism is an important research field for both the commercial software industry and within the research community. Existing methods of plagiarism detection primarily involve exhaustive pairwise document comparison, which does not scale well for large software collections. To achieve scalability, we approach the problem from an information retrieval (IR) perspective. We retrieve a ranked list of candidate documents in response to a pseudo-query representation constructed from each source code document in the collection. The challenge in source code document retrieval is that the standard bag-of-words (BoW) representation model for such documents is likely to result in many false positives being retrieved, because of the use of identical programming language specific constructs and keywords. To address this problem, we make use of an abstract syntax tree (AST) representation of the source code documents. While the IR approach is efficient, it is essentially unsupervised in nature. To further improve its effectiveness, we apply a supervised classifier (pre-trained with features extracted from sample plagiarized source code pairs) on the top ranked retrieved documents. We report experiments on the SOCO-2014 dataset comprising 12K Java source files with almost 1M lines of code. Our experiments confirm that the AST based approach produces significantly better retrieval effectiveness than a standard BoW representation, i.e., the AST based approach is able to identify a higher number of plagiarized source code documents at top ranks in response to a query source code document. The supervised classifier, trained on features extracted from sample plagiarized source code pairs, is shown to effectively filter and thus further improve the ranked list of retrieved candidate plagiarized documents.

中文翻译：

检索和分类源代码窃的实例

源代码of窃的自动检测对于商业软件行业和研究社区而言都是重要的研究领域。现有的窃检测方法主要涉及详尽的成对文档比较，这不适用于大型软件集合。为了实现可伸缩性，我们从信息检索（IR）角度解决了该问题。我们根据集合中每个源代码文档构造的伪查询表示，检索候选文档的排名列表。源代码文档检索中的挑战在于，由于使用相同的编程语言特定的构造和关键字，用于此类文档的标准单词袋（BoW）表示模型很可能导致检索到许多误报。为了解决此问题，我们使用源代码文档的抽象语法树（AST）表示形式。尽管IR方法是有效的，但本质上是无监督的。为了进一步提高其有效性，我们在检索到的排名最高的文档上应用了监督分类器（预先训练了从样本窃的源代码对中提取的特征）。我们在SOCO-2014数据集中报告了12个实验K个Java源文件，几乎有1 M行代码。我们的实验证实，基于AST的方法比标准的BoW表示产生了更好的检索效率，即，基于AST的方法能够响应查询源代码文档，在较高的排名中识别出更多的pla窃源代码文档。在从样本窃源代码对中提取的特征进行训练的监督分类器，可以有效过滤，从而进一步改善检索到的candidate窃候选文档的排序列表。

更新日期：2017-09-13

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11