Source Code Plagiarism Detection in Academia with Information Retrieval: Dataset and the Observation,Informatics in Education

当前位置： X-MOL 学术 › Informatics in Education › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Source Code Plagiarism Detection in Academia with Information Retrieval: Dataset and the Observation
Informatics in Education ( IF 2.1 ) Pub Date : 2019-10-16 , DOI: 10.15388/infedu.2019.15
Oscar KARNALIM , Setia BUDI , Hapnes TOBA , Mike JOY

Source code plagiarism is an emerging issue in computer science education. As a result, a number of techniques have been proposed to handle this issue. However, comparing these techniques may be challenging, since they are evaluated with their own private dataset(s). This paper contributes in providing a public dataset for comparing these techniques. Specifically, the dataset is designed for evaluation with an Information Retrieval (IR) perspective. The dataset consists of 467 source code files, covering seven introductory programming assessment tasks. Unique to this dataset, both intention to plagiarise and advanced plagiarism attacks are considered in its construction. The dataset’s characteristics were observed by comparing three IR-based detection techniques, and it is clear that most IR-based techniques are less effective than a baseline technique which relies on Running-Karp-Rabin Greedy-String-Tiling, even though some of them are far more time-efficient.

中文翻译：

通过信息检索在学术界进行源代码抄袭检测：数据集和观察

源代码窃是计算机科学教育中的一个新兴问题。结果，已经提出了许多技术来处理这个问题。但是，比较这些技术可能具有挑战性，因为它们是使用自己的私有数据集进行评估的。本文有助于提供用于比较这些技术的公共数据集。具体而言，该数据集旨在用于从信息检索（IR）角度进行评估。该数据集包含467个源代码文件，涵盖了七个入门编程评估任务。对于此数据集而言，其构造中既考虑了抄袭意图，又进行了高级抄袭攻击。通过比较三种基于红外的检测技术来观察数据集的特征，

更新日期：2019-10-16

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文