Flowchart-Based Cross-Language Source Code Similarity Detection,Scientific Programming

当前位置： X-MOL 学术 › Sci. Program. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Flowchart-Based Cross-Language Source Code Similarity Detection
Scientific Programming Pub Date : 2020-12-17 , DOI: 10.1155/2020/8835310
Feng Zhang _{1,

2} , Guofan Li ₁ , Cong Liu ₃ , Qian Song ₁

Affiliation

Source code similarity detection has various applications in code plagiarism detection and software intellectual property protection. In computer programming teaching, students may convert the source code written in one programming language into another language for their code assignment submission. Existing similarity measures of source code written in the same language are not applicable for the cross-language code similarity detection because of syntactic differences among different programming languages. Meanwhile, existing cross-language source similarity detection approaches are susceptible to complex code obfuscation techniques, such as replacing equivalent control structure and adding redundant statements. To solve this problem, we propose a cross-language code similarity detection (CLCSD) approach based on code flowcharts. In general, two source code fragments written in different programming languages are transformed into standardized code flowcharts (SCFC), and their similarity is obtained by measuring their corresponding SCFC. More specifically, we first introduce the standardized code flowchart (SCFC) model to be the uniform flowcharts representation of source code written in different languages. SCFC is language-independent, and therefore, it can be used as the intermediate structure for source code similarity detection. Meanwhile, transformation techniques are given to transform source code written in a specific programming language into an SCFC. Second, we propose the SCFC-SPGK algorithm based on the shortest path graph kernel to measure the similarity between two SCFCs. Thus, the similarity between two pieces of source code in different programming languages is given by the similarity between SCFCs. Experimental results show that compared with existing approaches, CLCSD has higher accuracy in cross-language source code similarity detection. Furthermore, CLCSD cannot only handle common source code obfuscation techniques used by students in computer programming teaching but also obtain nearly 90% accuracy in dealing with some complex obfuscation techniques.

中文翻译：

基于流程图的跨语言源代码相似性检测

源代码相似性检测在代码抄袭检测和软件知识产权保护方面有多种应用。在计算机编程教学中，学生可以将用一种编程语言编写的源代码转换成另一种语言，以便提交代码作业。由于不同编程语言之间的语法差异，现有的相同语言编写的源代码的相似性度量不适用于跨语言代码相似性检测。同时，现有的跨语言源相似性检测方法容易受到复杂代码混淆技术的影响，例如替换等效控制结构和添加冗余语句。为了解决这个问题，我们提出了一种基于代码流程图的跨语言代码相似性检测（CLCSD）方法。一般来说，将用不同编程语言编写的两个源代码片段转化为标准化代码流程图（SCFC），通过测量它们对应的 SCFC 来获得它们的相似度。更具体地说，我们首先引入了标准化代码流程图 (SCFC) 模型，它是用不同语言编写的源代码的统一流程图表示。SCFC 与语言无关，因此可以用作源代码相似性检测的中间结构。同时，还给出了将使用特定编程语言编写的源代码转换为 SCFC 的转换技术。其次，我们提出了基于最短路径图核的 SCFC-SPGK 算法来衡量两个 SCFC 之间的相似性。因此，不同编程语言的两段源代码之间的相似性由 SCFC 之间的相似性给出。实验结果表明，与现有方法相比，CLCSD在跨语言源代码相似度检测中具有更高的准确率。此外，CLCSD 不仅可以处理学生在计算机编程教学中使用的常见源代码混淆技术，而且在处理一些复杂的混淆技术时也能获得近 90% 的准确率。

更新日期：2020-12-17

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11