SCS-Gan: Learning Functionality-Agnostic Stylometric Representations for Source Code Authorship Verification,IEEE Transactions on Software Engineering

当前位置： X-MOL 学术 › IEEE Trans. Softw. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SCS-Gan: Learning Functionality-Agnostic Stylometric Representations for Source Code Authorship Verification
IEEE Transactions on Software Engineering ( IF 6.5 ) Pub Date : 5-25-2022 , DOI: 10.1109/tse.2022.3177228
Weihan Ou ₁ , Steven H.H. Ding ₁ , Yuan Tian ₂ , Leo Song ₁

Affiliation

In recent years, the number of anonymous script-based fileless malware attacks and software copyright disputes has increased rapidly. In the literature, automated Code Authorship Analysis (CAA) techniques have been proposed to reduce the manual effort in identifying those attacks and issues. Most CAA techniques aim to solve the task of Authorship Attribution (AA), i.e., identifying the actual author of a source code fragment from a given set of candidate authors. However, in many real-world scenarios, investigators do not have a predefined set of authors containing the actual author at the time of investigation, i.e., contradicting AA's assumption. Additionally, existing AA techniques ignore the influence of code functionality when identifying the authorship, which leads to biased matching simply based on code functionality. Different from AA, the task of (extreme) Authorship Verification (AV) is to decide if two texts were written by the same person or not. AV techniques do not need a predefined author set and thus could be applied in more code authorship-related applications than AA. To our knowledge, there is no previous work attempting to solve the AV problem for the source code. To fill the gap, we propose a novel adversarial neural network, namely SCS-Gan, that can learn a stylometric representation of code for automated AV. With the multi-head attention mechanism, SCS-Gan focuses on the code parts that are most informative regarding personal styles and generates functionality-agnostic stylometric representations through adversarial training. We benchmark SCS-Gan and two state-of-the-art code representation models on four out-of-sample datasets collected from a real-world programming competition. Our experiment results show that SCS-Gan outperforms the baselines on all four out-of-sample datasets.

中文翻译：

SCS-Gan：学习与功能无关的风格表示，用于源代码作者验证

近年来，基于匿名脚本的无文件恶意软件攻击和软件版权纠纷数量迅速增加。在文献中，已经提出了自动代码作者分析 (CAA) 技术来减少识别这些攻击和问题的手动工作。大多数 CAA 技术旨在解决作者归属 (AA) 的任务，即从一组给定的候选作者中识别源代码片段的实际作者。然而，在许多现实场景中，调查人员并没有包含调查时实际作者的一组预定义作者，即与 AA 的假设相矛盾。此外，现有的AA技术在识别作者身份时忽略了代码功能的影响，这导致仅仅根据代码功能进行有偏差的匹配。与 AA 不同，（极端）作者身份验证（AV）的任务是判断两个文本是否为同一个人所写。 AV 技术不需要预定义的作者集，因此可以比 AA 应用于更多与代码作者相关的应用程序。据我们所知，之前没有任何工作试图解决源代码的反病毒问题。为了填补这一空白，我们提出了一种新颖的对抗性神经网络，即 SCS-Gan，它可以学习自动 AV 代码的风格表示。借助多头注意力机制，SCS-Gan 专注于关于个人风格信息最丰富的代码部分，并通过对抗性训练生成与功能无关的风格表征。我们在从现实编程竞赛中收集的四个样本外数据集上对 SCS-Gan 和两个最先进的代码表示模型进行了基准测试。我们的实验结果表明，SCS-Gan 在所有四个样本外数据集上都优于基线。

更新日期：2024-08-26

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11