Large-scale and Robust Code Authorship Identification with Deep Feature Learning,ACM Transactions on Privacy and Security

当前位置： X-MOL 学术 › ACM Trans. Priv. Secur. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Large-scale and Robust Code Authorship Identification with Deep Feature Learning
ACM Transactions on Privacy and Security ( IF 3.0 ) Pub Date : 2021-07-19 , DOI: 10.1145/3461666
Mohammed Abuhamad ₁ , Tamer Abuhmed ₂ , David Mohaisen ₃ , Daehun Nyang ₄

Affiliation

Successful software authorship de-anonymization has both software forensics applications and privacy implications. However, the process requires an efficient extraction of authorship attributes. The extraction of such attributes is very challenging, due to various software code formats from executable binaries with different toolchain provenance to source code with different programming languages. Moreover, the quality of attributes is bounded by the availability of software samples to a certain number of samples per author and a specific size for software samples. To this end, this work proposes a deep Learning-based approach for software authorship attribution, that facilitates large-scale, format-independent, language-oblivious, and obfuscation-resilient software authorship identification. This proposed approach incorporates the process of learning deep authorship attribution using a recurrent neural network, and ensemble random forest classifier for scalability to de-anonymize programmers. Comprehensive experiments are conducted to evaluate the proposed approach over the entire Google Code Jam (GCJ) dataset across all years (from 2008 to 2016) and over real-world code samples from 1,987 public repositories on GitHub. The results of our work show high accuracy despite requiring a smaller number of samples per author. Experimenting with source-code, our approach allows us to identify 8,903 GCJ authors, the largest-scale dataset used by far, with an accuracy of 92.3%. Using the real-world dataset, we achieved an identification accuracy of 94.38% for 745 C programmers on GitHub. Moreover, the proposed approach is resilient to language-specifics, and thus it can identify authors of four programming languages (e.g., C, C++, Java, and Python), and authors writing in mixed languages (e.g., Java/C++, Python/C++). Finally, our system is resistant to sophisticated obfuscation (e.g., using C Tigress) with an accuracy of 93.42% for a set of 120 authors. Experimenting with executable binaries, our approach achieves 95.74% for identifying 1,500 programmers of software binaries. Similar results were obtained when software binaries are generated with different compilation options, optimization levels, and removing of symbol information. Moreover, our approach achieves 93.86% for identifying 1,500 programmers of obfuscated binaries using all features adopted in Obfuscator-LLVM tool.

中文翻译：

使用深度特征学习进行大规模和鲁棒的代码作者身份识别

成功的软件作者身份去匿名化既有软件取证应用程序又有隐私影响。但是，该过程需要有效地提取作者属性。由于从具有不同工具链出处的可执行二进制文件到具有不同编程语言的源代码的各种软件代码格式，这些属性的提取非常具有挑战性。此外，属性的质量受限于软件样本的可用性，每个作者的样本数量和软件样本的特定大小。为此，这项工作提出了一种基于深度学习的软件作者归属方法，该方法有助于进行大规模、格式无关、语言无意识和具有混淆弹性的软件作者身份识别。这种提议的方法结合了使用循环神经网络学习深度作者归属的过程，以及用于可扩展性以对程序员进行去匿名化的集成随机森林分类器。进行了全面的实验，以评估所有年份（从 2008 年到 2016 年）的整个 Google Code Jam (GCJ) 数据集以及来自 GitHub 上 1,987 个公共存储库的真实世界代码样本的提议方法。尽管每位作者需要的样本数量较少，但我们的工作结果显示出很高的准确性。使用源代码进行实验，我们的方法允许我们识别 8,903 位 GCJ 作者，这是迄今为止使用的最大规模的数据集，准确率为 92.3%。使用真实世界的数据集，我们在 GitHub 上对 745 名 C 程序员的识别准确率达到了 94.38%。而且，所提出的方法对特定语言具有弹性，因此它可以识别四种编程语言（例如，C、C++、Java 和 Python）的作者，以及使用混合语言（例如，Java/C++、Python/C++）编写的作者. 最后，我们的系统能够抵抗复杂的混淆（例如，使用 C Tigress），对于一组 120 位作者的准确率为 93.42%。通过对可执行二进制文件进行试验，我们的方法在识别 1,500 名软件二进制程序员时达到了 95.74%。当使用不同的编译选项、优化级别和删除符号信息生成软件二进制文件时，获得了类似的结果。此外，我们的方法在使用 Obfuscator-LLVM 工具中采用的所有功能识别 1,500 名混淆二进制文件的程序员方面实现了 93.86%。因此它可以识别四种编程语言（例如，C、C++、Java 和 Python）的作者，以及使用混合语言（例如，Java/C++、Python/C++）写作的作者。最后，我们的系统能够抵抗复杂的混淆（例如，使用 C Tigress），对于一组 120 位作者的准确率为 93.42%。通过对可执行二进制文件进行试验，我们的方法在识别 1,500 名软件二进制程序员时达到了 95.74%。当使用不同的编译选项、优化级别和删除符号信息生成软件二进制文件时，获得了类似的结果。此外，我们的方法在使用 Obfuscator-LLVM 工具中采用的所有功能识别 1,500 名混淆二进制文件的程序员方面实现了 93.86%。因此它可以识别四种编程语言（例如，C、C++、Java 和 Python）的作者，以及使用混合语言（例如，Java/C++、Python/C++）写作的作者。最后，我们的系统能够抵抗复杂的混淆（例如，使用 C Tigress），对于一组 120 位作者的准确率为 93.42%。通过对可执行二进制文件进行试验，我们的方法在识别 1,500 名软件二进制程序员时达到了 95.74%。当使用不同的编译选项、优化级别和删除符号信息生成软件二进制文件时，获得了类似的结果。此外，我们的方法在使用 Obfuscator-LLVM 工具中采用的所有功能识别 1,500 名混淆二进制文件的程序员方面实现了 93.86%。我们的系统能够抵抗复杂的混淆（例如，使用 C Tigress），对于一组 120 位作者的准确率为 93.42%。通过对可执行二进制文件进行试验，我们的方法在识别 1,500 名软件二进制程序员时达到了 95.74%。当使用不同的编译选项、优化级别和删除符号信息生成软件二进制文件时，获得了类似的结果。此外，我们的方法在使用 Obfuscator-LLVM 工具中采用的所有功能识别 1,500 名混淆二进制文件的程序员方面实现了 93.86%。我们的系统能够抵抗复杂的混淆（例如，使用 C Tigress），对于一组 120 位作者的准确率为 93.42%。通过对可执行二进制文件进行试验，我们的方法在识别 1,500 名软件二进制程序员时达到了 95.74%。当使用不同的编译选项、优化级别和删除符号信息生成软件二进制文件时，获得了类似的结果。此外，我们的方法在使用 Obfuscator-LLVM 工具中采用的所有功能识别 1,500 名混淆二进制文件的程序员方面实现了 93.86%。当使用不同的编译选项、优化级别和删除符号信息生成软件二进制文件时，获得了类似的结果。此外，我们的方法在使用 Obfuscator-LLVM 工具中采用的所有功能识别 1,500 名混淆二进制文件的程序员方面实现了 93.86%。当使用不同的编译选项、优化级别和删除符号信息生成软件二进制文件时，获得了类似的结果。此外，我们的方法在使用 Obfuscator-LLVM 工具中采用的所有功能识别 1,500 名混淆二进制文件的程序员方面实现了 93.86%。

更新日期：2021-07-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11