Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned,arXiv - CS - Software Engineering

当前位置： X-MOL 学术 › arXiv.cs.SE › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned
arXiv - CS - Software Engineering Pub Date : 2020-11-21 , DOI: arxiv-2011.10749
Dongkwan Kim, Eunsoo Kim, Sang Kil Cha, Sooel Son, Yongdae Kim

Binary code similarity analysis (BCSA) is widely used for diverse security applications such as plagiarism detection, software license violation detection, and vulnerability discovery. Despite the surging research interest in BCSA, it is significantly challenging to perform new research in this field for several reasons. First, most existing approaches focus only on the end results, namely, increasing the success rate of BCSA, by adopting uninterpretable machine learning. Moreover, they utilize their own benchmark sharing neither the source code nor the entire dataset. Finally, researchers often use different terminologies or even use the same technique without citing the previous literature properly, which makes it difficult to reproduce or extend previous work. To address these problems, we take a step back from the mainstream and contemplate fundamental research questions for BCSA. Why does a certain technique or a feature show better results than the others? Specifically, we conduct the first systematic study on the basic features used in BCSA by leveraging interpretable feature engineering on a large-scale benchmark. Our study reveals various useful insights on BCSA. For example, we show that a simple interpretable model with a few basic features can achieve a comparable result to that of recent deep learning-based approaches. Furthermore, we show that the way we compile binaries or the correctness of underlying binary analysis tools can significantly affect the performance of BCSA. Lastly, we make all our source code and benchmark public and suggest future directions in this field to help further research.

中文翻译：

使用可解释的特征工程重新学习二进制代码相似性分析和经验教训

二进制代码相似性分析（BCSA）被广泛用于各种安全应用程序，例如窃检测，软件许可证违规检测和漏洞发现。尽管对BCSA的研究兴趣激增，但出于几个原因，在该领域中进行新的研究仍面临着巨大挑战。首先，大多数现有方法仅关注最终结果，即通过采用不可解释的机器学习来提高BCSA的成功率。此外，他们利用自己的基准测试，既不共享源代码也不共享整个数据集。最后，研究人员经常使用不同的术语，甚至使用相同的技术，而没有适当地引用以前的文献，这使得复制或扩展以前的工作变得困难。为了解决这些问题，我们从主流退后一步，为BCSA考虑基础研究问题。为什么某种技术或功能要比其他技术或功能显示更好的结果？具体来说，我们通过在大型基准上利用可解释的特征工程对BCSA中使用的基本特征进行了首次系统研究。我们的研究揭示了有关BCSA的各种有用见解。例如，我们表明，具有一些基本功能的简单可解释模型可以实现与基于深度学习的新方法可比的结果。此外，我们证明了编译二进制文件的方式或底层二进制分析工具的正确性会大大影响BCSA的性能。最后，我们将所有源代码和基准公开，并提出该领域的未来发展方向，以帮助进一步研究。

更新日期：2020-11-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文