Improving malicious PDF classifier with feature engineering: A data-driven approach,Future Generation Computer Systems

当前位置： X-MOL 学术 › Future Gener. Comput. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Improving malicious PDF classifier with feature engineering: A data-driven approach
Future Generation Computer Systems ( IF 7.5 ) Pub Date : 2020-09-22 , DOI: 10.1016/j.future.2020.09.015
Ahmed Falah , Lei Pan , Shamsul Huda , Shiva Raj Pokhrel , Adnan Anwar

Several approaches and tools have been developed to analyse and detect the presence of malicious content within the PDF; however, the fundamental approach in designing the existing tools and techniques has not been entirely considerate. Existing tools are based on the available datasets and the observation made during the maldoc manual analysis, making them susceptible to various types of attacks such as Mimicry and Parser confusion. We aim to enhance PDF maldoc classification by identifying the most conclusive feature-set required for accurately classifying PDF maldocs. We extract features using two popular PDF analysis tools and derive a set of features backed by data that further complements classification. We subsequently evaluate all features through a wrapper function. The features with the highest importance values are used to construct a classifier that outperforms the baseline models in terms of classification accuracy and efficiency. Our proposed method helps us identify a useful set of tool-independent features that prolong the current tools’ lifespan and usability. It provides us with an in-depth understanding of how these chosen features cumulatively impact the classification. In addition, we evaluate our findings using real-world samples from VirusTotal. Using our proposed technique, we managed to decrease the size of the feature-set by more than 60% while increasing the classification accuracy by around 2%.

中文翻译：

通过功能工程改进恶意PDF分类器：一种数据驱动的方法

已经开发了几种方法和工具来分析和检测PDF中是否存在恶意内容。但是，设计现有工具和技术的基本方法尚未完全考虑。现有工具是基于可用的数据集以及在maldoc手动分析过程中所做的观察，从而使它们易于遭受各种类型的攻击，例如Mimicry和Parser混淆。我们的目标是通过识别准确分类PDF maldocs所需的最确定性的功能集来增强PDF maldoc的分类。我们使用两种流行的PDF分析工具提取特征，并导出由数据支持的一组特征，这些特征进一步补充了分类。随后，我们通过包装函数评估所有功能。具有最高重要性值的特征用于构造分类器，该分类器在分类准确性和效率方面优于基线模型。我们提出的方法可帮助我们确定一组有用的独立于工具的功能，这些功能可延长当前工具的寿命和可用性。它使我们对这些所选功能如何累积影响分类有深入的了解。此外，我们使用来自VirusTotal的真实样本评估我们的发现。使用我们提出的技术，我们设法将特征集的大小减少了60％以上，同时将分类精度提高了约2％。我们提出的方法可帮助我们确定一组有用的独立于工具的功能，这些功能可延长当前工具的寿命和可用性。它使我们对这些所选功能如何累积影响分类有深入的了解。此外，我们使用来自VirusTotal的真实样本评估我们的发现。使用我们提出的技术，我们设法将特征集的大小减少了60％以上，同时将分类精度提高了约2％。我们提出的方法可帮助我们确定一组有用的独立于工具的功能，这些功能可延长当前工具的寿命和可用性。它使我们对这些所选功能如何累积影响分类有深入的了解。此外，我们使用来自VirusTotal的真实样本评估我们的发现。使用我们提出的技术，我们设法将特征集的大小减少了60％以上，同时将分类精度提高了约2％。

更新日期：2020-09-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>