Optimal feature configuration for dynamic malware detection,Computers & Security

当前位置： X-MOL 学术 › Comput. Secur. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Optimal feature configuration for dynamic malware detection
Computers & Security ( IF 4.8 ) Pub Date : 2021-02-27 , DOI: 10.1016/j.cose.2021.102250
David Escudero García , Noemí DeCastro-García

Applying machine learning techniques to malware detection is a common approach to try to overcome the limitations of signature-based methods. However, it is difficult to engineer a set of features that characterizes the samples properly, especially when various file types may be a vector of infection. In this work, we configure several feature sets for dynamic malware detection extracted from API calls, including an alternative scheme grouping calls in categories, network activity, signatures from the Cuckoo sandbox report, and some interactions with the file system and registry. We test combinations of these feature sets to ascertain whether they are good enough to distinguish between benign and malicious samples from a dataset containing several file types, obtained from public sources. We apply statistical inference to measure the differences in the performance between the feature sets, and the hyperparameter optimization algorithms applied to construct the models. We also unbalance the datasets to evaluate the model performance on more realistic scenarios in which not many malware samples are available. Although all studied feature configurations provide accuracies greater than 0.98, and several of them a Matthews correlation coefficient greater than 0.95 in the unbalanced datasets, statistically meaningful differences appear, so we analyze the results to determine which is the optimal set of features. We obtain a model that achieves an accuracy of 0.9937 in the balanced dataset and a Matthews correlation coefficient of 0.964 in the unbalanced dataset with 5% of malware.

中文翻译：

动态恶意软件检测的最佳功能配置

将机器学习技术应用于恶意软件检测是一种常见的方法，旨在克服基于签名的方法的局限性。但是，很难设计出一组能够正确表征样品的特征，尤其是当各种文件类型可能是感染的载体时。在这项工作中，我们为从API调用中提取的动态恶意软件检测配置了一些功能集，包括将调用分类，网络活动，来自Cuckoo沙箱报告的签名以及与文件系统和注册表的某些交互的替代方案。我们测试了这些功能集的组合，以确定它们是否足以区分来自公开来源的包含几种文件类型的数据集中的良性样本与恶意样本。我们应用统计推断来测量功能集之间的性能差异，并应用超参数优化算法来构建模型。我们还对数据集进行不平衡，以评估在没有很多恶意软件样本的更现实的情况下的模型性能。尽管所有研究的特征配置都提供了大于0.98的准确度，并且其中有一些在不平衡数据集中的Matthews相关系数大于0.95，但仍存在统计学上有意义的差异，因此我们分析结果以确定哪一组特征是最佳的。我们获得了一个模型，该模型在具有5％的恶意软件的不平衡数据集中达到了0.9937的准确度，在Matthews相关系数为0.964。应用超参数优化算法构建模型。我们还对数据集进行不平衡，以评估在没有很多恶意软件样本的更现实的情况下的模型性能。尽管所有研究的特征配置都提供了大于0.98的准确度，并且其中有一些在不平衡数据集中的Matthews相关系数大于0.95，但仍存在统计学上有意义的差异，因此我们分析结果以确定哪一组特征是最佳的。我们获得了一个模型，该模型在包含5％恶意软件的不平衡数据集中达到了0.9937的准确性，在Matthews相关系数上达到了0.964。应用超参数优化算法构建模型。我们还对数据集进行不平衡，以评估在没有很多恶意软件样本的更现实的情况下的模型性能。尽管所有研究的特征配置都提供了大于0.98的准确度，并且其中有一些在不平衡数据集中的Matthews相关系数大于0.95，但仍存在统计学上有意义的差异，因此我们分析结果以确定哪一组特征是最佳的。我们获得了一个模型，该模型在具有5％的恶意软件的不平衡数据集中达到了0.9937的准确度，在Matthews相关系数为0.964。我们还对数据集进行不平衡，以评估在没有很多恶意软件样本的更现实的情况下的模型性能。尽管所有研究的特征配置都提供了大于0.98的准确度，并且其中有一些在不平衡数据集中的Matthews相关系数大于0.95，但仍存在统计学上有意义的差异，因此我们分析结果以确定哪一组特征是最佳的。我们获得了一个模型，该模型在具有5％的恶意软件的不平衡数据集中达到了0.9937的准确度，在Matthews相关系数为0.964。我们还对数据集进行不平衡，以评估在没有很多恶意软件样本的更现实的情况下的模型性能。尽管所有研究的特征配置都提供了大于0.98的准确度，并且其中有一些在不平衡数据集中的Matthews相关系数大于0.95，但仍存在统计学上有意义的差异，因此我们分析结果以确定哪一组特征是最佳的。我们获得了一个模型，该模型在具有5％的恶意软件的不平衡数据集中达到了0.9937的准确度，在Matthews相关系数为0.964。出现统计学上有意义的差异，因此我们分析结果以确定哪一组是最佳功能。我们获得了一个模型，该模型在具有5％的恶意软件的不平衡数据集中达到了0.9937的准确度，在Matthews相关系数为0.964。出现统计学上有意义的差异，因此我们分析结果以确定哪一组是最佳功能。我们获得了一个模型，该模型在具有5％的恶意软件的不平衡数据集中达到了0.9937的准确度，在Matthews相关系数为0.964。

更新日期：2021-03-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11