当前位置: X-MOL 学术Softw. Pract. Exp. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Mutual Information and Feature Importance Gradient Boosting: Automatic byte n-gram feature reranking for Android malware detection
Software: Practice and Experience ( IF 3.5 ) Pub Date : 2021-04-05 , DOI: 10.1002/spe.2971
Mahmood Yousefi‐Azar 1, 2 , Vijay Varadharajan 3 , Len Hamey 1 , Shiping Chen 2
Affiliation  

The fast pace evolving of Android malware demands for highly efficient strategy. That is, for a range of malware types, a malware detection scheme needs to be resilient and with minimum computation performs efficient and precise. In this paper, we propose Mutual Information and Feature Importance Gradient Boosting (MIFIBoost) tool that uses byte n-gram frequency. MIFIBoost consists of four steps in the model construction phase and two steps in the prediction phase. For training, first, n-grams urn:x-wiley:spe:media:spe2971:spe2971-math-0001 of both the classes.dex and AndroidManifest.xml binary files are obtained. Then, MIFIBoost uses Mutual Information (MI) to determine the top most informative items from the entire n-gram vocabulary. In the third phase, MIFIBoost utilizes the Gradient Boosting algorithm to re-rank these top n-grams. For testing, MIFIBoost uses the learned vocabulary of byte n-grams term-frequency (tf) to feed into the classifier for prediction. Thus, MIFIBoost does not require reverse engineering. A key insight from this work is that filtering using XGBoost helps us to address the hard problem of detecting obfuscated malware better while having a negligible impact on nonobfuscated malware. We have conducted a wide range of experiments on four different datasets one of which is obfuscated, and MIFIBoost outperforms state-of-the-art tools. MIFIBoost's f1-score for Drebin, DexShare, and AMD datasets is 99.1%, 98.87%, and 99.62%, respectively, a False Positive Rate of 0.41% using AMD dataset. On average, the False Negative Rate of MIFIBoost is 2.1% for the PRAGuard dataset in which seven different obfuscation techniques are implemented. In addition to fast run-time performance and resiliency against obfuscated malware, the experiments show that MIFIBoost performs quite efficiently for five zero-day families with 99.78% AUC.

中文翻译:

互信息和特征重要性梯度提升:用于 Android 恶意软件检测的自动字节 n-gram 特征重新排序

Android 恶意软件的快速发展需要高效的策略。也就是说,对于一系列恶意软件类型,恶意软件检测方案需要具有弹性并且以最少的计算执行高效和精确。在本文中,我们提出了使用字节 n-gram 频率的互信息和特征重要性梯度提升 (MIFIBoost) 工具。MIFIBoost 由模型构建阶段的四个步骤和预测阶段的两个步骤组成。对于训练,首先,n-gramsurn:x-wiley:spe:media:spe2971:spe2971-math-0001获取 classes.dex 和 AndroidManifest.xml 二进制文件。然后,MIFIBoost 使用互信息 (MI) 从整个 n-gram 词汇表中确定信息量最大的项目。在第三阶段,MIFIBoost 使用 Gradient Boosting 算法对这些顶级 n-gram 进行重新排序。为了测试,MIFIBoost 使用字节 n-grams term-frequency ( tf) 输入分类器进行预测。因此,MIFIBoost 不需要逆向工程。这项工作的一个关键见解是,使用 XGBoost 进行过滤可以帮助我们解决更好地检测混淆恶意软件的难题,同时对非混淆恶意软件的影响可以忽略不计。我们对四个不同的数据集进行了广泛的实验,其中一个被混淆,并且 MIFIBoost 的性能优于最先进的工具。MIFIBoost 对 Drebin、DexShare 和 AMD 数据集的 f1 分数分别为 99.1%、98.87% 和 99.62%,使用 AMD 数据集的误报率为 0.41%。对于 PRAGuard 数据集,MIFIBoost 的平均误报率为 2.1%,其中实施了七种不同的混淆技术。除了快速的运行时性能和抵御混淆恶意软件的弹性之外,
更新日期:2021-06-07
down
wechat
bug