当前位置: X-MOL 学术J. Inf. Secur. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Using fake text vectors to improve the sensitivity of minority class for macro malware detection
Journal of Information Security and Applications ( IF 5.6 ) Pub Date : 2020-08-19 , DOI: 10.1016/j.jisa.2020.102600
Mamoru Mimura

To detect new malware, machine learning approaches require many training samples. These training samples contribute to build an accurate model. To maintain the accuracy, collecting comprehensive samples continuously is very important. However, new malicious samples appear one after another, and thereby making it difficult. Hence, actual small training samples do not likely to represent the entire population adequately. Despite this gap between ideal and reality, few studies have addressed this practical problem in macro malware. To enhance small training samples, data augmentation is efficient in the field of image recognition. Data augmentation with Generative Adversarial Networks (GANs) is a reasonable approach for oversampling the minority class. A major difficulty of GANs is to generate fake samples that represent the context. This paper attempts to generate fake text vectors with Paragraph Vector to enhance small training samples. Paragraph Vector is a model to convert text into vectors, which represents the context and numerical distance. These features allow to directly vary each element of the vectors. Our method adds random noise to the vectors to generate fake text vectors which represent the context. This paper applies this technique to detect new malicious VBA (Visual Basic for Applications) macros to address the practical problem. This generic technique could be used for not only malware detection, but also any imbalanced and contextual data. To simulate small training samples, we reduce the malicious samples, and generate fake samples from the reduced ones. The experimental result shows that the fake samples enhance our model, and improve the detection rate.



中文翻译:

使用伪造的文本向量提高少数群体类别对宏恶意软件检测的敏感性

为了检测新的恶意软件,机器学习方法需要许多训练样本。这些训练样本有助于建立准确的模型。为了保持准确性,连续收集全面的样本非常重要。但是,新的恶意样本接连出现,从而使其变得困难。因此,实际的少量训练样本不可能充分代表整个人口。尽管理想与现实之间存在差距,但很少有研究解决宏恶意软件中的这一实际问题。为了增强小的训练样本,数据增强在图像识别领域非常有效。生成对抗网络(GAN)进行数据增强是对少数群体进行超采样的合理方法。GAN的主要困难是生成代表上下文的假样本。本文尝试使用段落向量生成伪造的文本向量,以增强小的训练样本。段落向量是将文本转换为向量的模型,该向量表示上下文和数值距离。这些特征允许直接改变向量的每个元素。我们的方法向矢量添加随机噪声,以生成代表上下文的伪文本矢量。本文将这种技术应用于检测新的恶意VBA(应用程序的Visual Basic)宏,以解决实际问题。这种通用技术不仅可以用于恶意软件检测,还可以用于任何不平衡的上下文数据。为了模拟小的训练样本,我们减少了恶意样本,并从减少的样本中生成了伪样本。实验结果表明,假冒伪劣样本增强了模型,提高了检出率。

更新日期:2020-08-19
down
wechat
bug