Byte embeddings for file fragment classification,Future Generation Computer Systems

当前位置： X-MOL 学术 › Future Gener. Comput. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Byte embeddings for file fragment classification
Future Generation Computer Systems ( IF 7.5 ) Pub Date : 2021-10-01 , DOI: 10.1016/j.future.2021.09.019
Md Enamul Haque ₁ , Mehmet Engin Tozal ₁

Affiliation

In digital forensics, file carving is the process of recovering files on a storage media in part or in whole without any file system information. An important problem in file carving is the identification of fragment types. Many fragment classification studies in the literature employ inflexible and indiscernible feature selection methods such as different statistics of byte frequency distributions. Moreover, assessing the strengths and weaknesses of some approaches is difficult as they are specific to certain file types such as graphics. In this paper, we propose a novel feature generation model using byte embeddings (Byte2Vec) which map fragments to dense vector representations. The proposed model extends the word2vec and doc2vec document embedding models to bytes and fragments, respectively. We use Byte2Vec for feature extraction and $k$ -Nearest Neighbors ( $k$ NN) for classification. We present effectiveness of Byte2Vec+kNN in file fragment classification using a publicly available digital forensics dataset and a random web search dataset. Our experimental results show that Byte2Vec+kNN reaches an accuracy rate of 72% along with 74% precision and 72% recall. Compared to the other feature extraction techniques such as n-gram, byte distributions, byte statistics, byte distances, and sparse dictionaries for byte n-gram along with different classifiers, Byte2Vec+kNN achieves an absolute improvement of 3% and 12% in accuracy and precision, respectively.

中文翻译：

用于文件片段分类的字节嵌入

在数字取证中，文件雕刻是在没有任何文件系统信息的情况下部分或全部恢复存储介质上的文件的过程。文件雕刻中的一个重要问题是片段类型的识别。文献中的许多片段分类研究采用了不灵活和不可分辨的特征选择方法，例如字节频率分布的不同统计。此外，评估某些方法的优点和缺点很困难，因为它们特定于某些文件类型，例如图形。在本文中，我们提出了一种使用字节嵌入 ( Byte2Vec )的新型特征生成模型，该模型将片段映射到密集向量表示。提出的模型扩展了word2vec和doc2vec分别将模型嵌入到字节和片段中。我们使用Byte2Vec进行特征提取和 $克$ -最近的邻居（ $克$ NN）进行分类。我们使用公开可用的数字取证数据集和随机网络搜索数据集展示了Byte2Vec+kNN在文件片段分类中的有效性。我们的实验结果表明，Byte2Vec+ k NN达到了 72% 的准确率以及 74% 的准确率和 72% 的召回率。与其他特征提取技术（如 n-gram、字节分布、字节统计、字节距离和字节 n-gram 的稀疏字典以及不同的分类器）相比，Byte2Vec+ k NN在准确度上实现了 3% 和 12% 的绝对提升和精度，分别。

更新日期：2021-10-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>