UMUDGA: a dataset for profiling DGA-based botnet,Computers & Security

当前位置： X-MOL 学术 › Comput. Secur. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

UMUDGA: a dataset for profiling DGA-based botnet
Computers & Security ( IF 4.8 ) Pub Date : 2020-05-01 , DOI: 10.1016/j.cose.2020.101719
Mattia Zago , Manuel Gil Pérez , Gregorio Martínez Pérez

Abstract Advanced botnet threats are natively deploying concealing techniques to prevent detection and sinkholing. To tackle them, machine learning solutions have become a standard approach, especially when dealing with Algorithmically Generated Domain (AGD) names. Nevertheless, machine learning state-of-the-art is non-specialist at best, having multiple issues in terms of rigorousness, reproducibility and ultimately credibility. This research focuses on the first critical step of the training phase, that is, the collection of data suitable for being analysed by algorithms. We have detected a common lack of scientific rigorousness in the literature regarding the aforementioned AGD analysis and, therefore, we advocate two major contributions in this article: i) a thorough analysis of the cyber panorama in terms of botnets that make use of Domain Generation Algorithms (DGAs) as evasive techniques, that flows into ii) a full-fledged machine-learning-ready labelled dataset that features over 30 million AGDs sorted in 50 malware variant classes. This mature dataset aims to fill the gap in the comparability between the different researches published in the literature. Lastly, two minor contributions are also included in this article: iii) we designed an exploratory analysis of the proposed dataset to provide both data characteristics and potential future research lines, which eventually emerges as iv) a collection of suggested guidelines. When proposing a machine learning solution, researchers should adhere to it in order to achieve scientific rigorousness.

中文翻译：

UMUDGA：用于分析基于 DGA 的僵尸网络的数据集

摘要高级僵尸网络威胁在本机部署隐藏技术以防止检测和陷坑。为了解决这些问题，机器学习解决方案已成为一种标准方法，尤其是在处理算法生成的域 (AGD) 名称时。尽管如此，最先进的机器学习技术充其量只是非专业人士，在严谨性、可重复性和最终可信度方面存在多个问题。本研究侧重于训练阶段的第一个关键步骤，即适合算法分析的数据的收集。我们发现有关上述 AGD 分析的文献普遍缺乏科学严谨性，因此，我们主张在本文中做出两个主要贡献：i) 使用领域生成算法 (DGA) 作为规避技术的僵尸网络对网络全景的全面分析，流入 ii) 一个成熟的机器学习就绪标记数据集，其中包含超过 3000 万个 AGD 排序在 50 个恶意软件变体类中。这个成熟的数据集旨在填补文献中发表的不同研究之间可比性的空白。最后，本文还包括两个小贡献：iii) 我们设计了对拟议数据集的探索性分析，以提供数据特征和潜在的未来研究路线，最终作为 iv) 建议指南的集合出现。在提出机器学习解决方案时，研究人员应该坚持下去，以达到科学严谨。ii) 一个成熟的机器学习就绪标记数据集，其中包含超过 3000 万个 AGD，分为 50 个恶意软件变体类别。这个成熟的数据集旨在填补文献中发表的不同研究之间可比性的空白。最后，本文还包括两个小贡献：iii) 我们设计了对拟议数据集的探索性分析，以提供数据特征和潜在的未来研究路线，最终作为 iv) 建议指南的集合出现。在提出机器学习解决方案时，研究人员应该坚持下去，以达到科学严谨。ii) 一个成熟的机器学习就绪标记数据集，其中包含超过 3000 万个 AGD，分为 50 个恶意软件变体类别。这个成熟的数据集旨在填补文献中发表的不同研究之间可比性的空白。最后，本文还包括两个小贡献：iii) 我们设计了对拟议数据集的探索性分析，以提供数据特征和潜在的未来研究路线，最终作为 iv) 建议指南的集合出现。在提出机器学习解决方案时，研究人员应该坚持下去，以达到科学严谨。这个成熟的数据集旨在填补文献中发表的不同研究之间可比性的空白。最后，本文还包括两个小贡献：iii) 我们设计了对拟议数据集的探索性分析，以提供数据特征和潜在的未来研究路线，最终作为 iv) 建议指南的集合出现。在提出机器学习解决方案时，研究人员应该坚持下去，以达到科学严谨。这个成熟的数据集旨在填补文献中发表的不同研究之间可比性的空白。最后，本文还包括两个小贡献：iii) 我们设计了对拟议数据集的探索性分析，以提供数据特征和潜在的未来研究路线，最终作为 iv) 建议指南的集合出现。在提出机器学习解决方案时，研究人员应该坚持下去，以达到科学严谨。最终出现为 iv) 建议指南的集合。在提出机器学习解决方案时，研究人员应该坚持下去，以达到科学严谨。最终出现为 iv) 建议指南的集合。在提出机器学习解决方案时，研究人员应该坚持下去，以达到科学严谨。

更新日期：2020-05-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11