research-article

A Close Look at a Daily Dataset of Malware Samples

Authors:
Xabier Ugarte-Pedrero

Cisco Systems, Inc.

Cisco Systems, Inc.
View Profile

,
Mariano Graziano

Cisco Systems, Inc.

Cisco Systems, Inc.
View Profile

,
Davide Balzarotti

Eurecom, France

Eurecom, France
View Profile

Authors Info & Claims

ACM Transactions on Privacy and Security Volume 22 Issue 1Article No.: 6pp 1–30https://doi.org/10.1145/3291061

Published:22 January 2019Publication History

ACM Transactions on Privacy and Security

Abstract

The number of unique malware samples is growing out of control. Over the years, security companies have designed and deployed complex infrastructures to collect and analyze this overwhelming number of samples. As a result, a security company can collect more than 1M unique files per day only from its different feeds. These are automatically stored and processed to extract actionable information derived from static and dynamic analysis. However, only a tiny amount of this data is interesting for security researchers and attracts the interest of a human expert.

To the best of our knowledge, nobody has systematically dissected these datasets to precisely understand what they really contain. The security community generally discards the problem because of the alleged prevalence of uninteresting samples.

In this article, we guide the reader through a step-by-step analysis of the hundreds of thousands Windows executables collected in one day from these feeds. Our goal is to show how a company can employ existing state-of-the-art techniques to automatically process these samples and then perform manual experiments to understand and document what is the real content of this gigantic dataset. We present the filtering steps, and we discuss in detail how samples can be grouped together according to their behavior to support manual verification. Finally, we use the results of this measurement experiment to provide a rough estimate of both the human and computer resources that are required to get to the bottom of the catch of the day.

References

Jose Morales. 2014. A New Approach to Prioritizing Malware Analysis. Retrieved from https://insights.sei.cmu.edu/sei_blog/2014/04/a-new-approach-to-prioritizing-malware-analysis.html.Google Scholar
Symantec. 2008. Symantec Global Internet Security Threat Report Trends for 2008. Retrieved from http://eval.symantec.com/mktginfo/enterprise/white_papers/b-whitepaper_internet_security_threat_report_xiv_04-2009.en-us.pdf.Google Scholar
Symantec. 2015. Symantec’s 2015 internet security threat report. Retrieved from https://www.symantec.com/security_response/publications/threatreport.jsp.Google Scholar
Francisco Santos. 2016. Putting the spotlight on firmware malware. Retrieved from http://blog.virustotal.com/2016/01/putting-spotlight-on-firmware-malware_27.html.Google Scholar
Symantec. 2016. Symantec’s Internet Security Threat Report 2016. Retrieved from https://www.symantec.com/content/dam/symantec/docs/reports/istr-21-2016-en.pdf.Google Scholar
Herman Slatman. 2017. Awesome Threat Intelligence. Retrieved from https://github.com/hslatman/awesome-threat-intelligence.Google Scholar
VirusTotal. 2017. VirusTotal File Statistics during the last 7 days. Retrieved from https://www.virustotal.com/en/statistics/.Google Scholar
Alberto Ortega. 2018. Pafish—Paranoid Fish. Retrieved from https://github.com/a0rtega/pafish.Google Scholar
C. Gates B. Li, K. Roundy, and Y. Vorobeychik. 2017. Large-scale identification of malicious singleton files. In Proceedings of the ACM Conference on Data and Application Security and Privacy (CODASPY’17). Google ScholarDigital Library
Ulrich Bayer, Imam Habibi, Davide Balzarotti, Engin Kirda, and Christopher Kruegel. 2009. A view on current malware behaviors. In Proceedings of the USENIX Workshop on Large-scale Exploits and Emergent Threats (LEET 09). Google ScholarDigital Library
Alejandro Calleja, Juan Tapiador, and Juan Caballero. 2016. A look into 30 years of malware development from a software metrics perspective. In Proceedings of the 19th International Symposium on Research in Attacks, Intrusions and Defenses. Evry, France.Google ScholarCross Ref
Julio Canto, Marc Dacier, Engin Kirda, and Corrado Leita. 2008. Large-scale malware collection: Lessons learned. In Proceedings of the 27th International Symposium on Reliable Distributed Systems (SRDS’08). Retrieved from http://www.eurecom.fr/publication/2648.Google Scholar
Emanuele Cozzi, Mariano Graziano, Yanick Fratantonio, and Davide Balzarotti. 2018. Understanding linux malware. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society.Google ScholarCross Ref
Ekta Gandotra, Divya Bansal, and Sanjeev Sofat. 2014. Malware analysis and classification: A survey. J. Info. Secur. 5, 2 (2014), 56.Google ScholarCross Ref
Mariano Graziano, Davide Canali, Leyla Bilge, Andrea Lanzi, and Davide Balzarotti. 2015. Needles in a haystack: Mining information from public dynamic analysis sandboxes for malware intelligence. In Proceedings of the 24th USENIX Security Symposium (USENIXSecurity’15). Google ScholarDigital Library
Xin Hu, Kang G. Shin, Sandeep Bhatkar, and Kent Griffin. 2013. MutantX-S: Scalable malware clustering based on static features. In Proceedings of the USENIX Annual Technical Conference (USENIXATC’13). USENIX, San Jose, CA, 187--198. Google ScholarDigital Library
Heqing Huang, Cong Zheng, Junyuan Zeng, Wu Zhou, Sencun Zhu, Peng Liu, Suresh Chari, and Ce Zhang. 2016. Android malware development on public malware scanning platforms: A large-scale data-driven study. In Proceedings of the IEEE International Conference on Big Data (BIG DATA’16). IEEE Computer Society, Washington, DC.Google ScholarCross Ref
Grégoire Jacob, Paolo Milani Comparetti, Matthias Neugschwandtner, Christopher Kruegel, and Giovanni Vigna. 2012. A static, packer-agnostic filter to detect similar malware samples. In Proceedings of the Conference on Detection of Intrusions and Malware 8 Vulnerability Assessment (DIMVA’12) (Lecture Notes in Computer Science), Vol. 7591. Springer, 102--122. Google ScholarDigital Library
Jiyong Jang, David Brumley, and Shobha Venkataraman. 2011. BitShred: Feature hashing malware for scalable triage and semantic analysis. In Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS’11). Google ScholarDigital Library
Jiyong Jang, Maverick Woo, and David Brumley. 2013. Towards automatic software lineage inference. In Proceedings of the 22nd USENIX Security Symposium (USENIXSecurity’13). USENIX, Washington, D.C., 81--96. Google ScholarDigital Library
Eric Jones, Travis Oliphant, Pearu Peterson et al. 2016. SciPy: Open source scientific tools for Python 2001--2012. Retrieved from http://www.scipy.org.Google Scholar
Sandeep Karanth, Srivatsan Laxman, Prasad Naldurg, Ramarathnam Venkatesan, John Lambert, and Jinwook Shin. 2011. ZDVUE: Prioritization of javascript attacks to discover new vulnerabilities. In Proceedings of the ACM Workshop on Artificial Intelligence and Security (AISec’11). ACM, 31--42. Google ScholarDigital Library
Doowon Kim, Bum Jun Kwon, and Tudor Dumitraş. 2017. Certified malware: Measuring breaches of trust in the windows code-signing PKI. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS’17). Google ScholarDigital Library
Kristián Kozák, Bum Jun Kwon, Doowon Kim, Christopher Gates, and Tudor Dumitraş. 2018. Issued for abuse: Measuring the underground trade in code signing certificate. arXiv preprint arXiv:1803.02931.Google Scholar
Bum Jun Kwon, Jayanta Mondal, Jiyong Jang, Leyla Bilge, and Tudor Dumitraş. 2015. The dropper effect: Insights into malware distribution with downloader graph analytics. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS’15). Google ScholarDigital Library
Chaz Lever, Platon Kotzias, Davide Balzarotti, Juan Caballero, and Manos Antonakakis. 2017. A lustrum of malware network communication: Evolution and insights. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society.Google ScholarCross Ref
Martina Lindorfer, Alessandro Di Federico, Federico Maggi, Paolo Milani Comparetti, and Stefano Zanero. 2012. Lines of malicious code: Insights into the malicious software industry. In Proceedings of the 28th Annual Computer Security Applications Conference (ACSAC’12). Google ScholarDigital Library
Martina Lindorfer, Clemens Kolbitsch, and Paolo Milani Comparetti. 2011. Detecting environment-sensitive malware. In Proceedings of the 14th International Symposium on Recent Advances in Intrusion Detection (RAID’11). Google ScholarDigital Library
Martina Lindorfer, Matthias Neugschwandtner, Lukas Weichselbaum, Yanick Fratantonio, Victor Van der Veen, and Christian Platzer. 2014. Andrubis—1,000,000 apps later: A view on current android malware behaviors. In Proceedings of the 3rd International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS’14). Google ScholarDigital Library
Jonathan Oliver, Chun Cheng, and Yanggui Chen. 2013. TLSH--a locality sensitive hash. In Proceedings of the 4th Cybercrime and Trustworthy Computing Workshop (CTC’13). IEEE, 7--13. Google ScholarDigital Library
Konrad Rieck, Thorsten Holz, Carsten Willems, Patrick Düssel, and Pavel Laskov. 2008. Learning and classification of malware behavior. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 108--125. Google ScholarDigital Library
Konrad Rieck, Philipp Trinius, Carsten Willems, and Thorsten Holz. 2011. Automatic analysis of malware behavior using machine learning. J. Comput. Secur. 19, 4 (2011), 639--668. Google ScholarCross Ref
Marcos Sebastián, Richard Rivera, Platon Kotzias, and Juan Caballero. 2016. AVclass: A tool for massive malware labeling. In Proceedings of the International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 230--253.Google ScholarCross Ref
Xabier Ugarte-Pedrero, Davide Balzarotti, Igor Santos, and Pablo G. Bringas. 2015. {SoK} deep packer inspection: A longitudinal study of the complexity of run-time packers. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society. Google ScholarDigital Library
George D. Webster, Bojan Kolosnjaji, Christian von Pentz, Julian Kirsch, Zachary D. Hanif, Apostolis Zarras, and Claudia Eckert. 2017. Finding the needle: A study of the PE32 rich header and respective malware triage. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 119--138.Google ScholarCross Ref
Georg Wicherski. 2009. peHash: A novel approach to fast malware clustering. In Proceedings of the 2nd USENIX Conference on Large-scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More (LEET’09). USENIX Association. Google ScholarDigital Library

Index Terms

A Close Look at a Daily Dataset of Malware Samples
1. Security and privacy

Recommendations

Measuring the Effectiveness of Twitter’s URL Shortener (t.co) at Protecting Users from Phishing and Malware Attacks
ACSW '20: Proceedings of the Australasian Computer Science Week Multiconference

In this paper we investigate how effective Twitter’s URL shortening service (t.co) is at protecting users from phishing and malware attacks. We show that over 10,000 unique blacklisted phishing and malware URLs were posted to Twitter during a 2-month ...
Read More
Collecting autonomous spreading malware using high-interaction honeypots
ICICS'07: Proceedings of the 9th international conference on Information and communications security

Autonomous spreading malware in the form of worms or bots has become a severe threat in today's Internet. Collecting the sample as early as possible is a necessary precondition for the further treatment of the spreading malware, e.g., to develop ...
Read More
Collecting Autonomous Spreading Malware Using High-Interaction Honeypots
Information and Communications Security
Abstract
Autonomous spreading malware in the form of worms or bots has become a severe threat in today’s Internet. Collecting the sample as early as possible is a necessary precondition for the further treatment of the spreading malware, e.g., to develop ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Privacy and Security Volume 22, Issue 1
February 2019
226 pages
ISSN:2471-2566
EISSN:2471-2574
DOI:10.1145/3287762
Editor:
David Basin
ETH Zurich, Switzerland
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 January 2019
- Revised: 1 October 2018
- Accepted: 1 October 2018
- Received: 1 January 2018
Published in tops Volume 22, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Malware
classification
measurement
prioritization
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 36
  Total Citations
  View Citations
- 869
  Total Downloads
- Downloads (Last 12 months)103
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

A Close Look at a Daily Dataset of Malware Samples

ACM Transactions on Privacy and Security

Abstract

References

Cited By

Index Terms

Recommendations

Measuring the Effectiveness of Twitter’s URL Shortener (t.co) at Protecting Users from Phishing and Malware Attacks

Collecting autonomous spreading malware using high-interaction honeypots

Collecting Autonomous Spreading Malware Using High-Interaction Honeypots