skip to main content
research-article

A Close Look at a Daily Dataset of Malware Samples

Published:22 January 2019Publication History
Skip Abstract Section

Abstract

The number of unique malware samples is growing out of control. Over the years, security companies have designed and deployed complex infrastructures to collect and analyze this overwhelming number of samples. As a result, a security company can collect more than 1M unique files per day only from its different feeds. These are automatically stored and processed to extract actionable information derived from static and dynamic analysis. However, only a tiny amount of this data is interesting for security researchers and attracts the interest of a human expert.

To the best of our knowledge, nobody has systematically dissected these datasets to precisely understand what they really contain. The security community generally discards the problem because of the alleged prevalence of uninteresting samples.

In this article, we guide the reader through a step-by-step analysis of the hundreds of thousands Windows executables collected in one day from these feeds. Our goal is to show how a company can employ existing state-of-the-art techniques to automatically process these samples and then perform manual experiments to understand and document what is the real content of this gigantic dataset. We present the filtering steps, and we discuss in detail how samples can be grouped together according to their behavior to support manual verification. Finally, we use the results of this measurement experiment to provide a rough estimate of both the human and computer resources that are required to get to the bottom of the catch of the day.

References

  1. Jose Morales. 2014. A New Approach to Prioritizing Malware Analysis. Retrieved from https://insights.sei.cmu.edu/sei_blog/2014/04/a-new-approach-to-prioritizing-malware-analysis.html.Google ScholarGoogle Scholar
  2. Symantec. 2008. Symantec Global Internet Security Threat Report Trends for 2008. Retrieved from http://eval.symantec.com/mktginfo/enterprise/white_papers/b-whitepaper_internet_security_threat_report_xiv_04-2009.en-us.pdf.Google ScholarGoogle Scholar
  3. Symantec. 2015. Symantec’s 2015 internet security threat report. Retrieved from https://www.symantec.com/security_response/publications/threatreport.jsp.Google ScholarGoogle Scholar
  4. Francisco Santos. 2016. Putting the spotlight on firmware malware. Retrieved from http://blog.virustotal.com/2016/01/putting-spotlight-on-firmware-malware_27.html.Google ScholarGoogle Scholar
  5. Symantec. 2016. Symantec’s Internet Security Threat Report 2016. Retrieved from https://www.symantec.com/content/dam/symantec/docs/reports/istr-21-2016-en.pdf.Google ScholarGoogle Scholar
  6. Herman Slatman. 2017. Awesome Threat Intelligence. Retrieved from https://github.com/hslatman/awesome-threat-intelligence.Google ScholarGoogle Scholar
  7. VirusTotal. 2017. VirusTotal File Statistics during the last 7 days. Retrieved from https://www.virustotal.com/en/statistics/.Google ScholarGoogle Scholar
  8. Alberto Ortega. 2018. Pafish—Paranoid Fish. Retrieved from https://github.com/a0rtega/pafish.Google ScholarGoogle Scholar
  9. C. Gates B. Li, K. Roundy, and Y. Vorobeychik. 2017. Large-scale identification of malicious singleton files. In Proceedings of the ACM Conference on Data and Application Security and Privacy (CODASPY’17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ulrich Bayer, Imam Habibi, Davide Balzarotti, Engin Kirda, and Christopher Kruegel. 2009. A view on current malware behaviors. In Proceedings of the USENIX Workshop on Large-scale Exploits and Emergent Threats (LEET 09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Alejandro Calleja, Juan Tapiador, and Juan Caballero. 2016. A look into 30 years of malware development from a software metrics perspective. In Proceedings of the 19th International Symposium on Research in Attacks, Intrusions and Defenses. Evry, France.Google ScholarGoogle ScholarCross RefCross Ref
  12. Julio Canto, Marc Dacier, Engin Kirda, and Corrado Leita. 2008. Large-scale malware collection: Lessons learned. In Proceedings of the 27th International Symposium on Reliable Distributed Systems (SRDS’08). Retrieved from http://www.eurecom.fr/publication/2648.Google ScholarGoogle Scholar
  13. Emanuele Cozzi, Mariano Graziano, Yanick Fratantonio, and Davide Balzarotti. 2018. Understanding linux malware. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society.Google ScholarGoogle ScholarCross RefCross Ref
  14. Ekta Gandotra, Divya Bansal, and Sanjeev Sofat. 2014. Malware analysis and classification: A survey. J. Info. Secur. 5, 2 (2014), 56.Google ScholarGoogle ScholarCross RefCross Ref
  15. Mariano Graziano, Davide Canali, Leyla Bilge, Andrea Lanzi, and Davide Balzarotti. 2015. Needles in a haystack: Mining information from public dynamic analysis sandboxes for malware intelligence. In Proceedings of the 24th USENIX Security Symposium (USENIXSecurity’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Xin Hu, Kang G. Shin, Sandeep Bhatkar, and Kent Griffin. 2013. MutantX-S: Scalable malware clustering based on static features. In Proceedings of the USENIX Annual Technical Conference (USENIXATC’13). USENIX, San Jose, CA, 187--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Heqing Huang, Cong Zheng, Junyuan Zeng, Wu Zhou, Sencun Zhu, Peng Liu, Suresh Chari, and Ce Zhang. 2016. Android malware development on public malware scanning platforms: A large-scale data-driven study. In Proceedings of the IEEE International Conference on Big Data (BIG DATA’16). IEEE Computer Society, Washington, DC.Google ScholarGoogle ScholarCross RefCross Ref
  18. Grégoire Jacob, Paolo Milani Comparetti, Matthias Neugschwandtner, Christopher Kruegel, and Giovanni Vigna. 2012. A static, packer-agnostic filter to detect similar malware samples. In Proceedings of the Conference on Detection of Intrusions and Malware 8 Vulnerability Assessment (DIMVA’12) (Lecture Notes in Computer Science), Vol. 7591. Springer, 102--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jiyong Jang, David Brumley, and Shobha Venkataraman. 2011. BitShred: Feature hashing malware for scalable triage and semantic analysis. In Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jiyong Jang, Maverick Woo, and David Brumley. 2013. Towards automatic software lineage inference. In Proceedings of the 22nd USENIX Security Symposium (USENIXSecurity’13). USENIX, Washington, D.C., 81--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Eric Jones, Travis Oliphant, Pearu Peterson et al. 2016. SciPy: Open source scientific tools for Python 2001--2012. Retrieved from http://www.scipy.org.Google ScholarGoogle Scholar
  22. Sandeep Karanth, Srivatsan Laxman, Prasad Naldurg, Ramarathnam Venkatesan, John Lambert, and Jinwook Shin. 2011. ZDVUE: Prioritization of javascript attacks to discover new vulnerabilities. In Proceedings of the ACM Workshop on Artificial Intelligence and Security (AISec’11). ACM, 31--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Doowon Kim, Bum Jun Kwon, and Tudor Dumitraş. 2017. Certified malware: Measuring breaches of trust in the windows code-signing PKI. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS’17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kristián Kozák, Bum Jun Kwon, Doowon Kim, Christopher Gates, and Tudor Dumitraş. 2018. Issued for abuse: Measuring the underground trade in code signing certificate. arXiv preprint arXiv:1803.02931.Google ScholarGoogle Scholar
  25. Bum Jun Kwon, Jayanta Mondal, Jiyong Jang, Leyla Bilge, and Tudor Dumitraş. 2015. The dropper effect: Insights into malware distribution with downloader graph analytics. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Chaz Lever, Platon Kotzias, Davide Balzarotti, Juan Caballero, and Manos Antonakakis. 2017. A lustrum of malware network communication: Evolution and insights. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society.Google ScholarGoogle ScholarCross RefCross Ref
  27. Martina Lindorfer, Alessandro Di Federico, Federico Maggi, Paolo Milani Comparetti, and Stefano Zanero. 2012. Lines of malicious code: Insights into the malicious software industry. In Proceedings of the 28th Annual Computer Security Applications Conference (ACSAC’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Martina Lindorfer, Clemens Kolbitsch, and Paolo Milani Comparetti. 2011. Detecting environment-sensitive malware. In Proceedings of the 14th International Symposium on Recent Advances in Intrusion Detection (RAID’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Martina Lindorfer, Matthias Neugschwandtner, Lukas Weichselbaum, Yanick Fratantonio, Victor Van der Veen, and Christian Platzer. 2014. Andrubis—1,000,000 apps later: A view on current android malware behaviors. In Proceedings of the 3rd International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jonathan Oliver, Chun Cheng, and Yanggui Chen. 2013. TLSH--a locality sensitive hash. In Proceedings of the 4th Cybercrime and Trustworthy Computing Workshop (CTC’13). IEEE, 7--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Konrad Rieck, Thorsten Holz, Carsten Willems, Patrick Düssel, and Pavel Laskov. 2008. Learning and classification of malware behavior. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 108--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Konrad Rieck, Philipp Trinius, Carsten Willems, and Thorsten Holz. 2011. Automatic analysis of malware behavior using machine learning. J. Comput. Secur. 19, 4 (2011), 639--668. Google ScholarGoogle ScholarCross RefCross Ref
  33. Marcos Sebastián, Richard Rivera, Platon Kotzias, and Juan Caballero. 2016. AVclass: A tool for massive malware labeling. In Proceedings of the International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 230--253.Google ScholarGoogle ScholarCross RefCross Ref
  34. Xabier Ugarte-Pedrero, Davide Balzarotti, Igor Santos, and Pablo G. Bringas. 2015. {SoK} deep packer inspection: A longitudinal study of the complexity of run-time packers. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. George D. Webster, Bojan Kolosnjaji, Christian von Pentz, Julian Kirsch, Zachary D. Hanif, Apostolis Zarras, and Claudia Eckert. 2017. Finding the needle: A study of the PE32 rich header and respective malware triage. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 119--138.Google ScholarGoogle ScholarCross RefCross Ref
  36. Georg Wicherski. 2009. peHash: A novel approach to fast malware clustering. In Proceedings of the 2nd USENIX Conference on Large-scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More (LEET’09). USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Close Look at a Daily Dataset of Malware Samples

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Privacy and Security
          ACM Transactions on Privacy and Security  Volume 22, Issue 1
          February 2019
          226 pages
          ISSN:2471-2566
          EISSN:2471-2574
          DOI:10.1145/3287762
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 January 2019
          • Revised: 1 October 2018
          • Accepted: 1 October 2018
          • Received: 1 January 2018
          Published in tops Volume 22, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format