Abstract
The number of unique malware samples is growing out of control. Over the years, security companies have designed and deployed complex infrastructures to collect and analyze this overwhelming number of samples. As a result, a security company can collect more than 1M unique files per day only from its different feeds. These are automatically stored and processed to extract actionable information derived from static and dynamic analysis. However, only a tiny amount of this data is interesting for security researchers and attracts the interest of a human expert.
To the best of our knowledge, nobody has systematically dissected these datasets to precisely understand what they really contain. The security community generally discards the problem because of the alleged prevalence of uninteresting samples.
In this article, we guide the reader through a step-by-step analysis of the hundreds of thousands Windows executables collected in one day from these feeds. Our goal is to show how a company can employ existing state-of-the-art techniques to automatically process these samples and then perform manual experiments to understand and document what is the real content of this gigantic dataset. We present the filtering steps, and we discuss in detail how samples can be grouped together according to their behavior to support manual verification. Finally, we use the results of this measurement experiment to provide a rough estimate of both the human and computer resources that are required to get to the bottom of the catch of the day.
- Jose Morales. 2014. A New Approach to Prioritizing Malware Analysis. Retrieved from https://insights.sei.cmu.edu/sei_blog/2014/04/a-new-approach-to-prioritizing-malware-analysis.html.Google Scholar
- Symantec. 2008. Symantec Global Internet Security Threat Report Trends for 2008. Retrieved from http://eval.symantec.com/mktginfo/enterprise/white_papers/b-whitepaper_internet_security_threat_report_xiv_04-2009.en-us.pdf.Google Scholar
- Symantec. 2015. Symantec’s 2015 internet security threat report. Retrieved from https://www.symantec.com/security_response/publications/threatreport.jsp.Google Scholar
- Francisco Santos. 2016. Putting the spotlight on firmware malware. Retrieved from http://blog.virustotal.com/2016/01/putting-spotlight-on-firmware-malware_27.html.Google Scholar
- Symantec. 2016. Symantec’s Internet Security Threat Report 2016. Retrieved from https://www.symantec.com/content/dam/symantec/docs/reports/istr-21-2016-en.pdf.Google Scholar
- Herman Slatman. 2017. Awesome Threat Intelligence. Retrieved from https://github.com/hslatman/awesome-threat-intelligence.Google Scholar
- VirusTotal. 2017. VirusTotal File Statistics during the last 7 days. Retrieved from https://www.virustotal.com/en/statistics/.Google Scholar
- Alberto Ortega. 2018. Pafish—Paranoid Fish. Retrieved from https://github.com/a0rtega/pafish.Google Scholar
- C. Gates B. Li, K. Roundy, and Y. Vorobeychik. 2017. Large-scale identification of malicious singleton files. In Proceedings of the ACM Conference on Data and Application Security and Privacy (CODASPY’17). Google ScholarDigital Library
- Ulrich Bayer, Imam Habibi, Davide Balzarotti, Engin Kirda, and Christopher Kruegel. 2009. A view on current malware behaviors. In Proceedings of the USENIX Workshop on Large-scale Exploits and Emergent Threats (LEET 09). Google ScholarDigital Library
- Alejandro Calleja, Juan Tapiador, and Juan Caballero. 2016. A look into 30 years of malware development from a software metrics perspective. In Proceedings of the 19th International Symposium on Research in Attacks, Intrusions and Defenses. Evry, France.Google ScholarCross Ref
- Julio Canto, Marc Dacier, Engin Kirda, and Corrado Leita. 2008. Large-scale malware collection: Lessons learned. In Proceedings of the 27th International Symposium on Reliable Distributed Systems (SRDS’08). Retrieved from http://www.eurecom.fr/publication/2648.Google Scholar
- Emanuele Cozzi, Mariano Graziano, Yanick Fratantonio, and Davide Balzarotti. 2018. Understanding linux malware. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society.Google ScholarCross Ref
- Ekta Gandotra, Divya Bansal, and Sanjeev Sofat. 2014. Malware analysis and classification: A survey. J. Info. Secur. 5, 2 (2014), 56.Google ScholarCross Ref
- Mariano Graziano, Davide Canali, Leyla Bilge, Andrea Lanzi, and Davide Balzarotti. 2015. Needles in a haystack: Mining information from public dynamic analysis sandboxes for malware intelligence. In Proceedings of the 24th USENIX Security Symposium (USENIXSecurity’15). Google ScholarDigital Library
- Xin Hu, Kang G. Shin, Sandeep Bhatkar, and Kent Griffin. 2013. MutantX-S: Scalable malware clustering based on static features. In Proceedings of the USENIX Annual Technical Conference (USENIXATC’13). USENIX, San Jose, CA, 187--198. Google ScholarDigital Library
- Heqing Huang, Cong Zheng, Junyuan Zeng, Wu Zhou, Sencun Zhu, Peng Liu, Suresh Chari, and Ce Zhang. 2016. Android malware development on public malware scanning platforms: A large-scale data-driven study. In Proceedings of the IEEE International Conference on Big Data (BIG DATA’16). IEEE Computer Society, Washington, DC.Google ScholarCross Ref
- Grégoire Jacob, Paolo Milani Comparetti, Matthias Neugschwandtner, Christopher Kruegel, and Giovanni Vigna. 2012. A static, packer-agnostic filter to detect similar malware samples. In Proceedings of the Conference on Detection of Intrusions and Malware 8 Vulnerability Assessment (DIMVA’12) (Lecture Notes in Computer Science), Vol. 7591. Springer, 102--122. Google ScholarDigital Library
- Jiyong Jang, David Brumley, and Shobha Venkataraman. 2011. BitShred: Feature hashing malware for scalable triage and semantic analysis. In Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS’11). Google ScholarDigital Library
- Jiyong Jang, Maverick Woo, and David Brumley. 2013. Towards automatic software lineage inference. In Proceedings of the 22nd USENIX Security Symposium (USENIXSecurity’13). USENIX, Washington, D.C., 81--96. Google ScholarDigital Library
- Eric Jones, Travis Oliphant, Pearu Peterson et al. 2016. SciPy: Open source scientific tools for Python 2001--2012. Retrieved from http://www.scipy.org.Google Scholar
- Sandeep Karanth, Srivatsan Laxman, Prasad Naldurg, Ramarathnam Venkatesan, John Lambert, and Jinwook Shin. 2011. ZDVUE: Prioritization of javascript attacks to discover new vulnerabilities. In Proceedings of the ACM Workshop on Artificial Intelligence and Security (AISec’11). ACM, 31--42. Google ScholarDigital Library
- Doowon Kim, Bum Jun Kwon, and Tudor Dumitraş. 2017. Certified malware: Measuring breaches of trust in the windows code-signing PKI. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS’17). Google ScholarDigital Library
- Kristián Kozák, Bum Jun Kwon, Doowon Kim, Christopher Gates, and Tudor Dumitraş. 2018. Issued for abuse: Measuring the underground trade in code signing certificate. arXiv preprint arXiv:1803.02931.Google Scholar
- Bum Jun Kwon, Jayanta Mondal, Jiyong Jang, Leyla Bilge, and Tudor Dumitraş. 2015. The dropper effect: Insights into malware distribution with downloader graph analytics. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS’15). Google ScholarDigital Library
- Chaz Lever, Platon Kotzias, Davide Balzarotti, Juan Caballero, and Manos Antonakakis. 2017. A lustrum of malware network communication: Evolution and insights. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society.Google ScholarCross Ref
- Martina Lindorfer, Alessandro Di Federico, Federico Maggi, Paolo Milani Comparetti, and Stefano Zanero. 2012. Lines of malicious code: Insights into the malicious software industry. In Proceedings of the 28th Annual Computer Security Applications Conference (ACSAC’12). Google ScholarDigital Library
- Martina Lindorfer, Clemens Kolbitsch, and Paolo Milani Comparetti. 2011. Detecting environment-sensitive malware. In Proceedings of the 14th International Symposium on Recent Advances in Intrusion Detection (RAID’11). Google ScholarDigital Library
- Martina Lindorfer, Matthias Neugschwandtner, Lukas Weichselbaum, Yanick Fratantonio, Victor Van der Veen, and Christian Platzer. 2014. Andrubis—1,000,000 apps later: A view on current android malware behaviors. In Proceedings of the 3rd International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS’14). Google ScholarDigital Library
- Jonathan Oliver, Chun Cheng, and Yanggui Chen. 2013. TLSH--a locality sensitive hash. In Proceedings of the 4th Cybercrime and Trustworthy Computing Workshop (CTC’13). IEEE, 7--13. Google ScholarDigital Library
- Konrad Rieck, Thorsten Holz, Carsten Willems, Patrick Düssel, and Pavel Laskov. 2008. Learning and classification of malware behavior. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 108--125. Google ScholarDigital Library
- Konrad Rieck, Philipp Trinius, Carsten Willems, and Thorsten Holz. 2011. Automatic analysis of malware behavior using machine learning. J. Comput. Secur. 19, 4 (2011), 639--668. Google ScholarCross Ref
- Marcos Sebastián, Richard Rivera, Platon Kotzias, and Juan Caballero. 2016. AVclass: A tool for massive malware labeling. In Proceedings of the International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 230--253.Google ScholarCross Ref
- Xabier Ugarte-Pedrero, Davide Balzarotti, Igor Santos, and Pablo G. Bringas. 2015. {SoK} deep packer inspection: A longitudinal study of the complexity of run-time packers. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society. Google ScholarDigital Library
- George D. Webster, Bojan Kolosnjaji, Christian von Pentz, Julian Kirsch, Zachary D. Hanif, Apostolis Zarras, and Claudia Eckert. 2017. Finding the needle: A study of the PE32 rich header and respective malware triage. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 119--138.Google ScholarCross Ref
- Georg Wicherski. 2009. peHash: A novel approach to fast malware clustering. In Proceedings of the 2nd USENIX Conference on Large-scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More (LEET’09). USENIX Association. Google ScholarDigital Library
Index Terms
- A Close Look at a Daily Dataset of Malware Samples
Recommendations
Measuring the Effectiveness of Twitter’s URL Shortener (t.co) at Protecting Users from Phishing and Malware Attacks
ACSW '20: Proceedings of the Australasian Computer Science Week MulticonferenceIn this paper we investigate how effective Twitter’s URL shortening service (t.co) is at protecting users from phishing and malware attacks. We show that over 10,000 unique blacklisted phishing and malware URLs were posted to Twitter during a 2-month ...
Collecting autonomous spreading malware using high-interaction honeypots
ICICS'07: Proceedings of the 9th international conference on Information and communications securityAutonomous spreading malware in the form of worms or bots has become a severe threat in today's Internet. Collecting the sample as early as possible is a necessary precondition for the further treatment of the spreading malware, e.g., to develop ...
Collecting Autonomous Spreading Malware Using High-Interaction Honeypots
Information and Communications SecurityAbstractAutonomous spreading malware in the form of worms or bots has become a severe threat in today’s Internet. Collecting the sample as early as possible is a necessary precondition for the further treatment of the spreading malware, e.g., to develop ...
Comments