Skip to main content

Advertisement

Log in

Digital Waste Disposal: an automated framework for analysis of spam emails

  • Regular Contribution
  • Published:
International Journal of Information Security Aims and scope Submit manuscript

Abstract

Spam email automated analysis and classification are a challenging task, which is vital in the identification of botnet structures and cybercrime fighting. In this work, we propose an automated methodology and the resulting framework based on innovative categorical divisive clustering, used both for grouping and for classification of spam messages. In particular, the grouping is exploited to identify campaigns of similar spam emails, while the classification is used to label specific emails according to the goal of spammer (e.g., phishing, malware distribution, advertisement, etc.). This work introduces the CCTree algorithm, both as clustering algorithm and as classification algorithm, in two operative modes: batch and dynamic, to handle both large data sets and data streams. Afterward, the CCTree is applied to large sets of spam emails for campaign identification and labeling. The performance of the algorithm is reported for both clustering and classification, and a comparison between the batch and dynamic approaches is presented and discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. http://www.cavar.me/damir/LID/.

  2. http://mathworks.com.

References

  1. Quigley, R.: Today in History: The First Spam Email Ever Sent. https://www.themarysue.com/first-spam-email/ (2016). Accessed 24 Sept 2019

  2. Statista: Global Spam Volume as Percentage of Total E-Mail Traffic from January 2014 to December 2016. https://www.statista.com/statistics/420391/spam-email-traffic-share/ (2018)

  3. Rao, J., Reiley, D.: On the spam campaign trail. Econ. Spam 26(3), 87–110 (2012)

    Google Scholar 

  4. Shah, N.F., Kumar, P.: A comparative analysis of various spam classifications. In: Sa, P.K., Sahoo, M.N., Murugappan, M., Wu, Y., Majhi, B. (eds.) Progress in Intelligent Computing Techniques: Theory, Practice, and Applications, pp. 265–271. Springer, Singapore (2018)

    Chapter  Google Scholar 

  5. Carreras, X., Marquez, L., Salgado, J.: Boosting trees for anti-spam email filtering. In: Proceedings of RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, pp. 58–64 (2001)

  6. Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization. IEEE Trans. Neural Netw. 10(5), 1048–1054 (1999). https://doi.org/10.1109/72.788645

    Article  Google Scholar 

  7. Seewald, A.K.: An evaluation of naive Bayes variants in content-based learning for spam filtering. Intell. Data Anal. 11(5), 497–524 (2007)

    Article  Google Scholar 

  8. Blanzieri, E., Bryl, A.: A survey of learning-based techniques of email spam filtering. Artif. Intell. Rev. 29(1), 63–92 (2008)

    Article  Google Scholar 

  9. McDaniel, P., Papernot, N., Celik, Z.B.: Machine learning in adversarial settings. IEEE Secur. Priv. 14(3), 68–72 (2016). https://doi.org/10.1109/MSP.2016.51

    Article  Google Scholar 

  10. Bergholz, A., De Beer, J., Glahn, S., Moens, M., Paass, G., Strobel, S.: New filtering approaches for phishing email. J. Comput. Secur. 18(1), 7–35 (2010)

    Article  Google Scholar 

  11. Roman, R., Zhou, J., Lopez, J.: An anti-spam scheme using pre-challenges. Comput. Commun. 29(15), 2739–2749 (2006). https://doi.org/10.1016/j.comcom.2005.10.037

    Article  Google Scholar 

  12. John, J., Moshchuk, A., Gribble, S., Krishnamurthy, A.: Studying spamming using Botlab. In: Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI09, USENIX Association, Berkeley, CA, USA, pp. 291–306 ( 2009)

  13. Leontiadis, N.: Measuring and analyzing search-redirection attacks in the illicit online prescription drug trade. In: Proceedings of USENIX Security 2011 (2011)

  14. Xie, Y., Yu, F., Achan, K., Panigrahy, R., Hulten, G., Osipkov, I.: Spamming botnets: signatures and characteristics. SIGCOMM Comput. Commun. Rev. 38(4), 171–182 (2008)

    Article  Google Scholar 

  15. Zhao, Y., Xie, Y., Yu, F., Ke, Q., Yu, Y., Chen, Y., Gillum, E.: BotGraph: large scale spamming botnet detection. In: Proceedings of 6th NSDI

  16. Putman, C.G.J., Abhishta, Nieuwenhuis, L.J.M.: Business Model of a Botnet, CoRR arXiv:1804.10848

  17. Dinh, S., Azeb, T., Fortin, F., Mouheb, D., Debbabi, M.: Spam campaign detection, analysis, and investigation. Digit. Invest. 12(Supplement 1), S12–S21 (2015)

    Article  Google Scholar 

  18. Anderson, D., Fleizach, C., Savage, S., Voelker, G.: Spamscatter: characterizing internet scam hosting infrastructure. In: Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium (2007)

  19. Radicati, S.: Email statistics report 2013–2017. http://goo.gl/ggLntn (2013). Accessed 24 Sept 2019

  20. Pu, C., Webb, S.: Observed trends in spam construction techniques: a case study of spam evolution. In: CEAS, pp. 104–112 (2006)

  21. Spam archive. http://untroubled.org/spam/. Accessed 24 Sept 2019

  22. Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., Martinelli, F.: Fast and effective clustering of spam emails based on structural similarity. In: Foundations and Practice of Security—8th International Symposium, FPS 2015, Clermont-Ferrand, France, October 26–28, 2015, Revised Selected Papers, pp. 195–211 (2015)

  23. Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., Martinelli, F.: Digital waste sorting: a goal-based, self-learning approach to label spam email campaigns. In: Security and Trust Management—11th International Workshop, STM 2015, Vienna, Austria, September 21–22, 2015, Proceedings, pp. 3–19 (2015)

  24. Calais, P., Pires, D., Guedes, D., Meira, W., Hoepers, C., Steding-Jessen, K.: A campaign-based characterization of spamming strategies. In: CEAS (2008)

  25. Song, J., Inque, D., Eto, M., Kim, H., Nakao, K.: O-means: an optimized clustering method for analyzing spam based attacks. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 94, 245–254 (2011)

    Article  Google Scholar 

  26. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2011)

    MATH  Google Scholar 

  27. Bezdek, J., Pal, N.: Cluster validation with generalized Dunn’s indices. In: Artificial Neural Networks and Expert Systems, 1995. Proceedings, Second New Zealand International Two-Stream Conference on, pp. 190–193 (1995). https://doi.org/10.1109/ANNES.1995.499469

  28. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  29. Halkidi, M., Vazirgiannis, M.: Clustering validity assessment: finding the optimal partitioning of a data set. In: Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on, pp. 187–194 (2001). https://doi.org/10.1109/ICDM.2001.989517

  30. Manning, C.D., Prabhakar, R., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Book  MATH  Google Scholar 

  31. Sheikhalishahi, M., Mejri, M., Tawbi, N.: Clustering spam emails into campaigns. In: Library, S.D. (ed.) 1st International Conference on Information Systems Security and Privacy (2015)

  32. Salvador, S., Chan, P.: Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI ’04, IEEE Computer Society, Washington, DC, USA, pp. 576–584 (2004)

  33. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951). https://doi.org/10.1214/aoms/1177729694

    Article  MathSciNet  MATH  Google Scholar 

  34. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)

    Google Scholar 

  35. Martin, S., Nelson, B., Sewani, A., Chen, K., Joseph, A.D.: Analyzing behavioral features for email classification. In: CEAS (2005)

  36. Kerber, R.: Chimerge: discretization of numeric attributes. In: Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI’92, pp. 123–128. AAAI Press (1992)

  37. Garcia, S., Luengo, J., Saez, J.A., Lopez, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)

    Article  Google Scholar 

  38. Hedley, J.: Jsoup cookbook. http://jsoup.org/cookbook (2009). Accessed 24 Sept 2019

  39. Kanich, C., Weavery, N., McCoy, D., Halvorson, T., Kreibichy, C., Levchenko, K., Paxson, V., Voelker, G., Savage, S.: Show me the money: Characterizing spam-advertised revenue. In: Proceedings of the 20th USENIX Conference on Security, SEC’11, USENIX Association, Berkeley, CA, USA (2011)

  40. Federal Trade Commission. http://www.consumer.ftc.gov (2009). Accessed 24 Sept 2019

  41. Kanich, C., Kreibich, C., Levchenko, K., Enright, B., Voelker, G., Paxson, V., Savage, S.: Spamalytics: an empirical analysis of spam marketing conversion. In: Proceedings of the 15th ACM Conference on Computer and Communications Security, CCS08, pp. 3–14. ACM, New York (2008). https://doi.org/10.1145/1455770.1455774

  42. Henderson, L.: Crimes of Persuasion: Schemes, Scams, Frauds: How Con Artists Will Steal Your Savings and Inheritance Through Telemarketing Fraud, Investment Schemes and Consumer Scams. Coyoto Ridge Press (2003)

  43. Cohen, Y., Hendler, D., Rubin, A.: Detection of malicious webmail attachments based on propagation patterns. Knowl.-Based Syst. 141, 67–79 (2018)

    Article  Google Scholar 

  44. Narang, S.: Cryptolocker alert: millions in the UK targeted in mass spam campaign (2013). http://www.symantec.com/connect/blogs/cryptolocker-alert-millions-uk-targeted-mass-spam-campaign. Accessed 24 Sept 2019

  45. Almomani, A., Gupta, B.B., Atawneh, S., Meulenberg, A., Almomani, E.: A survey of phishing email filtering techniques. IEEE Commun. Surv. Tutor. 15(4), 2070–2090 (2013)

    Article  Google Scholar 

  46. Smadi, S., Aslam, N., Zhang, L.: Detection of online phishing email using dynamic evolving neural network based on reinforcement learning. Decis. Support Syst. 107, 88–102 (2018)

    Article  Google Scholar 

  47. Li, F., Hsieh, M.: An empirical study of clustering behavior of spammers and groupbased anti-spam strategies. In: CEAS 2006 Third Conference on Email and AntiSpam, pp. 27–28 (2006)

  48. Zhang, C., Chen, W., Chen, X., Warner, G.: Revealing common sources of image spam by unsupervised clustering with visual features. In: Proceedings of the 2009 ACM symposium on Applied Computing, SAC ’09, pp. 891–892. ACM, New York (2009)

  49. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, vol. 1, pp. I–511–I–518 (2001). https://doi.org/10.1109/CVPR.2001.990517

  50. Beleites, C., Neugebauer, U., Bocklitz, T., Krafft, C., Popp, J.: Sample size planning for classification models. Anal. Chim. Acta 760, 25–33 (2013)

    Article  Google Scholar 

  51. Crawford, M., Khoshgoftaar, T.M., Prusa, J.D., Richter, A.N., Al-Najada, H.: Survey of review spam detection using machine learning techniques. J. Big Data 2(1), 23 (2015). https://doi.org/10.1186/s40537-015-0029-9

    Article  Google Scholar 

  52. Kumar, V., Monika, Kumar, P., Sharma, A.: Spam email detection using id3 algorithm and hidden Markov model. In: 2018 Conference on Information and Communication Technology (CICT), pp. 1–6 (2018). https://doi.org/10.1109/INFOCOMTECH.2018.8722378

  53. Labs, M.A.: Mcafee threats report: third quarter 2013 (2013)

  54. Kreibich, C., Kanich, C., Levchenko, K., Enright, B., Voelker, G., Paxson, V., Savage, S.: Spamcraft: an inside look at spam campaign orchestration. In: Proceedings of the 2nd USENIX Conference on Large-Scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More, LEET09 (2009)

  55. Song, J., Inque, D., Eto, M., Kim, H., Nakao, K.: An empirical study of spam: analyzing spam sending systems and malicious web servers. In: Proceedings of the 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet, SAINT ’10, pp. 257–260 (2010)

  56. Wei, C., Sprague, A., Warner, G., Skjellum, A.: Mining spam email to identify common origins for forensic application. In: Proceedings of the 2008 ACM Symposium on Applied Computing, SAC ’08, pp. 1433–1437. ACM, New York (2008)

  57. Kreibich, C., Kanich, C., Levchenko, K., Enright, B., Voelker, G., Paxson, V., Savage, S.: Spamcraft: an inside look at spam campaign orchestration. In: Proceedings of the 2nd USENIX Conference on Large-scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More, LEET’09. USENIX Association, Berkeley (2009)

  58. Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y., Zhao, B.: Detecting and characterizing social spam campaigns. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, IMC ’10, pp. 35–47. ACM, New York (2010). https://doi.org/10.1145/1879141.1879147

  59. Pathak, A., Qian, F., Hu, Y.C., Mao, Z.M., Ranjan, S.: Botnet spam campaigns can be long lasting: evidence, implications, and analysis. SIGMETRICS Perform. Eval. Rev. 37(1), 13–24 (2009)

    Article  Google Scholar 

  60. Moradpoor, N., Clavie, B., Buchanan, B.: Employing machine learning techniques for detection and classification of phishing emails. In: 2017 Computing Conference, pp. 149–156 (2017). https://doi.org/10.1109/SAI.2017.8252096

  61. Bergholz, A., PaaB, G., Reichartz, F., Strobel, S., Birlinghoven, S.: Improved phishing detection using model-based features. In: In Fifth Conference on Email and Anti-Spam, CEAS (2008)

  62. Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: Proceedings of the 16th International Conference on World Wide Web, pp. 649–656. ACM (2007)

  63. Jain, G., Sharma, M., Agarwal, B.: Spam detection in social media using convolutional and long short term memory neural network. Ann. Math. Artif. Intell. 85(1), 21–44 (2019). https://doi.org/10.1007/s10472-018-9612-z

    Article  MATH  Google Scholar 

  64. Sohrabi, M.K., Karimi, F.: A feature selection approach to detect spam in the facebook social network. Arab. J. Sci. Eng. 43(2), 949–958 (2018). https://doi.org/10.1007/s13369-017-2855-x

    Article  Google Scholar 

  65. Feng, B., Fu, Q., Dong, M., Guo, D., Li, Q.: Multistage and elastic spam detection in mobile social networks through deep learning. IEEE Netw. 32(4), 15–21 (2018). https://doi.org/10.1109/MNET.2018.1700406

    Article  Google Scholar 

  66. Almaatouq, A., Shmueli, E., Nouh, M., Alabdulkareem, A., Singh, V.K., Alsaleh, M., Alarifi, A., Alfaris, A., Pentland, A.S.: If it looks like a spammer and behaves like a spammer, it must be a spammer: analysis and detection of microblogging spam accounts. Int. J. Inf. Secur. 15(5), 475–491 (2016). https://doi.org/10.1007/s10207-016-0321-5

    Article  Google Scholar 

  67. Wu, T., Wen, S., Liu, S., Zhang, J., Xiang, Y., Alrubaian, M., Hassan, M.M.: Detecting spamming activities in twitter based on deep-learning technique. Concurr. Comput. Pract. Exp. 29(19), e4209 (2017). https://doi.org/10.1002/cpe.4209

    Article  Google Scholar 

  68. Lingam, G., Rout, R.R., Somayajulu, D.: Detection of social botnet using a trust model based on spam content in Twitter network. In: 2018 IEEE 13th International Conference on Industrial and Information Systems (ICIIS), pp. 280–285 (2018). https://doi.org/10.1109/ICIINFS.2018.8721318

Download references

Funding

This study was funded by H2020 C3ISP Project (GA 700294).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrea Saracino.

Ethics declarations

Conflict of interest

The authors declare that they do not have conflict of interests.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sheikhalishahi, M., Saracino, A., Martinelli, F. et al. Digital Waste Disposal: an automated framework for analysis of spam emails. Int. J. Inf. Secur. 19, 499–522 (2020). https://doi.org/10.1007/s10207-019-00470-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10207-019-00470-x

Keywords

Navigation