Digital Waste Disposal: an automated framework for analysis of spam emails

Sheikhalishahi, Mina; Saracino, Andrea; Martinelli, Fabio; La Marra, Antonio; Mejri, Mohammed; Tawbi, Nadia

doi:10.1007/s10207-019-00470-x

Digital Waste Disposal: an automated framework for analysis of spam emails

Regular Contribution
Published: 25 September 2019

Volume 19, pages 499–522, (2020)
Cite this article

International Journal of Information Security Aims and scope Submit manuscript

Mina Sheikhalishahi¹,
Andrea Saracino ORCID: orcid.org/0000-0001-8149-9322²,
Fabio Martinelli²,
Antonio La Marra²,
Mohammed Mejri³ &
…
Nadia Tawbi³

782 Accesses
4 Citations
Explore all metrics

Abstract

Spam email automated analysis and classification are a challenging task, which is vital in the identification of botnet structures and cybercrime fighting. In this work, we propose an automated methodology and the resulting framework based on innovative categorical divisive clustering, used both for grouping and for classification of spam messages. In particular, the grouping is exploited to identify campaigns of similar spam emails, while the classification is used to label specific emails according to the goal of spammer (e.g., phishing, malware distribution, advertisement, etc.). This work introduces the CCTree algorithm, both as clustering algorithm and as classification algorithm, in two operative modes: batch and dynamic, to handle both large data sets and data streams. Afterward, the CCTree is applied to large sets of spam emails for campaign identification and labeling. The performance of the algorithm is reported for both clustering and classification, and a comparison between the batch and dynamic approaches is presented and discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 2

Digital Waste Sorting: A Goal-Based, Self-Learning Approach to Label Spam Email Campaigns

Fast and Effective Clustering of Spam Emails Based on Structural Similarity

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Article Open access 11 May 2022

Francisco Jáñez-Martino, Rocío Alaiz-Rodríguez, … Enrique Alegre

Notes

References

Quigley, R.: Today in History: The First Spam Email Ever Sent. https://www.themarysue.com/first-spam-email/ (2016). Accessed 24 Sept 2019
Statista: Global Spam Volume as Percentage of Total E-Mail Traffic from January 2014 to December 2016. https://www.statista.com/statistics/420391/spam-email-traffic-share/ (2018)
Rao, J., Reiley, D.: On the spam campaign trail. Econ. Spam 26(3), 87–110 (2012)
Google Scholar
Shah, N.F., Kumar, P.: A comparative analysis of various spam classifications. In: Sa, P.K., Sahoo, M.N., Murugappan, M., Wu, Y., Majhi, B. (eds.) Progress in Intelligent Computing Techniques: Theory, Practice, and Applications, pp. 265–271. Springer, Singapore (2018)
Chapter Google Scholar
Carreras, X., Marquez, L., Salgado, J.: Boosting trees for anti-spam email filtering. In: Proceedings of RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, pp. 58–64 (2001)
Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization. IEEE Trans. Neural Netw. 10(5), 1048–1054 (1999). https://doi.org/10.1109/72.788645
Article Google Scholar
Seewald, A.K.: An evaluation of naive Bayes variants in content-based learning for spam filtering. Intell. Data Anal. 11(5), 497–524 (2007)
Article Google Scholar
Blanzieri, E., Bryl, A.: A survey of learning-based techniques of email spam filtering. Artif. Intell. Rev. 29(1), 63–92 (2008)
Article Google Scholar
McDaniel, P., Papernot, N., Celik, Z.B.: Machine learning in adversarial settings. IEEE Secur. Priv. 14(3), 68–72 (2016). https://doi.org/10.1109/MSP.2016.51
Article Google Scholar
Bergholz, A., De Beer, J., Glahn, S., Moens, M., Paass, G., Strobel, S.: New filtering approaches for phishing email. J. Comput. Secur. 18(1), 7–35 (2010)
Article Google Scholar
Roman, R., Zhou, J., Lopez, J.: An anti-spam scheme using pre-challenges. Comput. Commun. 29(15), 2739–2749 (2006). https://doi.org/10.1016/j.comcom.2005.10.037
Article Google Scholar
John, J., Moshchuk, A., Gribble, S., Krishnamurthy, A.: Studying spamming using Botlab. In: Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI09, USENIX Association, Berkeley, CA, USA, pp. 291–306 ( 2009)
Leontiadis, N.: Measuring and analyzing search-redirection attacks in the illicit online prescription drug trade. In: Proceedings of USENIX Security 2011 (2011)
Xie, Y., Yu, F., Achan, K., Panigrahy, R., Hulten, G., Osipkov, I.: Spamming botnets: signatures and characteristics. SIGCOMM Comput. Commun. Rev. 38(4), 171–182 (2008)
Article Google Scholar
Zhao, Y., Xie, Y., Yu, F., Ke, Q., Yu, Y., Chen, Y., Gillum, E.: BotGraph: large scale spamming botnet detection. In: Proceedings of 6th NSDI
Putman, C.G.J., Abhishta, Nieuwenhuis, L.J.M.: Business Model of a Botnet, CoRR arXiv:1804.10848
Dinh, S., Azeb, T., Fortin, F., Mouheb, D., Debbabi, M.: Spam campaign detection, analysis, and investigation. Digit. Invest. 12(Supplement 1), S12–S21 (2015)
Article Google Scholar
Anderson, D., Fleizach, C., Savage, S., Voelker, G.: Spamscatter: characterizing internet scam hosting infrastructure. In: Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium (2007)
Radicati, S.: Email statistics report 2013–2017. http://goo.gl/ggLntn (2013). Accessed 24 Sept 2019
Pu, C., Webb, S.: Observed trends in spam construction techniques: a case study of spam evolution. In: CEAS, pp. 104–112 (2006)
Spam archive. http://untroubled.org/spam/. Accessed 24 Sept 2019
Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., Martinelli, F.: Fast and effective clustering of spam emails based on structural similarity. In: Foundations and Practice of Security—8th International Symposium, FPS 2015, Clermont-Ferrand, France, October 26–28, 2015, Revised Selected Papers, pp. 195–211 (2015)
Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., Martinelli, F.: Digital waste sorting: a goal-based, self-learning approach to label spam email campaigns. In: Security and Trust Management—11th International Workshop, STM 2015, Vienna, Austria, September 21–22, 2015, Proceedings, pp. 3–19 (2015)
Calais, P., Pires, D., Guedes, D., Meira, W., Hoepers, C., Steding-Jessen, K.: A campaign-based characterization of spamming strategies. In: CEAS (2008)
Song, J., Inque, D., Eto, M., Kim, H., Nakao, K.: O-means: an optimized clustering method for analyzing spam based attacks. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 94, 245–254 (2011)
Article Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2011)
MATH Google Scholar
Bezdek, J., Pal, N.: Cluster validation with generalized Dunn’s indices. In: Artificial Neural Networks and Expert Systems, 1995. Proceedings, Second New Zealand International Two-Stream Conference on, pp. 190–193 (1995). https://doi.org/10.1109/ANNES.1995.499469
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article MATH Google Scholar
Halkidi, M., Vazirgiannis, M.: Clustering validity assessment: finding the optimal partitioning of a data set. In: Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on, pp. 187–194 (2001). https://doi.org/10.1109/ICDM.2001.989517
Manning, C.D., Prabhakar, R., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Book MATH Google Scholar
Sheikhalishahi, M., Mejri, M., Tawbi, N.: Clustering spam emails into campaigns. In: Library, S.D. (ed.) 1st International Conference on Information Systems Security and Privacy (2015)
Salvador, S., Chan, P.: Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI ’04, IEEE Computer Society, Washington, DC, USA, pp. 576–584 (2004)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951). https://doi.org/10.1214/aoms/1177729694
Article MathSciNet MATH Google Scholar
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Google Scholar
Martin, S., Nelson, B., Sewani, A., Chen, K., Joseph, A.D.: Analyzing behavioral features for email classification. In: CEAS (2005)
Kerber, R.: Chimerge: discretization of numeric attributes. In: Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI’92, pp. 123–128. AAAI Press (1992)
Garcia, S., Luengo, J., Saez, J.A., Lopez, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)
Article Google Scholar
Hedley, J.: Jsoup cookbook. http://jsoup.org/cookbook (2009). Accessed 24 Sept 2019
Kanich, C., Weavery, N., McCoy, D., Halvorson, T., Kreibichy, C., Levchenko, K., Paxson, V., Voelker, G., Savage, S.: Show me the money: Characterizing spam-advertised revenue. In: Proceedings of the 20th USENIX Conference on Security, SEC’11, USENIX Association, Berkeley, CA, USA (2011)
Federal Trade Commission. http://www.consumer.ftc.gov (2009). Accessed 24 Sept 2019
Kanich, C., Kreibich, C., Levchenko, K., Enright, B., Voelker, G., Paxson, V., Savage, S.: Spamalytics: an empirical analysis of spam marketing conversion. In: Proceedings of the 15th ACM Conference on Computer and Communications Security, CCS08, pp. 3–14. ACM, New York (2008). https://doi.org/10.1145/1455770.1455774
Henderson, L.: Crimes of Persuasion: Schemes, Scams, Frauds: How Con Artists Will Steal Your Savings and Inheritance Through Telemarketing Fraud, Investment Schemes and Consumer Scams. Coyoto Ridge Press (2003)
Cohen, Y., Hendler, D., Rubin, A.: Detection of malicious webmail attachments based on propagation patterns. Knowl.-Based Syst. 141, 67–79 (2018)
Article Google Scholar
Narang, S.: Cryptolocker alert: millions in the UK targeted in mass spam campaign (2013). http://www.symantec.com/connect/blogs/cryptolocker-alert-millions-uk-targeted-mass-spam-campaign. Accessed 24 Sept 2019
Almomani, A., Gupta, B.B., Atawneh, S., Meulenberg, A., Almomani, E.: A survey of phishing email filtering techniques. IEEE Commun. Surv. Tutor. 15(4), 2070–2090 (2013)
Article Google Scholar
Smadi, S., Aslam, N., Zhang, L.: Detection of online phishing email using dynamic evolving neural network based on reinforcement learning. Decis. Support Syst. 107, 88–102 (2018)
Article Google Scholar
Li, F., Hsieh, M.: An empirical study of clustering behavior of spammers and groupbased anti-spam strategies. In: CEAS 2006 Third Conference on Email and AntiSpam, pp. 27–28 (2006)
Zhang, C., Chen, W., Chen, X., Warner, G.: Revealing common sources of image spam by unsupervised clustering with visual features. In: Proceedings of the 2009 ACM symposium on Applied Computing, SAC ’09, pp. 891–892. ACM, New York (2009)
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, vol. 1, pp. I–511–I–518 (2001). https://doi.org/10.1109/CVPR.2001.990517
Beleites, C., Neugebauer, U., Bocklitz, T., Krafft, C., Popp, J.: Sample size planning for classification models. Anal. Chim. Acta 760, 25–33 (2013)
Article Google Scholar
Crawford, M., Khoshgoftaar, T.M., Prusa, J.D., Richter, A.N., Al-Najada, H.: Survey of review spam detection using machine learning techniques. J. Big Data 2(1), 23 (2015). https://doi.org/10.1186/s40537-015-0029-9
Article Google Scholar
Kumar, V., Monika, Kumar, P., Sharma, A.: Spam email detection using id3 algorithm and hidden Markov model. In: 2018 Conference on Information and Communication Technology (CICT), pp. 1–6 (2018). https://doi.org/10.1109/INFOCOMTECH.2018.8722378
Labs, M.A.: Mcafee threats report: third quarter 2013 (2013)
Kreibich, C., Kanich, C., Levchenko, K., Enright, B., Voelker, G., Paxson, V., Savage, S.: Spamcraft: an inside look at spam campaign orchestration. In: Proceedings of the 2nd USENIX Conference on Large-Scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More, LEET09 (2009)
Song, J., Inque, D., Eto, M., Kim, H., Nakao, K.: An empirical study of spam: analyzing spam sending systems and malicious web servers. In: Proceedings of the 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet, SAINT ’10, pp. 257–260 (2010)
Wei, C., Sprague, A., Warner, G., Skjellum, A.: Mining spam email to identify common origins for forensic application. In: Proceedings of the 2008 ACM Symposium on Applied Computing, SAC ’08, pp. 1433–1437. ACM, New York (2008)
Kreibich, C., Kanich, C., Levchenko, K., Enright, B., Voelker, G., Paxson, V., Savage, S.: Spamcraft: an inside look at spam campaign orchestration. In: Proceedings of the 2nd USENIX Conference on Large-scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More, LEET’09. USENIX Association, Berkeley (2009)
Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y., Zhao, B.: Detecting and characterizing social spam campaigns. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, IMC ’10, pp. 35–47. ACM, New York (2010). https://doi.org/10.1145/1879141.1879147
Pathak, A., Qian, F., Hu, Y.C., Mao, Z.M., Ranjan, S.: Botnet spam campaigns can be long lasting: evidence, implications, and analysis. SIGMETRICS Perform. Eval. Rev. 37(1), 13–24 (2009)
Article Google Scholar
Moradpoor, N., Clavie, B., Buchanan, B.: Employing machine learning techniques for detection and classification of phishing emails. In: 2017 Computing Conference, pp. 149–156 (2017). https://doi.org/10.1109/SAI.2017.8252096
Bergholz, A., PaaB, G., Reichartz, F., Strobel, S., Birlinghoven, S.: Improved phishing detection using model-based features. In: In Fifth Conference on Email and Anti-Spam, CEAS (2008)
Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: Proceedings of the 16th International Conference on World Wide Web, pp. 649–656. ACM (2007)
Jain, G., Sharma, M., Agarwal, B.: Spam detection in social media using convolutional and long short term memory neural network. Ann. Math. Artif. Intell. 85(1), 21–44 (2019). https://doi.org/10.1007/s10472-018-9612-z
Article MATH Google Scholar
Sohrabi, M.K., Karimi, F.: A feature selection approach to detect spam in the facebook social network. Arab. J. Sci. Eng. 43(2), 949–958 (2018). https://doi.org/10.1007/s13369-017-2855-x
Article Google Scholar
Feng, B., Fu, Q., Dong, M., Guo, D., Li, Q.: Multistage and elastic spam detection in mobile social networks through deep learning. IEEE Netw. 32(4), 15–21 (2018). https://doi.org/10.1109/MNET.2018.1700406
Article Google Scholar
Almaatouq, A., Shmueli, E., Nouh, M., Alabdulkareem, A., Singh, V.K., Alsaleh, M., Alarifi, A., Alfaris, A., Pentland, A.S.: If it looks like a spammer and behaves like a spammer, it must be a spammer: analysis and detection of microblogging spam accounts. Int. J. Inf. Secur. 15(5), 475–491 (2016). https://doi.org/10.1007/s10207-016-0321-5
Article Google Scholar
Wu, T., Wen, S., Liu, S., Zhang, J., Xiang, Y., Alrubaian, M., Hassan, M.M.: Detecting spamming activities in twitter based on deep-learning technique. Concurr. Comput. Pract. Exp. 29(19), e4209 (2017). https://doi.org/10.1002/cpe.4209
Article Google Scholar
Lingam, G., Rout, R.R., Somayajulu, D.: Detection of social botnet using a trust model based on spam content in Twitter network. In: 2018 IEEE 13th International Conference on Industrial and Information Systems (ICIIS), pp. 280–285 (2018). https://doi.org/10.1109/ICIINFS.2018.8721318

Download references

Funding

This study was funded by H2020 C3ISP Project (GA 700294).

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, The Netherlands
Mina Sheikhalishahi
Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, Pisa, Italy
Andrea Saracino, Fabio Martinelli & Antonio La Marra
Univesity of Lavalle, Quebec City, QC, Canada
Mohammed Mejri & Nadia Tawbi

Authors

Mina Sheikhalishahi
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Saracino
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Martinelli
View author publications
You can also search for this author in PubMed Google Scholar
Antonio La Marra
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Mejri
View author publications
You can also search for this author in PubMed Google Scholar
Nadia Tawbi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrea Saracino.

Ethics declarations

Conflict of interest

The authors declare that they do not have conflict of interests.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sheikhalishahi, M., Saracino, A., Martinelli, F. et al. Digital Waste Disposal: an automated framework for analysis of spam emails. Int. J. Inf. Secur. 19, 499–522 (2020). https://doi.org/10.1007/s10207-019-00470-x

Download citation

Published: 25 September 2019
Issue Date: October 2020
DOI: https://doi.org/10.1007/s10207-019-00470-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Digital Waste Disposal: an automated framework for analysis of spam emails

Abstract

Access this article

Similar content being viewed by others

Digital Waste Sorting: A Goal-Based, Self-Learning Approach to Label Spam Email Campaigns

Fast and Effective Clustering of Spam Emails Based on Structural Similarity

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Digital Waste Disposal: an automated framework for analysis of spam emails

Abstract

Access this article

Similar content being viewed by others

Digital Waste Sorting: A Goal-Based, Self-Learning Approach to Label Spam Email Campaigns

Fast and Effective Clustering of Spam Emails Based on Structural Similarity

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation