Skip to main content
Log in

Investigating the performance of Hadoop and Spark platforms on machine learning algorithms

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

One of the most challenging issues in the big data research area is the inability to process a large volume of information in a reasonable time. Hadoop and Spark are two frameworks for distributed data processing. Hadoop is a very popular and general platform for big data processing. Because of the in-memory programming model, Spark as an open-source framework is suitable for processing iterative algorithms. In this paper, Hadoop and Spark frameworks, the big data processing platforms, are evaluated and compared in terms of runtime, memory and network usage, and central processor efficiency. Hence, the K-nearest neighbor (KNN) algorithm is implemented on datasets with different sizes within both Hadoop and Spark frameworks. The results show that the runtime of the KNN algorithm implemented on Spark is 4 to 4.5 times faster than Hadoop. Evaluations show that Hadoop uses more sources, including central processor and network. It is concluded that the CPU in Spark is more effective than Hadoop. On the other hand, the memory usage in Hadoop is less than Spark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

References

  1. Chen M, Mao S, Liu Y (2014) Big data: a survey. Mob Netw Appl 19(2):171–209

    Article  Google Scholar 

  2. Wu C, Zapevalova E, Chen Y, Zeng D, Liu F (2018) Optimal model of continuous knowledge transfer in the big data environment. Computr Model Eng Sci 116(1):89–107

    Google Scholar 

  3. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  4. Tang Z, Jiang L, Yang L, Li K, Li K (2015) CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework. Clust Comput 18(2):493–505

    Article  Google Scholar 

  5. Tang Z, Liu K, Xiao J, Yang L, Xiao Z (2017) A parallel k-means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce. Concurr Comput Pract Exp 29(20):e4109

    Article  Google Scholar 

  6. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Michael J, Franklin SS, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Presented as Part of the 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), pp 15–28

  7. Cobb AN, Benjamin AJ, Huang ES, Kuo PC (2018) Big data: more than big data sets. Surgery 164(4):640–642

    Article  Google Scholar 

  8. Qin SJ, Chiang LH (2019) Advances and opportunities in machine learning for process data analytics. Comput Chem Eng 126:465–473

    Article  Google Scholar 

  9. Jordan MI, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects. Science 349(6245):255–260

    Article  MathSciNet  Google Scholar 

  10. Wu C, Zapevalova E, Li F, Zeng D (2018) Knowledge structure and its impact on knowledge transfer in the big data environment. J Internet Technol 19(2):581–590

    Google Scholar 

  11. Zhou L, Pan S, Wang J, Vasilakos AV (2017) Machine learning on big data: opportunities and challenges. Neurocomputing 237:350–361

    Article  Google Scholar 

  12. Russell SJ, Norvig P (2016) Artificial intelligence: a modern approach. Pearson Education Limited, Kuala Lumpur

    MATH  Google Scholar 

  13. Aziz K, Zaidouni D, Bellafkih M (2018) Real-time data analysis using Spark and Hadoop. In: 2018 4th International Conference on Optimization and Applications (ICOA). IEEE, pp 1–6

  14. Hazarika AV, Ram GJSR, Jain E (2017) Performance comparison of Hadoop and spark engine. In: 2017 International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC). IEEE, pp 671–674

  15. Gopalani S, Arora R (2015) Comparing apache spark and map reduce with performance analysis using k-means. Int J Comput Appl 113(1):8–11

    Google Scholar 

  16. Wang H, Wu B, Yang S, Wang B, Liu Y (2014) Research of decision tree on yarn using mapreduce and Spark. In: Proceedings of the 2014 World Congress in Computer Science, Computer Engineering, and Applied Computing, pp 21–24

  17. Liang F, Feng C, Lu X, Xu Z (2014) Performance benefits of DataMPI: a case study with BigDataBench. In: Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware. Springer, Cham, pp 111–123

  18. Pirzadeh P (2015) On the performance evaluation of big data systems. Doctoral dissertation, UC Irvine

  19. Mavridis I, Karatza H (2017) Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark. J Syst Softw 125:133–151

    Article  Google Scholar 

  20. Im S, Moseley B (2019) A conditional lower bound on graph connectivity in mapreduce. arXiv preprint arXiv:1904.08954

  21. Kodali S, Dabbiru M, Rao BT, Patnaik UKC (2019) A k-NN-based approach using MapReduce for meta-path classification in heterogeneous information networks. In: Soft Computing in Data Analytics. Springer, Singapore, pp 277–284

  22. Li Y, Eldawy A, Xue J, Knorozova N, Mokbel MF, Janardan R (2019) Scalable computational geometry in MapReduce. VLDB J 28(4):523–548

    Article  Google Scholar 

  23. Li F, Chen J, Wang Z (2019) Wireless MapReduce distributed computing. IEEE Trans Inf Theory 65(10):6101–6114

    Article  MathSciNet  Google Scholar 

  24. Liu J, Wang P, Zhou J, Li K (2020) McTAR: a multi-trigger check pointing tactic for fast task recovery in MapReduce. IEEE Trans Serv Comput. https://doi.org/10.1109/TSC.2019.2904270

    Article  Google Scholar 

  25. Glushkova D, Jovanovic P, Abelló A (2019) Mapreduce performance model for Hadoop 2.x. Inf Syst 79:32–43

    Article  Google Scholar 

  26. Saxena A, Chaurasia A, Kaushik N, Kaushik N (2019) Handling big data using MapReduce over hybrid cloud. In: International Conference on Innovative Computing and Communications. Springer, Singapore, pp 135–144

  27. Kuo A, Chrimes D, Qin P, Zamani H (2019) A Hadoop/MapReduce based platform for supporting health big data analytics. In: ITCH, pp 229–235

  28. Kumar DK, Bhavanam D, Reddy L (2020) Usage of HIVE tool in Hadoop ECO system with loading data and user defined functions. Int J Psychosoc Rehabil 24(4):1058–1062

    Google Scholar 

  29. Alnasir JJ, Shanahan HP (2020) The application of hadoop in structural bioinformatics. Brief Bioinform 21(1):96–105

    Google Scholar 

  30. Park HM, Park N, Myaeng SH, Kang U (2020) PACC: large scale connected component computation on Hadoop and Spark. PLoS ONE 15(3):e0229936

    Article  Google Scholar 

  31. Xu Y, Wu S, Wang M, Zou Y (2020) Design and implementation of distributed RSA algorithm based on Hadoop. J Ambient Intell Humaniz Comput 11(3):1047–1053

    Article  Google Scholar 

  32. Wang J, Li X, Ruiz R, Yang J, Chu D (2020) Energy utilization task scheduling for MapReduce in heterogeneous clusters. IEEE Trans Serv Comput. https://doi.org/10.1109/TSC.2020.2966697

    Article  Google Scholar 

  33. Wei P, He F, Li L, Shang C, Li J (2020) Research on large data set clustering method based on MapReduce. Neural Comput Appl 32(1):93–99

    Article  Google Scholar 

  34. Souza A, Garcia I (2020) A preemptive fair scheduler policy for disco MapReduce framework. In: Anais do XV Workshop em Desempenho de Sistemas Computacionais e de Comunicação. SBC, pp 1–12

  35. Jang S, Jang YE, Kim YJ, Yu H (2020) Input initialization for inversion of neural networks using k-nearest neighbor approach. Inf Sci 519:229–242

    Article  MathSciNet  Google Scholar 

  36. Chen Y, Hu X, Fan W, Shen L, Zhang Z, Liu X et al (2020) Fast density peak clustering for large scale data based on kNN. Knowl-Based Syst 187:104824

    Article  Google Scholar 

  37. Janardhanan PS, Samuel P (2020) Optimum parallelism in Spark framework on Hadoop YARN for maximum cluster resource. In: First International Conference on Sustainable Technologies for Computational Intelligence: Proceedings of ICTSCI 2019, vol 1045. Springer Nature, p 351

  38. Qin Y, Tang Y, Zhu X, Yan C, Wu C, Lin D (2020) Zone-based resource allocation strategy for heterogeneous spark clusters. In: Artificial Intelligence in China. Springer, Singapore, pp 113–121

  39. Hussain DM, Surendran D (2020) The efficient fast-response content-based image retrieval using spark and MapReduce model framework. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-020-01775-9

    Article  Google Scholar 

  40. Nguyen MC, Won H, Son S, Gil MS, Moon YS (2019) Prefetching-based metadata management in advanced multitenant Hadoop. J Supercomput 75(2):533–553

    Article  Google Scholar 

  41. Javanmardi AK, Yaghoubyan SH, Bagherifard K et al (2020) A unit-based, cost-efficient scheduler for heterogeneous Hadoop systems. J Supercomput. https://doi.org/10.1007/s11227-020-03256-4

    Article  Google Scholar 

  42. Guo A, Jiang A, Lin J, Li X (2020) Data mining algorithms for bridge health monitoring: Kohonen clustering and LSTM prediction approaches. J Supercomput 76(2):932–947

    Article  Google Scholar 

  43. Cheng F, Yang Z (2019) FastMFDs: a fast, efficient algorithm for mining minimal functional dependencies from large-scale distributed data with Spark. J Supercomput 75(5):2497–2517

    Article  Google Scholar 

  44. Kang M, Lee J (2020) Effect of garbage collection in iterative algorithms on Spark: an experimental analysis. J Supercomput. https://doi.org/10.1007/s11227-020-03150-z

    Article  Google Scholar 

  45. Xiao W, Hu J (2020) SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming. J Supercomput. https://doi.org/10.1007/s11227-020-03190-5

    Article  Google Scholar 

  46. Massie M, Li B, Nicholes B, Vuksan V, Alexander R, Buchbinder J, Costa F, Dean A, Josephsen D, Phaal P, Pocock D (2012) Monitoring with Ganglia: tracking dynamic host and application metrics at scale. O’Reilly Media Inc, Newton

    Google Scholar 

  47. Whiteson D (2014) Higgs data set. https://archive.ics.uci.edu/ml/datasets/HIGGS. Accessed 2016

  48. Harrington P (2012) Machine learning in action. Manning Publications Co, New York

    Google Scholar 

  49. Masarat S, Sharifian S, Taheri H (2016) Modified parallel random forest for intrusion detection systems. J Supercomput 72(6):2235–2258

    Article  Google Scholar 

  50. Lai WK, Chen YU, Wu TY, Obaidat MS (2014) Towards a framework for large-scale multimedia data storage and processing on Hadoop platform. J Supercomput 68(1):488–507

    Article  Google Scholar 

  51. Won H, Nguyen MC, Gil MS, Moon YS, Whang KY (2017) Moving metadata from ad hoc files to database tables for robust, highly available, and scalable HDFS. J Supercomput 73(6):2657–2681

    Article  Google Scholar 

  52. Lee ZJ, Lee CY (2020) A parallel intelligent algorithm applied to predict students dropping out of university. J Supercomput 76(2):1049–1062

    Article  Google Scholar 

  53. Sandrini M, Xu B, Volochayev R, Awosika O, Wang WT, Butman JA, Cohen LG (2020) Transcranial direct current stimulation facilitates response inhibition through dynamic modulation of the fronto-basal ganglia network. Brain Stimul 13(1):96–104

    Article  Google Scholar 

  54. Jiang W, Fu J, Chen F, Zhan Q, Wang Y, Wei M, Xiao B (2020) Basal ganglia infarction after mild head trauma in pediatric patients with basal ganglia calcification. Clin Neurol Neurosurg 192:105706

    Article  Google Scholar 

  55. Kowalski CW, Lindberg JE, Fowler DK, Simasko SM, Peters JH (2020) Contributing mechanisms underlying desensitization of CCK-induced activation of primary nodose ganglia neurons. Am J Physiol Cell Physiol 318:C787–C796

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joshuva Arockia Dhanraj.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mostafaeipour, A., Jahangard Rafsanjani, A., Ahmadi, M. et al. Investigating the performance of Hadoop and Spark platforms on machine learning algorithms. J Supercomput 77, 1273–1300 (2021). https://doi.org/10.1007/s11227-020-03328-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-020-03328-5

Keywords

Navigation