当前位置: X-MOL 学术J. Supercomput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Investigating the performance of Hadoop and Spark platforms on machine learning algorithms
The Journal of Supercomputing ( IF 2.5 ) Pub Date : 2020-05-13 , DOI: 10.1007/s11227-020-03328-5
Ali Mostafaeipour , Amir Jahangard Rafsanjani , Mohammad Ahmadi , Joshuva Arockia Dhanraj

One of the most challenging issues in the big data research area is the inability to process a large volume of information in a reasonable time. Hadoop and Spark are two frameworks for distributed data processing. Hadoop is a very popular and general platform for big data processing. Because of the in-memory programming model, Spark as an open-source framework is suitable for processing iterative algorithms. In this paper, Hadoop and Spark frameworks, the big data processing platforms, are evaluated and compared in terms of runtime, memory and network usage, and central processor efficiency. Hence, the K-nearest neighbor (KNN) algorithm is implemented on datasets with different sizes within both Hadoop and Spark frameworks. The results show that the runtime of the KNN algorithm implemented on Spark is 4 to 4.5 times faster than Hadoop. Evaluations show that Hadoop uses more sources, including central processor and network. It is concluded that the CPU in Spark is more effective than Hadoop. On the other hand, the memory usage in Hadoop is less than Spark.

中文翻译:

调查 Hadoop 和 Spark 平台在机器学习算法上的性能

大数据研究领域最具挑战性的问题之一是无法在合理的时间内处理大量信息。Hadoop和Spark是分布式数据处理的两个框架。Hadoop 是一个非常流行的通用大数据处理平台。由于采用内存编程模型,Spark 作为开源框架适用于处理迭代算法。在本文中,大数据处理平台 Hadoop 和 Spark 框架在运行时、内存和网络使用情况以及中央处理器效率方面进行了评估和比较。因此,K-近邻 (KNN) 算法在 Hadoop 和 Spark 框架内的不同大小的数据集上实现。结果表明,在 Spark 上实现的 KNN 算法的运行时间比 Hadoop 快 4 到 4.5 倍。评估表明,Hadoop 使用的资源更多,包括中央处理器和网络。结论是 Spark 中的 CPU 比 Hadoop 更有效。另一方面,Hadoop 中的内存使用量小于 Spark。
更新日期:2020-05-13
down
wechat
bug