Abstract
MapReduce framework is an effective method for big data parallel processing. Enhancing the performance of MapReduce clusters, along with reducing their job execution time, is a fundamental challenge to this approach. In fact, one is faced with two challenges here: how to maximize the execution overlap between jobs and how to create an optimum job scheduling. Accordingly, one of the most critical challenges to achieving these goals is developing a precise model to estimate the job execution time due to the large number and high volume of the submitted jobs, limited consumable resources, and the need for proper Hadoop configuration. This paper presents a model based on MapReduce phases for predicting the execution time of jobs in a heterogeneous cluster. Moreover, a novel heuristic method is designed, which significantly reduces the makespan of the jobs. In this method, first by providing the job profiling tool, we obtain the execution details of the MapReduce phases through log analysis. Then, using machine learning methods and statistical analysis, we propose a relevant model to predict runtime. Finally, another tool called job submission and monitoring tool is used for calculating makespan. Different experiments were conducted on the benchmarks under identical conditions for all jobs. The results show that the average makespan speedup for the proposed method was higher than an unoptimized case.
Similar content being viewed by others
References
Dittrich J, Quiané-Ruiz J (2012) Efficient big data processing in Hadoop MapReduce. Proc VLDB Endow 5(12):2014–2015. https://doi.org/10.14778/2367502.2367562
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492
Babu S (2010) Towards automatic optimization of MapReduce programs. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp 137–142. https://doi.org/10.1145/1807128.1807150
Lee K, Lee Y et al (2012) Parallel data processing with MapReduce. ACM SIGMOD Record 40(4):11–20. https://doi.org/10.1145/2094114.2094118
White T, Cutting D (2015) Hadoop: the definitive guide. O’Reilly Media, Yahoo
Arora A, Mehrotra S (2015) Learning YARN. Packt Publishing Ltd, Birmingham
Vavilapalli VK, Murthy AC et al (2013) Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th ACM Annual Symposium on Cloud Computing, p 5. https://doi.org/10.1145/2523616.2523633
Hashem IA, Anuar NB, Marjani M, Ahmed E, Chiroma H, Firdaus A, Abdullah MT, Alotaibi F, Ali WK, Yaqoob I, Gani A (2018) MapReduce scheduling algorithms: a review. J Supercomput. https://doi.org/10.1007/s11227-018-2719-5
Zikopoulos P, Eaton C (2011) Understanding big data: analytics for enterprise class Hadoop and streaming data. McGraw-Hill Osborne Media, New York City
Lin JC, Lee MC (2016) Performance evaluation of job schedulers on Hadoop YARN. Concurr Comput Practice Exp 28(9):2711–2728. https://doi.org/10.1002/cpe.3736
Zaharia M, Borthakur D et al (2009) Job scheduling for multi-user MapReduce clusters. EECS Department University of California Berkeley Technical Report UCB/EECS-2009-55 Apr, (UCB/EECS-2009-55), vol 47, p 131
Gautam J, Prajapati H et al (2015) A survey on job scheduling algorithms in Big data processing. In: IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp 1–11. https://doi.org/10.1109/ICECCT.2015.7226035
Shabestari F, Rahmani AM, Navimipour NJ, Jabbehdari S (2019) A taxonomy of software-based and hardware-based approaches for energy efficiency management in the Hadoop. J Netw Comput Appl 126:162–177. https://doi.org/10.1016/j.jnca.2018.11.007
Witt C, Bux M, Gusew W, Leser U (2019) Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Inf Syst. https://doi.org/10.1016/j.is.2019.01.006
Dong B, Zheng Q, Tian F, Chao KM, Godwin N, Ma T, Xu H (2014) Performance models and dynamic characteristics analysis for HDFS write and read operations: a systematic view. J Syst Softw 93:132–151. https://doi.org/10.1016/j.jss.2014.02.038
Khan M, Jin Y, Li M, Xiang Y, Jiang C (2016) Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans Parallel Distrib Syst 27(2):441–454. https://doi.org/10.1109/TPDS.2015.2405552
Ataie E, Gianniti E, Ardagna D, Movaghar A (2017) A combined analytical modeling machine learning approach for performance prediction of MapReduce jobs in Hadoop clusters. In: MICAS 2017 Management of Resources and Services in Cloud and Sky Computing, pp 0–7. https://doi.org/10.1109/synasc.2016.072
Wang N, Yang J, Lu Z, Li X, Wu J (2016) Comparison and improvement of Hadoop MapReduce performance prediction models in the private cloud. In: Asia-Pacific Services Computing Conference. Springer, Cham, pp 77–91. https://doi.org/10.1007/978-3-319-49178-3_6
Herodotou H, Babu S (2011) Profiling, what-if analysis, and cost-based optimization of MapReduce programs. In: Proceedings of the VLDB Endowment, vol 4, no. 11, pp 1111–1122
Karimian-Aliabadi S, Ardagna D, Entezari-Maleki R, Gianniti E, Movaghar A (2019) Analytical composite performance models for Big Data applications. J Netw Comput Appl. https://doi.org/10.1016/j.jnca.2019.06.009
Herodotou H, Lim H, Luo G, Borisov N, Dong L, Cetin F, Babu S (2011) Starfish: a self-tuning system for big data analytics. In: CIDR, vol 11, no 2011, pp 261–272
Herodotou H (2011) Hadoop performance models. Technical Report, CS-2011-05 Computer Science Department Duke University, p 19
Vianna E, Comarela G, Pontes T et al (2013) Analytical performance models for MapReduce workloads. Int J Parallel Prog 41(4):495–525. https://doi.org/10.1007/s10766-012-0227-4
Liang DR, Tripathi SK (2000) On performance prediction of parallel computations with precedent constraints. IEEE Trans Parallel Distrib Syst 11(5):491–508. https://doi.org/10.1109/71.852402
Glushkova D, Jovanovic P, Abelló A (2019) MapReduce performance model for Hadoop 2. x. Inf Syst 79:32–43. https://doi.org/10.1016/j.is.2017.11.006
Liu Q, Cai W, Jin D, Shen J, Fu Z, Liu X, Linge N (2016) Estimation accuracy on execution time of run-time tasks in a heterogeneous distributed environment. Sensors 16(9):1386. https://doi.org/10.3390/s16091386
Hammoud M, Sakr M (2011) Locality-aware reduce task scheduling for MapReduce. In: 2011 IEEE 3rd International Conference on Cloud Computing Technology and Science (CloudCom), pp 570–576. https://doi.org/10.1109/CloudCom.2011.87
Zhang X, Feng Y et al (2011) An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments. In: International Conference on Cloud and Service Computing (CSC), pp 235–242. https://doi.org/10.1109/CSC.2011.6138527
Wang G, Khasymski A, Krish KR, Butt AR (2013) Towards improving MapReduce task scheduling using online simulation based predictions. In: IEEE International Conference on Parallel and Distributed Systems (ICPADS), pp 299–306. https://doi.org/10.1109/ICPADS.2013.50
Yong M, Garegrat N, Mohan S (2009) Towards a resource aware scheduler in Hadoop. In: Proceedings of ICWS, pp 102–109
Zaharia M, Konwinski A, Joseph A, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: OSDI, vol 8, no 4, p 7. https://dl.acm.org/doi/10.5555/1855741.1855744
Chen Q, Zhang D et al (2010) SAMR: a self-adaptive MapReduce scheduling algorithm in heterogeneous environment. In: 2010 IEEE 10th International Conference on Computer and Information Technology (CIT), pp 2736–2743. https://doi.org/10.1109/CIT.2010.458
Tang Z, Liu M, Ammar A, Li K, Li K (2016) An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. J Supercomput 72(6):2059–2079. https://doi.org/10.1007/s1122
Zhang Q, Zhani MF, Yang Y, Boutaba R, Wong B (2015) PRISM: fine-grained resource-aware scheduling for MapReduce. IEEE Trans Cloud Comput 3(2):182–194. https://doi.org/10.1109/tcc.2014.2379096
Polo J, Castillo C et al (2011) Resource-aware adaptive scheduling for MapReduce clusters. In: Middleware 2011, pp 187–207. https://dl.acm.org/doi/10.5555/2414338.2414352
Lama P, Zhou X (2012) AROMA: automated resource allocation and configuration of MapReduce environment in the cloud. In: Proceedings of the 9th ACM International Conference on AUTONOMIC COMPUTING, pp 63–72. https://doi.org/10.1145/2371536.2371547
Verma A, Cherkasova L, Campbell RH (2011) ARIA: automatic resource inference and allocation for MapReduce environments. In: Proceedings of the 8th ACM International Conference on Autonomic Computing, pp 235–244. https://doi.org/10.1145/1998582.1998637
Chen Q, Liu C, Xiao Z (2014) Improving MapReduce performance using smart speculative execution strategy. IEEE Trans Comput 63(4):954–967. https://doi.org/10.1109/tc.2013.15
Wang Y et al (2015) Improving MapReduce performance with partial speculative execution. J Grid Comput 13(4):587–604. https://doi.org/10.1007/s10723-015-9350-y
Tang S, Lee BS, He B (2014) DynamicMR: a dynamic slot allocation optimization framework for MapReduce clusters. IEEE Trans Cloud Comput 2(3):333–347. https://doi.org/10.1109/tcc.2014.2329299
Verma A, Cherkasova L, Campbell RH (2013) Orchestrating an ensemble of MapReduce jobs for minimizing their makespan. IEEE Trans Dependable Secure Comput 10(5):314–327. https://doi.org/10.1109/TDSC.2013.14
Tian W, Li G, Yang W, Buyya R (2016) HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs. J Supercomput 72(6):2376–2393. https://doi.org/10.1007/s11227-016-1737-4
Tang S, Lee B, He B (2016) Dynamic job ordering and slot configurations for MapReduce workloads. IEEE Trans Serv Comput 9(1):4–17. https://doi.org/10.1109/TSC.2015.2426186
Zhang Z, Cherkasova L, Loo BT (2013) Benchmarking approach for designing a MapReduce performance model. In: Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, pp 253–258. https://doi.org/10.1145/2479871.2479906
Yao Y, Wang J, Sheng B, Lin J, Mi N (2014) HASTE: Hadoop YARN scheduling based on task-dependency and resource-demand. In: 2014 IEEE 7th International Conference on Cloud Computing (CLOUD), pp 184–191. https://doi.org/10.1109/CLOUD.2014.34
Wasi-ur-Rahman M, Lu X, Islam NS, Rajachandrasekar R, Panda DK (2015) High-performance design of YARN MapReduce on modern HPC clusters with Lustre and RDMA. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp 291–300. https://doi.org/10.1109/IPDPS.2015.83
Verma A, Cherkasova L, Campbell RH (2011) Resource provisioning framework for MapReduce jobs with performance goals. In: ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing. Springer, Berlin, pp 165–186. https://doi.org/10.1007/978-3-642-25821-3_9
Hamooni H, Debnath B, Xu J et al (2016) LogMine: fast pattern recognition for log analytics. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp 1573–1582. https://doi.org/10.1145/2983323.2983358
Sheu RK, Yuan SM, Lo WT, Ku CI (2014) Design and implementation of file deduplication framework on HDFS. Int J Distrib Sens Netw 10(4):561340. https://doi.org/10.1155/2014/561340
Huang S, Huang J, Dai J, Xie T, Huang B (2010) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp 41–51. https://doi.org/10.1109/ICDEW.2010.5452747
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gandomi, A., Movaghar, A., Reshadi, M. et al. Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach. J Supercomput 76, 7177–7203 (2020). https://doi.org/10.1007/s11227-020-03162-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03162-9