survey

Open Access

A Survey on Automatic Parameter Tuning for Big Data Processing Systems

Authors:
Herodotos Herodotou

Cyprus University of Technology, Limassol, Cyprus

Cyprus University of Technology, Limassol, Cyprus

0000-0002-8717-1691
View Profile

,
Yuxing Chen

University of Helsinki, Helsinki, Finland

University of Helsinki, Helsinki, Finland

0000-0002-6220-2535
View Profile

,
Jiaheng Lu

University of Helsinki, Helsinki, Finland

University of Helsinki, Helsinki, Finland

0000-0003-2067-454X
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 53 Issue 2Article No.: 43pp 1–37https://doi.org/10.1145/3381027

Published:26 April 2020Publication History

ACM Computing Surveys

Abstract

Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.

References

Sean T. Allen, Matthew Jankowski, and Peter Pathirana. 2015. Storm Applied: Strategies for Real-time Event Processing. Manning Publications Co.Google ScholarDigital Library
AMD Hadoop Tuning. 2012. AMD Hadoop Performance Tuning Guide. Retrieved from https://developer.amd.com/wordpress/media/2012/10/Hadoop_Tuning_Guide-Version5.pdf.Google Scholar
Apache Flink. 2019. Apache Flink. Retrieved from https://flink.apache.org/.Google Scholar
Apache Hadoop. 2019. Apache Hadoop. Retrieved from https://hadoop.apache.org/.Google Scholar
Apache Samza. 2019. Apache Samza. Retrieved from http://samza.apache.org/.Google Scholar
ApacheSpark. 2019. Apache Spark. Retrieved from https://spark.apache.org/.Google Scholar
Apache Spark Streaming. 2019. Apache Spark Streaming. Retrieved from https://spark.apache.org/streaming/.Google Scholar
Apache Spark Tuning. 2017. Apache Spark Tuning - DZone. Retrieved from https://dzone.com/articles/apache-spark-performance-tuning-degree-of-parallel.Google Scholar
Apache Spark Tuning Course. 2018. Apache Spark Tuning and Best Practices. Retrieved from https://databricks.com/training-overview/instructor-led-training/courses/apache-spark-tuning-and-best-practices.Google Scholar
Apache Spark Tuning Guide. 2019. Apache Spark Tuning Guide. Retrieved from https://spark.apache.org/docs/latest/tuning.html.Google Scholar
Apache Storm. 2019. Apache Storm. Retrieved from https://storm.apache.org/.Google Scholar
Apache Storm Performance Tuning. 2019. Apache Storm Performance Tuning. Retrieved from https://storm.apache.org/releases/current/Performance.html.Google Scholar
Apache Storm Trident. 2019. Apache Storm Trident. Retrieved from http://storm.apache.org/releases/current/Trident-tutorial.html.Google Scholar
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng et al. 2015. Spark SQL: Relational data processing in Spark. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’15). ACM, 1383--1394.Google Scholar
Shivnath Babu. 2010. Towards automatic optimization of MapReduce programs. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC’10). ACM, 137--142.Google ScholarDigital Library
Shivnath Babu and Herodotos Herodotou. 2013. Massively parallel databases and MapReduce systems. Found. Trends® Datab. 5, 1 (2013), 1--104.Google ScholarDigital Library
Manu Bansal, Eyal Cidon, Arjun Balasingam, Aditya Gudipati, Christos Kozyrakis, and Sachin Katti. 2018. Trevor: Automatic configuration and scaling of stream processing pipelines. CoRR abs/1812.09442 (2018).Google Scholar
Liang Bao, Xin Liu, and Weizhao Chen. 2018. Learning-based automatic parameter tuning for big data analytics frameworks. In Proceedings of the IEEE International Conference on Big Data. IEEE, 181--190.Google ScholarCross Ref
Mike Barlow. 2013. Real-time Big Data Analytics: Emerging Architecture. O’Reilly Media, Inc.Google Scholar
Ivan Bedini, Sherif Sakr, Bart Theeten, Alessandra Sala, and Peter Cogan. 2013. Modeling performance of a parallel streaming engine: Bridging theory and costs. In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering. ACM, 173--184.Google ScholarDigital Library
Zhendong Bei, Zhibin Yu, Huiling Zhang, Wen Xiong, Chengzhong Xu, Lieven Eeckhout, and Shengzhong Feng. 2016. RFHOC: A random-forest approach to auto-tuning Hadoop’s configuration. IEEE Transactions on Parallel and Distributed Systems (TPDS) 27, 5 (2016), 1470--1483.Google ScholarDigital Library
Muhammad Bilal and Marco Canini. 2017. Towards automatic parameter tuning of stream processing systems. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC’17). ACM, 189--200.Google ScholarDigital Library
BTrace. 2018. BTrace: A Dynamic Instrumentation Tool for Java. Retrieved from https://github.com/btraceio/btrace.Google Scholar
Rajkumar Buyya and Manzur Murshed. 2002. GridSim: A toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurr. Comput. Pract. Exper. 14, 13–15 (2002), 1175--1220.Google ScholarCross Ref
Chi-Ou Chen, Ye-Qi Zhuo, Chao-Chun Yeh, Che-Min Lin, and Shih-Wei Liao. 2015. Machine learning-based configuration parameter tuning on Hadoop system. In Proceedings of the IEEE International Congress on Big Data. IEEE, 386--392.Google ScholarDigital Library
Keke Chen, James Powers, Shumin Guo, and Fengguang Tian. 2014. CRESP: Towards optimal resource provisioning for MapReduce computing in public clouds. IEEE Trans. Parallel Distrib. Syst. 25, 6 (2014), 1403--1412.Google ScholarDigital Library
Yuxing Chen, Peter Goetsch, Mohammad A. Hoque, Jiaheng Lu, and Sasu Tarkoma. 2019. d-Simplexed: Adaptive Delaunay triangulation for performance modeling and prediction on big data analytics. IEEE Trans. Big Data (2019). https://ieeexplore.ieee.org/document/8878273.Google Scholar
Yuxing Chen, Jiaheng Lu, Chen Chen, Mohammad Hoque, and Sasu Tarkoma. 2019. Cost-effective resource provisioning for Spark workloads. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM’19). ACM, 2477--2480.Google ScholarDigital Library
Dazhao Cheng, Jia Rao, Yanfei Guo, and Xiaobo Zhou. 2014. Improving MapReduce performance in heterogeneous environments with adaptive task tuning. In Proceedings of the 15th International Middleware Conference. ACM, 97--108.Google ScholarDigital Library
ClouderaSparkTuning. 2018. Cloudera Performance Management - Tuning Spark Applications. Retrieved from https://www.cloudera.com/documentation/enterprise/5-9-x/topics/admin_spark_tuning.html.Google Scholar
ClouderaYarnTuning. 2018. Cloudera Performance Management - Tuning YARN. Retrieved from https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_ig_yarn_tuning.html.Google Scholar
Tathagata Das, Yuan Zhong, Ion Stoica, and Scott Shenker. 2014. Adaptive stream processing using dynamic batch sizing. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC’14). ACM, 16:1–16:13.Google ScholarDigital Library
Databricks. 2019. Databricks. Retrieved from https://sparkhub.databricks.com/.Google Scholar
Miyuru Dayarathna and Srinath Perera. 2018. Recent advancements in event processing. Comput. Surv. 51, 2 (2018), 33.Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.Google ScholarDigital Library
Xiaoan Ding, Yi Liu, and Depei Qian. 2015. Jellyfish: Online performance tuning with adaptive configuration and elastic container in Hadoop YARN. In Proceedings of the 21st International Conference on Parallel and Distributed Systems. IEEE, 831--836.Google Scholar
Shlomi Dolev, Patricia Florissi, Ehud Gudes, Shantanu Sharma, and Ido Singer. 2017. A survey on geographically distributed big-data processing using MapReduce. IEEE Trans. Big Data 5, 1 (2017), 60--80.Google ScholarCross Ref
Christos Doulkeridis and Kjetil Nørvåg. 2014. A survey of large-scale analytical query processing in MapReduce. VLDB J. 23, 3 (June 2014), 355--380. DOI:https://doi.org/10.1007/s00778-013-0319-9Google ScholarDigital Library
Songyun Duan, Vamsidhar Thummala, and Shivnath Babu. 2009. Tuning database configuration parameters with iTuned. PVLDB 2, 1 (2009), 1246--1257.Google ScholarDigital Library
Mostafa Ead, Herodotos Herodotou, Ashraf Aboulnaga, and Shivnath Babu. 2014. PStorM: Profile storage and matching for feedback-based tuning of MapReduce jobs. In Proceedings of the 17th International Conference on Extending Database Technology (EDBT’14). 1--12.Google Scholar
Lorenz Fischer, Shen Gao, and Abraham Bernstein. 2015. Machines tuning machines: Configuring distributed stream processors with Bayesian optimization. In Proceedings of the International Conference on Cluster Computing (CLUSTER’15). IEEE, 22--31.Google ScholarDigital Library
Avrilia Floratou, Ashvin Agrawal, Bill Graham, Sriram Rao, and Karthik Ramasamy. 2017. Dhalion: Self regulating stream processing in Heron. PVLDB 10, 12 (2017), 1825--1836.Google ScholarDigital Library
Tom Z. J. Fu, Jianbing Ding, Richard T. B. Ma, Marianne Winslett, Yin Yang, and Zhenjie Zhang. 2015. DRS: Dynamic resource scheduling for real-time analytics over fast streams. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS’15). IEEE, 411--420.Google ScholarCross Ref
Jyoti V. Gautam, Harshadkumar B. Prajapati, Vipul K. Dabhi, and Sanjay Chaudhary. 2015. A survey on job scheduling algorithms in big data processing. In Proceedings of the International Conference on Electrical, Computer and Communication Technologies. IEEE, 1--11.Google ScholarCross Ref
Mikhail Genkin, Frank Dehne et al. 2016. Automatic, on-line tuning of YARN container memory and CPU parameters. In Proceedings of the International Conference on High Performance Computing and Communications (HPCC’16). IEEE, 317--324.Google Scholar
Anastasios Gounaris, Georgia Kougka, Ruben Tous, Carlos Tripiana Montes, and Jordi Torres. 2017. Dynamic configuration of partitioning in Spark applications. IEEE Trans. Parallel Distrib. Syst. 28, 7 (2017), 1891--1904.Google ScholarDigital Library
Anastasios Gounaris and Jordi Torres. 2017. A methodology for Spark parameter tuning. Big Data Res. 11 (Mar. 2017), 22--32.Google Scholar
HadoopClusterSetup. 2019. Hadoop Cluster Setup. Retrieved from https://hadoop.apache.org/docs/r1.2.1/cluster_setup.html.Google Scholar
HadoopPerfUI. 2011. Hadoop Perf Monitoring UI. Retrieved from http://code.google.com/p/hadoop-toolkit/wiki/HadoopPerformanceMonitoring.Google Scholar
HadoopTuning. 2015. Hadoop Performance Tuning Tutorial. Retrieved from http://hadooptutorial.info/hadoop-performance-tuning/.Google Scholar
HadoopTutorial. 2018. Hadoop MapReduce Tutorial. Retrieved from https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html.Google Scholar
HadoopVaidya. 2011. Hadoop Vaidya. Retrieved from http://hadoop.apache.org/mapreduce/docs/r0.21.0/vaidya.html.Google Scholar
Suhel Hammoud, Maozhen Li, Yang Liu, Nasullah Khalid Alham, and Zelong Liu. 2010. MRSim: A discrete event based MapReduce simulator. In Proceedings of the 7th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD’10), Vol. 6. IEEE, 2993--2997.Google ScholarCross Ref
Dominique Heger. 2013. Hadoop Performance Tuning—A Pragmatic 8 Iterative Approach. Retrieved from https://www.cmg.org/wp-content/uploads/2013/04/m_97_3.pdf.Google Scholar
Álvaro Brandón Hernández, María S. Perez, Smrati Gupta, and Victor Muntés-Mulero. 2017. Using machine learning to optimize parallelism in big data applications. Fut. Gen. Comput. Syst. 86 (2018), 1076–1092. https://www.sciencedirect.com/science/article/abs/pii/S0167739X17314668?via%3Dihub.Google Scholar
Herodotos Herodotou. 2011. Hadoop performance models. CoRR abs/1106.0940 (2011).Google Scholar
Herodotos Herodotou. 2012. Automatic Tuning of Data-intensive Analytical Workloads. Ph.D. Dissertation. Duke University.Google Scholar
Herodotos Herodotou and Shivnath Babu. 2011. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB 4, 11 (2011), 1111--1122.Google ScholarDigital Library
Herodotos Herodotou and Shivnath Babu. 2013. A what-if engine for cost-based MapReduce optimization. IEEE Data Eng. Bull. 36, 1 (2013), 5--14.Google Scholar
Herodotos Herodotou, Fei Dong, and Shivnath Babu. 2011. No one (cluster) size fits all: Automatic cluster sizing for data-intensive analytics. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SoCC’11).Google ScholarDigital Library
Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu. 2011. Starfish: A self-tuning system for big data analytics. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR’11). 261--272.Google Scholar
Wilson A. Higashino, Miriam A. M. Capretz, and Luiz F. Bittencourt. 2016. CEPSim: Modelling and simulation of complex event processing systems in cloud environments. Fut. Gen. Comput. Syst. 65 (2016), 122--139.Google ScholarDigital Library
Martin Hirzel, Robert Soulé, Scott Schneider, Buğra Gedik, and Robert Grimm. 2014. A catalog of stream processing optimizations. Comput. Surv. 46, 4 (2014), 46.Google ScholarDigital Library
Fred Howell and Ross McNab. 1998. SimJava: A discrete event simulation library for Java. Simul. Series 30 (1998), 51--56.Google Scholar
Markus C. Huebscher and Julie A. McCann. 2008. A survey of autonomic computing—degrees, models, and applications. Comput. Surv. 40, 3 (2008), 7:1–7:28.Google Scholar
Pooyan Jamshidi and Giuliano Casale. 2016. An uncertainty-aware approach to optimal configuration of stream processing systems. In Proceedings of the IEEE/ACM International Symposium on Modeling, Analysis, and Simulation on Computer and Telecommunication Systems (MASCOTS’16). IEEE, 39--48.Google ScholarCross Ref
Zhen Jia, Chao Xue, Guancheng Chen, Jianfeng Zhan, Lixin Zhang, Yonghua Lin, and Peter Hofstee. 2016. Auto-tuning Spark big data workloads on POWER8: Prediction-based dynamic SMT threading. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT’16). IEEE, 387--400.Google ScholarDigital Library
Dawei Jiang, Beng Chin Ooi, Lei Shi, and Sai Wu. 2010. The performance of MapReduce: An in-depth study. PVLDB 3, 1–2 (2010), 472--483.Google ScholarDigital Library
Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware distributed parameter servers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’17). ACM, 463--478.Google ScholarDigital Library
Selvi Kadirvel and José A. B. Fortes. 2012. Grey-box approach for performance prediction in MapReduce based platforms. In Proceedings of the 21st International Conference on Computer Communications and Networks (ICCCN’12). IEEE, 1--9.Google Scholar
Faria Kalim, Thomas Cooper, Huijun Wu, Yao Li, Ning Wang, et al. 2019. Caladrius: A performance modelling service for distributed stream processing systems. In Proceedings of the 35th IEEE International Conference on Data Engineering (ICDE’19). IEEE, 1886--1897.Google ScholarCross Ref
Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. 2010. An analysis of traces from a production MapReduce cluster. In Proceedings of the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. IEEE, 94--103.Google ScholarDigital Library
Mukhtaj Khan, Yong Jin, Maozhen Li, Yang Xiang, and Changjun Jiang. 2016. Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans. Parallel Distrib. Syst. 27, 2 (2016), 441--454.Google ScholarDigital Library
Johannes Kroß and Helmut Krcmar. 2017. Model-based performance evaluation of batch and stream applications for big data. In Proceedings of the IEEE/ACM International Symposium on Modeling, Analysis, and Simulation on Computer and Telecommunication Systems (MASCOTS’17). IEEE Computer Society, 80--86.Google Scholar
Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, et al. 2015. Twitter Heron: Stream processing at scale. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’15). ACM, 239--250.Google ScholarDigital Library
Palden Lama and Xiaobo Zhou. 2012. AROMA: Automated resource allocation and configuration of MapReduce environment in the cloud. In Proceedings of the 9th International Conference on Autonomic Computing (ICAC’12). ACM, 63--72.Google ScholarDigital Library
Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon. 2012. Parallel data processing with MapReduce: A survey. ACM SIGMOD Record 40, 4 (Jan. 2012), 11--20. DOI:https://doi.org/10.1145/2094114.2094118.Google ScholarDigital Library
Min Li, Liangzhao Zeng, Shicong Meng, Jian Tan, Li Zhang, et al. 2014. MROnline: MapReduce online performance tuning. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC’14). ACM, 165--176.Google ScholarDigital Library
Teng Li, Jian Tang, and Jielong Xu. 2016. Performance modeling and predictive scheduling for distributed stream data processing. IEEE Trans. Big Data 2, 4 (2016), 353--364.Google ScholarCross Ref
Guangdeng Liao, Kushal Datta, and Theodore L. Willke. 2013. Gunther: Search-based auto-tuning of MapReduce. In Proceedings of the European Conference on Parallel Processing (Euro-Par’13). Springer, 406--419.Google Scholar
Jia-Chun Lin, Ming-Chang Lee, Ingrid Chieh Yu, and Einar Broch Johnsen. 2018. Modeling and simulation of Spark streaming. In Proceedings of the International Conference on Advanced Information Networking and Applications (AINA’18). IEEE, 407--413.Google ScholarCross Ref
Jia-Chun Lin, Ingrid Chieh Yu, Einar Broch Johnsen, and Ming-Chang Lee. 2016. ABS-YARN: A formal framework for modeling Hadoop YARN clusters. In Proceedings of the Fundamental Approaches to Software Engineering Conference (FASE’16) (Lecture Notes in Computer Science), Vol. 9633. Springer, 49--65.Google ScholarCross Ref
Xuelian Lin, Zide Meng, Chuan Xu, and Meng Wang. 2012. A practical performance model for Hadoop MapReduce. In Proceedings of the IEEE International Conference on Cluster Computing Workshops (CLUSTER WORKSHOPS’12). IEEE, 231--239.Google ScholarDigital Library
Chao Liu, Deze Zeng, Hong Yao, Chengyu Hu, Xuesong Yan, and Yuanyuan Fan. 2015. MR-COF: A genetic MapReduce configuration optimization framework. In Proceedings of the International Conference on Algorithms and Architecture for Parallel Processing. Springer, 344--357.Google ScholarCross Ref
Jun Liu, Nishkam Ravi, Srimat Chakradhar, and Mahmut Kandemir. 2012. Panacea: Towards holistic optimization of MapReduce applications. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, 33--43.Google ScholarDigital Library
Xunyun Liu, Amir Vahid Dastjerdi, Rodrigo N. Calheiros, et al. 2018. A stepwise auto-profiling method for performance optimization of streaming applications. ACM Trans. Auton. Adapt. Syst. 12, 4 (2018), 24:1–24:33.Google ScholarDigital Library
Yang Liu, Maozhen Li, Nasullah Khalid Alham, and Suhel Hammoud. 2013. HSim: A MapReduce simulator in enabling cloud computing. Fut. Gen. Comput. Syst. 29, 1 (2013), 300--308.Google ScholarDigital Library
Jiaheng Lu, Yuxing Chen, Herodotos Herodotou, and Shivnath Babu. 2019. Speedup your analytics: Automatic parameter tuning for databases and big data systems. PVLDB 12, 12 (2019), 1970--1973.Google ScholarDigital Library
Michael Malak and Robin East. 2016. Spark GraphX in Action. Manning Publications Co.Google Scholar
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. MLlib: Machine learning in Apache Spark. J. Mach. Learn. Res. 17, 1 (2016), 1235--1241.Google ScholarDigital Library
Matt Morgan. 2015. Ensuring the Best Performance from Your Hadoop Clusters, Proactively. Retrieved from https://hortonworks.com/blog/ensuring-the-best-performance-from-your-hadoop-clusters-proactively/.Google Scholar
Mumak. 2010. Mumak: Map-Reduce Simulator. Retrieved from https://issues.apache.org/jira/browse/MAPREDUCE-728.Google Scholar
Zubair Nabi, Eric Bouillet, Andrew Bainbridge, and Chris Thomas. 2014. Of streams and storms. IBM White Paper (2014), 1--31.Google Scholar
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, et al. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’09). ACM, 165--178.Google ScholarDigital Library
Panagiotis Petridis, Anastasios Gounaris, and Jordi Torres. 2016. Spark parameter tuning via trial-and-error. In Proceedings of the INNS Conference on Big Data. Springer, 226--237.Google Scholar
Max Petrov, Nikolay Butakov, Denis Nasonov, and Mikhail Melnik. 2018. Adaptive performance model for dynamic scaling apache spark streaming. Procedia Comput. Sci. 136 (2018), 109--117.Google ScholarCross Ref
Jorda Polo, David Carrera, Yolanda Becerra, Jordi Torres, Eduard Ayguadé, Malgorzata Steinder, and Ian Whalley. 2010. Performance-driven task co-scheduling for MapReduce environments. In Proceedings of the IEEE Network Operations and Management Symposium (NOMS’10). IEEE, 373--380.Google ScholarCross Ref
José Ignacio Requeno, José Merseguer, and Simona Bernardi. 2017. Performance analysis of Apache Storm applications using stochastic petri nets. In Proceedings of the International Conference on Information Reuse and Integration (IRI’17). IEEE, 411--418.Google ScholarCross Ref
Henriette Röger and Ruben Mayer. 2019. A comprehensive survey on parallelization and elasticity in stream processing. Comput. Surv. 52, 2 (2019), 36.Google ScholarDigital Library
Rumen 2009. Rumen: A Tool to Extract Job Characterization Data from Job Tracker Logs. Retrieved from https://issues.apache.org/jira/browse/MAPREDUCE-751.Google Scholar
Matthias J. Sax, Malu Castellanos, Qiming Chen, and Meichun Hsu. 2013. Performance optimization for distributed intra-node-parallel streaming systems. In Proceedings of the 29th International Conference on Data Engineering Workshops (ICDEW’13). IEEE, 62--69.Google ScholarCross Ref
Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, and Chen Wang. 2014. MRTuner: A toolkit to enable holistic optimization for mapreduce jobs. PVLDB 7, 13 (2014), 1319--1330.Google ScholarDigital Library
Rekha Singhal and Praveen Singh. 2017. Performance assurance model for applications on SPARK platform. In Proceedings of the Technology Conference on Performance Evaluation and Benchmarking (TPCTC’17). Springer, 131--146.Google Scholar
SparkCoreParameter 2019. Spark Core Parameters. Retrieved from https://spark.apache.org/docs/latest/configuration.html.Google Scholar
Nicoleta Tantalaki, Stavros Souravlas, and Manos Roumeliotis. 2019. A review on big data real-time stream processing and its scheduling techniques. Int. J. Parallel Emerg. Distrib. Syst. (2019), 1--31. https://www.tandfonline.com/doi/abs/10.1080/17445760.2019.1585848.Google Scholar
Fei Teng, Lei Yu, and Frederic Magoulès. 2011. SimMapReduce: A simulator for modeling MapReduce framework. In Proceedings of the 5th FTRA International Conference on Multimedia and Ubiquitous Engineering (MUE’11). IEEE, 277--282.Google ScholarDigital Library
TheNS2. 2011. The Network Simulator - ns-2. Retrieved from https://www.isi.edu/nsnam/ns/.Google Scholar
Michael Trotter, Guyue Liu, and Timothy Wood. 2017. Into the storm: Descrying optimal configurations using genetic algorithms and Bayesian optimization. In Proceedings of the IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS*W’17). IEEE Computer Society, 175--180.Google ScholarCross Ref
Michael Trotter, Timothy Wood, and Jinho Hwang. 2019. Forecasting a storm: Divining optimal configurations using genetic algorithms and supervised learning. In Proceedings of the International Conference on Autonomic Computing (ICAC’19). IEEE, 136--146.Google ScholarCross Ref
Luis M. Vaquero and Félix Cuadrado. 2018. Auto-tuning distributed stream processing systems using reinforcement learning. CoRR abs/1809.05495 (2018).Google Scholar
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, et al. 2013. Apache Hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC’13). ACM, 5.Google ScholarDigital Library
Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael J. Franklin, Benjamin Recht, and Ion Stoica. 2017. Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’17). ACM, 374--389.Google ScholarDigital Library
Shivaram Venkataraman, Zongheng Yang, Michael J. Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient performance prediction for large-scale advanced analytics. In Proceedings of the 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI’16). USENIX Association, 363--378.Google ScholarDigital Library
Abhishek Verma, Ludmila Cherkasova, and Roy H. Campbell. 2011. ARIA: Automatic resource inference and allocation for MapReduce environments. In Proceedings of the 8th International Conference on Autonomic Computing (ICAC’11). ACM, 235--244. DOI:https://doi.org/10.1145/1998582.1998637Google Scholar
Abhishek Verma, Ludmila Cherkasova, and Roy H. Campbell. 2011. Play it again, SimMR! In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’11). IEEE, 253--261.Google Scholar
Abhishek Verma, Ludmila Cherkasova, and Roy H. Campbell. 2011. Resource provisioning framework for MapReduce jobs with performance goals. In Proceedings of the ACM/IFIP/USENIX 12th International Middleware Conference. Springer, 165--186.Google Scholar
Chunkai Wang, Xiaofeng Meng, Qi Guo, Zujian Weng, and Chen Yang. 2017. Automating characterization deployment in distributed data stream management systems. IEEE Trans. Knowl. Data Eng. 29, 12 (2017), 2669--2681.Google ScholarCross Ref
Guanying Wang, Ali R. Butt, Prashant Pandey, and Karan Gupta. 2009. A simulation approach to evaluating design decisions in MapReduce setups. In Proceedings of the IEEE/ACM International Symposium on Modeling, Analysis, and Simulation on Computer and Telecommunication Systems (MASCOTS’09). IEEE, 1--11.Google Scholar
Guolu Wang, Jungang Xu, and Ben He. 2016. A novel method for tuning configuration parameters of Spark based on machine learning. In Proceedings of the International Conference on High Performance Computing and Communications (HPCC’16). IEEE, 586--593.Google ScholarCross Ref
Kewen Wang and Mohammad Maifi Hasan Khan. 2015. Performance prediction for Apache Spark platform. In Proceedings of the International Conference on High Performance Computing and Communications (HPCC’15). IEEE, 166--173.Google ScholarDigital Library
Kewen Wang, Xuelian Lin, and Wenzhong Tang. 2012. Predator—an experience guided configuration optimizer for Hadoop MapReduce. In Proceedings of the IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom’12). IEEE, 419--426.Google ScholarDigital Library
Thomas Weise. 2009. Global Optimization Algorithms—Theory and Application. Self-published. http://www.it-weise.de/projects/book.pdf.Google Scholar
Tom White. 2012. Hadoop: The Definitive Guide. O’Reilly Media, Inc.Google ScholarDigital Library
Dili Wu and Aniruddha Gokhale. 2013. A self-tuning system based on application profiling and performance analysis for optimizing Hadoop MapReduce cluster configuration. In Proceedings of the 20th International Conference on High Performance Computing (HiPC’13). IEEE, 89--98.Google ScholarCross Ref
Jielong Xu, Zhenhua Chen, Jian Tang, and Sen Su. 2014. T-Storm: Traffic-aware online scheduling in Storm. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS’14). IEEE, 535--544.Google ScholarDigital Library
Hailong Yang, Zhongzhi Luan, Wenjun Li, Depei Qian, and Gang Guan. 2012. Statistics-based workload modeling for MapReduce. In Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium Workshops 8 PhD Forum. IEEE, 2043--2051.Google ScholarDigital Library
Tao Ye and Shivkumar Kalyanaraman. 2003. A recursive random search algorithm for large-scale network parameter configuration. In Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems. ACM, 196--205.Google ScholarDigital Library
Nezih Yigitbasi, Theodore L. Willke, Guangdeng Liao, and Dick Epema. 2013. Towards machine learning–based auto-tuning of MapReduce. In Proceedings of the IEEE/ACM International Symposium on Modeling, Analysis, and Simulation On Computer and Telecommunication Systems (MASCOTS’13). IEEE, 11--20.Google ScholarDigital Library
Zhibin Yu, Zhendong Bei, and Xuehai Qian. 2018. Datasize-aware high dimensional configurations auto-tuning of in-memory cluster computing. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ACM, 564--577.Google ScholarDigital Library
Nikos Zacheilas, Vana Kalogeraki, Nikolaos Zygouras, Nikolaos Panagiotou, and Dimitrios Gunopulos. 2015. Elastic complex event processing exploiting prediction. In Proceedings of the IEEE International Conference on Big Data (Big Data’15). IEEE Computer Society, 213--222.Google ScholarDigital Library
Nikos Zacheilas, Stathis Maroulis, and Vana Kalogeraki. 2017. Dione: Profiling Spark applications exploiting graph similarity. In Proceedings of the IEEE International Conference on Big Data (BigData’17). IEEE, 389--394.Google ScholarCross Ref
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’12). USENIX Association, 2--14. Retrieved from http://dl.acm.org/citation.cfm?id=2228298.2228301.Google ScholarDigital Library
Zhuoyao Zhang, Ludmila Cherkasova, and Boon Thau Loo. 2014. Parameterizable benchmarking framework for designing a MapReduce performance model. Concurr. Comput. Pract. Exper. 26, 12 (2014), 2005--2026.Google ScholarDigital Library
Yuqing Zhu, Jianxun Liu, Mengying Guo, Yungang Bao, Wenlong Ma, et al. 2017. BestConfig: Tapping the performance potential of systems via automatic configuration tuning. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC’17). ACM, 338--350.Google ScholarDigital Library

Index Terms

A Survey on Automatic Parameter Tuning for Big Data Processing Systems
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Self-organizing autonomic computing
2. Information systems
  1. Information systems applications
    1. Computing platforms

Recommendations

Investigating Automatic Parameter Tuning for SQL-on-Hadoop Systems
Abstract
SQL-on-Hadoop engines such as Hive provide a declarative interface for processing large-scale data over computing frameworks such as Hadoop. The underlying frameworks contain a large number of configuration parameters that can ...
Read More
Design and evaluation of adaptive system for big data cyber security analytics
Abstract
Big Data Cyber Security Analytics (BDCA) systems leverage big data technologies to collect, store, and analyze a large volume of security event data for detecting cyber-attacks. Big data analytical frameworks (e.g, Apache Hadoop and ...
Read More
An experimental survey on big data frameworks
Abstract
Recently, increasingly large amounts of data are generated from a variety of sources.Existing data processing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a buzzword ...
Highlights
- An overview of most popular Big Data frameworks.
- A categorization of the presented frameworks and techniques.
- An extensive set of experiments to evaluate the studied Big Data frameworks.
- A description of best practices related ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 53, Issue 2
March 2021
848 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3388460
Editor:
Albert Zomaya
University of Sydney, Australia
Issue’s Table of Contents
Copyright © 2020 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 April 2020
- Accepted: 1 January 2020
- Revised: 1 December 2019
- Received: 1 March 2019
Published in csur Volume 53, Issue 2

Check for updates
Author Tags
MapReduce
Parameter tuning
Spark
Storm
self-tuning
stream
Qualifiers
- survey
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 56
  Total Citations
  View Citations
- 4,683
  Total Downloads
- Downloads (Last 12 months)888
- Downloads (Last 6 weeks)112
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

A Survey on Automatic Parameter Tuning for Big Data Processing Systems

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

Investigating Automatic Parameter Tuning for SQL-on-Hadoop Systems

Design and evaluation of adaptive system for big data cyber security analytics

An experimental survey on big data frameworks