Abstract
Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.
- Sean T. Allen, Matthew Jankowski, and Peter Pathirana. 2015. Storm Applied: Strategies for Real-time Event Processing. Manning Publications Co.Google ScholarDigital Library
- AMD Hadoop Tuning. 2012. AMD Hadoop Performance Tuning Guide. Retrieved from https://developer.amd.com/wordpress/media/2012/10/Hadoop_Tuning_Guide-Version5.pdf.Google Scholar
- Apache Flink. 2019. Apache Flink. Retrieved from https://flink.apache.org/.Google Scholar
- Apache Hadoop. 2019. Apache Hadoop. Retrieved from https://hadoop.apache.org/.Google Scholar
- Apache Samza. 2019. Apache Samza. Retrieved from http://samza.apache.org/.Google Scholar
- ApacheSpark. 2019. Apache Spark. Retrieved from https://spark.apache.org/.Google Scholar
- Apache Spark Streaming. 2019. Apache Spark Streaming. Retrieved from https://spark.apache.org/streaming/.Google Scholar
- Apache Spark Tuning. 2017. Apache Spark Tuning - DZone. Retrieved from https://dzone.com/articles/apache-spark-performance-tuning-degree-of-parallel.Google Scholar
- Apache Spark Tuning Course. 2018. Apache Spark Tuning and Best Practices. Retrieved from https://databricks.com/training-overview/instructor-led-training/courses/apache-spark-tuning-and-best-practices.Google Scholar
- Apache Spark Tuning Guide. 2019. Apache Spark Tuning Guide. Retrieved from https://spark.apache.org/docs/latest/tuning.html.Google Scholar
- Apache Storm. 2019. Apache Storm. Retrieved from https://storm.apache.org/.Google Scholar
- Apache Storm Performance Tuning. 2019. Apache Storm Performance Tuning. Retrieved from https://storm.apache.org/releases/current/Performance.html.Google Scholar
- Apache Storm Trident. 2019. Apache Storm Trident. Retrieved from http://storm.apache.org/releases/current/Trident-tutorial.html.Google Scholar
- Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng et al. 2015. Spark SQL: Relational data processing in Spark. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’15). ACM, 1383--1394.Google Scholar
- Shivnath Babu. 2010. Towards automatic optimization of MapReduce programs. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC’10). ACM, 137--142.Google ScholarDigital Library
- Shivnath Babu and Herodotos Herodotou. 2013. Massively parallel databases and MapReduce systems. Found. Trends® Datab. 5, 1 (2013), 1--104.Google ScholarDigital Library
- Manu Bansal, Eyal Cidon, Arjun Balasingam, Aditya Gudipati, Christos Kozyrakis, and Sachin Katti. 2018. Trevor: Automatic configuration and scaling of stream processing pipelines. CoRR abs/1812.09442 (2018).Google Scholar
- Liang Bao, Xin Liu, and Weizhao Chen. 2018. Learning-based automatic parameter tuning for big data analytics frameworks. In Proceedings of the IEEE International Conference on Big Data. IEEE, 181--190.Google ScholarCross Ref
- Mike Barlow. 2013. Real-time Big Data Analytics: Emerging Architecture. O’Reilly Media, Inc.Google Scholar
- Ivan Bedini, Sherif Sakr, Bart Theeten, Alessandra Sala, and Peter Cogan. 2013. Modeling performance of a parallel streaming engine: Bridging theory and costs. In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering. ACM, 173--184.Google ScholarDigital Library
- Zhendong Bei, Zhibin Yu, Huiling Zhang, Wen Xiong, Chengzhong Xu, Lieven Eeckhout, and Shengzhong Feng. 2016. RFHOC: A random-forest approach to auto-tuning Hadoop’s configuration. IEEE Transactions on Parallel and Distributed Systems (TPDS) 27, 5 (2016), 1470--1483.Google ScholarDigital Library
- Muhammad Bilal and Marco Canini. 2017. Towards automatic parameter tuning of stream processing systems. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC’17). ACM, 189--200.Google ScholarDigital Library
- BTrace. 2018. BTrace: A Dynamic Instrumentation Tool for Java. Retrieved from https://github.com/btraceio/btrace.Google Scholar
- Rajkumar Buyya and Manzur Murshed. 2002. GridSim: A toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurr. Comput. Pract. Exper. 14, 13–15 (2002), 1175--1220.Google ScholarCross Ref
- Chi-Ou Chen, Ye-Qi Zhuo, Chao-Chun Yeh, Che-Min Lin, and Shih-Wei Liao. 2015. Machine learning-based configuration parameter tuning on Hadoop system. In Proceedings of the IEEE International Congress on Big Data. IEEE, 386--392.Google ScholarDigital Library
- Keke Chen, James Powers, Shumin Guo, and Fengguang Tian. 2014. CRESP: Towards optimal resource provisioning for MapReduce computing in public clouds. IEEE Trans. Parallel Distrib. Syst. 25, 6 (2014), 1403--1412.Google ScholarDigital Library
- Yuxing Chen, Peter Goetsch, Mohammad A. Hoque, Jiaheng Lu, and Sasu Tarkoma. 2019. d-Simplexed: Adaptive Delaunay triangulation for performance modeling and prediction on big data analytics. IEEE Trans. Big Data (2019). https://ieeexplore.ieee.org/document/8878273.Google Scholar
- Yuxing Chen, Jiaheng Lu, Chen Chen, Mohammad Hoque, and Sasu Tarkoma. 2019. Cost-effective resource provisioning for Spark workloads. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM’19). ACM, 2477--2480.Google ScholarDigital Library
- Dazhao Cheng, Jia Rao, Yanfei Guo, and Xiaobo Zhou. 2014. Improving MapReduce performance in heterogeneous environments with adaptive task tuning. In Proceedings of the 15th International Middleware Conference. ACM, 97--108.Google ScholarDigital Library
- ClouderaSparkTuning. 2018. Cloudera Performance Management - Tuning Spark Applications. Retrieved from https://www.cloudera.com/documentation/enterprise/5-9-x/topics/admin_spark_tuning.html.Google Scholar
- ClouderaYarnTuning. 2018. Cloudera Performance Management - Tuning YARN. Retrieved from https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_ig_yarn_tuning.html.Google Scholar
- Tathagata Das, Yuan Zhong, Ion Stoica, and Scott Shenker. 2014. Adaptive stream processing using dynamic batch sizing. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC’14). ACM, 16:1–16:13.Google ScholarDigital Library
- Databricks. 2019. Databricks. Retrieved from https://sparkhub.databricks.com/.Google Scholar
- Miyuru Dayarathna and Srinath Perera. 2018. Recent advancements in event processing. Comput. Surv. 51, 2 (2018), 33.Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.Google ScholarDigital Library
- Xiaoan Ding, Yi Liu, and Depei Qian. 2015. Jellyfish: Online performance tuning with adaptive configuration and elastic container in Hadoop YARN. In Proceedings of the 21st International Conference on Parallel and Distributed Systems. IEEE, 831--836.Google Scholar
- Shlomi Dolev, Patricia Florissi, Ehud Gudes, Shantanu Sharma, and Ido Singer. 2017. A survey on geographically distributed big-data processing using MapReduce. IEEE Trans. Big Data 5, 1 (2017), 60--80.Google ScholarCross Ref
- Christos Doulkeridis and Kjetil Nørvåg. 2014. A survey of large-scale analytical query processing in MapReduce. VLDB J. 23, 3 (June 2014), 355--380. DOI:https://doi.org/10.1007/s00778-013-0319-9Google ScholarDigital Library
- Songyun Duan, Vamsidhar Thummala, and Shivnath Babu. 2009. Tuning database configuration parameters with iTuned. PVLDB 2, 1 (2009), 1246--1257.Google ScholarDigital Library
- Mostafa Ead, Herodotos Herodotou, Ashraf Aboulnaga, and Shivnath Babu. 2014. PStorM: Profile storage and matching for feedback-based tuning of MapReduce jobs. In Proceedings of the 17th International Conference on Extending Database Technology (EDBT’14). 1--12.Google Scholar
- Lorenz Fischer, Shen Gao, and Abraham Bernstein. 2015. Machines tuning machines: Configuring distributed stream processors with Bayesian optimization. In Proceedings of the International Conference on Cluster Computing (CLUSTER’15). IEEE, 22--31.Google ScholarDigital Library
- Avrilia Floratou, Ashvin Agrawal, Bill Graham, Sriram Rao, and Karthik Ramasamy. 2017. Dhalion: Self regulating stream processing in Heron. PVLDB 10, 12 (2017), 1825--1836.Google ScholarDigital Library
- Tom Z. J. Fu, Jianbing Ding, Richard T. B. Ma, Marianne Winslett, Yin Yang, and Zhenjie Zhang. 2015. DRS: Dynamic resource scheduling for real-time analytics over fast streams. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS’15). IEEE, 411--420.Google ScholarCross Ref
- Jyoti V. Gautam, Harshadkumar B. Prajapati, Vipul K. Dabhi, and Sanjay Chaudhary. 2015. A survey on job scheduling algorithms in big data processing. In Proceedings of the International Conference on Electrical, Computer and Communication Technologies. IEEE, 1--11.Google ScholarCross Ref
- Mikhail Genkin, Frank Dehne et al. 2016. Automatic, on-line tuning of YARN container memory and CPU parameters. In Proceedings of the International Conference on High Performance Computing and Communications (HPCC’16). IEEE, 317--324.Google Scholar
- Anastasios Gounaris, Georgia Kougka, Ruben Tous, Carlos Tripiana Montes, and Jordi Torres. 2017. Dynamic configuration of partitioning in Spark applications. IEEE Trans. Parallel Distrib. Syst. 28, 7 (2017), 1891--1904.Google ScholarDigital Library
- Anastasios Gounaris and Jordi Torres. 2017. A methodology for Spark parameter tuning. Big Data Res. 11 (Mar. 2017), 22--32.Google Scholar
- HadoopClusterSetup. 2019. Hadoop Cluster Setup. Retrieved from https://hadoop.apache.org/docs/r1.2.1/cluster_setup.html.Google Scholar
- HadoopPerfUI. 2011. Hadoop Perf Monitoring UI. Retrieved from http://code.google.com/p/hadoop-toolkit/wiki/HadoopPerformanceMonitoring.Google Scholar
- HadoopTuning. 2015. Hadoop Performance Tuning Tutorial. Retrieved from http://hadooptutorial.info/hadoop-performance-tuning/.Google Scholar
- HadoopTutorial. 2018. Hadoop MapReduce Tutorial. Retrieved from https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html.Google Scholar
- HadoopVaidya. 2011. Hadoop Vaidya. Retrieved from http://hadoop.apache.org/mapreduce/docs/r0.21.0/vaidya.html.Google Scholar
- Suhel Hammoud, Maozhen Li, Yang Liu, Nasullah Khalid Alham, and Zelong Liu. 2010. MRSim: A discrete event based MapReduce simulator. In Proceedings of the 7th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD’10), Vol. 6. IEEE, 2993--2997.Google ScholarCross Ref
- Dominique Heger. 2013. Hadoop Performance Tuning—A Pragmatic 8 Iterative Approach. Retrieved from https://www.cmg.org/wp-content/uploads/2013/04/m_97_3.pdf.Google Scholar
- Álvaro Brandón Hernández, María S. Perez, Smrati Gupta, and Victor Muntés-Mulero. 2017. Using machine learning to optimize parallelism in big data applications. Fut. Gen. Comput. Syst. 86 (2018), 1076–1092. https://www.sciencedirect.com/science/article/abs/pii/S0167739X17314668?via%3Dihub.Google Scholar
- Herodotos Herodotou. 2011. Hadoop performance models. CoRR abs/1106.0940 (2011).Google Scholar
- Herodotos Herodotou. 2012. Automatic Tuning of Data-intensive Analytical Workloads. Ph.D. Dissertation. Duke University.Google Scholar
- Herodotos Herodotou and Shivnath Babu. 2011. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB 4, 11 (2011), 1111--1122.Google ScholarDigital Library
- Herodotos Herodotou and Shivnath Babu. 2013. A what-if engine for cost-based MapReduce optimization. IEEE Data Eng. Bull. 36, 1 (2013), 5--14.Google Scholar
- Herodotos Herodotou, Fei Dong, and Shivnath Babu. 2011. No one (cluster) size fits all: Automatic cluster sizing for data-intensive analytics. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SoCC’11).Google ScholarDigital Library
- Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu. 2011. Starfish: A self-tuning system for big data analytics. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR’11). 261--272.Google Scholar
- Wilson A. Higashino, Miriam A. M. Capretz, and Luiz F. Bittencourt. 2016. CEPSim: Modelling and simulation of complex event processing systems in cloud environments. Fut. Gen. Comput. Syst. 65 (2016), 122--139.Google ScholarDigital Library
- Martin Hirzel, Robert Soulé, Scott Schneider, Buğra Gedik, and Robert Grimm. 2014. A catalog of stream processing optimizations. Comput. Surv. 46, 4 (2014), 46.Google ScholarDigital Library
- Fred Howell and Ross McNab. 1998. SimJava: A discrete event simulation library for Java. Simul. Series 30 (1998), 51--56.Google Scholar
- Markus C. Huebscher and Julie A. McCann. 2008. A survey of autonomic computing—degrees, models, and applications. Comput. Surv. 40, 3 (2008), 7:1–7:28.Google Scholar
- Pooyan Jamshidi and Giuliano Casale. 2016. An uncertainty-aware approach to optimal configuration of stream processing systems. In Proceedings of the IEEE/ACM International Symposium on Modeling, Analysis, and Simulation on Computer and Telecommunication Systems (MASCOTS’16). IEEE, 39--48.Google ScholarCross Ref
- Zhen Jia, Chao Xue, Guancheng Chen, Jianfeng Zhan, Lixin Zhang, Yonghua Lin, and Peter Hofstee. 2016. Auto-tuning Spark big data workloads on POWER8: Prediction-based dynamic SMT threading. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT’16). IEEE, 387--400.Google ScholarDigital Library
- Dawei Jiang, Beng Chin Ooi, Lei Shi, and Sai Wu. 2010. The performance of MapReduce: An in-depth study. PVLDB 3, 1–2 (2010), 472--483.Google ScholarDigital Library
- Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware distributed parameter servers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’17). ACM, 463--478.Google ScholarDigital Library
- Selvi Kadirvel and José A. B. Fortes. 2012. Grey-box approach for performance prediction in MapReduce based platforms. In Proceedings of the 21st International Conference on Computer Communications and Networks (ICCCN’12). IEEE, 1--9.Google Scholar
- Faria Kalim, Thomas Cooper, Huijun Wu, Yao Li, Ning Wang, et al. 2019. Caladrius: A performance modelling service for distributed stream processing systems. In Proceedings of the 35th IEEE International Conference on Data Engineering (ICDE’19). IEEE, 1886--1897.Google ScholarCross Ref
- Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. 2010. An analysis of traces from a production MapReduce cluster. In Proceedings of the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. IEEE, 94--103.Google ScholarDigital Library
- Mukhtaj Khan, Yong Jin, Maozhen Li, Yang Xiang, and Changjun Jiang. 2016. Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans. Parallel Distrib. Syst. 27, 2 (2016), 441--454.Google ScholarDigital Library
- Johannes Kroß and Helmut Krcmar. 2017. Model-based performance evaluation of batch and stream applications for big data. In Proceedings of the IEEE/ACM International Symposium on Modeling, Analysis, and Simulation on Computer and Telecommunication Systems (MASCOTS’17). IEEE Computer Society, 80--86.Google Scholar
- Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, et al. 2015. Twitter Heron: Stream processing at scale. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’15). ACM, 239--250.Google ScholarDigital Library
- Palden Lama and Xiaobo Zhou. 2012. AROMA: Automated resource allocation and configuration of MapReduce environment in the cloud. In Proceedings of the 9th International Conference on Autonomic Computing (ICAC’12). ACM, 63--72.Google ScholarDigital Library
- Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon. 2012. Parallel data processing with MapReduce: A survey. ACM SIGMOD Record 40, 4 (Jan. 2012), 11--20. DOI:https://doi.org/10.1145/2094114.2094118.Google ScholarDigital Library
- Min Li, Liangzhao Zeng, Shicong Meng, Jian Tan, Li Zhang, et al. 2014. MROnline: MapReduce online performance tuning. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC’14). ACM, 165--176.Google ScholarDigital Library
- Teng Li, Jian Tang, and Jielong Xu. 2016. Performance modeling and predictive scheduling for distributed stream data processing. IEEE Trans. Big Data 2, 4 (2016), 353--364.Google ScholarCross Ref
- Guangdeng Liao, Kushal Datta, and Theodore L. Willke. 2013. Gunther: Search-based auto-tuning of MapReduce. In Proceedings of the European Conference on Parallel Processing (Euro-Par’13). Springer, 406--419.Google Scholar
- Jia-Chun Lin, Ming-Chang Lee, Ingrid Chieh Yu, and Einar Broch Johnsen. 2018. Modeling and simulation of Spark streaming. In Proceedings of the International Conference on Advanced Information Networking and Applications (AINA’18). IEEE, 407--413.Google ScholarCross Ref
- Jia-Chun Lin, Ingrid Chieh Yu, Einar Broch Johnsen, and Ming-Chang Lee. 2016. ABS-YARN: A formal framework for modeling Hadoop YARN clusters. In Proceedings of the Fundamental Approaches to Software Engineering Conference (FASE’16) (Lecture Notes in Computer Science), Vol. 9633. Springer, 49--65.Google ScholarCross Ref
- Xuelian Lin, Zide Meng, Chuan Xu, and Meng Wang. 2012. A practical performance model for Hadoop MapReduce. In Proceedings of the IEEE International Conference on Cluster Computing Workshops (CLUSTER WORKSHOPS’12). IEEE, 231--239.Google ScholarDigital Library
- Chao Liu, Deze Zeng, Hong Yao, Chengyu Hu, Xuesong Yan, and Yuanyuan Fan. 2015. MR-COF: A genetic MapReduce configuration optimization framework. In Proceedings of the International Conference on Algorithms and Architecture for Parallel Processing. Springer, 344--357.Google ScholarCross Ref
- Jun Liu, Nishkam Ravi, Srimat Chakradhar, and Mahmut Kandemir. 2012. Panacea: Towards holistic optimization of MapReduce applications. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, 33--43.Google ScholarDigital Library
- Xunyun Liu, Amir Vahid Dastjerdi, Rodrigo N. Calheiros, et al. 2018. A stepwise auto-profiling method for performance optimization of streaming applications. ACM Trans. Auton. Adapt. Syst. 12, 4 (2018), 24:1–24:33.Google ScholarDigital Library
- Yang Liu, Maozhen Li, Nasullah Khalid Alham, and Suhel Hammoud. 2013. HSim: A MapReduce simulator in enabling cloud computing. Fut. Gen. Comput. Syst. 29, 1 (2013), 300--308.Google ScholarDigital Library
- Jiaheng Lu, Yuxing Chen, Herodotos Herodotou, and Shivnath Babu. 2019. Speedup your analytics: Automatic parameter tuning for databases and big data systems. PVLDB 12, 12 (2019), 1970--1973.Google ScholarDigital Library
- Michael Malak and Robin East. 2016. Spark GraphX in Action. Manning Publications Co.Google Scholar
- Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. MLlib: Machine learning in Apache Spark. J. Mach. Learn. Res. 17, 1 (2016), 1235--1241.Google ScholarDigital Library
- Matt Morgan. 2015. Ensuring the Best Performance from Your Hadoop Clusters, Proactively. Retrieved from https://hortonworks.com/blog/ensuring-the-best-performance-from-your-hadoop-clusters-proactively/.Google Scholar
- Mumak. 2010. Mumak: Map-Reduce Simulator. Retrieved from https://issues.apache.org/jira/browse/MAPREDUCE-728.Google Scholar
- Zubair Nabi, Eric Bouillet, Andrew Bainbridge, and Chris Thomas. 2014. Of streams and storms. IBM White Paper (2014), 1--31.Google Scholar
- Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, et al. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’09). ACM, 165--178.Google ScholarDigital Library
- Panagiotis Petridis, Anastasios Gounaris, and Jordi Torres. 2016. Spark parameter tuning via trial-and-error. In Proceedings of the INNS Conference on Big Data. Springer, 226--237.Google Scholar
- Max Petrov, Nikolay Butakov, Denis Nasonov, and Mikhail Melnik. 2018. Adaptive performance model for dynamic scaling apache spark streaming. Procedia Comput. Sci. 136 (2018), 109--117.Google ScholarCross Ref
- Jorda Polo, David Carrera, Yolanda Becerra, Jordi Torres, Eduard Ayguadé, Malgorzata Steinder, and Ian Whalley. 2010. Performance-driven task co-scheduling for MapReduce environments. In Proceedings of the IEEE Network Operations and Management Symposium (NOMS’10). IEEE, 373--380.Google ScholarCross Ref
- José Ignacio Requeno, José Merseguer, and Simona Bernardi. 2017. Performance analysis of Apache Storm applications using stochastic petri nets. In Proceedings of the International Conference on Information Reuse and Integration (IRI’17). IEEE, 411--418.Google ScholarCross Ref
- Henriette Röger and Ruben Mayer. 2019. A comprehensive survey on parallelization and elasticity in stream processing. Comput. Surv. 52, 2 (2019), 36.Google ScholarDigital Library
- Rumen 2009. Rumen: A Tool to Extract Job Characterization Data from Job Tracker Logs. Retrieved from https://issues.apache.org/jira/browse/MAPREDUCE-751.Google Scholar
- Matthias J. Sax, Malu Castellanos, Qiming Chen, and Meichun Hsu. 2013. Performance optimization for distributed intra-node-parallel streaming systems. In Proceedings of the 29th International Conference on Data Engineering Workshops (ICDEW’13). IEEE, 62--69.Google ScholarCross Ref
- Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, and Chen Wang. 2014. MRTuner: A toolkit to enable holistic optimization for mapreduce jobs. PVLDB 7, 13 (2014), 1319--1330.Google ScholarDigital Library
- Rekha Singhal and Praveen Singh. 2017. Performance assurance model for applications on SPARK platform. In Proceedings of the Technology Conference on Performance Evaluation and Benchmarking (TPCTC’17). Springer, 131--146.Google Scholar
- SparkCoreParameter 2019. Spark Core Parameters. Retrieved from https://spark.apache.org/docs/latest/configuration.html.Google Scholar
- Nicoleta Tantalaki, Stavros Souravlas, and Manos Roumeliotis. 2019. A review on big data real-time stream processing and its scheduling techniques. Int. J. Parallel Emerg. Distrib. Syst. (2019), 1--31. https://www.tandfonline.com/doi/abs/10.1080/17445760.2019.1585848.Google Scholar
- Fei Teng, Lei Yu, and Frederic Magoulès. 2011. SimMapReduce: A simulator for modeling MapReduce framework. In Proceedings of the 5th FTRA International Conference on Multimedia and Ubiquitous Engineering (MUE’11). IEEE, 277--282.Google ScholarDigital Library
- TheNS2. 2011. The Network Simulator - ns-2. Retrieved from https://www.isi.edu/nsnam/ns/.Google Scholar
- Michael Trotter, Guyue Liu, and Timothy Wood. 2017. Into the storm: Descrying optimal configurations using genetic algorithms and Bayesian optimization. In Proceedings of the IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS*W’17). IEEE Computer Society, 175--180.Google ScholarCross Ref
- Michael Trotter, Timothy Wood, and Jinho Hwang. 2019. Forecasting a storm: Divining optimal configurations using genetic algorithms and supervised learning. In Proceedings of the International Conference on Autonomic Computing (ICAC’19). IEEE, 136--146.Google ScholarCross Ref
- Luis M. Vaquero and Félix Cuadrado. 2018. Auto-tuning distributed stream processing systems using reinforcement learning. CoRR abs/1809.05495 (2018).Google Scholar
- Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, et al. 2013. Apache Hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC’13). ACM, 5.Google ScholarDigital Library
- Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael J. Franklin, Benjamin Recht, and Ion Stoica. 2017. Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’17). ACM, 374--389.Google ScholarDigital Library
- Shivaram Venkataraman, Zongheng Yang, Michael J. Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient performance prediction for large-scale advanced analytics. In Proceedings of the 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI’16). USENIX Association, 363--378.Google ScholarDigital Library
- Abhishek Verma, Ludmila Cherkasova, and Roy H. Campbell. 2011. ARIA: Automatic resource inference and allocation for MapReduce environments. In Proceedings of the 8th International Conference on Autonomic Computing (ICAC’11). ACM, 235--244. DOI:https://doi.org/10.1145/1998582.1998637Google Scholar
- Abhishek Verma, Ludmila Cherkasova, and Roy H. Campbell. 2011. Play it again, SimMR! In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’11). IEEE, 253--261.Google Scholar
- Abhishek Verma, Ludmila Cherkasova, and Roy H. Campbell. 2011. Resource provisioning framework for MapReduce jobs with performance goals. In Proceedings of the ACM/IFIP/USENIX 12th International Middleware Conference. Springer, 165--186.Google Scholar
- Chunkai Wang, Xiaofeng Meng, Qi Guo, Zujian Weng, and Chen Yang. 2017. Automating characterization deployment in distributed data stream management systems. IEEE Trans. Knowl. Data Eng. 29, 12 (2017), 2669--2681.Google ScholarCross Ref
- Guanying Wang, Ali R. Butt, Prashant Pandey, and Karan Gupta. 2009. A simulation approach to evaluating design decisions in MapReduce setups. In Proceedings of the IEEE/ACM International Symposium on Modeling, Analysis, and Simulation on Computer and Telecommunication Systems (MASCOTS’09). IEEE, 1--11.Google Scholar
- Guolu Wang, Jungang Xu, and Ben He. 2016. A novel method for tuning configuration parameters of Spark based on machine learning. In Proceedings of the International Conference on High Performance Computing and Communications (HPCC’16). IEEE, 586--593.Google ScholarCross Ref
- Kewen Wang and Mohammad Maifi Hasan Khan. 2015. Performance prediction for Apache Spark platform. In Proceedings of the International Conference on High Performance Computing and Communications (HPCC’15). IEEE, 166--173.Google ScholarDigital Library
- Kewen Wang, Xuelian Lin, and Wenzhong Tang. 2012. Predator—an experience guided configuration optimizer for Hadoop MapReduce. In Proceedings of the IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom’12). IEEE, 419--426.Google ScholarDigital Library
- Thomas Weise. 2009. Global Optimization Algorithms—Theory and Application. Self-published. http://www.it-weise.de/projects/book.pdf.Google Scholar
- Tom White. 2012. Hadoop: The Definitive Guide. O’Reilly Media, Inc.Google ScholarDigital Library
- Dili Wu and Aniruddha Gokhale. 2013. A self-tuning system based on application profiling and performance analysis for optimizing Hadoop MapReduce cluster configuration. In Proceedings of the 20th International Conference on High Performance Computing (HiPC’13). IEEE, 89--98.Google ScholarCross Ref
- Jielong Xu, Zhenhua Chen, Jian Tang, and Sen Su. 2014. T-Storm: Traffic-aware online scheduling in Storm. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS’14). IEEE, 535--544.Google ScholarDigital Library
- Hailong Yang, Zhongzhi Luan, Wenjun Li, Depei Qian, and Gang Guan. 2012. Statistics-based workload modeling for MapReduce. In Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium Workshops 8 PhD Forum. IEEE, 2043--2051.Google ScholarDigital Library
- Tao Ye and Shivkumar Kalyanaraman. 2003. A recursive random search algorithm for large-scale network parameter configuration. In Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems. ACM, 196--205.Google ScholarDigital Library
- Nezih Yigitbasi, Theodore L. Willke, Guangdeng Liao, and Dick Epema. 2013. Towards machine learning–based auto-tuning of MapReduce. In Proceedings of the IEEE/ACM International Symposium on Modeling, Analysis, and Simulation On Computer and Telecommunication Systems (MASCOTS’13). IEEE, 11--20.Google ScholarDigital Library
- Zhibin Yu, Zhendong Bei, and Xuehai Qian. 2018. Datasize-aware high dimensional configurations auto-tuning of in-memory cluster computing. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ACM, 564--577.Google ScholarDigital Library
- Nikos Zacheilas, Vana Kalogeraki, Nikolaos Zygouras, Nikolaos Panagiotou, and Dimitrios Gunopulos. 2015. Elastic complex event processing exploiting prediction. In Proceedings of the IEEE International Conference on Big Data (Big Data’15). IEEE Computer Society, 213--222.Google ScholarDigital Library
- Nikos Zacheilas, Stathis Maroulis, and Vana Kalogeraki. 2017. Dione: Profiling Spark applications exploiting graph similarity. In Proceedings of the IEEE International Conference on Big Data (BigData’17). IEEE, 389--394.Google ScholarCross Ref
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’12). USENIX Association, 2--14. Retrieved from http://dl.acm.org/citation.cfm?id=2228298.2228301.Google ScholarDigital Library
- Zhuoyao Zhang, Ludmila Cherkasova, and Boon Thau Loo. 2014. Parameterizable benchmarking framework for designing a MapReduce performance model. Concurr. Comput. Pract. Exper. 26, 12 (2014), 2005--2026.Google ScholarDigital Library
- Yuqing Zhu, Jianxun Liu, Mengying Guo, Yungang Bao, Wenlong Ma, et al. 2017. BestConfig: Tapping the performance potential of systems via automatic configuration tuning. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC’17). ACM, 338--350.Google ScholarDigital Library
Index Terms
- A Survey on Automatic Parameter Tuning for Big Data Processing Systems
Recommendations
Investigating Automatic Parameter Tuning for SQL-on-Hadoop Systems
AbstractSQL-on-Hadoop engines such as Hive provide a declarative interface for processing large-scale data over computing frameworks such as Hadoop. The underlying frameworks contain a large number of configuration parameters that can ...
Design and evaluation of adaptive system for big data cyber security analytics
AbstractBig Data Cyber Security Analytics (BDCA) systems leverage big data technologies to collect, store, and analyze a large volume of security event data for detecting cyber-attacks. Big data analytical frameworks (e.g, Apache Hadoop and ...
An experimental survey on big data frameworks
AbstractRecently, increasingly large amounts of data are generated from a variety of sources.Existing data processing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a buzzword ...
Highlights- An overview of most popular Big Data frameworks.
- A categorization of the presented frameworks and techniques.
- An extensive set of experiments to evaluate the studied Big Data frameworks.
- A description of best practices related ...
Comments