skip to main content
survey
Open Access

A Survey on Automatic Parameter Tuning for Big Data Processing Systems

Published:26 April 2020Publication History
Skip Abstract Section

Abstract

Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.

References

  1. Sean T. Allen, Matthew Jankowski, and Peter Pathirana. 2015. Storm Applied: Strategies for Real-time Event Processing. Manning Publications Co.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. AMD Hadoop Tuning. 2012. AMD Hadoop Performance Tuning Guide. Retrieved from https://developer.amd.com/wordpress/media/2012/10/Hadoop_Tuning_Guide-Version5.pdf.Google ScholarGoogle Scholar
  3. Apache Flink. 2019. Apache Flink. Retrieved from https://flink.apache.org/.Google ScholarGoogle Scholar
  4. Apache Hadoop. 2019. Apache Hadoop. Retrieved from https://hadoop.apache.org/.Google ScholarGoogle Scholar
  5. Apache Samza. 2019. Apache Samza. Retrieved from http://samza.apache.org/.Google ScholarGoogle Scholar
  6. ApacheSpark. 2019. Apache Spark. Retrieved from https://spark.apache.org/.Google ScholarGoogle Scholar
  7. Apache Spark Streaming. 2019. Apache Spark Streaming. Retrieved from https://spark.apache.org/streaming/.Google ScholarGoogle Scholar
  8. Apache Spark Tuning. 2017. Apache Spark Tuning - DZone. Retrieved from https://dzone.com/articles/apache-spark-performance-tuning-degree-of-parallel.Google ScholarGoogle Scholar
  9. Apache Spark Tuning Course. 2018. Apache Spark Tuning and Best Practices. Retrieved from https://databricks.com/training-overview/instructor-led-training/courses/apache-spark-tuning-and-best-practices.Google ScholarGoogle Scholar
  10. Apache Spark Tuning Guide. 2019. Apache Spark Tuning Guide. Retrieved from https://spark.apache.org/docs/latest/tuning.html.Google ScholarGoogle Scholar
  11. Apache Storm. 2019. Apache Storm. Retrieved from https://storm.apache.org/.Google ScholarGoogle Scholar
  12. Apache Storm Performance Tuning. 2019. Apache Storm Performance Tuning. Retrieved from https://storm.apache.org/releases/current/Performance.html.Google ScholarGoogle Scholar
  13. Apache Storm Trident. 2019. Apache Storm Trident. Retrieved from http://storm.apache.org/releases/current/Trident-tutorial.html.Google ScholarGoogle Scholar
  14. Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng et al. 2015. Spark SQL: Relational data processing in Spark. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’15). ACM, 1383--1394.Google ScholarGoogle Scholar
  15. Shivnath Babu. 2010. Towards automatic optimization of MapReduce programs. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC’10). ACM, 137--142.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Shivnath Babu and Herodotos Herodotou. 2013. Massively parallel databases and MapReduce systems. Found. Trends® Datab. 5, 1 (2013), 1--104.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Manu Bansal, Eyal Cidon, Arjun Balasingam, Aditya Gudipati, Christos Kozyrakis, and Sachin Katti. 2018. Trevor: Automatic configuration and scaling of stream processing pipelines. CoRR abs/1812.09442 (2018).Google ScholarGoogle Scholar
  18. Liang Bao, Xin Liu, and Weizhao Chen. 2018. Learning-based automatic parameter tuning for big data analytics frameworks. In Proceedings of the IEEE International Conference on Big Data. IEEE, 181--190.Google ScholarGoogle ScholarCross RefCross Ref
  19. Mike Barlow. 2013. Real-time Big Data Analytics: Emerging Architecture. O’Reilly Media, Inc.Google ScholarGoogle Scholar
  20. Ivan Bedini, Sherif Sakr, Bart Theeten, Alessandra Sala, and Peter Cogan. 2013. Modeling performance of a parallel streaming engine: Bridging theory and costs. In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering. ACM, 173--184.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Zhendong Bei, Zhibin Yu, Huiling Zhang, Wen Xiong, Chengzhong Xu, Lieven Eeckhout, and Shengzhong Feng. 2016. RFHOC: A random-forest approach to auto-tuning Hadoop’s configuration. IEEE Transactions on Parallel and Distributed Systems (TPDS) 27, 5 (2016), 1470--1483.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Muhammad Bilal and Marco Canini. 2017. Towards automatic parameter tuning of stream processing systems. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC’17). ACM, 189--200.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. BTrace. 2018. BTrace: A Dynamic Instrumentation Tool for Java. Retrieved from https://github.com/btraceio/btrace.Google ScholarGoogle Scholar
  24. Rajkumar Buyya and Manzur Murshed. 2002. GridSim: A toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurr. Comput. Pract. Exper. 14, 13–15 (2002), 1175--1220.Google ScholarGoogle ScholarCross RefCross Ref
  25. Chi-Ou Chen, Ye-Qi Zhuo, Chao-Chun Yeh, Che-Min Lin, and Shih-Wei Liao. 2015. Machine learning-based configuration parameter tuning on Hadoop system. In Proceedings of the IEEE International Congress on Big Data. IEEE, 386--392.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Keke Chen, James Powers, Shumin Guo, and Fengguang Tian. 2014. CRESP: Towards optimal resource provisioning for MapReduce computing in public clouds. IEEE Trans. Parallel Distrib. Syst. 25, 6 (2014), 1403--1412.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yuxing Chen, Peter Goetsch, Mohammad A. Hoque, Jiaheng Lu, and Sasu Tarkoma. 2019. d-Simplexed: Adaptive Delaunay triangulation for performance modeling and prediction on big data analytics. IEEE Trans. Big Data (2019). https://ieeexplore.ieee.org/document/8878273.Google ScholarGoogle Scholar
  28. Yuxing Chen, Jiaheng Lu, Chen Chen, Mohammad Hoque, and Sasu Tarkoma. 2019. Cost-effective resource provisioning for Spark workloads. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM’19). ACM, 2477--2480.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Dazhao Cheng, Jia Rao, Yanfei Guo, and Xiaobo Zhou. 2014. Improving MapReduce performance in heterogeneous environments with adaptive task tuning. In Proceedings of the 15th International Middleware Conference. ACM, 97--108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. ClouderaSparkTuning. 2018. Cloudera Performance Management - Tuning Spark Applications. Retrieved from https://www.cloudera.com/documentation/enterprise/5-9-x/topics/admin_spark_tuning.html.Google ScholarGoogle Scholar
  31. ClouderaYarnTuning. 2018. Cloudera Performance Management - Tuning YARN. Retrieved from https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_ig_yarn_tuning.html.Google ScholarGoogle Scholar
  32. Tathagata Das, Yuan Zhong, Ion Stoica, and Scott Shenker. 2014. Adaptive stream processing using dynamic batch sizing. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC’14). ACM, 16:1–16:13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Databricks. 2019. Databricks. Retrieved from https://sparkhub.databricks.com/.Google ScholarGoogle Scholar
  34. Miyuru Dayarathna and Srinath Perera. 2018. Recent advancements in event processing. Comput. Surv. 51, 2 (2018), 33.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Xiaoan Ding, Yi Liu, and Depei Qian. 2015. Jellyfish: Online performance tuning with adaptive configuration and elastic container in Hadoop YARN. In Proceedings of the 21st International Conference on Parallel and Distributed Systems. IEEE, 831--836.Google ScholarGoogle Scholar
  37. Shlomi Dolev, Patricia Florissi, Ehud Gudes, Shantanu Sharma, and Ido Singer. 2017. A survey on geographically distributed big-data processing using MapReduce. IEEE Trans. Big Data 5, 1 (2017), 60--80.Google ScholarGoogle ScholarCross RefCross Ref
  38. Christos Doulkeridis and Kjetil Nørvåg. 2014. A survey of large-scale analytical query processing in MapReduce. VLDB J. 23, 3 (June 2014), 355--380. DOI:https://doi.org/10.1007/s00778-013-0319-9Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Songyun Duan, Vamsidhar Thummala, and Shivnath Babu. 2009. Tuning database configuration parameters with iTuned. PVLDB 2, 1 (2009), 1246--1257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Mostafa Ead, Herodotos Herodotou, Ashraf Aboulnaga, and Shivnath Babu. 2014. PStorM: Profile storage and matching for feedback-based tuning of MapReduce jobs. In Proceedings of the 17th International Conference on Extending Database Technology (EDBT’14). 1--12.Google ScholarGoogle Scholar
  41. Lorenz Fischer, Shen Gao, and Abraham Bernstein. 2015. Machines tuning machines: Configuring distributed stream processors with Bayesian optimization. In Proceedings of the International Conference on Cluster Computing (CLUSTER’15). IEEE, 22--31.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Avrilia Floratou, Ashvin Agrawal, Bill Graham, Sriram Rao, and Karthik Ramasamy. 2017. Dhalion: Self regulating stream processing in Heron. PVLDB 10, 12 (2017), 1825--1836.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Tom Z. J. Fu, Jianbing Ding, Richard T. B. Ma, Marianne Winslett, Yin Yang, and Zhenjie Zhang. 2015. DRS: Dynamic resource scheduling for real-time analytics over fast streams. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS’15). IEEE, 411--420.Google ScholarGoogle ScholarCross RefCross Ref
  44. Jyoti V. Gautam, Harshadkumar B. Prajapati, Vipul K. Dabhi, and Sanjay Chaudhary. 2015. A survey on job scheduling algorithms in big data processing. In Proceedings of the International Conference on Electrical, Computer and Communication Technologies. IEEE, 1--11.Google ScholarGoogle ScholarCross RefCross Ref
  45. Mikhail Genkin, Frank Dehne et al. 2016. Automatic, on-line tuning of YARN container memory and CPU parameters. In Proceedings of the International Conference on High Performance Computing and Communications (HPCC’16). IEEE, 317--324.Google ScholarGoogle Scholar
  46. Anastasios Gounaris, Georgia Kougka, Ruben Tous, Carlos Tripiana Montes, and Jordi Torres. 2017. Dynamic configuration of partitioning in Spark applications. IEEE Trans. Parallel Distrib. Syst. 28, 7 (2017), 1891--1904.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Anastasios Gounaris and Jordi Torres. 2017. A methodology for Spark parameter tuning. Big Data Res. 11 (Mar. 2017), 22--32.Google ScholarGoogle Scholar
  48. HadoopClusterSetup. 2019. Hadoop Cluster Setup. Retrieved from https://hadoop.apache.org/docs/r1.2.1/cluster_setup.html.Google ScholarGoogle Scholar
  49. HadoopPerfUI. 2011. Hadoop Perf Monitoring UI. Retrieved from http://code.google.com/p/hadoop-toolkit/wiki/HadoopPerformanceMonitoring.Google ScholarGoogle Scholar
  50. HadoopTuning. 2015. Hadoop Performance Tuning Tutorial. Retrieved from http://hadooptutorial.info/hadoop-performance-tuning/.Google ScholarGoogle Scholar
  51. HadoopTutorial. 2018. Hadoop MapReduce Tutorial. Retrieved from https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html.Google ScholarGoogle Scholar
  52. HadoopVaidya. 2011. Hadoop Vaidya. Retrieved from http://hadoop.apache.org/mapreduce/docs/r0.21.0/vaidya.html.Google ScholarGoogle Scholar
  53. Suhel Hammoud, Maozhen Li, Yang Liu, Nasullah Khalid Alham, and Zelong Liu. 2010. MRSim: A discrete event based MapReduce simulator. In Proceedings of the 7th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD’10), Vol. 6. IEEE, 2993--2997.Google ScholarGoogle ScholarCross RefCross Ref
  54. Dominique Heger. 2013. Hadoop Performance Tuning—A Pragmatic 8 Iterative Approach. Retrieved from https://www.cmg.org/wp-content/uploads/2013/04/m_97_3.pdf.Google ScholarGoogle Scholar
  55. Álvaro Brandón Hernández, María S. Perez, Smrati Gupta, and Victor Muntés-Mulero. 2017. Using machine learning to optimize parallelism in big data applications. Fut. Gen. Comput. Syst. 86 (2018), 1076–1092. https://www.sciencedirect.com/science/article/abs/pii/S0167739X17314668?via%3Dihub.Google ScholarGoogle Scholar
  56. Herodotos Herodotou. 2011. Hadoop performance models. CoRR abs/1106.0940 (2011).Google ScholarGoogle Scholar
  57. Herodotos Herodotou. 2012. Automatic Tuning of Data-intensive Analytical Workloads. Ph.D. Dissertation. Duke University.Google ScholarGoogle Scholar
  58. Herodotos Herodotou and Shivnath Babu. 2011. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB 4, 11 (2011), 1111--1122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Herodotos Herodotou and Shivnath Babu. 2013. A what-if engine for cost-based MapReduce optimization. IEEE Data Eng. Bull. 36, 1 (2013), 5--14.Google ScholarGoogle Scholar
  60. Herodotos Herodotou, Fei Dong, and Shivnath Babu. 2011. No one (cluster) size fits all: Automatic cluster sizing for data-intensive analytics. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SoCC’11).Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu. 2011. Starfish: A self-tuning system for big data analytics. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR’11). 261--272.Google ScholarGoogle Scholar
  62. Wilson A. Higashino, Miriam A. M. Capretz, and Luiz F. Bittencourt. 2016. CEPSim: Modelling and simulation of complex event processing systems in cloud environments. Fut. Gen. Comput. Syst. 65 (2016), 122--139.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Martin Hirzel, Robert Soulé, Scott Schneider, Buğra Gedik, and Robert Grimm. 2014. A catalog of stream processing optimizations. Comput. Surv. 46, 4 (2014), 46.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Fred Howell and Ross McNab. 1998. SimJava: A discrete event simulation library for Java. Simul. Series 30 (1998), 51--56.Google ScholarGoogle Scholar
  65. Markus C. Huebscher and Julie A. McCann. 2008. A survey of autonomic computing—degrees, models, and applications. Comput. Surv. 40, 3 (2008), 7:1–7:28.Google ScholarGoogle Scholar
  66. Pooyan Jamshidi and Giuliano Casale. 2016. An uncertainty-aware approach to optimal configuration of stream processing systems. In Proceedings of the IEEE/ACM International Symposium on Modeling, Analysis, and Simulation on Computer and Telecommunication Systems (MASCOTS’16). IEEE, 39--48.Google ScholarGoogle ScholarCross RefCross Ref
  67. Zhen Jia, Chao Xue, Guancheng Chen, Jianfeng Zhan, Lixin Zhang, Yonghua Lin, and Peter Hofstee. 2016. Auto-tuning Spark big data workloads on POWER8: Prediction-based dynamic SMT threading. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT’16). IEEE, 387--400.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Dawei Jiang, Beng Chin Ooi, Lei Shi, and Sai Wu. 2010. The performance of MapReduce: An in-depth study. PVLDB 3, 1–2 (2010), 472--483.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware distributed parameter servers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’17). ACM, 463--478.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Selvi Kadirvel and José A. B. Fortes. 2012. Grey-box approach for performance prediction in MapReduce based platforms. In Proceedings of the 21st International Conference on Computer Communications and Networks (ICCCN’12). IEEE, 1--9.Google ScholarGoogle Scholar
  71. Faria Kalim, Thomas Cooper, Huijun Wu, Yao Li, Ning Wang, et al. 2019. Caladrius: A performance modelling service for distributed stream processing systems. In Proceedings of the 35th IEEE International Conference on Data Engineering (ICDE’19). IEEE, 1886--1897.Google ScholarGoogle ScholarCross RefCross Ref
  72. Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. 2010. An analysis of traces from a production MapReduce cluster. In Proceedings of the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. IEEE, 94--103.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Mukhtaj Khan, Yong Jin, Maozhen Li, Yang Xiang, and Changjun Jiang. 2016. Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans. Parallel Distrib. Syst. 27, 2 (2016), 441--454.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Johannes Kroß and Helmut Krcmar. 2017. Model-based performance evaluation of batch and stream applications for big data. In Proceedings of the IEEE/ACM International Symposium on Modeling, Analysis, and Simulation on Computer and Telecommunication Systems (MASCOTS’17). IEEE Computer Society, 80--86.Google ScholarGoogle Scholar
  75. Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, et al. 2015. Twitter Heron: Stream processing at scale. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’15). ACM, 239--250.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Palden Lama and Xiaobo Zhou. 2012. AROMA: Automated resource allocation and configuration of MapReduce environment in the cloud. In Proceedings of the 9th International Conference on Autonomic Computing (ICAC’12). ACM, 63--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon. 2012. Parallel data processing with MapReduce: A survey. ACM SIGMOD Record 40, 4 (Jan. 2012), 11--20. DOI:https://doi.org/10.1145/2094114.2094118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Min Li, Liangzhao Zeng, Shicong Meng, Jian Tan, Li Zhang, et al. 2014. MROnline: MapReduce online performance tuning. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC’14). ACM, 165--176.Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Teng Li, Jian Tang, and Jielong Xu. 2016. Performance modeling and predictive scheduling for distributed stream data processing. IEEE Trans. Big Data 2, 4 (2016), 353--364.Google ScholarGoogle ScholarCross RefCross Ref
  80. Guangdeng Liao, Kushal Datta, and Theodore L. Willke. 2013. Gunther: Search-based auto-tuning of MapReduce. In Proceedings of the European Conference on Parallel Processing (Euro-Par’13). Springer, 406--419.Google ScholarGoogle Scholar
  81. Jia-Chun Lin, Ming-Chang Lee, Ingrid Chieh Yu, and Einar Broch Johnsen. 2018. Modeling and simulation of Spark streaming. In Proceedings of the International Conference on Advanced Information Networking and Applications (AINA’18). IEEE, 407--413.Google ScholarGoogle ScholarCross RefCross Ref
  82. Jia-Chun Lin, Ingrid Chieh Yu, Einar Broch Johnsen, and Ming-Chang Lee. 2016. ABS-YARN: A formal framework for modeling Hadoop YARN clusters. In Proceedings of the Fundamental Approaches to Software Engineering Conference (FASE’16) (Lecture Notes in Computer Science), Vol. 9633. Springer, 49--65.Google ScholarGoogle ScholarCross RefCross Ref
  83. Xuelian Lin, Zide Meng, Chuan Xu, and Meng Wang. 2012. A practical performance model for Hadoop MapReduce. In Proceedings of the IEEE International Conference on Cluster Computing Workshops (CLUSTER WORKSHOPS’12). IEEE, 231--239.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Chao Liu, Deze Zeng, Hong Yao, Chengyu Hu, Xuesong Yan, and Yuanyuan Fan. 2015. MR-COF: A genetic MapReduce configuration optimization framework. In Proceedings of the International Conference on Algorithms and Architecture for Parallel Processing. Springer, 344--357.Google ScholarGoogle ScholarCross RefCross Ref
  85. Jun Liu, Nishkam Ravi, Srimat Chakradhar, and Mahmut Kandemir. 2012. Panacea: Towards holistic optimization of MapReduce applications. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, 33--43.Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Xunyun Liu, Amir Vahid Dastjerdi, Rodrigo N. Calheiros, et al. 2018. A stepwise auto-profiling method for performance optimization of streaming applications. ACM Trans. Auton. Adapt. Syst. 12, 4 (2018), 24:1–24:33.Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Yang Liu, Maozhen Li, Nasullah Khalid Alham, and Suhel Hammoud. 2013. HSim: A MapReduce simulator in enabling cloud computing. Fut. Gen. Comput. Syst. 29, 1 (2013), 300--308.Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Jiaheng Lu, Yuxing Chen, Herodotos Herodotou, and Shivnath Babu. 2019. Speedup your analytics: Automatic parameter tuning for databases and big data systems. PVLDB 12, 12 (2019), 1970--1973.Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Michael Malak and Robin East. 2016. Spark GraphX in Action. Manning Publications Co.Google ScholarGoogle Scholar
  90. Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. MLlib: Machine learning in Apache Spark. J. Mach. Learn. Res. 17, 1 (2016), 1235--1241.Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Matt Morgan. 2015. Ensuring the Best Performance from Your Hadoop Clusters, Proactively. Retrieved from https://hortonworks.com/blog/ensuring-the-best-performance-from-your-hadoop-clusters-proactively/.Google ScholarGoogle Scholar
  92. Mumak. 2010. Mumak: Map-Reduce Simulator. Retrieved from https://issues.apache.org/jira/browse/MAPREDUCE-728.Google ScholarGoogle Scholar
  93. Zubair Nabi, Eric Bouillet, Andrew Bainbridge, and Chris Thomas. 2014. Of streams and storms. IBM White Paper (2014), 1--31.Google ScholarGoogle Scholar
  94. Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, et al. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’09). ACM, 165--178.Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. Panagiotis Petridis, Anastasios Gounaris, and Jordi Torres. 2016. Spark parameter tuning via trial-and-error. In Proceedings of the INNS Conference on Big Data. Springer, 226--237.Google ScholarGoogle Scholar
  96. Max Petrov, Nikolay Butakov, Denis Nasonov, and Mikhail Melnik. 2018. Adaptive performance model for dynamic scaling apache spark streaming. Procedia Comput. Sci. 136 (2018), 109--117.Google ScholarGoogle ScholarCross RefCross Ref
  97. Jorda Polo, David Carrera, Yolanda Becerra, Jordi Torres, Eduard Ayguadé, Malgorzata Steinder, and Ian Whalley. 2010. Performance-driven task co-scheduling for MapReduce environments. In Proceedings of the IEEE Network Operations and Management Symposium (NOMS’10). IEEE, 373--380.Google ScholarGoogle ScholarCross RefCross Ref
  98. José Ignacio Requeno, José Merseguer, and Simona Bernardi. 2017. Performance analysis of Apache Storm applications using stochastic petri nets. In Proceedings of the International Conference on Information Reuse and Integration (IRI’17). IEEE, 411--418.Google ScholarGoogle ScholarCross RefCross Ref
  99. Henriette Röger and Ruben Mayer. 2019. A comprehensive survey on parallelization and elasticity in stream processing. Comput. Surv. 52, 2 (2019), 36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Rumen 2009. Rumen: A Tool to Extract Job Characterization Data from Job Tracker Logs. Retrieved from https://issues.apache.org/jira/browse/MAPREDUCE-751.Google ScholarGoogle Scholar
  101. Matthias J. Sax, Malu Castellanos, Qiming Chen, and Meichun Hsu. 2013. Performance optimization for distributed intra-node-parallel streaming systems. In Proceedings of the 29th International Conference on Data Engineering Workshops (ICDEW’13). IEEE, 62--69.Google ScholarGoogle ScholarCross RefCross Ref
  102. Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, and Chen Wang. 2014. MRTuner: A toolkit to enable holistic optimization for mapreduce jobs. PVLDB 7, 13 (2014), 1319--1330.Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Rekha Singhal and Praveen Singh. 2017. Performance assurance model for applications on SPARK platform. In Proceedings of the Technology Conference on Performance Evaluation and Benchmarking (TPCTC’17). Springer, 131--146.Google ScholarGoogle Scholar
  104. SparkCoreParameter 2019. Spark Core Parameters. Retrieved from https://spark.apache.org/docs/latest/configuration.html.Google ScholarGoogle Scholar
  105. Nicoleta Tantalaki, Stavros Souravlas, and Manos Roumeliotis. 2019. A review on big data real-time stream processing and its scheduling techniques. Int. J. Parallel Emerg. Distrib. Syst. (2019), 1--31. https://www.tandfonline.com/doi/abs/10.1080/17445760.2019.1585848.Google ScholarGoogle Scholar
  106. Fei Teng, Lei Yu, and Frederic Magoulès. 2011. SimMapReduce: A simulator for modeling MapReduce framework. In Proceedings of the 5th FTRA International Conference on Multimedia and Ubiquitous Engineering (MUE’11). IEEE, 277--282.Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. TheNS2. 2011. The Network Simulator - ns-2. Retrieved from https://www.isi.edu/nsnam/ns/.Google ScholarGoogle Scholar
  108. Michael Trotter, Guyue Liu, and Timothy Wood. 2017. Into the storm: Descrying optimal configurations using genetic algorithms and Bayesian optimization. In Proceedings of the IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS*W’17). IEEE Computer Society, 175--180.Google ScholarGoogle ScholarCross RefCross Ref
  109. Michael Trotter, Timothy Wood, and Jinho Hwang. 2019. Forecasting a storm: Divining optimal configurations using genetic algorithms and supervised learning. In Proceedings of the International Conference on Autonomic Computing (ICAC’19). IEEE, 136--146.Google ScholarGoogle ScholarCross RefCross Ref
  110. Luis M. Vaquero and Félix Cuadrado. 2018. Auto-tuning distributed stream processing systems using reinforcement learning. CoRR abs/1809.05495 (2018).Google ScholarGoogle Scholar
  111. Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, et al. 2013. Apache Hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC’13). ACM, 5.Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael J. Franklin, Benjamin Recht, and Ion Stoica. 2017. Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’17). ACM, 374--389.Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. Shivaram Venkataraman, Zongheng Yang, Michael J. Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient performance prediction for large-scale advanced analytics. In Proceedings of the 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI’16). USENIX Association, 363--378.Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. Abhishek Verma, Ludmila Cherkasova, and Roy H. Campbell. 2011. ARIA: Automatic resource inference and allocation for MapReduce environments. In Proceedings of the 8th International Conference on Autonomic Computing (ICAC’11). ACM, 235--244. DOI:https://doi.org/10.1145/1998582.1998637Google ScholarGoogle Scholar
  115. Abhishek Verma, Ludmila Cherkasova, and Roy H. Campbell. 2011. Play it again, SimMR! In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’11). IEEE, 253--261.Google ScholarGoogle Scholar
  116. Abhishek Verma, Ludmila Cherkasova, and Roy H. Campbell. 2011. Resource provisioning framework for MapReduce jobs with performance goals. In Proceedings of the ACM/IFIP/USENIX 12th International Middleware Conference. Springer, 165--186.Google ScholarGoogle Scholar
  117. Chunkai Wang, Xiaofeng Meng, Qi Guo, Zujian Weng, and Chen Yang. 2017. Automating characterization deployment in distributed data stream management systems. IEEE Trans. Knowl. Data Eng. 29, 12 (2017), 2669--2681.Google ScholarGoogle ScholarCross RefCross Ref
  118. Guanying Wang, Ali R. Butt, Prashant Pandey, and Karan Gupta. 2009. A simulation approach to evaluating design decisions in MapReduce setups. In Proceedings of the IEEE/ACM International Symposium on Modeling, Analysis, and Simulation on Computer and Telecommunication Systems (MASCOTS’09). IEEE, 1--11.Google ScholarGoogle Scholar
  119. Guolu Wang, Jungang Xu, and Ben He. 2016. A novel method for tuning configuration parameters of Spark based on machine learning. In Proceedings of the International Conference on High Performance Computing and Communications (HPCC’16). IEEE, 586--593.Google ScholarGoogle ScholarCross RefCross Ref
  120. Kewen Wang and Mohammad Maifi Hasan Khan. 2015. Performance prediction for Apache Spark platform. In Proceedings of the International Conference on High Performance Computing and Communications (HPCC’15). IEEE, 166--173.Google ScholarGoogle ScholarDigital LibraryDigital Library
  121. Kewen Wang, Xuelian Lin, and Wenzhong Tang. 2012. Predator—an experience guided configuration optimizer for Hadoop MapReduce. In Proceedings of the IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom’12). IEEE, 419--426.Google ScholarGoogle ScholarDigital LibraryDigital Library
  122. Thomas Weise. 2009. Global Optimization Algorithms—Theory and Application. Self-published. http://www.it-weise.de/projects/book.pdf.Google ScholarGoogle Scholar
  123. Tom White. 2012. Hadoop: The Definitive Guide. O’Reilly Media, Inc.Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. Dili Wu and Aniruddha Gokhale. 2013. A self-tuning system based on application profiling and performance analysis for optimizing Hadoop MapReduce cluster configuration. In Proceedings of the 20th International Conference on High Performance Computing (HiPC’13). IEEE, 89--98.Google ScholarGoogle ScholarCross RefCross Ref
  125. Jielong Xu, Zhenhua Chen, Jian Tang, and Sen Su. 2014. T-Storm: Traffic-aware online scheduling in Storm. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS’14). IEEE, 535--544.Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. Hailong Yang, Zhongzhi Luan, Wenjun Li, Depei Qian, and Gang Guan. 2012. Statistics-based workload modeling for MapReduce. In Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium Workshops 8 PhD Forum. IEEE, 2043--2051.Google ScholarGoogle ScholarDigital LibraryDigital Library
  127. Tao Ye and Shivkumar Kalyanaraman. 2003. A recursive random search algorithm for large-scale network parameter configuration. In Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems. ACM, 196--205.Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. Nezih Yigitbasi, Theodore L. Willke, Guangdeng Liao, and Dick Epema. 2013. Towards machine learning–based auto-tuning of MapReduce. In Proceedings of the IEEE/ACM International Symposium on Modeling, Analysis, and Simulation On Computer and Telecommunication Systems (MASCOTS’13). IEEE, 11--20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  129. Zhibin Yu, Zhendong Bei, and Xuehai Qian. 2018. Datasize-aware high dimensional configurations auto-tuning of in-memory cluster computing. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ACM, 564--577.Google ScholarGoogle ScholarDigital LibraryDigital Library
  130. Nikos Zacheilas, Vana Kalogeraki, Nikolaos Zygouras, Nikolaos Panagiotou, and Dimitrios Gunopulos. 2015. Elastic complex event processing exploiting prediction. In Proceedings of the IEEE International Conference on Big Data (Big Data’15). IEEE Computer Society, 213--222.Google ScholarGoogle ScholarDigital LibraryDigital Library
  131. Nikos Zacheilas, Stathis Maroulis, and Vana Kalogeraki. 2017. Dione: Profiling Spark applications exploiting graph similarity. In Proceedings of the IEEE International Conference on Big Data (BigData’17). IEEE, 389--394.Google ScholarGoogle ScholarCross RefCross Ref
  132. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’12). USENIX Association, 2--14. Retrieved from http://dl.acm.org/citation.cfm?id=2228298.2228301.Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. Zhuoyao Zhang, Ludmila Cherkasova, and Boon Thau Loo. 2014. Parameterizable benchmarking framework for designing a MapReduce performance model. Concurr. Comput. Pract. Exper. 26, 12 (2014), 2005--2026.Google ScholarGoogle ScholarDigital LibraryDigital Library
  134. Yuqing Zhu, Jianxun Liu, Mengying Guo, Yungang Bao, Wenlong Ma, et al. 2017. BestConfig: Tapping the performance potential of systems via automatic configuration tuning. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC’17). ACM, 338--350.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Survey on Automatic Parameter Tuning for Big Data Processing Systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Computing Surveys
        ACM Computing Surveys  Volume 53, Issue 2
        March 2021
        848 pages
        ISSN:0360-0300
        EISSN:1557-7341
        DOI:10.1145/3388460
        Issue’s Table of Contents

        Copyright © 2020 Owner/Author

        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 April 2020
        • Accepted: 1 January 2020
        • Revised: 1 December 2019
        • Received: 1 March 2019
        Published in csur Volume 53, Issue 2

        Check for updates

        Qualifiers

        • survey
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format