Skip to main content
Log in

Speeding up AutoTuning of the Memory Management Options in Data Analytics

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Many solutions used towards building autonomous (or, self-driving) data processing systems today are trying to leverage the “black box” algorithm of Bayesian Optimization (BO) both due to its wider applicability and the theoretical guarantees provided on the quality of results produced. The black-box approach, however, could be time and labor-intensive; or otherwise get stuck in a local minima. We study an important problem of auto-tuning the memory allocation for applications running on modern distributed data processing systems. A simple “white-box” model is developed which can quickly separate good configurations from bad ones. To combine the benefits of the two approaches to tuning, we build a framework called Guided Bayesian Optimization (GBO) that uses the white-box model as a guide during the Bayesian Optimization exploration process. An evaluation carried out on Apache Spark using industry-standard benchmark applications shows that GBO consistently provides performance speedups across the application workload with the magnitude of savings being close to 2x.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Agrawal, S., Chaudhuri, S., Narasayya, V.R.: Automated selection of materialized views and indexes in SQL databases. In: Proceedings of the 26th International Conference on Very Large Data Bases (VLDB ’00), pp. 496–505. Morgan Kaufmann Publishers, San Francisco (2000). ISBN 1-55860-715-3. http://dl.acm.org/citation.cfm?id=645926.671701

  2. Aken, D.V., Pavlo, A., Gordon, G.J., Zhang, B.: Automatic database management system tuning through large-scale machine learning. In: Salihoglu, S., Zhou, W., Chirkova, R., Yang, J., Suciu, D. (eds.) Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, 14–19 May 2017, pp. 1009–1024. ACM, New York (2017). ISBN 978-1-4503-4197-4. https://doi.org/10.1145/3035918.3064029

  3. Alipourfard, O., Liu, H.H., Chen, J., Venkataraman, S., Yu, M., Zhang, M.: Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In: 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pp. 469–482, Boston, MA. USENIX Association, Berkeley (2017). ISBN 978-1-931971-37-9. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/alipourfard

  4. Arvanitis, A., Babu, S., Chu, E., Popescu, A., Simitsis, A., Wilkinson, K.: Automated performance management for the big data stack. In: CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 13–16 January 2019, Online Proceedings (2019). www.cidrdb.org, http://cidrdb.org/cidr2019/papers/p150-arvanitis-cidr19.pdf

  5. Bao, L., Liu, X., Chen, W.: Learning-based automatic parameter tuning for big data analytics frameworks. CoRR (2018). arXiv:1808.06008

  6. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  7. Byrd, R.H., Lu, P., Nocedal, J.: A limited-memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16, 1190–1208 (1994)

    Article  MathSciNet  Google Scholar 

  8. Cao, Z., Tarasov, V., Tiwari, S., Zadok, E.: Towards better understanding of black-box auto-tuning: A comparative analysis for storage systems. In: Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference. USENIX ATC ’18, pp. 893–907. USENIX Association, Berkeley (2018). ISBN 978-1-931971-44-7. URL http://dl.acm.org/citation.cfm?id=3277355.3277441

  9. Chaudhuri, S., Narasayya, V.: Self-tuning database systems: a decade of progress. In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB ’07), pp. 3–14. VLDB Endowment, 2007. ISBN 978-1-59593-649-3. http://dl.acm.org/citation.cfm?id=1325851.1325856

  10. Chaudhuri, S., Narasayya, V.R.: An efficient cost-driven index selection tool for microsoft sql server. In: Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB ’97), pp. 146–155. Morgan Kaufmann Publishers, San Francisco (1997). ISBN 1-55860-470-7. http://dl.acm.org/citation.cfm?id=645923.673646

  11. Dalibard, V., Schaarschmidt, M., Yoneki, E.: BOAT: Building auto-tuners with structured bayesian optimization. In: Proceedings of the 26th International Conference on World Wide Web (WWW ’17), Republic and Canton of Geneva, Switzerland, pp. 479–488. International World Wide Web Conferences Steering Committee (2017). ISBN 978-1-4503-4913-0. https://doi.org/10.1145/3038912.3052662

  12. Dias, K., Ramacher, M., Shaft, U., Venkataramani, V., Wood, G.: Automatic performance diagnosis and tuning in oracle. In: CIDR 2005, Second Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 4–7 January 2005, Online Proceedings. pp. 84–94 (2005). www.cidrdb.org, http://cidrdb.org/cidr2005/papers/P07.pdf

  13. Draper, N., Smith, H.: Applied Regression Analysis. Wiley Series in Probability and Statistics: Texts and References Section, vol. 1. Wiley, New York (1998). ISBN 9780471170822. URL https://books.google.co.in/books?id=8n8pAQAAMAAJ

  14. Duan, S., Thummala, V., Babu, S.: Tuning database configuration parameters with ituned. PVLDB 2(1), 1246–1257 (2009). https://doi.org/10.14778/1687627.1687767

    Article  Google Scholar 

  15. Gounaris, A., Torres, J.: A methodology for spark parameter tuning. Big Data Res. 11, 22–32 (2018)

    Article  Google Scholar 

  16. Herodotou, H., Dong, F., Babu, S.: No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-Intensive Analytics (SOCC ’11), pp. 18:1–18:14. ACM, New York (2011). ISBN 978-1-4503-0976-9. https://doi.org/10.1145/2038916.2038934

  17. Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: A self-tuning system for big data analytics. In: CIDR, Asilomar, pp. 261–272 (2011)

  18. Hsu, C., Nair, V., Freeh, V.W., Menzies, T.: Arrow: low-level augmented Bayesian optimization for finding the best cloud VM. In: 38th IEEE International Conference on Distributed Computing Systems (ICDCS 2018), Vienna, Austria, 2–6 July 2018, pp. 660–670. IEEE Computer Society (2018). ISBN 978-1-5386-6871-9. https://doi.org/10.1109/ICDCS.2018.00070

  19. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51, March 2010.https://doi.org/10.1109/ICDEW.2010.5452747

  20. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization, pp. 507–523. Springer, Heidelberg (2011)

  21. Ireland, C.: Fundamental concepts in the design of experiments. Technometrics 7(4), 652–653 (1965). https://doi.org/10.1080/00401706.1965.10490308

    Article  Google Scholar 

  22. Jamshidi, P., Casale, G.: An uncertainty-aware approach to optimal configuration of stream processing systems. In: 24th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS 2016), London, UK, 19–21 September 2016, pp. 39–48. IEEE Computer Society (2016). ISBN 978-1-5090-3432-1. https://doi.org/10.1109/MASCOTS.2016.17

  23. Kunjir, M., Babu, S.: Thoth in action: memory management in modern data analytics. Proc. VLDB Endow. 10(12), 1917–1920 (2017). https://doi.org/10.14778/3137765.3137808

    Article  Google Scholar 

  24. Kwan, E., Lightstone, S., Schiefer, K.B., Storm, A.J., Wu, L.: Automatic database configuration for DB2 universal database: Compressing years of performance expertise into seconds of execution. In: Weikum, G., Schöning, H., Rahm, E. (eds.) BTW 2003, Datenbanksysteme für Business, Technologie und Web, Tagungsband der 10. BTW-Konferenz, 26–28 Februar 2003, Leipzig, LNI, vol. 26, pp. 620–629. GI (2003). ISBN 3-88579-355-5. http://subs.emis.de/LNI/Proceedings/Proceedings26/article665.html

  25. Leskovec, J., Krevl, A.: SNAP Datasets: Stanford large network dataset collection (2014). http://snap.stanford.edu/data

  26. Li, G., Zhou, X., Li, S., Gao, B.: Qtune: A query-aware database tuning system with deep reinforcement learning. Proc. VLDB Endow. 12(12), 2118–2130 (2019). https://doi.org/10.14778/3352063.3352129

    Article  Google Scholar 

  27. Li, M., Zeng, L., Meng, S., Tan, J., Zhang, L., Butt, A.R., Fuller, N.: Mronline: Mapreduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 165–176. ACM, New York (2014)

  28. Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks (HotNets ’16), pp. 50–56. ACM, New York (2016). ISBN 978-1-4503-4661-0. https://doi.org/10.1145/3005745.3005750

  29. Marcus, R., Negi, P., Mao, H., Zhang, C., Alizadeh, M., Kraska, T., Papaemmanouil, O., Tatbul, N.: Neo: A learned query optimizer. Proc. VLDB Endow. 12(11), 1705–1718 (2019). https://doi.org/10.14778/3342263.3342644

    Article  Google Scholar 

  30. Mockus, J.: Bayesian Approach to Global Optimization: Theory and Applications. Mathematics and Its Applications . Soviet Series. Kluwer, Dordrecht (1989). ISBN 9780792301158

  31. Online: Java garbage collection basics (2019). https://bit.ly/2N8JyOp. Accessed 10 July 2019

  32. Online: Java management extensions (jmx) (2019). https://bit.ly/2KIvbNn. Accessed 10 July 2019

  33. Online: Intel’s performance analysis tool (2019). https://github.com/intel-hadoop/PAT. Accessed 10 July 2019

  34. Online: RelM Technical Report (2019). https://www.dropbox.com/s/2wwmdmw7a77qz03/main.pdf?dl=0. Accessed 10 July 2019

  35. Online: Amazon EMR documentation (2019). https://amzn.to/2zrpNtt. Accessed 10 July 2019

  36. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  37. Rao, J., Zhang, C., Megiddo, N., Lohman, G.: Automating physical database design in a parallel database. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD ’02), pp. 558–569. ACM, New York (2002). ISBN 1-58113-497-5.https://doi.org/10.1145/564691.564757

  38. Rasmussen, C.E.: Gaussian Processes for Machine Learning. MIT, Cambridge (2006)

    MATH  Google Scholar 

  39. Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., de Freitas, N.: Taking the human out of the loop: a review of bayesian optimization. Proc. IEEE 104, 148–175 (2016)

    Article  Google Scholar 

  40. Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs. Proc. VLDB Endow. 7(13), 1319–1330 (2014). https://doi.org/10.14778/2733004.2733005

    Article  Google Scholar 

  41. Storm, A.J., Garcia-Arellano, C., Lightstone, S.S., Diao, Y., Surendra, M.: Adaptive self-tuning memory in DB2. In: Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB ’06), pp. 1081–1092. VLDB Endowment (2006). http://dl.acm.org/citation.cfm?id=1182635.1164220

  42. Tan, Z., Babu, S.: Tempo: Robust and self-tuning resource management in multi-tenant parallel databases. Proc. VLDB Endow. 9(10), 720–731 (2016). https://doi.org/10.14778/2977797.2977799

    Article  Google Scholar 

  43. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  44. Venkataraman, S., Yang, Z., Franklin, M., Recht, B., Stoica, I.: Ernest: Efficient performance prediction for large-scale advanced analytics. In: 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), pp. 363–378, Santa Clara, CA. USENIX Association, Berkeley (2016). ISBN 978-1-931971-29-4. https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/venkataraman

  45. Wang, G., Xu, J., He, B.: A novel method for tuning configuration parameters of spark based on machine learning. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications, pp. 586–593. IEEE, Piscataway (2016). https://doi.org/10.1109/HPCC-SmartCity-DSS.2016.0088

  46. Wang, M., Au, K., Ailamaki, A., Brockwell, A., Faloutsos, C., Ganger, G.R.: Storage device performance prediction with cart models. In: Proceedings of the IEEE Computer Society’s 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004 (MASCOTS 2004), pp. 588–595. IEEE, Washington DC (2004)

  47. Weikum, G., Moenkeberg, A., Hasse, C., Zabback, P.: Self-tuning database technology and information services: from wishful thinking to viable engineering. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB ’02), pp. 20–31. VLDB Endowment (2002). http://dl.acm.org/citation.cfm?id=1287369.1287373

  48. Wikipedia Contributors: Pearson correlation coefficient—Wikipedia, the free encyclopedia (2019). https://en.wikipedia.org/w/index.php?title=Pearson_correlation_coefficient&oldid=905965350. Accessed 10 July 2019

  49. Xi, B., Liu, Z., Raghavachari, M., Xia, C.H., Zhang, L.: A smart hill-climbing algorithm for application server configuration. In: Proceedings of the 13th International Conference on World Wide Web, WWW ’04, pp. 287–296, New York, NY, USA, 2004. ACM. ISBN 1-58113-844-X. https://doi.org/10.1145/988672.988711

  50. Yadwadkar, N.J., Hariharan, B., Gonzalez, J.E., Smith, B., Katz, R.H.: Selecting the best vm across multiple public clouds: A data-driven performance modeling approach. In: Proceedings of the 2017 Symposium on Cloud Computing, SoCC ’17, pp. 452–465, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-5028-0. https://doi.org/10.1145/3127479.3131614

  51. Ye, T., Kalyanaraman, S.: A recursive random search algorithm for large-scale network parameter configuration. SIGMETRICS Perform. Eval. Rev. 31(1), 196–205 (2003). https://doi.org/10.1145/885651.781052

    Article  Google Scholar 

  52. Zhang, J., Liu, Y., Zhou, K., Li, G., Xiao, Z., Cheng, B., Xing, J., Wang, Y., Cheng, T., Liu, L., Ran, M., Li, Z.: An end-to-end automatic cloud database tuning system using deep reinforcement learning. In: Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19), pp. 415–432. ACM, New York (2019). ISBN 978-1-4503-5643-5. https://doi.org/10.1145/3299869.3300085

  53. Zhu, Y., Liu, J., Guo, M., Bao, Y., Ma, W., Liu, Z., Song, K., Yang, Y.: Bestconfig: tapping the performance potential of systems via automatic configuration tuning. In: Proceedings of the 2017 Symposium on Cloud Computing, pp. 338–350. ACM, New York (2017)

Download references

Acknowledgements

The article was supported by National Science Foundation (Grant No. CNS-1423128). The author would like to thank Dr. Shivnath Babu for his expert feedback on the system design.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mayuresh Kunjir.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kunjir, M. Speeding up AutoTuning of the Memory Management Options in Data Analytics. Distrib Parallel Databases 38, 841–863 (2020). https://doi.org/10.1007/s10619-019-07281-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-019-07281-y

Keywords

Mathematics Subject Classification

Navigation