Skip to main content
Log in

A Task-Aware Fine-Grained Storage Selection Mechanism for In-Memory Big Data Computing Frameworks

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

In-memory big data computing, widely used in hot areas such as deep learning and artificial intelligence, can meet the demands of ultra-low latency service and real-time data analysis. However, existing in-memory computing frameworks usually use memory in an aggressive way. Memory space is quickly exhausted and leads to great performance degradation or even task failure. On the other hand, the increasing volumes of raw data and intermediate data introduce huge memory demands, which further deteriorate the short of memory. To release the pressure on memory, those in-memory frameworks provide various storage schemes options for caching data, which determines where and how data is cached. But their storage scheme selection mechanisms are simple and insufficient, always manually set by users. Besides, those coarse-grained data storage mechanisms cannot satisfy memory access patterns of each computing unit which works on only part of the data. In this paper, we proposed a novel task-aware fine-grained storage scheme auto-selection mechanism. It automatically determines the storage scheme for caching each data block, which is the smallest unit during computing. The caching decision is made by considering the future tasks, real-time resource utilization, and storage costs, including block creation costs, I/O costs, and serialization costs under each storage scenario. The experiments show that our proposed mechanism, compared with the default storage setting, can offer great performance improvement, especially in memory-constrained circumstances it can be as much as 78%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Apache Hadoop Project, http://hadoop.apache.org/.

  2. Apache Spark Project, http://spark.apache.org/.

  3. Apache Storm Project, http://storm.apache.org/.

  4. Apache Flink Project, http://flink.apache.org/.

  5. Amazon S3, https://aws.amazon.com/s3/.

  6. Kryo, https://github.com/EsotericSoftware/kryo/.

References

  1. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation, San Francisco, CA, p. 10 (2004)

  2. Isard, M., Budiu, M., Yu, Y., et al.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pp. 59–72. ACM, Lisbon, Portugal (2007)

  3. Zhang, H., Chen, G., Ooi, B.C., et al.: In-memory big data management and processing: a survey. IEEE Trans. Knowl. Data Eng. 27, 1920–1948 (2015). https://doi.org/10.1109/TKDE.2015.2427795

    Article  Google Scholar 

  4. Zaharia, M., Chowdhury, M., Das, T., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, San Jose, CA, p. 2 (2012)

  5. Yu, Y., Wang, W., Zhang, J., et al.: LRC: dependency-aware cache management for data analytics clusters. In: Proceedings of INFOCOM, Atlanta, GA, USA, pp. 1–9 (2017)

  6. Choi, I.S., Yang, W., Kee, Y.S.: Early experience with optimizing I/O performance using high-performance SSDs for in-memory cluster computing. In: Proceedings of IEEE International Conference on Big Data (Big Data 2015) (2015), pp. 1073–1083

  7. Xin, R.: Project Tungsten (Spark 1.5 Phase 1). https://issues.apache.org/jira/browse/SPARK-7075. Accessed 1 Mar 2019

  8. Geng, Y., Shi, X., Pei, C., et al.: LCS: an efficient data eviction strategy for spark. Int. J. Parallel Program. 45, 1285–1297 (2017). https://doi.org/10.1007/s10766-016-0470-1

    Article  Google Scholar 

  9. Koliopoulos, A.K., Yiapanis, P., Tekiner, F., et al.: Towards automatic memory tuning for in-memory Big Data analytics in clusters. In: Proceedings of IEEE International Congress on Big Data (BigData Congress 2016), pp. 353–356 (2016)

  10. Saha, B., Shah, H., Seth, S., et al.: Apache Tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, Melbourne, Victoria, Australia, pp. 1357–1369 (2015)

  11. Shvachko, K., Kuang, H., Radia, S., et al.: The Hadoop distributed file system. In: Proceedings of IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST 2010), pp. 1–10 (2010)

  12. Zaharia, M., Borthakur, D., Sarma, J.S., et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems, pp. 265–278. ACM, Paris, France (2010)

  13. Li, M., Tan, J., Wang, Y., et al.: SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, pp. 1–8. ACM, Ischia, Italy (2015)

  14. Zhao, Y., Hu, F., Chen, H.: An adaptive tuning strategy on spark based on in-memory computation characteristics. In: Proceedings of the 18th International Conference on Advanced Communication Technology (ICACT2016), pp. 484–488 (2016)

  15. Mattson, R.L., Gecsei, J., Slutz, D.R., et al.: Evaluation techniques for storage hierarchies. IBM Syst. J. 9, 78–117 (1970). https://doi.org/10.1147/sj.92.0078

    Article  MATH  Google Scholar 

  16. Aho, A.V., Denning, P.J., Ullman, J.D.: Principles of optimal page replacement. J. ACM 18, 80–93 (1971). https://doi.org/10.1145/321623.321632

    Article  MathSciNet  MATH  Google Scholar 

  17. Cao, P., Felten, E.W., Karlin, A.R., et al.: Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling. ACM Trans. Comput. Syst. 14, 311–343 (1996). https://doi.org/10.1145/235543.235544

    Article  Google Scholar 

  18. Patterson, R.H., Gibson, G.A., Ginting, E., et al.: Informed prefetching and caching. SIGOPS Oper. Syst. Rev. 29, 79–95 (1995). https://doi.org/10.1145/224057.224064

    Article  Google Scholar 

  19. Ananthanarayanan, G., Ghodsi, A., Wang, A., et al.: PACMan: coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 20. USENIX Association, San Jose, CA (2012)

  20. Xu, L., Li, M., Zhang, L., et al.: MEMTUNE: dynamic memory management for in-memory data analytic platforms. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium, pp. 383–392 (2016)

  21. Feng, L.: Research and implementation of memory optimization based on parallel computing engine spark. M.S. Dissertation, Tsinghua University, China (2013)

  22. Wang, K., Zhang, K., Gao, C.: A new scheme for cache optimization based on cluster computing framework spark. In: Proceedings of 8th International Symposium on Computational Intelligence and Design (ISCID 2015), pp. 114–117 (2015)

  23. Duan, M., Li, K., Tang, Z., et al.: Selection and replacement algorithms for memory performance improvement in Spark. Concurr. Comput. Pract. Exp. 28, 2473–2486 (2016)

    Article  Google Scholar 

  24. Bian, C., Yu, C., Ying, C., et al.: Self-adaptive strategy for cache management in spark. Chin. J. Electron. 45, 278–284 (2016)

    Google Scholar 

  25. Spark Tuning. Online Referencing. http://spark.apache.org/docs/latest/tuning.html/#tuning-spark. Accessed 1 Mar 2019

  26. Wang, G.L., Xu, J.G., Liu, R.F.: A performance automatic optimization method for spark. Patent 105868019 A, CN (2016)

  27. Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. VLDB 4, 1111–1122 (2011)

    Google Scholar 

  28. Herodotou, H., Lim, H., Luo, G., et al.: Starfish: a self-tuning system for Big Data analytics. In: Proceedings of Fifth Biennial Conference on Innovative Data Systems Research (CIDR2011), Asilomar, CA, USA, pp. 261–272 (2011)

  29. Liu, C., Zeng, D., Yao, H., et al.: MR-COF: a genetic MapReduce configuration optimization framework. In: Proceedings of the 15th International Conference of Algorithms and Architectures for Parallel Processing (ICA3PP 2015), pp. 344–357(2015)

  30. Yigitbasi, N., Willke, T.L., Liao, G., et al.: Towards machine learning-based auto-tuning of MapReduce. In: Proceedings of the 2013 IEEE 21st International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems, pp. 11–20 (2013)

  31. Chen, C.O., Zhuo, Y.Q., Yeh, C.C., et al.: Machine learning-based configuration parameter tuning on Hadoop system. In: Proceedings of IEEE International Congress on Big Data, pp. 386–392 (2015)

  32. Chen, Q.A., Feng, L., Yue, C., et al.: Parameter optimization for spark jobs based on runtime data analysis. China Comput. Eng. Sci. 38, 11–19 (2016)

    Google Scholar 

  33. Khan, M., Huang, Z., Li, M., et al.: Optimizing Hadoop parameter settings with gene expression programming guided PSO. Concurr. Comput. Pract. Exp. (2017). https://doi.org/10.1002/cpe.3786

    Article  Google Scholar 

  34. Li, M., Zeng, L., Meng, S., et al.: MRONLINE: MapReduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 165–176. ACM, Vancouver, BC, Canada (2014)

  35. Cheng, D., Rao, J., Guo, Y., et al.: Improving MapReduce performance in heterogeneous environments with adaptive task tuning. In: Proceedings of the 15th International Middleware Conference, pp. 97–108. ACM, Bordeaux, France (2014)

Download references

Acknowledgements

This work is supported by South China University of Technology Start-up Grant No. D61600470, Guangzhou Technology Grant No. 201707010148, The New Generation of Artificial Intelligence In Guangdong Grant No. 2018B0107003, Doctoral Startup Program of Guangdong Natural Science Foundation Grant No. 2018A030310408, The Key Science and Technology Program of Henan Province Grant No. 202102210152, and National Science Foundation of China under Grant Nos. 61370062, 61866038.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jie Tang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, B., Tang, J., Zhang, R. et al. A Task-Aware Fine-Grained Storage Selection Mechanism for In-Memory Big Data Computing Frameworks. Int J Parallel Prog 49, 25–50 (2021). https://doi.org/10.1007/s10766-020-00662-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-020-00662-2

Keywords

Navigation