Abstract
In-memory big data computing, widely used in hot areas such as deep learning and artificial intelligence, can meet the demands of ultra-low latency service and real-time data analysis. However, existing in-memory computing frameworks usually use memory in an aggressive way. Memory space is quickly exhausted and leads to great performance degradation or even task failure. On the other hand, the increasing volumes of raw data and intermediate data introduce huge memory demands, which further deteriorate the short of memory. To release the pressure on memory, those in-memory frameworks provide various storage schemes options for caching data, which determines where and how data is cached. But their storage scheme selection mechanisms are simple and insufficient, always manually set by users. Besides, those coarse-grained data storage mechanisms cannot satisfy memory access patterns of each computing unit which works on only part of the data. In this paper, we proposed a novel task-aware fine-grained storage scheme auto-selection mechanism. It automatically determines the storage scheme for caching each data block, which is the smallest unit during computing. The caching decision is made by considering the future tasks, real-time resource utilization, and storage costs, including block creation costs, I/O costs, and serialization costs under each storage scenario. The experiments show that our proposed mechanism, compared with the default storage setting, can offer great performance improvement, especially in memory-constrained circumstances it can be as much as 78%.
Similar content being viewed by others
Notes
Apache Hadoop Project, http://hadoop.apache.org/.
Apache Spark Project, http://spark.apache.org/.
Apache Storm Project, http://storm.apache.org/.
Apache Flink Project, http://flink.apache.org/.
Amazon S3, https://aws.amazon.com/s3/.
References
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation, San Francisco, CA, p. 10 (2004)
Isard, M., Budiu, M., Yu, Y., et al.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pp. 59–72. ACM, Lisbon, Portugal (2007)
Zhang, H., Chen, G., Ooi, B.C., et al.: In-memory big data management and processing: a survey. IEEE Trans. Knowl. Data Eng. 27, 1920–1948 (2015). https://doi.org/10.1109/TKDE.2015.2427795
Zaharia, M., Chowdhury, M., Das, T., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, San Jose, CA, p. 2 (2012)
Yu, Y., Wang, W., Zhang, J., et al.: LRC: dependency-aware cache management for data analytics clusters. In: Proceedings of INFOCOM, Atlanta, GA, USA, pp. 1–9 (2017)
Choi, I.S., Yang, W., Kee, Y.S.: Early experience with optimizing I/O performance using high-performance SSDs for in-memory cluster computing. In: Proceedings of IEEE International Conference on Big Data (Big Data 2015) (2015), pp. 1073–1083
Xin, R.: Project Tungsten (Spark 1.5 Phase 1). https://issues.apache.org/jira/browse/SPARK-7075. Accessed 1 Mar 2019
Geng, Y., Shi, X., Pei, C., et al.: LCS: an efficient data eviction strategy for spark. Int. J. Parallel Program. 45, 1285–1297 (2017). https://doi.org/10.1007/s10766-016-0470-1
Koliopoulos, A.K., Yiapanis, P., Tekiner, F., et al.: Towards automatic memory tuning for in-memory Big Data analytics in clusters. In: Proceedings of IEEE International Congress on Big Data (BigData Congress 2016), pp. 353–356 (2016)
Saha, B., Shah, H., Seth, S., et al.: Apache Tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, Melbourne, Victoria, Australia, pp. 1357–1369 (2015)
Shvachko, K., Kuang, H., Radia, S., et al.: The Hadoop distributed file system. In: Proceedings of IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST 2010), pp. 1–10 (2010)
Zaharia, M., Borthakur, D., Sarma, J.S., et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems, pp. 265–278. ACM, Paris, France (2010)
Li, M., Tan, J., Wang, Y., et al.: SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, pp. 1–8. ACM, Ischia, Italy (2015)
Zhao, Y., Hu, F., Chen, H.: An adaptive tuning strategy on spark based on in-memory computation characteristics. In: Proceedings of the 18th International Conference on Advanced Communication Technology (ICACT2016), pp. 484–488 (2016)
Mattson, R.L., Gecsei, J., Slutz, D.R., et al.: Evaluation techniques for storage hierarchies. IBM Syst. J. 9, 78–117 (1970). https://doi.org/10.1147/sj.92.0078
Aho, A.V., Denning, P.J., Ullman, J.D.: Principles of optimal page replacement. J. ACM 18, 80–93 (1971). https://doi.org/10.1145/321623.321632
Cao, P., Felten, E.W., Karlin, A.R., et al.: Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling. ACM Trans. Comput. Syst. 14, 311–343 (1996). https://doi.org/10.1145/235543.235544
Patterson, R.H., Gibson, G.A., Ginting, E., et al.: Informed prefetching and caching. SIGOPS Oper. Syst. Rev. 29, 79–95 (1995). https://doi.org/10.1145/224057.224064
Ananthanarayanan, G., Ghodsi, A., Wang, A., et al.: PACMan: coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 20. USENIX Association, San Jose, CA (2012)
Xu, L., Li, M., Zhang, L., et al.: MEMTUNE: dynamic memory management for in-memory data analytic platforms. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium, pp. 383–392 (2016)
Feng, L.: Research and implementation of memory optimization based on parallel computing engine spark. M.S. Dissertation, Tsinghua University, China (2013)
Wang, K., Zhang, K., Gao, C.: A new scheme for cache optimization based on cluster computing framework spark. In: Proceedings of 8th International Symposium on Computational Intelligence and Design (ISCID 2015), pp. 114–117 (2015)
Duan, M., Li, K., Tang, Z., et al.: Selection and replacement algorithms for memory performance improvement in Spark. Concurr. Comput. Pract. Exp. 28, 2473–2486 (2016)
Bian, C., Yu, C., Ying, C., et al.: Self-adaptive strategy for cache management in spark. Chin. J. Electron. 45, 278–284 (2016)
Spark Tuning. Online Referencing. http://spark.apache.org/docs/latest/tuning.html/#tuning-spark. Accessed 1 Mar 2019
Wang, G.L., Xu, J.G., Liu, R.F.: A performance automatic optimization method for spark. Patent 105868019 A, CN (2016)
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. VLDB 4, 1111–1122 (2011)
Herodotou, H., Lim, H., Luo, G., et al.: Starfish: a self-tuning system for Big Data analytics. In: Proceedings of Fifth Biennial Conference on Innovative Data Systems Research (CIDR2011), Asilomar, CA, USA, pp. 261–272 (2011)
Liu, C., Zeng, D., Yao, H., et al.: MR-COF: a genetic MapReduce configuration optimization framework. In: Proceedings of the 15th International Conference of Algorithms and Architectures for Parallel Processing (ICA3PP 2015), pp. 344–357(2015)
Yigitbasi, N., Willke, T.L., Liao, G., et al.: Towards machine learning-based auto-tuning of MapReduce. In: Proceedings of the 2013 IEEE 21st International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems, pp. 11–20 (2013)
Chen, C.O., Zhuo, Y.Q., Yeh, C.C., et al.: Machine learning-based configuration parameter tuning on Hadoop system. In: Proceedings of IEEE International Congress on Big Data, pp. 386–392 (2015)
Chen, Q.A., Feng, L., Yue, C., et al.: Parameter optimization for spark jobs based on runtime data analysis. China Comput. Eng. Sci. 38, 11–19 (2016)
Khan, M., Huang, Z., Li, M., et al.: Optimizing Hadoop parameter settings with gene expression programming guided PSO. Concurr. Comput. Pract. Exp. (2017). https://doi.org/10.1002/cpe.3786
Li, M., Zeng, L., Meng, S., et al.: MRONLINE: MapReduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 165–176. ACM, Vancouver, BC, Canada (2014)
Cheng, D., Rao, J., Guo, Y., et al.: Improving MapReduce performance in heterogeneous environments with adaptive task tuning. In: Proceedings of the 15th International Middleware Conference, pp. 97–108. ACM, Bordeaux, France (2014)
Acknowledgements
This work is supported by South China University of Technology Start-up Grant No. D61600470, Guangzhou Technology Grant No. 201707010148, The New Generation of Artificial Intelligence In Guangdong Grant No. 2018B0107003, Doctoral Startup Program of Guangdong Natural Science Foundation Grant No. 2018A030310408, The Key Science and Technology Program of Henan Province Grant No. 202102210152, and National Science Foundation of China under Grant Nos. 61370062, 61866038.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, B., Tang, J., Zhang, R. et al. A Task-Aware Fine-Grained Storage Selection Mechanism for In-Memory Big Data Computing Frameworks. Int J Parallel Prog 49, 25–50 (2021). https://doi.org/10.1007/s10766-020-00662-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-020-00662-2