A Task-Aware Fine-Grained Storage Selection Mechanism for In-Memory Big Data Computing Frameworks

Wang, Bo; Tang, Jie; Zhang, Rui; Liu, Jialei; Liu, Shaoshan; Qi, Deyu

doi:10.1007/s10766-020-00662-2

A Task-Aware Fine-Grained Storage Selection Mechanism for In-Memory Big Data Computing Frameworks

Published: 05 June 2020

Volume 49, pages 25–50, (2021)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Bo Wang¹,
Jie Tang ORCID: orcid.org/0000-0001-8602-7754²,
Rui Zhang³,
Jialei Liu¹,
Shaoshan Liu⁴ &
…
Deyu Qi²

287 Accesses
1 Citation
Explore all metrics

Abstract

In-memory big data computing, widely used in hot areas such as deep learning and artificial intelligence, can meet the demands of ultra-low latency service and real-time data analysis. However, existing in-memory computing frameworks usually use memory in an aggressive way. Memory space is quickly exhausted and leads to great performance degradation or even task failure. On the other hand, the increasing volumes of raw data and intermediate data introduce huge memory demands, which further deteriorate the short of memory. To release the pressure on memory, those in-memory frameworks provide various storage schemes options for caching data, which determines where and how data is cached. But their storage scheme selection mechanisms are simple and insufficient, always manually set by users. Besides, those coarse-grained data storage mechanisms cannot satisfy memory access patterns of each computing unit which works on only part of the data. In this paper, we proposed a novel task-aware fine-grained storage scheme auto-selection mechanism. It automatically determines the storage scheme for caching each data block, which is the smallest unit during computing. The caching decision is made by considering the future tasks, real-time resource utilization, and storage costs, including block creation costs, I/O costs, and serialization costs under each storage scenario. The experiments show that our proposed mechanism, compared with the default storage setting, can offer great performance improvement, especially in memory-constrained circumstances it can be as much as 78%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Dependency-Aware Storage Schema Selection Mechanism for In-Memory Big Data Computing Frameworks

Article 09 April 2019

Performance Test for Big Data Workloads on Various Emerging Memories

EarnCache: Self-adaptive Incremental Caching for Big Data Applications

Notes

Apache Hadoop Project, http://hadoop.apache.org/.
Apache Spark Project, http://spark.apache.org/.
Apache Storm Project, http://storm.apache.org/.
Apache Flink Project, http://flink.apache.org/.
Amazon S3, https://aws.amazon.com/s3/.
Kryo, https://github.com/EsotericSoftware/kryo/.

References

Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation, San Francisco, CA, p. 10 (2004)
Isard, M., Budiu, M., Yu, Y., et al.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pp. 59–72. ACM, Lisbon, Portugal (2007)
Zhang, H., Chen, G., Ooi, B.C., et al.: In-memory big data management and processing: a survey. IEEE Trans. Knowl. Data Eng. 27, 1920–1948 (2015). https://doi.org/10.1109/TKDE.2015.2427795
Article Google Scholar
Zaharia, M., Chowdhury, M., Das, T., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, San Jose, CA, p. 2 (2012)
Yu, Y., Wang, W., Zhang, J., et al.: LRC: dependency-aware cache management for data analytics clusters. In: Proceedings of INFOCOM, Atlanta, GA, USA, pp. 1–9 (2017)
Choi, I.S., Yang, W., Kee, Y.S.: Early experience with optimizing I/O performance using high-performance SSDs for in-memory cluster computing. In: Proceedings of IEEE International Conference on Big Data (Big Data 2015) (2015), pp. 1073–1083
Xin, R.: Project Tungsten (Spark 1.5 Phase 1). https://issues.apache.org/jira/browse/SPARK-7075. Accessed 1 Mar 2019
Geng, Y., Shi, X., Pei, C., et al.: LCS: an efficient data eviction strategy for spark. Int. J. Parallel Program. 45, 1285–1297 (2017). https://doi.org/10.1007/s10766-016-0470-1
Article Google Scholar
Koliopoulos, A.K., Yiapanis, P., Tekiner, F., et al.: Towards automatic memory tuning for in-memory Big Data analytics in clusters. In: Proceedings of IEEE International Congress on Big Data (BigData Congress 2016), pp. 353–356 (2016)
Saha, B., Shah, H., Seth, S., et al.: Apache Tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, Melbourne, Victoria, Australia, pp. 1357–1369 (2015)
Shvachko, K., Kuang, H., Radia, S., et al.: The Hadoop distributed file system. In: Proceedings of IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST 2010), pp. 1–10 (2010)
Zaharia, M., Borthakur, D., Sarma, J.S., et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems, pp. 265–278. ACM, Paris, France (2010)
Li, M., Tan, J., Wang, Y., et al.: SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, pp. 1–8. ACM, Ischia, Italy (2015)
Zhao, Y., Hu, F., Chen, H.: An adaptive tuning strategy on spark based on in-memory computation characteristics. In: Proceedings of the 18th International Conference on Advanced Communication Technology (ICACT2016), pp. 484–488 (2016)
Mattson, R.L., Gecsei, J., Slutz, D.R., et al.: Evaluation techniques for storage hierarchies. IBM Syst. J. 9, 78–117 (1970). https://doi.org/10.1147/sj.92.0078
Article MATH Google Scholar
Aho, A.V., Denning, P.J., Ullman, J.D.: Principles of optimal page replacement. J. ACM 18, 80–93 (1971). https://doi.org/10.1145/321623.321632
Article MathSciNet MATH Google Scholar
Cao, P., Felten, E.W., Karlin, A.R., et al.: Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling. ACM Trans. Comput. Syst. 14, 311–343 (1996). https://doi.org/10.1145/235543.235544
Article Google Scholar
Patterson, R.H., Gibson, G.A., Ginting, E., et al.: Informed prefetching and caching. SIGOPS Oper. Syst. Rev. 29, 79–95 (1995). https://doi.org/10.1145/224057.224064
Article Google Scholar
Ananthanarayanan, G., Ghodsi, A., Wang, A., et al.: PACMan: coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 20. USENIX Association, San Jose, CA (2012)
Xu, L., Li, M., Zhang, L., et al.: MEMTUNE: dynamic memory management for in-memory data analytic platforms. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium, pp. 383–392 (2016)
Feng, L.: Research and implementation of memory optimization based on parallel computing engine spark. M.S. Dissertation, Tsinghua University, China (2013)
Wang, K., Zhang, K., Gao, C.: A new scheme for cache optimization based on cluster computing framework spark. In: Proceedings of 8th International Symposium on Computational Intelligence and Design (ISCID 2015), pp. 114–117 (2015)
Duan, M., Li, K., Tang, Z., et al.: Selection and replacement algorithms for memory performance improvement in Spark. Concurr. Comput. Pract. Exp. 28, 2473–2486 (2016)
Article Google Scholar
Bian, C., Yu, C., Ying, C., et al.: Self-adaptive strategy for cache management in spark. Chin. J. Electron. 45, 278–284 (2016)
Google Scholar
Spark Tuning. Online Referencing. http://spark.apache.org/docs/latest/tuning.html/#tuning-spark. Accessed 1 Mar 2019
Wang, G.L., Xu, J.G., Liu, R.F.: A performance automatic optimization method for spark. Patent 105868019 A, CN (2016)
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. VLDB 4, 1111–1122 (2011)
Google Scholar
Herodotou, H., Lim, H., Luo, G., et al.: Starfish: a self-tuning system for Big Data analytics. In: Proceedings of Fifth Biennial Conference on Innovative Data Systems Research (CIDR2011), Asilomar, CA, USA, pp. 261–272 (2011)
Liu, C., Zeng, D., Yao, H., et al.: MR-COF: a genetic MapReduce configuration optimization framework. In: Proceedings of the 15th International Conference of Algorithms and Architectures for Parallel Processing (ICA3PP 2015), pp. 344–357(2015)
Yigitbasi, N., Willke, T.L., Liao, G., et al.: Towards machine learning-based auto-tuning of MapReduce. In: Proceedings of the 2013 IEEE 21st International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems, pp. 11–20 (2013)
Chen, C.O., Zhuo, Y.Q., Yeh, C.C., et al.: Machine learning-based configuration parameter tuning on Hadoop system. In: Proceedings of IEEE International Congress on Big Data, pp. 386–392 (2015)
Chen, Q.A., Feng, L., Yue, C., et al.: Parameter optimization for spark jobs based on runtime data analysis. China Comput. Eng. Sci. 38, 11–19 (2016)
Google Scholar
Khan, M., Huang, Z., Li, M., et al.: Optimizing Hadoop parameter settings with gene expression programming guided PSO. Concurr. Comput. Pract. Exp. (2017). https://doi.org/10.1002/cpe.3786
Article Google Scholar
Li, M., Zeng, L., Meng, S., et al.: MRONLINE: MapReduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 165–176. ACM, Vancouver, BC, Canada (2014)
Cheng, D., Rao, J., Guo, Y., et al.: Improving MapReduce performance in heterogeneous environments with adaptive task tuning. In: Proceedings of the 15th International Middleware Conference, pp. 97–108. ACM, Bordeaux, France (2014)

Download references

Acknowledgements

This work is supported by South China University of Technology Start-up Grant No. D61600470, Guangzhou Technology Grant No. 201707010148, The New Generation of Artificial Intelligence In Guangdong Grant No. 2018B0107003, Doctoral Startup Program of Guangdong Natural Science Foundation Grant No. 2018A030310408, The Key Science and Technology Program of Henan Province Grant No. 202102210152, and National Science Foundation of China under Grant Nos. 61370062, 61866038.

Author information

Authors and Affiliations

Anyang Normal University, Anyang, China
Bo Wang & Jialei Liu
South China University of Technology, Guangzhou, China
Jie Tang & Deyu Qi
Yan’an University, Yan’an, China
Rui Zhang
PerceptIn, Shenzhen, China
Shaoshan Liu

Authors

Bo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Tang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jialei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shaoshan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Deyu Qi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Tang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, B., Tang, J., Zhang, R. et al. A Task-Aware Fine-Grained Storage Selection Mechanism for In-Memory Big Data Computing Frameworks. Int J Parallel Prog 49, 25–50 (2021). https://doi.org/10.1007/s10766-020-00662-2

Download citation

Received: 26 March 2019
Accepted: 19 May 2020
Published: 05 June 2020
Issue Date: February 2021
DOI: https://doi.org/10.1007/s10766-020-00662-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Task-Aware Fine-Grained Storage Selection Mechanism for In-Memory Big Data Computing Frameworks

Abstract

Access this article

Similar content being viewed by others

A Dependency-Aware Storage Schema Selection Mechanism for In-Memory Big Data Computing Frameworks

Performance Test for Big Data Workloads on Various Emerging Memories

EarnCache: Self-adaptive Incremental Caching for Big Data Applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Task-Aware Fine-Grained Storage Selection Mechanism for In-Memory Big Data Computing Frameworks

Abstract

Access this article

Similar content being viewed by others

A Dependency-Aware Storage Schema Selection Mechanism for In-Memory Big Data Computing Frameworks

Performance Test for Big Data Workloads on Various Emerging Memories

EarnCache: Self-adaptive Incremental Caching for Big Data Applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation