Clustering-based data placement in cloud computing: a predictive approach

Sellami, Mokhtar; Mezni, Haithem; Hacid, Mohand Said; Gammoudi, Mohamed Moshen

doi:10.1007/s10586-021-03332-1

Clustering-based data placement in cloud computing: a predictive approach

Published: 16 June 2021

Volume 24, pages 3311–3336, (2021)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Mokhtar Sellami¹,
Haithem Mezni ORCID: orcid.org/0000-0001-9932-8433^2,3,
Mohand Said Hacid⁴ &
…
Mohamed Moshen Gammoudi⁵

452 Accesses
8 Citations
Explore all metrics

Abstract

Nowadays, cloud computing environments have become a natural choice to host and process a huge volume of data. The combination of cloud computing and big data frameworks is an effective way to run data-intensive applications and tasks. Also, an optimal arrangement of data partitions can improve the tasks executions, which is not the case in most big data frameworks. For example, the default distribution of data partitions in Hadoop-based clouds causes several problems, which are mainly related to the load balancing and the resource usage. In addition, most existing data placement solutions are static and lack precision in the placement of data partitions. To overcome these issues, we propose a data placement approach based on the prediction of the future resources usage. We exploit Kernel Density Estimation (KDE) and Fuzzy FCA techniques to, first, forecast the workers’ and tasks’ future resource consumption and, second, cluster data partitions and intensive jobs according to the estimated resource usage. Fuzzy FCA is also used to exclude partitions and jobs that require less resources, which will reduce the needless migrations. To allow monitoring and predicting the workers’ states and the data partitions’ consumption, we modeled the big data cluster as an autonomic service-based system. The obtained results have shown that our solution outperformed existing approaches in terms of migrations rate and resource consumption.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Energy efficiency in cloud computing data centers: a survey on software technologies

Article 30 August 2022

A survey of Kubernetes scheduling algorithms

Article Open access 13 June 2023

Design of Intelligent Warehouse Management System

Article 04 January 2018

Notes

http://www.lsst.org/.
The Java source code of our approach is available following this link: https://goo.gl/HNWkkC.
https://weka.sourceforge.io/doc.dev/weka/estimators/KernelEstimator.html.
http://www.iro.umontreal.ca/~galicia/.
https://goo.gl/Umsrgr.
https://hadoop.apache.org/docs/current/hadoop-resourceestimator/ResourceEstimator.html.

References

Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U.: The rise of “big data’’ on cloud computing: review and open research issues. Inf. Syst. 47, 98–115 (2015)
Article Google Scholar
Kaur, A., Gupta, P., Singh, M., Nayyar, A.: Data placement in era of cloud computing: a survey, taxonomy and open research issues. Scalable Comput. Pract. Exp. 20(2), 377–398 (2019)
Article Google Scholar
Anjos, J.C., Carrera, I., Kolberg, W., Tibola, A.L., Arantes, L.B., Geyer, C.R.: Mra++: scheduling and data placement on MapReduce for heterogeneous environments. Future Gen. Comput. Syst. 42, 22–35 (2015)
Article Google Scholar
Tang, Z., Zhang, X., Li, K., Li, K.: An intermediate data placement algorithm for load balancing in spark computing environment. Future Gen. Comput. Syst. 78, 287–301 (2018)
Article Google Scholar
Liu, G., Zhu, X., Wang, J., Guo, D., Bao, W., Guo, H.: SP-Partitioner: a novel partition method to handle intermediate data skew in spark streaming. Future Gen. Comput. Syst. (2017). https://doi.org/10.1016/j.future.2017.07.014
Article Google Scholar
Shi, Y., Dong, M., Zhang, W., Liu, L., Zheng, Y., Cui, L., Zhang, J.: AdaptScale: an adaptive data scaling controller for improving the multiple performance requirements in clouds. Future Gen. Comput. Syst. 105, 814–823 (2020)
Article Google Scholar
Li, X., Zhang, L., Wu, Y., Liu, X., Zhu, E., Yi, H., Wang, F., Zhang, C., Yang, Y.: A novel workflow-level data placement strategy for data-sharing scientific cloud workflows. IEEE Trans. Serv. Comput. 12(70), 370–383 (2019)
Article Google Scholar
Wu, J.-X., Zhang, C.-S., Zhang, B., Wang, P.: A new data-grouping-aware dynamic data placement method that take into account jobs execute frequency for Hadoop. Microprocess. Microsyst. 47, 161–169 (2016)
Article Google Scholar
Kumar, S., Tiwari, R.: An efficient content placement scheme based on normalized node degree in content centric networking. Clust. Comput. 24(4), 1–15 (2020)
Google Scholar
Hosseinzadeh, M., Masdari, M., Rahmani, A.M., Mohammadi, M., Aldalwie, A.H.M., Majeed, M.K., Karim, S.H.T.: Improved butterfly optimization algorithm for data placement and scheduling in edge computing environments. J. Grid Comput. 19(2), 1–27 (2021)
Article Google Scholar
Abad, C.L., Lu, Y., Campbell, R.H.: Dare: Adaptive data replication for efficient cluster scheduling. In: 2011 IEEE International Conference on Cluster Computing (CLUSTER), IEEE, pp. 159–168 (2011)
Jin, H., Yang, X., Sun, X.-H., Raicu, I.: Adapt: Availability-aware MapReduce data placement for non-dedicated distributed computing. In: 2012 IEEE 32nd International Conference on Distributed Computing Systems (ICDCS), IEEE, pp. 516–525 (2012)
Kristan, M., Leonardis, A.: Online discriminative kernel density estimator with Gaussian kernels. IEEE Trans. Cybern. 44(3), 355–365 (2014)
Article Google Scholar
Poelmans, J., Ignatov, D.I., Kuznetsov, S.O., Dedene, G.: Formal concept analysis in knowledge processing: a survey on applications. Expert Syst. Appl. 40(16), 6538–6560 (2013)
Article Google Scholar
Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: flexible data placement and its exploitation in Hadoop. Proc. VLDB Endow. 4(9), 575–585 (2011)
Article Google Scholar
Xu, M., Alamro, S., Lan, T., Subramaniam, S.: CRED: cloud right-sizing with execution deadlines and data locality. IEEE Trans. Parallel Distrib. Syst. 28(12), 3389–3400 (2017)
Article Google Scholar
Guo, Z., Fox, G., Zhou, M.: Investigation of data locality in MapReduce. In: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), IEEE, pp. 419–426 (2012)
Shakarami, A., Ghobaei-Arani, M., Shahidinejad, A., Masdari, M., Shakarami, H.: Data replication schemes in cloud computing: a survey. Clust. Comput. (2021). https://doi.org/10.1007/s10586-021-03283-7
Article Google Scholar
Kchaou, H., Kechaou, Z., Alimi, A.M.: Interval type-2 fuzzy c-means data placement optimization in scientific cloud workflow applications. Simul. Model. Pract. Theory 107, 102217 (2021)
Article Google Scholar
Khalajzadeh, H., Yuan, D., Zhou, B.B., Grundy, J., Yang, Y.: Cost effective dynamic data placement for efficient access of social networks. J. Parallel Distrib. Comput. 141, 82–98 (2020)
Article Google Scholar
Fan, Y., Wang, C., Zhang, B., Gu, S., Wu, W., Du, D.: Data placement in distributed data centers for improved SLA and network cost. J. Parallel Distrib. Comput. 146, 189–200 (2020)
Article Google Scholar
Xu, X., Fu, S., Li, W., Dai, F., Gao, H., Chang, V.: Multi-objective data placement for workflow management in cloud infrastructure using NSGA-II. IEEE Trans. Emerg. Top. Comput. Intell. 4(5), 605–615 (2020)
Article Google Scholar
Chen, W., Liu, B., Paik, I., Li, Z., Zheng, Z.: QoS-aware data placement for MapReduce applications in geo-distributed data centers. IEEE Trans. Eng. Manage. 68(1), 120–136 (2020)
Article Google Scholar
Khan, A.A., Goens, A., Hameed, F., Castrillon, J.: Generalized data placement strategies for racetrack memories. In: Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE 2020, pp. 1502–1507 (2020)
Li, C., Bai, J., Tang, J.: Joint optimization of data placement and scheduling for improving user experience in edge computing. J. Parallel Distrib. Comput. 125, 93–105 (2019)
Article Google Scholar
Liu, K., Peng, J., Wang, J., Yu, B., Liao, Z., Huang, Z., Pan, J.: A learning-based data placement framework for low latency in data center networks. IEEE Trans. Cloud Comput. (2019). https://doi.org/10.1109/TCC.2019.2940953
Article Google Scholar
Lin, B., Zhu, F., Zhang, J., Chen, J., Chen, X., Xiong, N.N., Mauri, J.L.: A time-driven data placement strategy for a scientific workflow combining edge computing and cloud computing. IEEE Trans. Ind. Inf. 15(7), 4254–4265 (2019)
Article Google Scholar
Xu, X., Fu, S., Qi, L., Zhang, X., Liu, Q., He, Q., Li, S.: An IoT-oriented data placement method with privacy preservation in cloud environment. J. Netw. Comput. Appl. 124, 148–157 (2018)
Article Google Scholar
Naas, M.I., Boukhobza, J., Parvedy, P.R., Lemarchand, L.: An extension to iFogSim to enable the design of data placement strategies. In: 2018 IEEE 2nd International Conference on Fog and Edge Computing (ICFEC), IEEE, pp. 1–8 (2018)
Wang, S., Wang, J., Chung, F.-L.: Kernel density estimation, kernel methods, and fast learning in large data sets. IEEE Trans. Cybern. 44(1), 1–20 (2014)
Article Google Scholar
Borthakur, D., et al.: HDFS architecture guide. Hadoop Apache Project 53(1–13), 2 (2008)
Google Scholar
Tallada, P., Carretero, J., Casals, J., Acosta-Silva, C., Serrano, S., Caubet, M., Castander, F.J., César, E., Crocce, M., Delfino, M., et al.: CosmoHub: interactive exploration and distribution of astronomical data on Hadoop. Astron. Comput. 32, 100391 (2020)
Article Google Scholar
Brazier, F.M., Kephart, J.O., Parunak, H.V.D., Huhns, M.N.: Agents and service-oriented computing for autonomic computing: a research agenda. IEEE Internet Comput. 13(3), 82–87 (2009)
Article Google Scholar
Inoubli, W., Aridhi, S., Mezni, H., Maddouri, M., Nguifo, E.M.: An experimental survey on big data frameworks. Future Gen. Comput. Syst. 86, 546–564 (2018)
Article Google Scholar
Farahnakian, F., Liljeberg, P., Plosila, J.: Lircup: Linear regression based cpu usage prediction algorithm for live migration of virtual machines in data centers. In: 2013 39th Euromicro Conference on Software Engineering and Advanced Applications, IEEE, pp. 357–364 (2013)
Jyothi, S. A., Curino, C., Menache, I., Narayanamurthy, S.M., Tumanov, A., Yaniv, J., Mavlyutov, R., Goiri, I., Krishnan, S., Kulkarni, J., et al.: Morpheus: Towards automated SLOS for enterprise clusters. In: OSDI, pp. 117–134 (2016)
Fu, X., Gao, Y., Luo, B., Du, X., Guizani, M.: Security threats to Hadoop: data leakage attacks and investigation. IEEE Netw. 31(2), 67–71 (2017)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Jendouba, Jendouba, Tunisia
Mokhtar Sellami
Taibah University, Madinah, Saudi Arabia
Haithem Mezni
SMART Lab, ISG de Tunis, Tunis, Tunisia
Haithem Mezni
Univ. Lyon, University Claude Bernard Lyon 1, LIRIS, Lyon, France
Mohand Said Hacid
Higher Institute of Multimedia Arts of Manouba, RIADI, Manouba, Tunisia
Mohamed Moshen Gammoudi

Authors

Mokhtar Sellami
View author publications
You can also search for this author in PubMed Google Scholar
Haithem Mezni
View author publications
You can also search for this author in PubMed Google Scholar
Mohand Said Hacid
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Moshen Gammoudi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mokhtar Sellami.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sellami, M., Mezni, H., Hacid, M.S. et al. Clustering-based data placement in cloud computing: a predictive approach. Cluster Comput 24, 3311–3336 (2021). https://doi.org/10.1007/s10586-021-03332-1

Download citation

Received: 28 May 2020
Revised: 14 May 2021
Accepted: 02 June 2021
Published: 16 June 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s10586-021-03332-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering-based data placement in cloud computing: a predictive approach

Abstract

Access this article

Similar content being viewed by others

Energy efficiency in cloud computing data centers: a survey on software technologies

A survey of Kubernetes scheduling algorithms

Design of Intelligent Warehouse Management System

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering-based data placement in cloud computing: a predictive approach

Abstract

Access this article

Similar content being viewed by others

Energy efficiency in cloud computing data centers: a survey on software technologies

A survey of Kubernetes scheduling algorithms

Design of Intelligent Warehouse Management System

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation