Skip to main content
Log in

Clustering-based data placement in cloud computing: a predictive approach

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Nowadays, cloud computing environments have become a natural choice to host and process a huge volume of data. The combination of cloud computing and big data frameworks is an effective way to run data-intensive applications and tasks. Also, an optimal arrangement of data partitions can improve the tasks executions, which is not the case in most big data frameworks. For example, the default distribution of data partitions in Hadoop-based clouds causes several problems, which are mainly related to the load balancing and the resource usage. In addition, most existing data placement solutions are static and lack precision in the placement of data partitions. To overcome these issues, we propose a data placement approach based on the prediction of the future resources usage. We exploit Kernel Density Estimation (KDE) and Fuzzy FCA techniques to, first, forecast the workers’ and tasks’ future resource consumption and, second, cluster data partitions and intensive jobs according to the estimated resource usage. Fuzzy FCA is also used to exclude partitions and jobs that require less resources, which will reduce the needless migrations. To allow monitoring and predicting the workers’ states and the data partitions’ consumption, we modeled the big data cluster as an autonomic service-based system. The obtained results have shown that our solution outperformed existing approaches in terms of migrations rate and resource consumption.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. http://www.lsst.org/.

  2. The Java source code of our approach is available following this link: https://goo.gl/HNWkkC.

  3. https://weka.sourceforge.io/doc.dev/weka/estimators/KernelEstimator.html.

  4. http://www.iro.umontreal.ca/~galicia/.

  5. https://goo.gl/Umsrgr.

  6. https://hadoop.apache.org/docs/current/hadoop-resourceestimator/ResourceEstimator.html.

References

  1. Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U.: The rise of “big data’’ on cloud computing: review and open research issues. Inf. Syst. 47, 98–115 (2015)

    Article  Google Scholar 

  2. Kaur, A., Gupta, P., Singh, M., Nayyar, A.: Data placement in era of cloud computing: a survey, taxonomy and open research issues. Scalable Comput. Pract. Exp. 20(2), 377–398 (2019)

    Article  Google Scholar 

  3. Anjos, J.C., Carrera, I., Kolberg, W., Tibola, A.L., Arantes, L.B., Geyer, C.R.: Mra++: scheduling and data placement on MapReduce for heterogeneous environments. Future Gen. Comput. Syst. 42, 22–35 (2015)

    Article  Google Scholar 

  4. Tang, Z., Zhang, X., Li, K., Li, K.: An intermediate data placement algorithm for load balancing in spark computing environment. Future Gen. Comput. Syst. 78, 287–301 (2018)

    Article  Google Scholar 

  5. Liu, G., Zhu, X., Wang, J., Guo, D., Bao, W., Guo, H.: SP-Partitioner: a novel partition method to handle intermediate data skew in spark streaming. Future Gen. Comput. Syst. (2017). https://doi.org/10.1016/j.future.2017.07.014

    Article  Google Scholar 

  6. Shi, Y., Dong, M., Zhang, W., Liu, L., Zheng, Y., Cui, L., Zhang, J.: AdaptScale: an adaptive data scaling controller for improving the multiple performance requirements in clouds. Future Gen. Comput. Syst. 105, 814–823 (2020)

    Article  Google Scholar 

  7. Li, X., Zhang, L., Wu, Y., Liu, X., Zhu, E., Yi, H., Wang, F., Zhang, C., Yang, Y.: A novel workflow-level data placement strategy for data-sharing scientific cloud workflows. IEEE Trans. Serv. Comput. 12(70), 370–383 (2019)

    Article  Google Scholar 

  8. Wu, J.-X., Zhang, C.-S., Zhang, B., Wang, P.: A new data-grouping-aware dynamic data placement method that take into account jobs execute frequency for Hadoop. Microprocess. Microsyst. 47, 161–169 (2016)

    Article  Google Scholar 

  9. Kumar, S., Tiwari, R.: An efficient content placement scheme based on normalized node degree in content centric networking. Clust. Comput. 24(4), 1–15 (2020)

    Google Scholar 

  10. Hosseinzadeh, M., Masdari, M., Rahmani, A.M., Mohammadi, M., Aldalwie, A.H.M., Majeed, M.K., Karim, S.H.T.: Improved butterfly optimization algorithm for data placement and scheduling in edge computing environments. J. Grid Comput. 19(2), 1–27 (2021)

    Article  Google Scholar 

  11. Abad, C.L., Lu, Y., Campbell, R.H.: Dare: Adaptive data replication for efficient cluster scheduling. In: 2011 IEEE International Conference on Cluster Computing (CLUSTER), IEEE, pp. 159–168 (2011)

  12. Jin, H., Yang, X., Sun, X.-H., Raicu, I.: Adapt: Availability-aware MapReduce data placement for non-dedicated distributed computing. In: 2012 IEEE 32nd International Conference on Distributed Computing Systems (ICDCS), IEEE, pp. 516–525 (2012)

  13. Kristan, M., Leonardis, A.: Online discriminative kernel density estimator with Gaussian kernels. IEEE Trans. Cybern. 44(3), 355–365 (2014)

    Article  Google Scholar 

  14. Poelmans, J., Ignatov, D.I., Kuznetsov, S.O., Dedene, G.: Formal concept analysis in knowledge processing: a survey on applications. Expert Syst. Appl. 40(16), 6538–6560 (2013)

    Article  Google Scholar 

  15. Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: flexible data placement and its exploitation in Hadoop. Proc. VLDB Endow. 4(9), 575–585 (2011)

    Article  Google Scholar 

  16. Xu, M., Alamro, S., Lan, T., Subramaniam, S.: CRED: cloud right-sizing with execution deadlines and data locality. IEEE Trans. Parallel Distrib. Syst. 28(12), 3389–3400 (2017)

    Article  Google Scholar 

  17. Guo, Z., Fox, G., Zhou, M.: Investigation of data locality in MapReduce. In: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), IEEE, pp. 419–426 (2012)

  18. Shakarami, A., Ghobaei-Arani, M., Shahidinejad, A., Masdari, M., Shakarami, H.: Data replication schemes in cloud computing: a survey. Clust. Comput. (2021). https://doi.org/10.1007/s10586-021-03283-7

    Article  Google Scholar 

  19. Kchaou, H., Kechaou, Z., Alimi, A.M.: Interval type-2 fuzzy c-means data placement optimization in scientific cloud workflow applications. Simul. Model. Pract. Theory 107, 102217 (2021)

    Article  Google Scholar 

  20. Khalajzadeh, H., Yuan, D., Zhou, B.B., Grundy, J., Yang, Y.: Cost effective dynamic data placement for efficient access of social networks. J. Parallel Distrib. Comput. 141, 82–98 (2020)

    Article  Google Scholar 

  21. Fan, Y., Wang, C., Zhang, B., Gu, S., Wu, W., Du, D.: Data placement in distributed data centers for improved SLA and network cost. J. Parallel Distrib. Comput. 146, 189–200 (2020)

    Article  Google Scholar 

  22. Xu, X., Fu, S., Li, W., Dai, F., Gao, H., Chang, V.: Multi-objective data placement for workflow management in cloud infrastructure using NSGA-II. IEEE Trans. Emerg. Top. Comput. Intell. 4(5), 605–615 (2020)

    Article  Google Scholar 

  23. Chen, W., Liu, B., Paik, I., Li, Z., Zheng, Z.: QoS-aware data placement for MapReduce applications in geo-distributed data centers. IEEE Trans. Eng. Manage. 68(1), 120–136 (2020)

    Article  Google Scholar 

  24. Khan, A.A., Goens, A., Hameed, F., Castrillon, J.: Generalized data placement strategies for racetrack memories. In: Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE 2020, pp. 1502–1507 (2020)

  25. Li, C., Bai, J., Tang, J.: Joint optimization of data placement and scheduling for improving user experience in edge computing. J. Parallel Distrib. Comput. 125, 93–105 (2019)

    Article  Google Scholar 

  26. Liu, K., Peng, J., Wang, J., Yu, B., Liao, Z., Huang, Z., Pan, J.: A learning-based data placement framework for low latency in data center networks. IEEE Trans. Cloud Comput. (2019). https://doi.org/10.1109/TCC.2019.2940953

    Article  Google Scholar 

  27. Lin, B., Zhu, F., Zhang, J., Chen, J., Chen, X., Xiong, N.N., Mauri, J.L.: A time-driven data placement strategy for a scientific workflow combining edge computing and cloud computing. IEEE Trans. Ind. Inf. 15(7), 4254–4265 (2019)

    Article  Google Scholar 

  28. Xu, X., Fu, S., Qi, L., Zhang, X., Liu, Q., He, Q., Li, S.: An IoT-oriented data placement method with privacy preservation in cloud environment. J. Netw. Comput. Appl. 124, 148–157 (2018)

    Article  Google Scholar 

  29. Naas, M.I., Boukhobza, J., Parvedy, P.R., Lemarchand, L.: An extension to iFogSim to enable the design of data placement strategies. In: 2018 IEEE 2nd International Conference on Fog and Edge Computing (ICFEC), IEEE, pp. 1–8 (2018)

  30. Wang, S., Wang, J., Chung, F.-L.: Kernel density estimation, kernel methods, and fast learning in large data sets. IEEE Trans. Cybern. 44(1), 1–20 (2014)

    Article  Google Scholar 

  31. Borthakur, D., et al.: HDFS architecture guide. Hadoop Apache Project 53(1–13), 2 (2008)

    Google Scholar 

  32. Tallada, P., Carretero, J., Casals, J., Acosta-Silva, C., Serrano, S., Caubet, M., Castander, F.J., César, E., Crocce, M., Delfino, M., et al.: CosmoHub: interactive exploration and distribution of astronomical data on Hadoop. Astron. Comput. 32, 100391 (2020)

    Article  Google Scholar 

  33. Brazier, F.M., Kephart, J.O., Parunak, H.V.D., Huhns, M.N.: Agents and service-oriented computing for autonomic computing: a research agenda. IEEE Internet Comput. 13(3), 82–87 (2009)

    Article  Google Scholar 

  34. Inoubli, W., Aridhi, S., Mezni, H., Maddouri, M., Nguifo, E.M.: An experimental survey on big data frameworks. Future Gen. Comput. Syst. 86, 546–564 (2018)

    Article  Google Scholar 

  35. Farahnakian, F., Liljeberg, P., Plosila, J.: Lircup: Linear regression based cpu usage prediction algorithm for live migration of virtual machines in data centers. In: 2013 39th Euromicro Conference on Software Engineering and Advanced Applications, IEEE, pp. 357–364 (2013)

  36. Jyothi, S. A., Curino, C., Menache, I., Narayanamurthy, S.M., Tumanov, A., Yaniv, J., Mavlyutov, R., Goiri, I., Krishnan, S., Kulkarni, J., et al.: Morpheus: Towards automated SLOS for enterprise clusters. In: OSDI, pp. 117–134 (2016)

  37. Fu, X., Gao, Y., Luo, B., Du, X., Guizani, M.: Security threats to Hadoop: data leakage attacks and investigation. IEEE Netw. 31(2), 67–71 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mokhtar Sellami.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sellami, M., Mezni, H., Hacid, M.S. et al. Clustering-based data placement in cloud computing: a predictive approach. Cluster Comput 24, 3311–3336 (2021). https://doi.org/10.1007/s10586-021-03332-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-021-03332-1

Keywords

Navigation