Skip to main content
Log in

Using data mining techniques to improve replica management in cloud environment

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Effective data management is a crucial problem in distributed systems such as data grid and cloud. This can be achieved by replicating file in a wise manner, which reduces data access time, increases data availability, reliability and system load balancing. Determining a reasonable number and appropriate location of replicas is essential decision in cloud computing. In this paper, a new dynamic replication strategy called Data Mining-based Data Replication (DMDR) is proposed, which determines the correlation of the data files accessed using the file access history. We focus particularly on how extracted knowledge with maximal frequent correlated pattern mining improves data replication. We can group files with high dependency in the same replica set. Through the DMDR strategy, replicas can be stored in the suitable locations, with reduced access latency according to the centrality factor. In addition, due to the finite storage space of each node, replicas that are useful for future tasks can be wastefully deleted and replaced with less beneficial ones. Results of simulation using CloudSim indicate that DMDR strategy has a relative advantage in effective network usage, average response time, hit ratio in comparison with current methods. It can be concluded from this investigation that data mining technique is effective and helpful in the finding of users’ future access behavior in cloud environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30

Similar content being viewed by others

References

  • Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin E (2009) HadoopDB A: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc VLDB Endow 2(1):922–933

    Google Scholar 

  • Ahmed I, Socci C, Severini F, Yasser QR, Pretaroli R (2018) Forecasting investment and consumption behavior of economic agents through dynamic computable general equilibrium model. Financ Innov 4:7

    Google Scholar 

  • Al-Asaly MS, Hassan MM, Alsanad A (2019) A cognitive/intelligent resource provisioning for cloud computing services: opportunities and challenges. Soft Comput 32(19):9069–9081

    Google Scholar 

  • Alghamdi M, Tang B, Chen Y (2017) Profit-Based file replication in data intensive cloud data centers. In: IEEE international conference on communications

  • Barroso LA, Clidaras J, Holzle U (2013) The datacenter as a computer: an introduction to the design of warehouse-scale machines, 2nd edn. Morgan and Claypool Publishers, San Rafael

    Google Scholar 

  • Bernal A, Ear U, Kyrpides N (2001) Genomes online database (GOLD): a monitor of genome projects world-wide. Nucl Acids Res 29:126–127

    Google Scholar 

  • Bojanova I, Samba A (2011) Analysis of cloud computing delivery architecture models. In: IEEE workshops of international conference on advanced information networking and applications, pp 453–458

  • Bouyer A, Karimi M, Jalali M (2009) An online and predictive method for grid scheduling based on data mining and rough set. In: Computational science and its applications, lecture notes in computer science vol 5592, pp 775–787

  • Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: Proceedings of the ACMSIGMOD international conference on management of data, pp 265–276

  • Calheiros RN, Ranjan R, Beloglazov A, De Rose CAF, Buyya R (2011) Cloudsim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw Pract Exp 41(1):23–50

    Google Scholar 

  • Cameron DG, Carvajal-schiaffino R, Paul Millar A, Nicholson C, Stockinger K, Zini F (2003) UK grid simulation with OptorSim. UK e-Science all hands meeting

  • Casas I, Taheri J, Ranjan R, Wang L, Zomaya AY (2017) A balanced scheduler with data reuse and replication for scientific workflows in cloud computing. Future Gener Comput Syst 74:168–178

    Google Scholar 

  • Cassandra (2011) http://incubator.apache.org/cassandra/. Accessed 2019

  • Cooper B, Baldeschwieler E, Fonseca R, Kistler J, Narayan P, Neerdaels C, Negrin T, Ramakrishnan R, Silberstein A, Srivastava U, Stata R (2009) Building a cloud for Yahoo! IEEE Data Eng Bull 32(1):36–43

    Google Scholar 

  • Croda RMC, Romero DEG, Morales SOC (2019) Sales prediction through neural networks for a small dataset. Int J Interact Multimed Artif Intell 5(4):35–41

    Google Scholar 

  • Desprez F, Vernois A (2006) Simultaneous scheduling of replication and computation for data-intensive applications on the grid. Journal of Grid Computing 4(1):19–31

    Google Scholar 

  • Ding P, Aliaga L, Mubarak M, Tsaris A, Norman A, Lyon A, Ross R (2016) Analyzing how we do Analysis and Consume Data, Results from the SciDAC-Data Project. Argonne National Lab. (ANL), Argonne, IL (United States)

  • Doraimani S (2007) Filecules: a new granularity for resource management in grids (Master thesis). University of South Florida, USA

  • Duan R, Prodan R, Fahringer T (2006) Data mining-based fault prediction and detection on the grid. In: Proceedings of the 15th IEEE international symposium on high performance distributed computing, pp 305–308

  • Elango P, Kuppusamy D (2016) Fuzzy FP-tree based data replication management system in cloud. Int J Eng Trends Technol 36:481–489

    Google Scholar 

  • ESA (2010) Observing the earth. http://www.esa.int/Our_Activities/Observing_the_Earth. Accessed 2019

  • Grace RK, Manimegalai R (2014) Data access prediction and optimization in data grid using SVM and AHL classifications. Int Rev Comput Softw 9(7):1188–1194

    Google Scholar 

  • Gupta BB, Agrawal DP, Yamaguchi S, Sheng M (2018) Advances in applying soft computing techniques for big data and cloud computing. Soft Comput 22(23):7679–7683

    MATH  Google Scholar 

  • Hamrouni T, Faouzi SS, Charrada B (2015) A data mining correlated patterns-based periodic decentralized replication strategy for data grids. J Syst Softw 110:10–27

    Google Scholar 

  • Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques. Morgan Kaufmann Publishers, Burlington

    MATH  Google Scholar 

  • HBase (2016) http://hadoop.apache.org/. Accessed 2019

  • Hong TP, Kuo CS, Chi SC (1999) Mining association rules from quantitative data. Intell Data Anal 3(5):363–376

    MATH  Google Scholar 

  • Jalil AM, Hafidi I, Alami L, Khouribga E (2016) Comparative study of clustering algorithms in text mining context. Int J Interact Multimed Artif Intell 3(7):42–45

    Google Scholar 

  • Jung JK, Jung SM, Kim TK, Chung TM (2012) A study on the cloud simulation with a network topology generator. Int J Comput Inf Eng 6(11):1312–1315

    Google Scholar 

  • Keator DB, Grethe JS, Marcus D, Ozyurt B, Gadde S, Murphy S, Pieper S, Greve D, Notestine R, Bockholt HJ, Papadopoulos P (2008) A national human neuroimaging collaboratory enabled by the biomedical informatics research network (BIRN). IEEE Trans Inf Technol Biomed 12(2):162–172

    Google Scholar 

  • Khalili AS (2019) A Bee Colony (Beehive) based approach for data replication in cloud environments. Lecture notes in electrical engineering. Nature Singapore Pte Ltd, Singapore, pp 1039–1052

    Google Scholar 

  • Khanli LM, Isazadeh A, Shishavanc TN (2011) PHFS: a dynamic replication method, to decrease access latency in the multi-tier data grid. Future Gener Comput Syst 27(3):233–244

    Google Scholar 

  • Ko SY, Morales R, Gupta I (2007) New worker-centric scheduling strategies for data-intensive grid applications. In: Proceedings of the 8th ACM/IFIP/USENIX international conference on middleware, pp 121–142

  • Kou G, Lu Y, Peng Y, Sh Y (2012) Evaluation of classification algorithms using MCDM and rank correlation. Int J Inf Technol Decis Mak 11(1):197–225

    Google Scholar 

  • Kou G, Peng Y, Wang G (2014) Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Inf Sci 275:1–12

    Google Scholar 

  • Kou G, Chao X, Peng Y, Alsaadi FE, Viedma EH (2019) Machine learning methods for systemic risk analysis in financial sectors. Technol Econ Dev Econ 25(5):716–742

    Google Scholar 

  • Lee YK, Kim WY, Cai YD, Han J (2003) COMINE: efficient mining of correlated patterns. In: Proceedings of the 3rd IEEE international conference on data mining, pp 581–584

  • Long SQ, Zhao YL, Chen W (2014) MORM: a multi-objective optimized replication management strategy for cloud storage cluster. J Syst Architect 60:234–244

    Google Scholar 

  • Lou C, Zheng M, Liu X, Li X (2014) Replica selection strategy based on individual QoS sensitivity constraints in cloud environment. Pervasive Comput Netw World 8351:393–399

    Google Scholar 

  • Manjula S, Indra Devi M, Swathiya R (2016) Division of data in cloud environment for secure data storage. In: International conference on computing technologies and intelligent data engineering (ICCTIDE)

  • Mansouri N, Javidi MM (2018a) A hybrid data replication strategy with fuzzy-based deletion for heterogeneous cloud data centers. J Supercomput 74(10):5349–5372

    Google Scholar 

  • Mansouri N, Javidi MM (2018b) A new Prefetching-aware data replication to decrease access latency in cloud environment. J Syst Softw 144:197–215

    Google Scholar 

  • Mansouri N, Kuchaki Rafsanjani M, Javidi MM (2017) DPRS: a dynamic popularity aware replication strategy with parallel download scheme in cloud environments. Simul Model Theory 77:177–196

    Google Scholar 

  • Mansouri N, Mohammad Hasani Zade B, Javidi MM (2019) Hybrid task scheduling strategy for cloud computing by modified particle swarm optimization and fuzzy theory. Comput Ind Eng 130:597–633

    Google Scholar 

  • Mell P, Grance T (2009) Definition of cloud computing. National Institute of Standard and Technology

  • Moradi S, Mokhatab Rafiei F (2019) A dynamic credit risk assessment model with data mining techniques: evidence from Iranian banks. Financ Innov 5:15

    Google Scholar 

  • Mukundan R, Madria S, Linderman M (2014) Efficient integrity verification of replicated data in cloud using homomorphic encryption. Distrib Parallel Databases 32(4):507–534

    Google Scholar 

  • Newman M (2009) Networks: an introduction. Oxford University Press, Oxford

    Google Scholar 

  • Nivetha NK, Vijayakumar D (2016) Modeling fuzzy based replication strategy to improve data availability in cloud datacenter. In: International conference on computing technologies and intelligent data engineering

  • Omiecinski E (2003) Alternative interest measures for mining associations in databases. IEEE Trans Knowl Data Eng 15(1):57–69

    MathSciNet  Google Scholar 

  • Park J, Kim U, Yun D, Yeom K (2017) C-RCE: an approach for constructing and managing a cloud service broker. J Grid Comput 17(1):137–168

    Google Scholar 

  • Peer Mohamed MS, Swarnammal SR (2017) An efficient framework to handle integrated VM workloads in heterogeneous cloud infrastructure. Soft Comput 21:3367–3376

    Google Scholar 

  • Peng Y, Gang K, Shi Y, Chen Z (2008) A descriptive framework for the field of data mining and knowledge discovery. Int J Inf Technol Decis Mak 7(4):639–682

    Google Scholar 

  • Peng Y, Kou G, Wang G, Shi Y (2011) FAMCDM: a fusion approach of MCDM methods to rank multiclass classification algorithms. Omega 39(6):677–689

    Google Scholar 

  • Qi G, Tsai WT, Li W, Zhu Z, Luo Y (2017) A cloud-based triage log analysis and recovery framework. Simul Model Pract Theory 77:292–316

    Google Scholar 

  • Rehman Malik SU, Khan SU, Ewen SJ, Tziritas N, Kolodziej J, Zomaya AY, Madani SA, Min-Allah N, Wang L, Xu CZ, Malluhi QM, Pecero JE, Balaji P, Vishnu A, Ranjan R, Zeadally S, Li H (2016) Performance analysis of data intensive cloud systems based on data management and replication: a survey. Distrib Parallel Databases 34:179–215

    Google Scholar 

  • Russel M, Allen G, Daues G, Foster I, Seidel E, Novotny J, Shalf J, Laszewski G (2001) The astrophysics simulation collaboratory: a science portal enabling community software development. In: Proceedings 10th IEEE international symposium on high performance distributed computing

  • Saleh A, Javidan R, Fatehikhaje MT (2015) A four-phase data replication algorithm for data grid. J Adv Comput Sci Technol 4:163–174

    Google Scholar 

  • Sánchez A, Montes J, Dubitzky W, Valdés JJ, Pérez MS, Miguel PD (2008) Data mining meets grid computing: time to dance? In: Dubitzky W (ed) Data mining techniques in grid computing environments. Wiley, New York, pp 1–16

    Google Scholar 

  • Settouti N, Bechar MEA, Chikh MA (2016) Statistical comparisons of the top 10 algorithms in data mining for classification task. International J Interact Multimed Artif Intell 4:46–51

    Google Scholar 

  • Thusoo A, Sarma J, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive—a warehousing solution over a MapReduce framework. In: Proceedings of the VLDB endowment, pp 1626–1629

  • Torres-Franco E, García JD, Sanjuan-Martinez O, Aguilar LJ, Crespo RG (2015) A quantitative justification to dynamic partial replication of web contents through an agent architecture. Int J Interact Multimed Artif Intell 3(3):82–88

    Google Scholar 

  • Tos U, Mokadem R, Hameurlain A, Ayav T, Bora S (2018) Ensuring performance and provider profit through data replication in cloud systems. Clust Comput 21(3):1479–1492

    Google Scholar 

  • Wu T, Chen Y, Han J (2010) Re-examination of interestingness measures in pattern mining: a unified framework. Data Min Knowl Discov 21(3):371–397

    MathSciNet  Google Scholar 

  • Zaki MJ, Meira WJ (2014) Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Zhong H, Zhang Z, Zhang X (2010) A dynamic replica management strategy based on data grid. In: Ninth international conference on grid and cloud computing, pp 18–23

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to N. Mansouri.

Ethics declarations

Conflict of interest

N. Mansouri declares that he has no conflict of interest. M.M. Javidi declares that he has no conflict of interest. B. Mohammad Hasani Zade declares that he has no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mansouri, N., Javidi, M.M. & Mohammad Hasani Zade, B. Using data mining techniques to improve replica management in cloud environment. Soft Comput 24, 7335–7360 (2020). https://doi.org/10.1007/s00500-019-04357-w

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-019-04357-w

Keywords

Navigation