Abstract
Effective data management is a crucial problem in distributed systems such as data grid and cloud. This can be achieved by replicating file in a wise manner, which reduces data access time, increases data availability, reliability and system load balancing. Determining a reasonable number and appropriate location of replicas is essential decision in cloud computing. In this paper, a new dynamic replication strategy called Data Mining-based Data Replication (DMDR) is proposed, which determines the correlation of the data files accessed using the file access history. We focus particularly on how extracted knowledge with maximal frequent correlated pattern mining improves data replication. We can group files with high dependency in the same replica set. Through the DMDR strategy, replicas can be stored in the suitable locations, with reduced access latency according to the centrality factor. In addition, due to the finite storage space of each node, replicas that are useful for future tasks can be wastefully deleted and replaced with less beneficial ones. Results of simulation using CloudSim indicate that DMDR strategy has a relative advantage in effective network usage, average response time, hit ratio in comparison with current methods. It can be concluded from this investigation that data mining technique is effective and helpful in the finding of users’ future access behavior in cloud environment.
Similar content being viewed by others
References
Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin E (2009) HadoopDB A: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc VLDB Endow 2(1):922–933
Ahmed I, Socci C, Severini F, Yasser QR, Pretaroli R (2018) Forecasting investment and consumption behavior of economic agents through dynamic computable general equilibrium model. Financ Innov 4:7
Al-Asaly MS, Hassan MM, Alsanad A (2019) A cognitive/intelligent resource provisioning for cloud computing services: opportunities and challenges. Soft Comput 32(19):9069–9081
Alghamdi M, Tang B, Chen Y (2017) Profit-Based file replication in data intensive cloud data centers. In: IEEE international conference on communications
Barroso LA, Clidaras J, Holzle U (2013) The datacenter as a computer: an introduction to the design of warehouse-scale machines, 2nd edn. Morgan and Claypool Publishers, San Rafael
Bernal A, Ear U, Kyrpides N (2001) Genomes online database (GOLD): a monitor of genome projects world-wide. Nucl Acids Res 29:126–127
Bojanova I, Samba A (2011) Analysis of cloud computing delivery architecture models. In: IEEE workshops of international conference on advanced information networking and applications, pp 453–458
Bouyer A, Karimi M, Jalali M (2009) An online and predictive method for grid scheduling based on data mining and rough set. In: Computational science and its applications, lecture notes in computer science vol 5592, pp 775–787
Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: Proceedings of the ACMSIGMOD international conference on management of data, pp 265–276
Calheiros RN, Ranjan R, Beloglazov A, De Rose CAF, Buyya R (2011) Cloudsim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw Pract Exp 41(1):23–50
Cameron DG, Carvajal-schiaffino R, Paul Millar A, Nicholson C, Stockinger K, Zini F (2003) UK grid simulation with OptorSim. UK e-Science all hands meeting
Casas I, Taheri J, Ranjan R, Wang L, Zomaya AY (2017) A balanced scheduler with data reuse and replication for scientific workflows in cloud computing. Future Gener Comput Syst 74:168–178
Cassandra (2011) http://incubator.apache.org/cassandra/. Accessed 2019
Cooper B, Baldeschwieler E, Fonseca R, Kistler J, Narayan P, Neerdaels C, Negrin T, Ramakrishnan R, Silberstein A, Srivastava U, Stata R (2009) Building a cloud for Yahoo! IEEE Data Eng Bull 32(1):36–43
Croda RMC, Romero DEG, Morales SOC (2019) Sales prediction through neural networks for a small dataset. Int J Interact Multimed Artif Intell 5(4):35–41
Desprez F, Vernois A (2006) Simultaneous scheduling of replication and computation for data-intensive applications on the grid. Journal of Grid Computing 4(1):19–31
Ding P, Aliaga L, Mubarak M, Tsaris A, Norman A, Lyon A, Ross R (2016) Analyzing how we do Analysis and Consume Data, Results from the SciDAC-Data Project. Argonne National Lab. (ANL), Argonne, IL (United States)
Doraimani S (2007) Filecules: a new granularity for resource management in grids (Master thesis). University of South Florida, USA
Duan R, Prodan R, Fahringer T (2006) Data mining-based fault prediction and detection on the grid. In: Proceedings of the 15th IEEE international symposium on high performance distributed computing, pp 305–308
Elango P, Kuppusamy D (2016) Fuzzy FP-tree based data replication management system in cloud. Int J Eng Trends Technol 36:481–489
ESA (2010) Observing the earth. http://www.esa.int/Our_Activities/Observing_the_Earth. Accessed 2019
Grace RK, Manimegalai R (2014) Data access prediction and optimization in data grid using SVM and AHL classifications. Int Rev Comput Softw 9(7):1188–1194
Gupta BB, Agrawal DP, Yamaguchi S, Sheng M (2018) Advances in applying soft computing techniques for big data and cloud computing. Soft Comput 22(23):7679–7683
Hamrouni T, Faouzi SS, Charrada B (2015) A data mining correlated patterns-based periodic decentralized replication strategy for data grids. J Syst Softw 110:10–27
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques. Morgan Kaufmann Publishers, Burlington
HBase (2016) http://hadoop.apache.org/. Accessed 2019
Hong TP, Kuo CS, Chi SC (1999) Mining association rules from quantitative data. Intell Data Anal 3(5):363–376
Jalil AM, Hafidi I, Alami L, Khouribga E (2016) Comparative study of clustering algorithms in text mining context. Int J Interact Multimed Artif Intell 3(7):42–45
Jung JK, Jung SM, Kim TK, Chung TM (2012) A study on the cloud simulation with a network topology generator. Int J Comput Inf Eng 6(11):1312–1315
Keator DB, Grethe JS, Marcus D, Ozyurt B, Gadde S, Murphy S, Pieper S, Greve D, Notestine R, Bockholt HJ, Papadopoulos P (2008) A national human neuroimaging collaboratory enabled by the biomedical informatics research network (BIRN). IEEE Trans Inf Technol Biomed 12(2):162–172
Khalili AS (2019) A Bee Colony (Beehive) based approach for data replication in cloud environments. Lecture notes in electrical engineering. Nature Singapore Pte Ltd, Singapore, pp 1039–1052
Khanli LM, Isazadeh A, Shishavanc TN (2011) PHFS: a dynamic replication method, to decrease access latency in the multi-tier data grid. Future Gener Comput Syst 27(3):233–244
Ko SY, Morales R, Gupta I (2007) New worker-centric scheduling strategies for data-intensive grid applications. In: Proceedings of the 8th ACM/IFIP/USENIX international conference on middleware, pp 121–142
Kou G, Lu Y, Peng Y, Sh Y (2012) Evaluation of classification algorithms using MCDM and rank correlation. Int J Inf Technol Decis Mak 11(1):197–225
Kou G, Peng Y, Wang G (2014) Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Inf Sci 275:1–12
Kou G, Chao X, Peng Y, Alsaadi FE, Viedma EH (2019) Machine learning methods for systemic risk analysis in financial sectors. Technol Econ Dev Econ 25(5):716–742
Lee YK, Kim WY, Cai YD, Han J (2003) COMINE: efficient mining of correlated patterns. In: Proceedings of the 3rd IEEE international conference on data mining, pp 581–584
Long SQ, Zhao YL, Chen W (2014) MORM: a multi-objective optimized replication management strategy for cloud storage cluster. J Syst Architect 60:234–244
Lou C, Zheng M, Liu X, Li X (2014) Replica selection strategy based on individual QoS sensitivity constraints in cloud environment. Pervasive Comput Netw World 8351:393–399
Manjula S, Indra Devi M, Swathiya R (2016) Division of data in cloud environment for secure data storage. In: International conference on computing technologies and intelligent data engineering (ICCTIDE)
Mansouri N, Javidi MM (2018a) A hybrid data replication strategy with fuzzy-based deletion for heterogeneous cloud data centers. J Supercomput 74(10):5349–5372
Mansouri N, Javidi MM (2018b) A new Prefetching-aware data replication to decrease access latency in cloud environment. J Syst Softw 144:197–215
Mansouri N, Kuchaki Rafsanjani M, Javidi MM (2017) DPRS: a dynamic popularity aware replication strategy with parallel download scheme in cloud environments. Simul Model Theory 77:177–196
Mansouri N, Mohammad Hasani Zade B, Javidi MM (2019) Hybrid task scheduling strategy for cloud computing by modified particle swarm optimization and fuzzy theory. Comput Ind Eng 130:597–633
Mell P, Grance T (2009) Definition of cloud computing. National Institute of Standard and Technology
Moradi S, Mokhatab Rafiei F (2019) A dynamic credit risk assessment model with data mining techniques: evidence from Iranian banks. Financ Innov 5:15
Mukundan R, Madria S, Linderman M (2014) Efficient integrity verification of replicated data in cloud using homomorphic encryption. Distrib Parallel Databases 32(4):507–534
Newman M (2009) Networks: an introduction. Oxford University Press, Oxford
Nivetha NK, Vijayakumar D (2016) Modeling fuzzy based replication strategy to improve data availability in cloud datacenter. In: International conference on computing technologies and intelligent data engineering
Omiecinski E (2003) Alternative interest measures for mining associations in databases. IEEE Trans Knowl Data Eng 15(1):57–69
Park J, Kim U, Yun D, Yeom K (2017) C-RCE: an approach for constructing and managing a cloud service broker. J Grid Comput 17(1):137–168
Peer Mohamed MS, Swarnammal SR (2017) An efficient framework to handle integrated VM workloads in heterogeneous cloud infrastructure. Soft Comput 21:3367–3376
Peng Y, Gang K, Shi Y, Chen Z (2008) A descriptive framework for the field of data mining and knowledge discovery. Int J Inf Technol Decis Mak 7(4):639–682
Peng Y, Kou G, Wang G, Shi Y (2011) FAMCDM: a fusion approach of MCDM methods to rank multiclass classification algorithms. Omega 39(6):677–689
Qi G, Tsai WT, Li W, Zhu Z, Luo Y (2017) A cloud-based triage log analysis and recovery framework. Simul Model Pract Theory 77:292–316
Rehman Malik SU, Khan SU, Ewen SJ, Tziritas N, Kolodziej J, Zomaya AY, Madani SA, Min-Allah N, Wang L, Xu CZ, Malluhi QM, Pecero JE, Balaji P, Vishnu A, Ranjan R, Zeadally S, Li H (2016) Performance analysis of data intensive cloud systems based on data management and replication: a survey. Distrib Parallel Databases 34:179–215
Russel M, Allen G, Daues G, Foster I, Seidel E, Novotny J, Shalf J, Laszewski G (2001) The astrophysics simulation collaboratory: a science portal enabling community software development. In: Proceedings 10th IEEE international symposium on high performance distributed computing
Saleh A, Javidan R, Fatehikhaje MT (2015) A four-phase data replication algorithm for data grid. J Adv Comput Sci Technol 4:163–174
Sánchez A, Montes J, Dubitzky W, Valdés JJ, Pérez MS, Miguel PD (2008) Data mining meets grid computing: time to dance? In: Dubitzky W (ed) Data mining techniques in grid computing environments. Wiley, New York, pp 1–16
Settouti N, Bechar MEA, Chikh MA (2016) Statistical comparisons of the top 10 algorithms in data mining for classification task. International J Interact Multimed Artif Intell 4:46–51
Thusoo A, Sarma J, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive—a warehousing solution over a MapReduce framework. In: Proceedings of the VLDB endowment, pp 1626–1629
Torres-Franco E, García JD, Sanjuan-Martinez O, Aguilar LJ, Crespo RG (2015) A quantitative justification to dynamic partial replication of web contents through an agent architecture. Int J Interact Multimed Artif Intell 3(3):82–88
Tos U, Mokadem R, Hameurlain A, Ayav T, Bora S (2018) Ensuring performance and provider profit through data replication in cloud systems. Clust Comput 21(3):1479–1492
Wu T, Chen Y, Han J (2010) Re-examination of interestingness measures in pattern mining: a unified framework. Data Min Knowl Discov 21(3):371–397
Zaki MJ, Meira WJ (2014) Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press, Cambridge
Zhong H, Zhang Z, Zhang X (2010) A dynamic replica management strategy based on data grid. In: Ninth international conference on grid and cloud computing, pp 18–23
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
N. Mansouri declares that he has no conflict of interest. M.M. Javidi declares that he has no conflict of interest. B. Mohammad Hasani Zade declares that he has no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mansouri, N., Javidi, M.M. & Mohammad Hasani Zade, B. Using data mining techniques to improve replica management in cloud environment. Soft Comput 24, 7335–7360 (2020). https://doi.org/10.1007/s00500-019-04357-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-019-04357-w