Weight Based Deduplication for Minimizing Data Replication in Public Cloud Storage,Journal of Scientific & Industrial Research

当前位置： X-MOL 学术 › J. Sci. Ind. Res. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Weight Based Deduplication for Minimizing Data Replication in Public Cloud Storage
Journal of Scientific & Industrial Research ( IF 0.6 ) Pub Date : 2021-03-11
E Pugazhendi, M R Sumalatha, Lakshmi P Harika

The approach to optimize the data replication in public cloud storage when targeting the multiple instances is one of the challenging issues to process the text data. The amount of digital data has been increasing exponentially. There is a need to reduce the amount of storage space by storing the data efficiently. In cloud storage environment, the data replication provides high availability with fault tolerance system. An effective approach of deduplication system using weight based method is proposed at the target level in order to reduce the unwanted storage spaces in cloud. Storage space can be efficiently utilized by removing the unpopular files from the secondary servers. Target level consumes less processing power than source level deduplication. Multiple input text documents are stored into dropbox cloud. The top text features are detected using the Term Frequency (TF) and Named Entity Recognition (NER) and they are stored in text database. After storing the top features in database, fresh text documents are collected to find the popular and unpopular files in order to optimize the existing text corpus of cloud storage. Top Text features of the freshly collected text documents are detected using TF and NER and these unique features after the removing the duplicate features cleaning are compared with the existing features stored in the database. On the comparison, relevant text documents are listed. After listing the text documents, document frequency, document weight and threshold factor are detected. Depending on average threshold value, the popular and unpopular files are detected. The popular files are retained in all the storage nodes to achieve the full availability of data and unpopular files are removed from all the secondary servers except primary server. Before deduplication, the storage space occupied in the dropbox cloud is 8.09 MB. After deduplication, the unpopular files are removed from secondary storage nodes and the storage space in the dropbox cloud is optimized to 4.82MB. Finally, data replications are minimized and 45.60% of the cloud storage space is efficiently saved by applying the weight based deduplication system.

中文翻译：

基于权重的重复数据删除可最大程度地减少公共云存储中的数据复制

针对多个实例优化公共云存储中数据复制的方法是处理文本数据的挑战性问题之一。数字数据量呈指数增长。需要通过有效地存储数据来减少存储空间量。在云存储环境中，数据复制通过容错系统提供了高可用性。为了减少云中不必要的存储空间，在目标级别上提出了一种基于权重的重复数据删除系统的有效方法。通过从辅助服务器中删除不受欢迎的文件，可以有效地利用存储空间。目标级别比源级别的重复数据消除消耗更少的处理能力。多个输入文本文档存储在保管箱云中。使用术语频率（TF）和命名实体识别（NER）来检测排名靠前的文本特征，并将其存储在文本数据库中。在将主要功能存储在数据库中之后，将收集新的文本文档以查找受欢迎和不受欢迎的文件，以优化云存储的现有文本语料库。使用TF和NER检测新收集的文本文档的“主要文本”特征，并将删除重复特征清除后的这些独特特征与数据库中存储的现有特征进行比较。在比较中，列出了相关的文本文档。列出文本文件后，将检测文件频率，文件重量和阈值因子。根据平均阈值，检测到受欢迎和不受欢迎的文件。流行文件保留在所有存储节点中，以实现数据的完全可用性，不受欢迎的文件从除主服务器之外的所有辅助服务器中删除。重复数据删除之前，Dropbox云中占用的存储空间为8.09 MB。重复数据删除后，将不受欢迎的文件从辅助存储节点中删除，并将保管箱云中的存储空间优化为4.82MB。最后，通过应用基于权重的重复数据删除系统，可以最大程度地减少数据复制，并有效节省45.60％的云存储空间。从辅助存储节点中删除不受欢迎的文件，并将Dropbox云中的存储空间优化为4.82MB。最后，通过应用基于权重的重复数据删除系统，可以最大程度地减少数据复制，并有效节省45.60％的云存储空间。从辅助存储节点中删除不受欢迎的文件，并将Dropbox云中的存储空间优化为4.82MB。最后，通过应用基于权重的重复数据删除系统，可以最大程度地减少数据复制，并有效节省45.60％的云存储空间。

更新日期：2021-03-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>