An independent time optimized hybrid infrastructure for big data analytics,Modern Physics Letters B

当前位置： X-MOL 学术 › Mod. Phys. Lett. B › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An independent time optimized hybrid infrastructure for big data analytics
Modern Physics Letters B ( IF 1.9 ) Pub Date : 2020-07-21 , DOI: 10.1142/s021798492050311x
Satvik Vats ₁ , B. B. Sagar ₁

Affiliation

In Big data domain, platform dependency can alter the behavior of the business. It is because of the different kinds (Structured, Semi-structured and Unstructured) and characteristics of the data. By the traditional infrastructure, different kinds of data cannot be processed simultaneously due to their platform dependency for a particular task. Therefore, the responsibility of selecting suitable tools lies with the user. The variety of data generated by different sources requires the selection of suitable tools without human intervention. Further, these tools also face the limitation of recourses to deal with a large volume of data. This limitation of resources affects the performance of the tools in terms of execution time. Therefore, in this work, we proposed a model in which different data analytics tools share a common infrastructure to provide data independence and resource sharing environment, i.e. the proposed model shares common (Hybrid) Hadoop Distributed File System (HDFS) between three Name-Node (Master Node), three Data-Node and one Client-node, which works under the DeMilitarized zone (DMZ). To realize this model, we have implemented Mahout, R-Hadoop and Splunk sharing a common HDFS. Further using our model, we run [Formula: see text]-means clustering, Naïve Bayes and recommender algorithms on three different datasets, movie rating, newsgroup, and Spam SMS dataset, representing structured, semi-structured and unstructured, respectively. Our model selected the appropriate tool, e.g. Mahout to run on the newsgroup dataset as other tools cannot run on this data. This shows that our model provides data independence. Further results of our proposed model are compared with the legacy (individual) model in terms of execution time and scalability. The improved performance of the proposed model establishes the hypothesis that our model overcomes the limitation of the resources of the legacy model.

中文翻译：

用于大数据分析的独立时间优化混合基础架构

在大数据领域，平台依赖性可以改变业务的行为。这是因为数据的不同种类（结构化、半结构化和非结构化）和特征。通过传统的基础设施，由于特定任务的平台依赖性，不同类型的数据无法同时处理。因此，选择合适工具的责任在于用户。不同来源生成的各种数据需要在没有人工干预的情况下选择合适的工具。此外，这些工具还面临处理大量数据的资源限制。这种资源限制会影响工具在执行时间方面的性能。因此，在这项工作中，我们提出了一个模型，其中不同的数据分析工具共享一个公共基础设施以提供数据独立性和资源共享环境，即所提出的模型在三个 Name-Node（主节点）、三个数据节点和一个客户端节点，在非军事区 (DMZ) 下工作。为了实现这个模型，我们实现了 Mahout、R-Hadoop 和 Splunk 共享一个共同的 HDFS。进一步使用我们的模型，我们在三个不同的数据集、电影评分、新闻组和垃圾短信数据集上运行 [公式：见文本]-均值聚类、朴素贝叶斯和推荐算法，分别代表结构化、半结构化和非结构化。我们的模型选择了合适的工具，例如 Mahout 来在新闻组数据集上运行，因为其他工具无法在该数据上运行。这表明我们的模型提供了数据独立性。我们提出的模型的进一步结果在执行时间和可扩展性方面与传统（个人）模型进行了比较。所提出模型的改进性能建立了我们的模型克服了遗留模型资源限制的假设。

更新日期：2020-07-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>