Performance Evaluation of an Independent Time Optimized Infrastructure for Big Data Analytics that Maintains Symmetry,Symmetry

当前位置： X-MOL 学术 › Symmetry › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Performance Evaluation of an Independent Time Optimized Infrastructure for Big Data Analytics that Maintains Symmetry
Symmetry ( IF 2.2 ) Pub Date : 2020-08-02 , DOI: 10.3390/sym12081274
Satvik Vats , Bharat Bhushan Sagar , Karan Singh , Ali Ahmadian , Bruno A. Pansera

Traditional data analytics tools are designed to deal with the asymmetrical type of data i.e., structured, semi-structured, and unstructured. The diverse behavior of data produced by different sources requires the selection of suitable tools. The restriction of recourses to deal with a huge volume of data is a challenge for these tools, which affects the performances of the tool’s execution time. Therefore, in the present paper, we proposed a time optimization model, shares common HDFS (Hadoop Distributed File System) between three Name-node (Master Node), three Data-node, and one Client-node. These nodes work under the DeMilitarized zone (DMZ) to maintain symmetry. Machine learning jobs are explored from an independent platform to realize this model. In the first node (Name-node 1), Mahout is installed with all machine learning libraries through the maven repositories. The second node (Name-node 2), R connected to Hadoop, is running through the shiny-server. Splunk is configured in the third node (Name-node 3) and is used to analyze the logs. Experiments are performed between the proposed and legacy model to evaluate the response time, execution time, and throughput. K-means clustering, Navies Bayes, and recommender algorithms are run on three different data sets, i.e., movie rating, newsgroup, and Spam SMS data set, representing structured, semi-structured, and unstructured data, respectively. The selection of tools defines data independence, e.g., Newsgroup data set to run on Mahout as others cannot be compatible with this data. It is evident from the outcome of the data that the performance of the proposed model establishes the hypothesis that our model overcomes the limitation of the resources of the legacy model. In addition, the proposed model can process any kind of algorithm on different sets of data, which resides in its native formats.

中文翻译：

用于保持对称性的大数据分析的独立时间优化基础架构的性能评估

传统的数据分析工具旨在处理非对称类型的数据，即结构化、半结构化和非结构化数据。不同来源产生的数据的不同行为需要选择合适的工具。处理海量数据的资源限制对这些工具来说是一个挑战，这会影响工具执行时间的性能。因此，在本文中，我们提出了一种时间优化模型，在三个名称节点（主节点）、三个数据节点和一个客户端节点之间共享公共 HDFS（Hadoop 分布式文件系统）。这些节点在非军事区 (DMZ) 下工作以保持对称。从一个独立的平台探索机器学习工作来实现这个模型。在第一个节点（Name-node 1）中，Mahout 通过 Maven 存储库与所有机器学习库一起安装。第二个节点（名称节点 2），R 连接到 Hadoop，通过闪亮的服务器运行。Splunk 配置在第三个节点（Name-node 3）中，用于分析日志。在提议的模型和遗留模型之间进行实验以评估响应时间、执行时间和吞吐量。K-means 聚类、Navies Bayes 和推荐算法在三个不同的数据集上运行，即电影评级、新闻组和垃圾短信数据集，分别代表结构化、半结构化和非结构化数据。工具的选择定义了数据独立性，例如，新闻组数据集运行在 Mahout 上，因为其他数据无法与此数据兼容。从数据结果可以明显看出，所提出模型的性能建立了我们的模型克服了遗留模型资源限制的假设。此外，所提出的模型可以在不同的数据集上处理任何类型的算法，这些数据集中在其原生格式中。

更新日期：2020-08-02

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文