A Big Data Lake for Multilevel Streaming Analytics,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Big Data Lake for Multilevel Streaming Analytics
arXiv - CS - Databases Pub Date : 2020-09-25 , DOI: arxiv-2009.12415
Ruoran Liu, Haruna Isah, Farhana Zulkernine

Large organizations are seeking to create new architectures and scalable platforms to effectively handle data management challenges due to the explosive nature of data rarely seen in the past. These data management challenges are largely posed by the availability of streaming data at high velocity from various sources in multiple formats. The changes in data paradigm have led to the emergence of new data analytics and management architecture. This paper focuses on storing high volume, velocity and variety data in the raw formats in a data storage architecture called a data lake. First, we present our study on the limitations of traditional data warehouses in handling recent changes in data paradigms. We discuss and compare different open source and commercial platforms that can be used to develop a data lake. We then describe our end-to-end data lake design and implementation approach using the Hadoop Distributed File System (HDFS) on the Hadoop Data Platform (HDP). Finally, we present a real-world data lake development use case for data stream ingestion, staging, and multilevel streaming analytics which combines structured and unstructured data. This study can serve as a guide for individuals or organizations planning to implement a data lake solution for their use cases.

中文翻译：

用于多级流分析的大数据湖

由于过去很少见的数据爆炸性，大型组织正在寻求创建新的架构和可扩展的平台，以有效应对数据管理挑战。这些数据管理挑战主要是由多种格式的各种来源的高速流数据的可用性构成的。数据范式的变化导致了新的数据分析和管理架构的出现。本文侧重于在称为数据湖的数据存储架构中以原始格式存储大量、速度和多样性的数据。首先，我们介绍了我们对传统数据仓库在处理最近数据范式变化方面的局限性的研究。我们讨论并比较了可用于开发数据湖的不同开源和商业平台。然后，我们描述了我们在 Hadoop 数据平台 (HDP) 上使用 Hadoop 分布式文件系统 (HDFS) 的端到端数据湖设计和实现方法。最后，我们展示了一个真实世界的数据湖开发用例，用于数据流摄取、分段和多级流分析，它结合了结构化和非结构化数据。本研究可为计划为其用例实施数据湖解决方案的个人或组织提供指导。

更新日期：2020-09-29

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>