Joint Management and Analysis of Textual Documents and Tabular Data within the AUDAL Data Lake,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Joint Management and Analysis of Textual Documents and Tabular Data within the AUDAL Data Lake
arXiv - CS - Databases Pub Date : 2021-09-03 , DOI: arxiv-2109.01374
Pegdwendé SawadogoERIC, Jérôme DarmontERIC, Camille Noûs

In 2010, the concept of data lake emerged as an alternative to data warehouses for big data management. Data lakes follow a schema-on-read approach to provide rich and flexible analyses. However, although trendy in both the industry and academia, the concept of data lake is still maturing, and there are still few methodological approaches to data lake design. Thus, we introduce a new approach to design a data lake and propose an extensive metadata system to activate richer features than those usually supported in data lake approaches. We implement our approach in the AUDAL data lake, where we jointly exploit both textual documents and tabular data, in contrast with structured and/or semi-structured data typically processed in data lakes from the literature. Furthermore, we also innovate by leveraging metadata to activate both data retrieval and content analysis, including Text-OLAP and SQL querying. Finally, we show the feasibility of our approach using a real-word use case on the one hand, and a benchmark on the other hand.

中文翻译：

AUDAL 数据湖中文本文档和表格数据的联合管理和分析

2010 年，数据湖的概念作为大数据管理数据仓库的替代方案出现。数据湖遵循读取模式方法来提供丰富而灵活的分析。然而，尽管在工业界和学术界都很流行，但数据湖的概念还在不断成熟，数据湖设计的方法论方法仍然很少。因此，我们引入了一种设计数据湖的新方法，并提出了一个广泛的元数据系统来激活比数据湖方法通常支持的功能更丰富的功能。我们在 AUDAL 数据湖中实施我们的方法，在那里我们共同利用文本文档和表格数据，与通常在文献数据湖中处理的结构化和/或半结构化数据形成对比。此外，我们还通过利用元数据来激活数据检索和内容分析（包括文本 OLAP 和 SQL 查询）进行创新。最后，我们一方面使用实际用例，另一方面使用基准来展示我们方法的可行性。

更新日期：2021-09-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文