Array databases: concepts, standards, implementations,Journal of Big Data

当前位置： X-MOL 学术 › J. Big Data › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Array databases: concepts, standards, implementations
Journal of Big Data ( IF 8.6 ) Pub Date : 2021-02-02 , DOI: 10.1186/s40537-020-00399-2
Peter Baumann , Dimitar Misev , Vlad Merticariu , Bang Pham Huu

Multi-dimensional arrays (also known as raster data or gridded data) play a key role in many, if not all science and engineering domains where they typically represent spatio-temporal sensor, image, simulation output, or statistics “datacubes”. As classic database technology does not support arrays adequately, such data today are maintained mostly in silo solutions, with architectures that tend to erode and not keep up with the increasing requirements on performance and service quality. Array Database systems attempt to close this gap by providing declarative query support for flexible ad-hoc analytics on large n-D arrays, similar to what SQL offers on set-oriented data, XQuery on hierarchical data, and SPARQL and CIPHER on graph data. Today, Petascale Array Database installations exist, employing massive parallelism and distributed processing. Hence, questions arise about technology and standards available, usability, and overall maturity. Several papers have compared models and formalisms, and benchmarks have been undertaken as well, typically comparing two systems against each other. While each of these represent valuable research to the best of our knowledge there is no comprehensive survey combining model, query language, architecture, and practical usability, and performance aspects. The size of this comparison differentiates our study as well with 19 systems compared, four benchmarked to an extent and depth clearly exceeding previous papers in the field; for example, subsetting tests were designed in a way that systems cannot be tuned to specifically these queries. It is hoped that this gives a representative overview to all who want to immerse into the field as well as a clear guidance to those who need to choose the best suited datacube tool for their application. This article presents results of the Research Data Alliance (RDA) Array Database Assessment Working Group (ADA:WG), a subgroup of the Big Data Interest Group. It has elicited the state of the art in Array Databases, technically supported by IEEE GRSS and CODATA Germany, to answer the question: how can data scientists and engineers benefit from Array Database technology? As it turns out, Array Databases can offer significant advantages in terms of flexibility, functionality, extensibility, as well as performance and scalability—in total, the database approach of offering “datacubes” analysis-ready heralds a new level of service quality. Investigation shows that there is a lively ecosystem of technology with increasing uptake, and proven array analytics standards are in place. Consequently, such approaches have to be considered a serious option for datacube services in science, engineering and beyond. Tools, though, vary greatly in functionality and performance as it turns out.

中文翻译：

数组数据库：概念，标准，实现

多维数组（也称为栅格数据或网格数据）在许多（如果不是全部）科学和工程领域中发挥着关键作用，在这些领域中，它们通常代表时空传感器，图像，模拟输出或统计“数据立方体”。由于经典的数据库技术不能充分支持阵列，因此今天的此类数据主要由筒仓解决方案维护，其体系结构容易受到侵蚀，无法满足对性能和服务质量不断增长的要求。数组数据库系统试图通过为大型nD数组提供灵活的即席分析的声明性查询支持来弥补这一差距，这类似于SQL对面向集合的数据，对分层数据的XQuery以及对图数据的SPARQL和CIPHER提供的支持。如今，存在采用大规模并行性和分布式处理的Petascale阵列数据库安装。因此，出现了有关可用技术和标准，可用性和整体成熟度的问题。几篇论文比较了模型和形式主义，还进行了基准测试，通常将两个系统相互比较。尽管所有这些都代表了我们所学的最有价值的研究，但是还没有将模型，查询语言，体系结构以及实际可用性和性能方面的综合调查综合在一起。这项比较的规模也使我们的研究与众不同，有19种系统进行了比较，其中有4种基准的范围和深度明显超出了该领域以前的论文。例如，子集测试的设计方式是无法将系统调整为专门用于这些查询。希望这可以为所有想要深入该领域的人提供代表性的概述，并为需要为其应用程序选择最合适的datacube工具的人提供明确的指导。本文介绍了研究数据联盟（RDA）阵列数据库评估工作组（ADA：WG）的结果，该工作组是大数据兴趣小组的一个子小组。在IEEE GRSS和德国CODATA的技术支持下，它激发了阵列数据库的最新技术，以回答以下问题：数据科学家和工程师如何从阵列数据库技术中受益？事实证明，阵列数据库可以在灵活性，功能性，可扩展性以及性能和可伸缩性方面提供显着的优势-总而言之，提供“数据立方体”分析准备就绪的数据库方法预示了服务质量的新水平。调查显示，存在一个活跃的技术生态系统，且采用率不断提高，并且已经建立了行之有效的阵列分析标准。因此，对于科学，工程学及其他领域的数据多维数据集服务，必须将此类方法视为一种严肃的选择。但是，事实证明，工具的功能和性能差异很大。

更新日期：2021-02-03

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文