Northlight: Declarative and Optimized Analysis of Atmospheric Datasets in SparkSQL,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Northlight: Declarative and Optimized Analysis of Atmospheric Datasets in SparkSQL
arXiv - CS - Databases Pub Date : 2021-09-16 , DOI: arxiv-2109.08053
Justus Henneberg, Felix Schuhknecht, Philipp Reutter, Nils Brast, Peter Spichtinger

Performing data-intensive analytics is an essential part of modern Earth science. As such, research in atmospheric physics and meteorology frequently requires the processing of very large observational and/or modeled datasets. Typically, these datasets (a) have high dimensionality, i.e. contain various measurements per spatiotemporal point, (b) are extremely large, containing observations over a long time period. Additionally, (c) the analytical tasks being performed on these datasets are structurally complex. Over the years, the binary format NetCDF has been established as a de-facto standard in distributing and exchanging such multi-dimensional datasets in the Earth science community -- along with tools and APIs to visualize, process, and generate them. Unfortunately, these access methods typically lack either (1) an easy-to-use but rich query interface or (2) an automatic optimization pipeline tailored towards the specialities of these datasets. As such, researchers from the field of Earth sciences (which are typically not computer scientists) unnecessarily struggle in efficiently working with these datasets on a daily basis. Consequently, in this work, we aim at resolving the aforementioned issues. Instead of proposing yet another specialized tool and interface to work with atmospheric datasets, we integrate sophisticated NetCDF processing capabilities into the established SparkSQL dataflow engine -- resulting in our system Northlight. In contrast to comparable systems, Northlight introduces a set of fully automatic optimizations specifically tailored towards NetCDF processing. We experimentally show that Northlight scales gracefully with the selectivity of the analysis tasks and outperforms the comparable state-of-the-art pipeline by up to a factor of 6x.

中文翻译：

Northlight：SparkSQL 中大气数据集的声明式和优化分析

执行数据密集型分析是现代地球科学的重要组成部分。因此，大气物理学和气象学的研究经常需要处理非常大的观测和/或模型数据集。通常，这些数据集 (a) 具有高维数，即包含每个时空点的各种测量值，(b) 非常大，包含长时间段内的观测值。此外，(c) 在这些数据集上执行的分析任务在结构上很复杂。多年来，二进制格式 NetCDF 已成为在地球科学界分发和交换此类多维数据集的事实上的标准——以及用于可视化、处理和生成它们的工具和 API。很遗憾，这些访问方法通常缺乏（1）易于使用但丰富的查询界面或（2）针对这些数据集的特殊性量身定制的自动优化管道。因此，地球科学领域的研究人员（通常不是计算机科学家）每天都在不必要地努力有效地处理这些数据集。因此，在这项工作中，我们旨在解决上述问题。我们没有提出另一种专门的工具和接口来处理大气数据集，而是将复杂的 NetCDF 处理功能集成到已建立的 SparkSQL 数据流引擎中——从而产生了我们的系统 Northlight。与同类系统相比，Northlight 引入了一组专门针对 NetCDF 处理量身定制的全自动优化。

更新日期：2021-09-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>