Evaluating Query Languages and Systems for High-Energy Physics Data,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluating Query Languages and Systems for High-Energy Physics Data
arXiv - CS - Databases Pub Date : 2021-04-26 , DOI: arxiv-2104.12615
Dan GraurDepartment of Computer Science, ETH Zurich, Ingo MüllerDepartment of Computer Science, ETH Zurich, Ghislain FournyDepartment of Computer Science, ETH Zurich, Gordon T. WattsDepartment of Physics, University of Washington, Mason ProffittDepartment of Physics, University of Washington, Gustavo AlonsoDepartment of Computer Science, ETH Zurich

Query languages in general and SQL in particular are arguably one of the most successful programming interfaces. Yet, in the domain of high-energy physics (HEP), they have found limited acceptance. This is surprising since data analysis in HEP matches the SQL model well: it is fully structured data queried using combinations of selections, projections, joins, and reductions. To gain insights on why this is the case, in this paper we perform an exhaustive performance and functionality analysis of several data processing platforms (Amazon Athena, Google Big Query, Presto, Rumble) and compare them to the new RDataFrame interface of the ROOT framework, the most commonly used system by particle physicists today. The goal of the analysis is to identify the potential advantages and shortcomings of each system considering not only performance but also cost for cloud deployments, suitability of the query dialect, and resulting query complexity. The analysis is done using a HEP workload: the Analysis Description Languages (ADL) benchmark, created by physicists to capture representative aspects of their data processing tasks. The evaluation of these systems results in an interesting and rather complex picture of existing solutions: those offering the best possibilities in terms of expressiveness, conciseness, and usability turn out to be the slowest and most expensive; the fastest ones are not the most cost-efficient and involve complex queries; RDataFrame, the baseline we use as a reference, is often faster and cheaper but is currently facing scalability issues with large multi-core machines. In the paper, we analyze all the aspects that lead to such results and discuss how systems should evolve to better support HEP workloads. In the process, we identify several weaknesses of existing systems that should be relevant to a wide range of use cases beyond particle physics.

中文翻译：

评估高能物理数据的查询语言和系统

一般而言，查询语言，尤其是SQL，可以说是最成功的编程接口之一。然而，在高能物理（HEP）领域，他们的接受程度有限。这是令人惊讶的，因为HEP中的数据分析与SQL模型非常匹配：它是使用选择，投影，联接和约简的组合查询的完全结构化的数据。为了深入了解为什么会发生这种情况，在本文中，我们对几种数据处理平台（Amazon Athena，Google Big Query，Presto，Rumble）进行了详尽的性能和功能分析，并将它们与ROOT框架的新RDataFrame接口进行了比较。，是当今粒子物理学家最常用的系统。分析的目的是确定每个系统的潜在优点和缺点，不仅要考虑性能，还要考虑云部署的成本，查询方言的适用性以及查询的复杂性。分析是通过使用HEP工作负载完成的：分析描述语言（ADL）基准，由物理学家创建以捕获其数据处理任务的代表性方面。对这些系统的评估导致了现有解决方案的有趣且相当复杂的描述：就表现力，简洁性和可用性而言，提供最佳可能性的解决方案是最慢，最昂贵的；最快的查询并非最具成本效益，并且涉及复杂的查询；RDataFrame，我们用作参考的基准，通常更快，更便宜，但目前正面临大型多核计算机的可伸缩性问题。在本文中，我们分析了导致此类结果的所有方面，并讨论了系统应如何发展以更好地支持HEP工作负载。在此过程中，我们确定了现有系统的一些弱点，这些弱点应该与粒子物理以外的广泛使用案例相关。

更新日期：2021-04-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文