当前位置: X-MOL 学术Astron. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Analysing billion-objects catalogue interactively: Apache Spark for physicists
Astronomy and Computing ( IF 1.9 ) Pub Date : 2019-07-31 , DOI: 10.1016/j.ascom.2019.100305
S. Plaszczynski , J. Peloton , C. Arnault , J.E. Campagne

Apache Spark is a Big Data framework for working on large distributed datasets. Although widely used in the industry, it remains rather limited in the academic community or often restricted to software engineers. The goal of this paper is to show with practical uses-cases that the technology is mature enough to be used without excessive programming skills by astronomers or cosmologists in order to perform standard analyses over large datasets, as those originating from future galaxy surveys.

To demonstrate it, we start from a realistic simulation corresponding to 10 years of LSST data taking (6 billions of galaxies). Then, we design, optimize and benchmark a set of Spark python algorithms in order to perform standard operations as adding photometric redshift errors, measuring the selection function or computing power spectra over tomographic bins. Most of the commands execute on the full 110 GB dataset within tens of seconds and can therefore be performed interactively in order to design full-scale cosmological analyses. A jupyter notebook summarizing the analysis is available at https://github.com/astrolabsoftware/1807.03078.



中文翻译:

交互式分析十亿个对象的目录:物理学家的Apache Spark

Apache Spark是一个用于处理大型分布式数据集的大数据框架。尽管在行业中广泛使用,但它在学术界仍然相当受限制,或者通常仅限于软件工程师。本文的目的是通过实际使用案例展示该技术已经足够成熟,可以被天文学家或宇宙学家使用而无需过多的编程技能,从而可以对大型数据集进行标准分析,例如来自未来银河系调查的数据集。

为了证明这一点,我们从对应10年LSST数据采集(60亿个星系)的真实模拟开始。然后,我们设计,优化和基准化一组Spark  python算法,以执行标准操作,例如增加光度红移误差,测量选择功能或计算层析成像箱上的功率谱。大多数命令会在数十秒内在完整的110 GB数据集上执行,因此可以交互执行,以设计全面的宇宙学分析。一个jupyter笔记本汇总分析可在https://github.com/astrolabsoftware/1807.03078。

更新日期:2019-07-31
down
wechat
bug