Elsevier

Astronomy and Computing

Volume 34, January 2021, 100447
Astronomy and Computing

Full length article
Development of a high throughput cloud-based data pipeline for 21 cm cosmology

https://doi.org/10.1016/j.ascom.2021.100447Get rights and content

Abstract

We present a case study of a cloud-based computational workflow for processing large astronomical data sets from the Murchison Widefield Array (MWA) cosmology experiment. Cloud computing is well-suited to large-scale, episodic computation because it offers extreme scalability in a pay-for-use model. This facilitates fast turnaround times for testing computationally expensive analysis techniques. We describe how we have used the Amazon Web Services (AWS) cloud platform to efficiently and economically test and implement our data analysis pipeline. We discuss the challenges of working with the AWS spot market, which reduces costs at the expense of longer processing turnaround times, and we explore this tradeoff with a Monte Carlo simulation.

Introduction

The formation of the first stars and galaxies, and their later acceleration due to dark energy, can be probed by measuring the large-scale distribution of neutral hydrogen at high redshift (for reviews see Morales and Wyithe, 2010 and Liu and Shaw, 2020). Several arrays have been developed to measure the power spectrum of this cosmological signal, including the Low Frequency Array1 (LOFAR; Van Haarlem et al., 2013), the Donald C. Backer Precision Array for Probing the Epoch of Reionization2 (PAPER; Parsons et al., 2010), the Murchison Widefield Array3 (MWA; Tingay et al., 2013), and the Hydrogen Epoch of Reionization Array4 (HERA; DeBoer et al., 2017). The hundreds of antennas and thousands of channels which give the needed sensitivity and redshift span generate a significant amount of data. For example, the MWA has generated a 28 PB data archive since it commenced in 2013.

Data analysis follows a traditional calibration and imaging approach, a compute intensive operation which is not trivially parallelizable. Analysis is further complicated by the need to distinguish foregrounds from the faint spectral signature of the cosmological background. This challenge emerges as a need to control for systematic error throughout the experiment and analysis to one part in 100,000; custom analysis codes are required to control for systematics in calibration, synthesis imaging, and error propagation. To date, all limits on the cosmological power spectrum have been limited by systematic biases that degrade measurement precision. These systematic floors are reached after processing hours or days of data. Each analysis iteration results in better identification of systematics and allows integration of more data for a deeper measurement. The iteration cycle is improved by testing on large amounts of data.

Recently, cloud computing has emerged as an alternative to traditional computing clusters for high-performance academic research computing, particularly of large astronomical data sets. Dodson et al. (2016) describe using the Amazon Web Services (AWS) cloud computing service to analyze the CHILES dataset, an example of parallelization used to process repeated measurements. Sabater et al. (2017) similarly calibrate LOFAR data with AWS. A related analysis, though not of radio astronomy data, was reported by Warren et al. (2016), which describes processing satellite images in the cloud.

Cloud computing is particularly well-suited to episodic computation, where users require short periods of high computational throughput interspersed with periods of low usage. Dedicated clusters or small shared clusters can be expensive to maintain during periods of minimal usage and limited in their scalability during periods of heavy computation. The development of analysis techniques for radio cosmology measurements requires highly episodic computation as we identify systematics and test new analysis approaches on large data sets. The speed of this development cycle is limited by the testing turnaround time.

Here we discuss how we have used cloud computing to routinely test analyses of data from the Murchison Widefield Array (MWA). We have used AWS to execute jobs in hundreds of parallel nodes, performing calibration, synthesis imaging, mosaicing, and power spectrum analysis on hundreds of TB of data. We describe our cloud pipeline and report finding on its efficiency, cost, and failure modes. We note that while the spot market mitigates costs, it extends testing turnaround times. To better understand this tradeoff we present a simple model that simulates the impact of the spot market on a typical analysis run. The simulation indicates that improvements in checkpointing and restart automation would offer faster overall execution time while retaining the spot market’s cost savings.

Section snippets

Background on cloud computing with AWS

While a number of cloud computing platforms exist (Microsoft Azure, Google Cloud, etc.), this paper focuses on a workflow developed with AWS. We primarily use two AWS tools: Elastic Compute Cloud (EC2) for computation and Simple Storage Service (S3) for data storage. In this section we describe the basic functionality of these tools and define terminology used throughout the paper.

Data processing

We describe processing data from the MWA radio observatory. The MWA is an array of 128 stations, each comprising a grid of 16 dipole antennas phased to form a steerable 15 degree field-of-view. The interferometric output is a measure of the correlation between all pairs of stations, or baselines, as a function of frequency, polarization, and time. Data volumes therefore scale as the number of independent baselines, or N(N+1)2 where N denotes the number of stations, meaning that larger arrays

The AWS cloud workflow

In this section we present the cloud-based data processing workflow we developed to process cosmological data sets from the MWA. The workflow supports high throughput parallelized processing of many observations. It is efficient, economical, and relatively simple to operate and train others to use. Since its development, we have trained five new users on the workflow. It is currently heavily used by three graduate students at the University of Washington.

Developing this workflow consisted of

Working with the spot market

The AWS spot market allows users to trade reliability for cost savings: EC2 instances can be purchased at steeply reduced cost if the user can tolerate some probability of unexpected instance termination. There is extensive literature exploring the potential savings from using spot instances (Yi et al., 2010, Javadi et al., 2011, Jung et al., 2011, Mazzucco and Dumas, 2011, Voorsluys and Buyya, 2012, Ben-Yehuda et al., 2013, Poola et al., 2014, He et al., 2015, Karunakaran and Sundarraj, 2015,

Discussion

Cloud computing has matured within the last decade, and Infrastructure as a Service (IaaS) has taken root in everyday technologies. In the lifetime of this project, which began in 2015, the scope and complexity of offerings has grown exponentially. Much of this development was driven by commercial needs; cloud computing tools for academic research have lagged behind private-sector advancement. Tools such as AWS’s ParallelCluster have brought much-needed new investment to academic areas.

Even so,

Conclusion

Using the AWS cloud computing platform, we have produced an efficient processing workflow for radio cosmology data. Our workflow is highly scalable, which permits faster testing turnaround times than with typical academic computing clusters. This enables rapid development of the novel analysis techniques needed to mitigate systematics in our data processing pipeline.

We note that substituting spot instances for on-demand instances can reduce computational costs at the expense of longer

CRediT authorship contribution statement

R. Byrne: Conceptualization, Software, Investigation, Writing - original draft. D. Jacobs: Writing - review & editing, Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We thank Nichole Barry, Jon Ringuette, and Michael Wilensky for their contributions to the AWS cloud workflow. Computation on AWS was supported in part by the University of Washington student-led Research Computing Club with funding provided by the University of Washington, United States Student Technology Fee Committee. This project was made possible by computing credits from the Amazon/Square Kilometer Array Astrocompute initiative. We thank Lori Clithero, Aaron Bucher, and Sean Smith of AWS

References (27)

  • DodsonR. et al.

    Imaging SKA-scale data in three different computing environments

    Astron. Comput.

    (2016)
  • BarryN. et al.

    The FHD/Eppsilon epoch of reionisation power spectrum pipeline

    Publ. Astron. Soc. Aust.

    (2019)
  • Ben-YehudaO. et al.

    Deconstructing amazon EC2 spot instance pricing

    ACM Trans. Econ. Comput.

    (2013)
  • DeBoerD.R. et al.

    Hydrogen epoch of reionization array (HERA)

    Publ. Astron. Soc. Pac.

    (2017)
  • HaghshenasH. et al.

    Parasite cloud service providers: On-demand prices on top of spot prices

    Heliyon

    (2019)
  • HeX. et al.

    Cutting the cost of hosting online services using cloud spot markets

  • Hurley-WalkerN. et al.

    Galactic and extragalactic all-sky murchison widefield array (GLEAM) survey - I. A low-frequency extragalactic catalogue

    Mon. Not. R. Astron. Soc.

    (2017)
  • JacobsD.C. et al.

    The Murchison widefield array 21 cm power spectrum analysis methodology

    Astrophys. J.

    (2016)
  • JavadiB. et al.

    Statistical modeling of spot instance prices in public cloud environments

  • JordanC. et al.

    Characterization of the ionosphere above the Murchison radio observatory using the Murchison widefield array

    Mon. Not. R. Astron. Soc.

    (2017)
  • JungD. et al.
  • KarunakaranS. et al.

    Bidding strategies for spot instances in cloud computing markets

    IEEE Internet Comput.

    (2015)
  • KhandelwalV. et al.

    Bidding strategies for amazon EC2 spot instances - a comprehensive review

  • View full text