Brought to you by:

A Catalog of Broad Morphology of Pan-STARRS Galaxies Based on Deep Learning

and

Published 2020 December 10 © 2020. The American Astronomical Society. All rights reserved.
, , Citation Hunter Goddard and Lior Shamir 2020 ApJS 251 28 DOI 10.3847/1538-4365/abc0ed

Download Article PDF
DownloadArticle ePub

You need an eReader or compatible software to experience the benefits of the ePub3 file format.

0067-0049/251/2/28

Abstract

Autonomous digital sky surveys such as Pan-STARRS have the ability to image a very large number of galactic and extragalactic objects, and the large and complex nature of the image data reinforces the use of automation. Here we describe the design and implementation of a data analysis process for automatic broad morphology annotation of galaxies, and applied it to the data of Pan-STARRS DR1. The process is based on filters followed by a two-step convolutional neural network (CNN) classification. Training samples are generated by using an augmented and balanced set of manually classified galaxies. Results are evaluated for accuracy by comparison to the annotation of Pan-STARRS included in a previous broad morphology catalog of Sloan Digital Sky Survey galaxies. Our analysis shows that a CNN combined with several filters is an effective approach for annotating the galaxies and removing unclean images. The catalog contains morphology labels for 1,662,190 galaxies with ∼95% accuracy. The accuracy can be further improved by selecting labels above certain confidence thresholds. The catalog is publicly available.

Export citation and abstract BibTeX RIS

1. Introduction

With their ability to generate very large databases, autonomous digital sky surveys have been enabling research tasks that were not possible in the preinformation era, and have been becoming increasingly pivotal in astronomy. The ability of digital sky surveys to image large parts of the sky, combined with the concept of virtual observatories that make these data publicly accessible (Djorgovski et al. 2001), has been introducing a new form of astronomy research, and that trend is bound to continue (Borne 2013; Djorgovski et al. 2013).

The Panoramic Survey Telescope and Rapid Response System (Pan-STARRS; Kaiser 2004; Flewelling et al. 2020) is a comprehensive digital sky survey covering ∼103 degree2 per night using an array of two 1.8 m telescopes. Among other celestial objects, Pan-STARRS images a very large number of galaxies. Due to the complexity of galaxy morphology, the ability of current photometric pipelines to analyze these galaxy images is limited, and substantial information that is visible to the human eye is practically unavailable to users of digital sky survey data.

To automate the analysis of galaxy images, several methods have been proposed, including GALFIT (Peng et al. 2002), GIM2D (Simard 1999), CAS (Conselice 2003), the Gini coefficient of the light distribution (Abraham et al. 2003), Ganalyzer (Shamir 2011), and SpArcFiRe (Davis & Hayes 2014). However, the ability of these methods to analyze a large number of real-world galaxy images and produce clean data products is limited, and catalogs of galaxy morphology were prepared manually by professional astronomers (Nair & Abraham 2010; Baillard et al. 2011).

Due to the high volumes of data, the available pool of professional astronomers is not able to provide sufficient labor to analyze databases generated by modern digital sky surveys, leading to the use of crowdsourcing for that task (Lintott et al. 2008, 2011; Willett et al. 2013). The main crowdsourcing campaign for analysis of galaxy morphology was Galaxy Zoo (Lintott et al. 2011), providing annotations of the broad morphology of galaxies imaged by the Sloan Digital Sky Survey (SDSS), as well as other surveys such as the Cosmic Assembly Near-infrared Deep Extragalactic Legacy (CANDELS). However, analyzing the broad morphology of SDSS galaxies required ∼3 yr of work performed by over 105 volunteers, and led to ∼7 × 104 galaxies considered "superclean." Given the huge databases of current and future sky surveys, it is clear that even when using crowdsourcing, the throughput of manual classification might not be sufficient for an exhaustive analysis of such databases.

The use of machine learning provided more effective methods for the purpose of galaxy image classification (Shamir 2009; Huertas-Company et al. 2009; Banerji et al. 2010; Shamir et al. 2013; Kuminski et al. 2014; Dieleman et al. 2015; Schutter & Shamir 2015; Hocking et al. 2017; Kuminski & Shamir 2018; Silva et al. 2018), and the use of such methods also provided computer-generated catalogs of galaxy morphology (Huertas-Company et al. 2011; Simard et al. 2011; Shamir & Wallin 2014; Kuminski & Shamir 2016; Huertas-Company et al. 2015a, 2015b; Timmis & Shamir 2017; Paul et al. 2018; Shamir 2019). Automatic annotation methods were also tested on Pan-STARRS data by using the photometric measurements of colors and moments, classified by a random forest classifier to achieve a considerable accuracy of ∼89% (Baldeschi et al. 2020).

Here we use automatic image analysis to prepare a catalog of the broad morphology of ∼1.7 × 106 Pan-STARRS Data Release 1 (DR1) galaxies. The catalog was generated by using a data analysis process that involves several steps and two convolutional neural networks (CNN) that automated the annotation process to handle the high volume of data.

2. Data

The galaxy image data is sourced from DR1 of Pan-STARRS (Hodapp et al. 2004; Chambers et al. 2016; Flewelling et al. 2020). First, all objects with a Kron r magnitude of less than than 19 and identified by the Pan-STARRS photometric pipeline as extended in all bands were selected.

To filter out objects that are too small for their morphology to be identified, objects that have a Petrosian radius smaller than 5farcs5 were removed. To remove stars, objects for which their point-spread function (PSF) i magnitude subtracted by their Kron i magnitude was greater than 0.05 were also removed. That led to a data set of 2,394,452 objects (Timmis & Shamir 2017). Objects that were flagged by the Pan-STARRS photometric pipeline as artifacts, or having a brighter neighbor, defect, double PSF, or a blend in any of the bands were excluded from the data set. That led to a data set of 2,131,371 objects assumed to be sufficiently large and clean to allow morphological analysis. Figure 1 shows the distribution of the r Kron magnitude of the galaxies in the data set.

Figure 1.

Figure 1. Distribution of the r Kron magnitude of the galaxies in the data set.

Standard image High-resolution image

The galaxy images were then downloaded using the Pan-STARRS cutout service. The images are in the JPG format and have a dimensionality of 120 × 120 pixels as in Kuminski & Shamir (2016). Pan-STARRS cutout provides JPG images for each of the bands. Here we use the images of the g band, as the color images using the y, i, and g bands are in many cases noisy, and do not allow effective analysis of the morphology. The process of downloading the data was completed in 62 days.

The initial scale of the cutout was set to 0farcs25 per pixel. For each image that was downloaded, all bright pixels (grayscale value higher than 125) located on the edge of the frame were counted. If more than 25% of the pixels on the edge of the frame were bright, it is an indication that the object does not fully fit inside the frame. In that case, the scale was increased by 0farcs05, and the image was downloaded again. That was repeated until the number of bright pixels on the edge was less than 25% of the total edge pixels, meaning that the object is inside the frame. The JPG images are far smaller than the FITS images. A 120 × 120 JPG image retrieved through Pan-STARRS cutout service is normally ∼3 kB in size, while an image of the same dimensions in the FITS format will be ∼76 kB. Although the FITS files provide more information, downloading the files in FITS format is substantially slower. While downloading the JPG images took 62 days, downloading the same number of images in the much larger FITS format will require a far longer period of time. The JPG images do not allow photometry, but they are smaller than the FITS files and provide visual information about the shape of the galaxy, which is the information required for the morphological classification of the galaxies. As explained in Section 3.1, the training of the neural network was done with images retrieved from Pan-STARRS, with the exact same size and format as the images that were annotated by the neural network after it was trained.

3. Image Analysis Method

The filtering of the data described in Section 2 aims to remove objects that are not clean galaxy images. That allows for a reduction of the number of images downloaded and classified in the next step with the deep neural network. The removal of objects that are not galaxy images also makes the neural network more accurate due to the higher consistency of the data it is trained with.

To remove saturated images and images that have too few features to allow morphological classification, two additional filters are used. The first filter finds the ratio of fully saturated pixels (a grayscale value of 255 in the JPG image) to the total number of pixels and discards the image if this ratio is higher than 15:1000. Since a high number of saturated pixels is not expected in a clean galaxy image, the simple threshold of 1.5% is sufficient to identify and reject saturated images that are not galaxy images. This step rejected 30,220 objects that were identified as saturated.

The second filter uses the Otsu global threshold method (Otsu 1979) to separate the image into foreground and background pixels. If the number of foreground pixels is less than 1.8% of the total image, the image is marked as having too few distinguishable features. This filter rejected 375,107 galaxies that were identified as having too little foreground to allow identification. Together, these filters removed 405,327 images (∼19%) from the data set. The thresholds were determined experimentally by observing galaxy image samples. Table 1 shows examples of several objects that were filtered based on too few foreground pixels or too many saturated pixels.

Table 1.  Examples of Images Filtered for Having Too Few Foreground Pixels or Too Many Saturated Pixels

Image Saturated Foreground
  Pixels (%) Pixels (%)
6.1 10.7
13.5 21.7
30.9 34.9
0.06 1.4
0.16 1.1

Download table as:  ASCIITypeset image

3.1. Primary Classification

The classifier used for the purpose of annotating the galaxy images is a deep convolutional neural network based on the LeNet-5 architecture (LeCun et al. 1998). To adjust the model for input images of size 120 × 120 instead of 32 × 32, the kernel in the first convolutional layer was changed from 5 × 5 with stride 1 to 10 × 10 with stride 2, and the filter in the first pooling layer was similarly changed from 2 × 2 with stride 2 to 4 × 4 with stride 4. Each of the following layers has identical hyperparameters except for the output layer, where the number of classes is reduced from 10 to 2. The SoftMax output layer of the model provides a degree of certainty for the annotations that allows controlling the size/accuracy trade-off of the catalog, as will be discussed in Section 4.

Training samples were obtained using the debiased "superclean" Galaxy Zoo annotations. "Superclean" objects are objects on which 95% or more of the annotators agreed on their morphology with correction for the redshift bias (Lintott et al. 2011). That selection leads to a subset of very consistent annotations (Lintott et al. 2011), but it also filters the vast majority of Galaxy Zoo annotations that do not satisfy these requirements. The Galaxy Zoo crowdsourcing campaign annotated galaxies imaged by SDSS, which is a different instrument with a different image processing pipeline. Although it has been shown in the past that neural networks trained with data from one telescope can be used to classify data acquired by other telescopes (Domínguez Sánchez et al. 2019), it has also been shown that the accuracy of such networks is inferior to the accuracy of a neural network trained and tested with data from the same instrument (Domínguez Sánchez et al. 2019). Since a large number of galaxies annotated by Galaxy Zoo were also imaged by Pan-STARRS, the Pan-STARRS images of these galaxies can be fetched and be used as the training data, so that the images used to train the neural network are imaged by the same instrument that imaged the galaxies annotated by that network.

Due to the substantial overlap between the footprint of Pan-STARRS and SDSS, the idea of using SDSS data as labels to train machine-learning systems with Pan-STARRS data has been used in the past. For instance, Tarrío & Zarattini (2020) used spectroscopic data from SDSS as labels for training a machine-learning system that can determine the photometric redshift of Pan-STARRS galaxies.

In order to train the neural network with images from the same instrument that it is expected to annotate, the images of the galaxies annotated by Galaxy Zoo were retrieved from Pan-STARRS. Pan-STARRS has a different footprint than SDSS, so not all galaxies annotated by Galaxy Zoo are also imaged by Pan-STARRS. However, 22,456 Galaxy Zoo galaxies with "superclean" annotations were matched with galaxies in Pan-STARRS DR1 based on their R.A. and decl. (within a difference of 0.0001 degrees). These images were fetched from Pan-STARRS and were used for training the neural network.

Figure 2 shows the distribution of the r exponential magnitude of the Galaxy Zoo galaxies where their annotations were used for the compilation of the training set. The magnitude distribution is somewhat different from the magnitude distribution of the Pan-STARRS galaxies shown in Figure 1, which can be explained by the 17.77 limiting r Petrosian magnitude applied to the initial Galaxy Zoo sample (Lintott et al. 2008). As mentioned above, the SDSS images themselves were not used for the training.

Figure 2.

Figure 2. Distribution of the r exponential magnitude of the galaxies in the Galaxy Zoo data set from which the annotations were taken.

Standard image High-resolution image

Figures 3 and 4 show the histograms of the redshift distribution of the galaxies in Pan-STARRS and SDSS, respectively. The number of Pan-STARRS galaxies with redshift is small since Pan-STARRS does not collect spectra, and the spectra was only taken from SDSS galaxies that overlapped with Pan-STARRS galaxies. The two graphs show that the distribution of the redshift is similar in both data sets.

Figure 3.

Figure 3. Distribution of the redshift of the galaxies in the Pan-STARRS data set. The redshift values were taken from SDSS.

Standard image High-resolution image
Figure 4.

Figure 4. Distribution of the redshift of the galaxies in the Galaxy Zoo data set from which the annotations were taken.

Standard image High-resolution image

Galaxy Zoo manual annotations have been shown in the past to be sensitive to the spin direction of the galaxies (Land et al. 2008). To eliminate the possible effect of spin patterns, the training set was augmented such that all galaxies were mirrored, and both the original and mirrored image of each galaxy were used in the training set. That resulted in a training set of 31,564 spiral images and 13,348 elliptical images. Mirroring the spiral galaxies ensures a symmetric data set that is not biased by certain preferences of the human volunteers who annotated the galaxies. That is, while mirroring the images in the training set is often used when training deep neural networks for augmenting the data and increasing the number of training samples, in this case it was also used to produce a symmetric unbiased data set. Mirroring of the elliptical galaxies was done to ensure consistency in the manner training data are handled, and avoid a situation in which different classes are handled differently.

The classifier is implemented in Python 3 using TensorFlow1 and Keras2 . The model was trained for 250 epochs on a 70% training subset and ended with 98.7% accuracy when evaluated against the remaining 30% testing subset. Figure 5 shows the confusion matrix and receiver operating characteristic (ROC) curve of the classification. The high accuracy shows that although the galaxy images were labeled with annotations made with SDSS galaxies, the annotations were still consistent in Pan-STARRS images. That consistency indicates that the two sky surveys are roughly equivalent in the information they provide about the morphology of the galaxies.

Figure 5.

Figure 5. Confusion matrix and ROC curve of the classification of the 30% test samples.

Standard image High-resolution image

Loss was computed using categorical cross entropy, and stochastic gradient descent was used as the optimizer. Various activation functions including ReLU were tested, and gave comparable classification accuracy. The tanh activation used by LeNet-5 gave the best performance and therefore was used for the model. Classification on the total data set (excluding those removed by the filtering step) labeled 904,550 images as elliptical galaxies and 821,494 images as spiral galaxies.

3.2. Secondary Classification

Following the classification described in Section 3.1, the set of images predicted as spiral was shown to contain a significant number of "ghosts," or unclean images. The CNN classifier interpreted the unclean images as patterns of spiral features, leaving the elliptical predictions relatively clean.

To remove these ghosts, we constructed a second deep CNN to separate them from the true spirals. The architecture of this model is simpler than the first, using three convolutional layers with filter sizes of 7 × 7 × 8, 5 × 5 × 32, and 3 × 3 × 64, ReLU activation function, and a single SoftMax output layer. Between the convolutional layers there are maximum pooling layers that each reduce the input dimensions by half. The model uses the Adam optimizer and categorical cross entropy for loss.

For training, several hundred ghost images were initially selected from the set of galaxy images that were mistakenly predicted as spirals, and an equal number of spiral galaxy images were randomly selected from the original spiral training set. These images were divided into 70% training and 30% testing subsets, as before. The model converged during training, and the images originally labeled as spirals were further classified into true spirals and ghosts. This process was repeated several times by selecting additional training images from those labeled as "ghosts" until the size of the training set reached 4,000 images. Testing the neural network shows that the network identifies "ghosts" with accuracy very close to 100%, and almost no false positives. Figure 6 shows the confusion and matrix and ROC curve when testing 1,200 images of spiral galaxies and ghosts. The final iteration of this classifier identified a total of 63,854 images as "ghosts" (∼7.8%), removing them from the set of spiral galaxies.

Figure 6.

Figure 6. Confusion matrix and ROC curve of the classification of the 1,200 ghosts and spiral galaxies.

Standard image High-resolution image

4. Results

The application of the methods described in Section 3 to the Pan-STARRS images described in Section 2 provided a catalog of 1,662,190 galaxies. The catalog is accessible through a simple CSV file that can be downloaded at doi:10.6084/m9.figshare.12081144.v1. Each row in the catalog is a galaxy, and includes the Pan-STARRS object ID of the galaxy, its R.A., decl., and the probability of the galaxy to be spiral or elliptical as estimated by the SoftMax layer of the CNN as described in Section 3. Figure 7 shows the number of galaxies available after applying a threshold to the output of the SoftMax layer of the model.

Figure 7.

Figure 7. Number of spiral and elliptical galaxies remaining when keeping only those at or above a certain model confidence.

Standard image High-resolution image

The catalog includes 904,550 galaxies identified as elliptical and 757,640 identified as spiral. It should be noted that the annotation of a galaxy as an elliptical galaxy means that no spiral features were identified. However, the ability of an algorithm or a person to identify spiral features largely depends on the ability of the optics to provide a detailed image. Therefore, the identification of a galaxy as elliptical does not necessarily guarantee that the galaxy indeed does not have spiral features, but that the optics cannot identify such features (Dojcsak & Shamir 2014). For instance, Table 2 shows examples of galaxies imaged by Pan-STARRS and SDSS, and the same galaxies imaged by Hubble Space Telescope (HST). As the table shows, these galaxies do not have clear visible spiral arms in the Earth-based telescopes, while the arms are seen clearly in the HST images.

Table 2.  Galaxies Imaged by Pan-STARRS, SDSS, and HST

Coordinates Pan-STARRS SDSS HST
(150fdg165, 1fdg588)
(150fdg329, 1fdg603)
(149fdg951, 1fdg966)

Note. While the Pan-STARRS and SDSS images do not show clear spiral arms of the galaxies, HST shows that these galaxies are clearly spiral, and the arms can be identified.

Download table as:  ASCIITypeset image

4.1. Comparison to an Existing SDSS Catalog

In the absence of a large manually annotated galaxy morphology catalog of Pan-STARRS galaxies, the evaluation of the consistency of the annotations was done using annotations of SDSS galaxies that were also imaged by Pan-STARRS. The largest catalog of broad morphology of SDSS galaxies is Kuminski & Shamir (2016), with annotation of ∼3 × 106 objects. Although SDSS is a different sky survey, the footprint of SDSS overlaps with the footprint of Pan-STARRS. Since the Kuminski & Shamir (2016) catalog is large, it is expected that some galaxies in Kuminski & Shamir (2016) will also be included in the catalog of Pan-STARRS galaxies described in this paper.

To evaluate the catalog, the annotations were compared to the annotations of SDSS galaxies in Kuminski & Shamir (2016) with a high degree of confidence of the annotations. Since the images of Kuminski & Shamir (2016) are collected and processed by the SDSS pipeline, their object identifiers naturally do not match the identifiers of Pan-STARRS objects. Therefore, the objects were matched by their coordinates, with tolerance of 0fdg0001 to account for subtle differences in measurements between the two telescopes. This produced 13,186 total matches with 1961 having 90% or higher confidence in the Kuminski & Shamir (2016) catalog. Figure 8 shows the degree of agreement between the annotations of the galaxies in the catalog and the annotations of the galaxies in Kuminski & Shamir (2016) with a high confidence level. Table 3 shows examples of images that were classified correctly in Kuminski & Shamir (2016), but were classified incorrectly by the model. Table 4 shows examples of images that were misclassified by Kuminski & Shamir (2016), but were classified correctly by the model proposed in this paper.

Figure 8.

Figure 8. The proportion of predicted labels that, when restricted to a minimum confidence threshold, agree with the annotations in Kuminski & Shamir (2016). For example, restricting the catalog to labels with 90% confidence or higher will have approximately 98% agreement with the annotations in Kuminski & Shamir (2016).

Standard image High-resolution image

Table 3.  Examples of Images that Were Misclassified by the Model

Misclassified Confidence Misclassified Confidence
as Spiral   as Elliptical  
0.5181 0.6017
0.5614 0.6093
0.5940 0.6455
0.7677 0.7493
0.9114 0.7647

Download table as:  ASCIITypeset image

Table 4.  Examples of Images that Were Classified Correctly by the Model

Classified Confidence Classified Confidence
as Spiral   Elliptical  
0.9999 0.9999
0.9988 0.8799
0.9718 0.7568
0.7446 0.6780
0.5163 0.5637

Download table as:  ASCIITypeset image

When comparing the accuracy of the catalog to the accuracy of Kuminski & Shamir (2016), the algorithm was more accurate in identifying spiral galaxies, while the algorithm used in this catalog was more accurate in the identification of elliptical galaxies. The algorithm used in Kuminski & Shamir (2016) is a "shallow learning" algorithm (Shamir et al. 2008), which is a different paradigm of machine learning compared to the deep convolutional neural network used here. Shallow-learning features such as textures and fractals might better reflect spiral arms, and therefore increase the ability of the algorithm to detect spiral galaxies. Elliptical galaxies are more consistent in shape than spiral galaxies, which can increase the performance of deep convolutional neural networks that their accuracy depend on the consistency of the images.

Figure 9 shows galaxies that were classified as elliptical galaxies by the Kuminski & Shamir (2016) catalog, but as spiral in this catalog, and by visual inspection seem to be spiral galaxies. Figure 10 shows galaxies classified in Kuminski & Shamir (2016) as spiral but in this catalog as elliptical. Careful manual inspection of the images show that the galaxies in Figure 9 are spiral galaxies, but in many of the cases the arms are dim. It is therefore possible that the shallow-learning algorithm used in Kuminski & Shamir (2016) failed to detect these spiral galaxies due to the weak presence of the spiral arms.

Figure 9.

Figure 9. Galaxies imaged by Pan-STARRS that were classified incorrectly as elliptical in Kuminski & Shamir (2016) and as spiral in this catalog.

Standard image High-resolution image
Figure 10.

Figure 10. Galaxies imaged by Pan-STARRS that were classified as spiral in Kuminski & Shamir (2016) but as elliptical in this catalog.

Standard image High-resolution image

Figure 10 shows that the galaxies classified as elliptical do not have clear spiral arms. However, given that the resolution of Pan-STARRS is limited, it is possible that these galaxies are spiral, as shown in Figure 2, where spiral arms not visible in Pan-STARRS become clearly visible using a space-based instrument with higher resolution.

Figures 11 and 12 show the same galaxies in Figures 9 and 10 imaged by SDSS, and classified as elliptical in Kuminski & Shamir (2016). Figure 11 shows that some of the galaxies are ring galaxies or interacting systems, while some of them are clear spiral galaxies that were misclassified by the algorithm. Figure 12 shows galaxies classified as spiral in the Kuminski & Shamir (2016) catalog.

Figure 11.

Figure 11. Galaxies imaged by SDSS and were classified incorrectly as elliptical in Kuminski & Shamir (2016).

Standard image High-resolution image
Figure 12.

Figure 12. Galaxies imaged by SDSS and were classified as spiral in Kuminski & Shamir (2016) but as elliptical in this catalog.

Standard image High-resolution image

The comparison between the shallow-learning algorithm and deep neural network shows that while the deep neural network leans toward elliptical galaxies, the shallow-learning algorithm is more sensitive to spiral galaxies. The shallow-learning algorithm used in Kuminski & Shamir (2016) is described in detail in Orlov et al. (2008) and Shamir et al. (2008, 2010). In summary, it computed 2883 numerical image content descriptors from each galaxy image. These image features include edges, textures, fractals, polynomial decomposition, statistical distribution of pixel intensities, and more, to provide a comprehensive numerical representation of the image. These features are filtered for the most informative features, weighted for their informativeness, and then classified using an instance-based classifier. Instance-based classifiers have the advantage of handling effectively rare instances, imbalanced classes, and variations inside the classes (Li & Zhang 2011; Zhang et al. 2017; Mullick et al. 2018). Since the variability in the class of spiral galaxies is higher than the variability in the class elliptical galaxies, it is possible that the instance-based classifier used in Kuminski & Shamir (2016) can be more accurate in the identification of spiral galaxies.

Experiments done by Walmsley et al. (2019) compared the accuracy of deep convolutional neural networks to the shallow-learning algorithm used in Kuminski & Shamir (2016) for the purpose of automatic morphological classification of galaxies. The results showed that the CNN provided better accuracy compared to the older shallow-learning algorithm, especially in cases of faint tidal features. That can explain some of the missclassified galaxies shown in Figure 9, in which the arms are visible, but are relatively dim. However, the experiments also showed that the shallow-learning algorithm used in Kuminski & Shamir (2016) was better able to handle the more complex cases, in which the CNNs struggled to make clear classification (Walmsley et al. 2019). Since a collection of spiral galaxies is more likely to contain more rare objects, and since the variability among spiral galaxies is higher, an instance-based classifier such as the one used in Kuminski & Shamir (2016) can be more effective in the identification of spiral galaxies compared to elliptical galaxies.

5. Conclusions

While digital sky surveys are capable of collecting and generating extremely large databases, one of the obstacles in fully utilizing these data is the automatic analysis. Image data, and in particular images of extended objects, are more challenging to analyze due to the complex nature of the image data. Here we created a catalog of Pan-STARRS galaxies classified by their broad morphology into elliptical and spiral galaxies. The likelihood of the annotations provided by the SoftMax layer allows the selection of the objects such that a more consistent catalog by sacrificing some of the galaxies that their classification is less certain. The catalog is available in the form of a CSV file at doi:10.6084/m9.figshare.12081144.v1. The classification accuracy is favorably comparable to the ∼89% classification accuracy achieved when using the photometric features provided by the Pan-STARRS photometric pipeline (Baldeschi et al. 2020).

As space-based missions such as Euclid and ground-based missions such as the Rubin Observatory are expected to generate high volumes of astronomical image data, computational methods that can label and organize real-world astronomical images are expected to become increasingly pivotal in astronomy research. Such methods can provide usable data products, and are expected to become important for the purpose of fully utilizing the power of these missions. While CNNs have demonstrated their ability to classify galaxies by their morphology, a practical solution needs to handle noise, bad data, and inconsistencies that are typical to large real-world data sets. As shown in this paper, the deep neural network is not sufficient to provide clean data products. Instead, a combination of several algorithms that complete a full data analysis pipeline was needed. With the increasing robustness of such systems, it is also expected that protocols that combine multiple neural networks and filtering algorithms will be used to provide detailed morphological information. That information will become part of future data releases of digital sky surveys.

The processing was done by first downloading the galaxy images to another server, and the analysis of the data was done on the remote server. The reason for using that practice is because the data analysis is based on solutions designed specifically for the task of galaxy annotation, and not on "standard" tasks provided by common services such as CasJobs (Li & Thakar 2008). Although the smaller JPG images were used, downloading all images still required a substantial amount of time. Using the more informative FITS images would have increased the required time to download the data by an order of magnitude, and analyzing data of much larger digital sky surveys such as the Rubin Observatory will become impractical using this practice. Therefore, future surveys might provide users not merely with certain specific predesigned tasks, but might also allow processing time for user-designed programs to access the raw data without the need to download it to third-party servers.

We would like to thank the anonymous reviewer for the comments that helped to improve the paper. This research was funded by NSF grant AST-1903823. This publication uses data generated via the Zooniverse.org platform, development of which is funded by generous support, including a Global Impact Award from Google, and by a grant from the Alfred P. Sloan Foundation. The Pan-STARRS1 Surveys (PS1) and the PS1 public science archive have been made possible through contributions by the Institute for Astronomy, the University of Hawaii, the Pan-STARRS Project Office, the Max-Planck Society and its participating institutes, the Max Planck Institute for Astronomy, Heidelberg and the Max Planck Institute for Extraterrestrial Physics, Garching, the Johns Hopkins University, Durham University, the University of Edinburgh, the Queen's University Belfast, the Harvard-Smithsonian Center for Astrophysics, the Las Cumbres Observatory Global Telescope Network Incorporated, the National Central University of Taiwan, the Space Telescope Science Institute, the National Aeronautics and Space Administration under grant No. NNX08AR22G issued through the Planetary Science Division of the NASA Science Mission Directorate, the National Science Foundation grant No. AST-1238877, the University of Maryland, Eotvos Lorand University (ELTE), the Los Alamos National Laboratory, and the Gordon and Betty Moore Foundation.

Footnotes

Please wait… references are loading.
10.3847/1538-4365/abc0ed