Active Optical Control with Machine Learning: A Proof of Concept for the Vera C. Rubin Observatory

, , , , and

Published 2021 April 8 © 2021. The American Astronomical Society. All rights reserved.
, , Citation Jun E. Yin et al 2021 AJ 161 216 DOI 10.3847/1538-3881/abe9b9

Download Article PDF
DownloadArticle ePub

You need an eReader or compatible software to experience the benefits of the ePub3 file format.

1538-3881/161/5/216

Abstract

The Active Optics System of the Vera C. Rubin Observatory (Rubin) uses information provided by four wave front sensors to determine deviations between the reconstructed wave front and the ideal wave front. The observed deviations are used to adjust the control parameters of the optical system to maintain image quality across the 3fdg5 field of view. The baseline approach from the project is to obtain amplitudes of the Zernike polynomials describing the distorted wave front from out-of-focus images collected by the wave front sensors. These Zernike amplitudes are related via an "influence matrix" to the control parameters necessary to correct the wave front. In this paper, we use deep-learning methods to extract the control parameters directly from the images captured by the wave front sensors. Our neural net model uses anti-aliasing pooling to boost performance, and a domain-specific loss function to aid learning and generalization. The accuracy of the control parameters derived from our model exceeds Rubin requirements even in the presence of full-moon background levels and mis-centering of reference stars. Although the training process is time consuming, model evaluation requires only a few milliseconds. This low latency should allow for the correction of the optical configuration during the readout and slew interval between successive exposures.

Export citation and abstract BibTeX RIS

1. Introduction

The Legacy Survey of Space and Time (LSST) at the Vera C. Rubin Observatory (Rubin) will be the deepest optical/IR survey ever to cover the majority of the sky. It will provide new insights into the mysteries of dark matter and dark energy, as well as transient phenomena, solar system objects, and Milky Way structure. The telescope's optical system actively compensates for flexure via 50 control parameters that adjust the mirror and camera positions (using two hexapods with six actuators each), and actively deforms the primary, secondary, and tertiary mirrors, using 156 and 72 actuators, respectively. In order to keep the images sharp, we seek an efficient and accurate way to estimate these control parameters.

The light arriving from a distant star at the top of the atmosphere may be thought of as a plane wave. Spatial variations in the index of refraction cause the wave front to distort as it passes through the atmosphere. After the wave front enters the telescope, the optical system introduces additional distortions. A typical value for the point-spread function (PSF) due to atmospheric effects is estimated to be 0farcs7. The distortions from the optical system are approximately stable over an exposure, and must be determined and corrected from one exposure to the next.

When substantially out of focus, a star produces an annulus corresponding to the part of the primary mirror unobscured by the secondary. These annuli are colloquially known as "donut" images. The donuts reveal shifts in focus by their overall size, but contain far more information than that. The inner and outer boundaries of each donut deviate from their undistorted shape (in the ideal case, a circle), and the surface brightness of the donut is nonuniform. These perturbations carry sufficient information to enable extraction of the control parameters needed to correct the distortion. The current plan (Xin et al. 2015) for control parameter extraction is based on wave front determination via the iterative fast Fourier transformation (Roddier & Roddier 1993) and the series expansion method (Gureyev & Nugent 1996).

Meanwhile, machine-learning (ML) methods promise high image regression accuracy and short model evaluation time. This short evaluation time may be especially valuable for adaptive optics (AO) where corrections on millisecond timescales are required. For example, Angel et al. (1990), Wizinowich et al. (1991), and Vdovin (1995) studied using fully connected layers for wave front sensing for AO. Hoffmann & Elster (2020) studied using deep learning for wave front slope measurements. Liu et al. (2020) used deep learning and long short-term memory to predict wave front distortions for AO. Others have explored the possibility of wave front correction for extremely large telescopes by reconstructing the atmospheric turbulence profile with neural networks (Osborn et al. 2014; Yang et al. 2019).

In this work we address a related, but different, problem. Our task is to update the optical parameters of Rubin on a 30 s timescale using donut images as the input. A convolutional neural network (CNN) is a natural fit for a regression problem that takes in images and outputs a small number of parameters, as illustrated in Figure 1. We train a CNN on images generated for a realistic range of control parameters, and a range of sky background levels from dark sky to full Moon. The CNN architecture builds on the success of the ResNet18 (He et al. 2015) model, with some enhancements:

  • 1.  
    We use an approach called self-attention (see Section 5.2) to make the CNN nonlocal.
  • 2.  
    We use an anti-aliasing method to improve translation equivariance of neural networks.
  • 3.  
    We explicitly include a measure of PSF width in the loss function.

Our method is reliable and evaluates in milliseconds, faster than baseline methods.

Figure 1.

Figure 1. Schematics of the algorithm. Left: the input to the network is eight channel images from the wave front sensors, 64 × 64 pixels each, or 12farcs8 for 0farcs2 pixels. Middle: the convolutional neural network structure is based on ResNet18 with some additional task-specific subnetwork architectures. Right: the output from the network is the prediction of the values of the 50 control parameters.

Standard image High-resolution image

This paper is organized as follows. The main characteristics of Rubin and its active optics system (AOS) are discussed in Section 2. The control parameters that define the characteristics of the optical system and the accuracy with which they must be determined are discussed in Section 3. The machine-learning determination of the wave front method is discussed in Section 4. The results of the evaluation are summarized in Section 5 and the conclusions are discussed in Section 6.

2. Vera C. Rubin Observatory

The Vera C. Rubin Observatory (Abell et al. 2009) is an 8.4 m telescope located on Cerro Pachón, Coquimbo Region, Chile. It has the world's largest digital camera, with 3.2 billion pixels covering a 3fdg5 field of view. The LSST survey, which will be in operation from 2022 to 2032, will take approximately 5 million exposures, probing the sky with a sensitivity and cadence superior to any previous wide-area survey.

The main scientific goals of Vera C. Rubin Observatory are to understand the mysterious dark matter and dark energy, to study the formation of the Milky Way galaxy, to map out objects in the solar system, and to observe a wide variety of transient phenomena. The study of dark matter uses weak gravitational lensing (WL). In WL, the shape of a distant galaxy is weakly distorted by the foreground masses and this shearing is a key observable. Because the shear effects are small (a few percent), the distortions in the images due to imperfections of the optical system must be minimized.

2.1. The Optical System

For a telescope of its size, Rubin has an exceptionally wide field of view giving it the ability to survey the entire sky every three to four nights. It uses a three-mirror design, as illustrated schematically in Figure 2. The optical system has 50 degrees of freedom. The default control parameters are set to correct for the as-built parameters of the telescope structure and mirror imperfections. In practice, changes in environmental conditions and gravity vector will cause changes in the shape and position of the components of the optical system. In this work we are considering the 50 ΔCP values, i.e., the deviations from the default control parameters.

Figure 2.

Figure 2. Schematic of the optical system of the Vera C. Rubin Observatory. Light is reflected by the M1 mirror, then by the M2 mirror, and finally by the M3 mirror on the way to the camera installed underneath M2. The elements controlled by the 50 control parameters are indicated. This figure is taken from Abell et al. (2009) and Ivezić et al. (2019).

Standard image High-resolution image

The 50 control parameters used to control the shape and position of the optical system are illustrated schematically in Figure 2. Ten control parameters control the position and orientation of the M2 mirror and camera, i.e., the position in x, y, and z, and rotation about the x and y directions. The other 40 control parameters control the deformation of the mirrors, 20 for the shape of the M1/M3 mirror and 20 for the M2 mirror.

The AOS controls the position/shape of the primary and secondary mirrors and the position of the camera. The AOS measures the optical perturbations after the telescope's first 15 s exposure of each visit, and adjusts the control parameters of the telescope after the second 15 s exposure, before the next visit. The goal of this paper is to create a machine-learning model that improves AOS by providing fast, accurate updates to the control parameters.

2.2. The Wave Front Sensors

The Rubin LSST camera (Figure 3) consists of 189 science CCDs (blue), 8 guide sensors (yellow) and 4 wave front sensors (green). Each of the wave front sensors is split, with half of the surface located 1.5 mm in front of the focal plane and the other half located 1.5 mm behind it. The out-of-focus star images collected by these sensors exhibit the donut shape described above (Figure 3, right side). Our task is to derive the control parameters from such images.

Figure 3.

Figure 3. Left: the Vera C. Rubin Observatory focal plane, showing the sensors used for science (blue), guiding (yellow), and wave front sensing (green). Right: the schematic operation of the split wave front sensors, capturing images 1 mm in front of the focal plane (intra-focal) and behind the focal plane (extra-focal). Reprinted and adapted with permission from Xin et al. (2015). © The Optical Society.

Standard image High-resolution image

2.3. The Image Quality Budget of the AOS

To achieve its science goals, the image quality obtained by Rubin should be limited only by atmospheric distortions ("seeing"), not the telescope or camera. The telescope site typically delivers a PSF FWHM of 0farcs5–1'' with a median of about 0farcs7. The overall system image quality budget (Thomas et al. 2016) is 0farcs4 FWHM, where 0farcs25 is allocated to the telescope optics and 0farcs3 is allocated to the camera, including sensor effect like charge diffusion.

The image quality budget associated with the AOS is 0farcs079. The part of this budget specifically allocated for distortions from incorrectly determined control parameters is 0farcs036. These tolerances correspond to angular resolutions of 0.38 μrad (0farcs079) and 0.17 μrad (0farcs036).

3. Control Parameters

In this section, the influence of the control parameters on the measured wavefronts are discussed. Using the influence matrix, we estimate the maximum allowable uncertainty in each control parameter.

3.1. Correlation between Control Parameters and Zernike Amplitudes

The correlation between the control parameters ( CP ) and the Zernike 4 amplitudes ( Z ) of the wave front can be characterized by using an influence matrix. The details of the influence matrix are described in Angeli et al. (2014). Each field sensor of the focal plane of the camera has a different influence matrix. The 50 control parameters cover the major components of the optical system. The first five control parameters, CP1–CP5, describe the translational and rotational degrees of freedom of M2. The next five control parameters, CP6–CP10, describe the translational and rotational degrees of freedom of the camera. The surface figure of M1M3 is controlled by control parameters CP11–CP30 and the surface figure of M2 is controlled by control parameters CP31–CP50. The effect of a change in control parameters, ΔCP on the Zernike coefficients is given by Equation (1)

Equation (1)

Zernike coefficients 1–3 correspond to piston, tip, and tilt, and are not included. The influence matrix is defined assuming the control parameters for displacements and rotations are expressed in units of microns and arcseconds, respectively. The Zernike amplitudes obtained using Equation (1) are in units of microns.

3.2. Impact of Control Parameters on Image Quality

The impact of variations in the CPs on image quality was explored by determining the 80% enclosed energy diameter of the wave front normal vectors as a function of the control parameters. In these studies, the xy plane is defined to be the plane of the wave front when all CPs are 0 (and all Zernike amplitudes are 0). The following steps were carried out for each control parameter:

  • 1.  
    For each CP, calculate the corresponding Zernike amplitudes using the process discussed in Section 3.1 and construct the "distorted" wave front. An example of the distorted wavefronts when CP 10 (the camera tilt around the y-axis) is varied between 10'' and 100'' is shown in Figure 4.
  • 2.  
    Use the gradient of the "distorted" wave front to determine normal directions to the wave front across the pupil.
  • 3.  
    Calculate the angle between the normal to the distorted wave front and the z-axis (the normal direction to the undistorted wave front.) An example of this angle across the pupil when CP 10 is varied between 10'' and 100'' is shown in Figure 5.
  • 4.  
    Define the angle θ80 as the angle that contains 80% of all normal vectors. This angle is equivalent to the 80% encircled energy diameter (EE80).
  • 5.  
    Check that the relation between θ80 and each CP is linear, as shown in Figure 6. Use this relation to determine the value of each CP corresponding to θ80 = 0.38 μrad (79 mas).

Figure 4.

Figure 4. Reconstructed wavefronts at the edge of the focal plane for CP 10 values between 10'' and 100''. CP 10 is the camera tilt about the y-axis in arcseconds. The color scale shows deviation from the xy plane in units of μm.

Standard image High-resolution image
Figure 5.

Figure 5. Angle between the normal of the distorted wave front and the z-axis when CP 10 varies between 10'' and 100''. The color scale shows angle in units of μrad.

Standard image High-resolution image
Figure 6.

Figure 6.  θ80 as a function of the value of CP 10. The observed linear relation is used to determine for what value of CP 10 θ80 = 0.38 μrad.

Standard image High-resolution image

These steps were followed for all 50 CPs in order to determine the sensitivity of the wave front to variations in the values of these CPs. The results of these calculation for each CP are shown in Table 1.

Table 1. Deviation for Each CP That Limits the Image Resolution to θ80 = 0.38 μrad

CPΔCPCPΔCPCPΔCPCPΔCPCPΔCP
118 μm110.15 μm210.11 μm310.23 μm410.10 μm
2235 μm120.15 μm220.19 μm320.23 μm420.08 μm
3235 μm130.23 μm230.06 μm330.16 μm430.08 μm
47''140.11 μm240.06 μm340.16 μm440.08 μm
57''150.11 μm250.16 μm350.19 μm450.08 μm
617 μm160.15 μm260.17 μm360.10 μm460.16 μm
7850 μm170.15 μm270.08 μm370.10 μm470.16 μm
8850 μm180.07 μm280.08 μm380.14 μm480.70 μm
914''190.07 μm290.23 μm390.14 μm490.75 μm
1014''200.11 μm300.27 μm400.11 μm500.12 μm

Note. The degrees of freedom associated with each CP are shown schematically in Figure 2.

Download table as:  ASCIITypeset image

3.3. Error Budget of Control Parameters

The error budget for the AOS is 79 mas, and is discussed in detail in Angeli et al. (2014). An uncertainty in a single CP with the value listed in Table 1 would result in an image spread of 79 mas. If the errors in all 50 CPs were statistically independent, each CP could have an error of 15% of the value listed in Table 1 and the total image spread due to the uncertainties in all CPs would be 79 mas. As a result, the goal of this work is to demonstrate that each CP can be reconstructed with an uncertainty less than 15% of the values listed in Table 1.

4. Data

The training and test images were generated using the wave front analysis code developed by Roodman et al. (2014) for the Dark Energy Camera. We also validated the method on simulated Rubin wavefronts by GalSim (Rowe et al. 2015).

For each set of random control parameters, the corresponding Zernike amplitudes were calculated using the process discussed in Section 3.1. To simulate a focal-plane offset of ±1 mm, we used defocus values (Z4) of ±40 μm. Such intra- and extra-focal images were produced for each of the four wave front sensors, for a total of eight images. The 224 × 224 images were generated at a pixel scale of 0farcs20/pixel, and then zero-padded and binned down by a factor of 4 × 4. This rebinning speeds up model training and evaluation, and does not substantially affect the results.

A total of 37,000 sets of random control parameters were used to generate 37,000 sets of eight intra- and extra-focal images. These images are used to train and test the performance of the neural network. Examples of images generated for various control parameters are shown in Figure 7 and 8.

Figure 7.

Figure 7. Left: simulated image in Sensor 32 with no perturbation in the optical system (all CPs are zero). Middle: simulated image in Sensor 32 when CP08 = 1700 μm (the limit for this CP as shown in Table 1) and all other CPs are zero. Right: simulated image in Sensor 32 when CP10 = 140'' (five times the limit for this CP as shown in Table 1) and all other CPs are zero. The color scale in each image shows the number of electrons per pixel.

Standard image High-resolution image

It is essential that the CNN is robust to varying amounts of pixel noise, and to small shifts in donut position. We can modify the training data by adding noise and random shifts (uncorrelated among the eight donut images).

Each donut image is generated by simulated 108 counts (photoelectrons), resulting in an signal-to-noise ratio (S/N) of about 600 per pixel. We treat these images as having infinite S/N and then add noise as follows. We model the variance of each pixel as the sum of background counts and image counts for each (rescaled) donut, and then add to each pixel a number drawn from a mean-zero Gaussian with that variance.

Typical sky background in r band for a 15 s exposure is between 800 (no moon) and 5000 (full moon) counts. The sky background for training images is drawn from a uniform distribution between these two values.

5. Method

5.1. Training Algorithm

We use ResNet18 (He et al. 2015) as the backbone network. 5 Our data set includes 37,000 sets of images (30,000 training, 7000 test), and each set consists of eight wave front images. The model predicts 50 physical parameters for each set of eight images, as illustrated in Figure 1. We compare different combinations of self-attention (SelfAttn), anti-aliasing (AntiAlias), PSF loss (PSF), and scaled loss (Scaled), as discussed later. We do the same experiments on data without noise and random shift, with noise but without random shift, and with noise and random shift.

In each case the model is trained for 250 epochs 6 with batch size of 64. Through one epoch, the NN has made use of all the training examples once, and then the model is evaluated on the test data set to assess performance.

We train the NN with Adam (Kingma & Ba 2014), a gradient descent based optimization method, with initial learning rate 0.01, and divide the learning rate by 10 at epochs 75, 150, and 200. The training time is on the order of 6 GPU hours. We use mean absolute error (MAE), root-mean-squared error (RMSE), and coefficient of determination (R2) as error metrics.

Figure 8.

Figure 8. Examples of simulated images from eight wave front sensors. The color scale in each image shows the number of electrons per pixel. A perfectly corrected wave front would be rotationally symmetric, with a uniform distribution of electrons in the donut.

Standard image High-resolution image

5.2. Self-attention

Following Wang et al. (2018) and Xie et al. (2018), we use self-attention modules to denoise feature maps (Goodfellow et al. 2016). Self-attention produces a nonlocal operation. Feature maps refer to both input images and output of the subsequent CNN layers. It calculates the response at a position as a weighted sum of the features at all positions in the input feature map (Wang et al. 2018). It helps the network capture long-range dependence. As shown in Figure 9, the self-attention can be formulated as,

Equation (2)

where x is the input feature, o is the output feature, i is the index of an output position, Ci is a normalization factor, h is a function usually formulated as a linear projection, and F is a function to model similarity of xi and xj . Usually,

Equation (3)

where f( x ) = W f x , g( x ) = W g x , and W f , W f are learned weight matrices.

Figure 9.

Figure 9. Self-attention module. Figure credit: Self-attention GAN Zhang et al. (2019).

Standard image High-resolution image

5.3. Anti-aliasing in Feature Maps

A common property of CNNs is that the output depends sensitively on the position of the object in the input image. For example, when a pooling Goodfellow et al. (2016) step is done sparsely and then subsampled, a single pixel shift can have an unacceptably large impact. For our application we would like translational equivariance, 7 so we employ anti-aliasing max pooling (Zhang 2019). At each convolutional layer we first densely evaluate max pooling on the feature maps, then use a box filter to blur feature maps in each channel, and finally subsample the feature maps using the required stride (Figure 10).

Figure 10.

Figure 10. Anti-aliasing max pooling. This is conceptually similar to smoothing an image before down-sampling to maintain a well sampled PSF. Anti-aliasing pooling makes the output of the CNN nearly shift invariant. Figure credit: Making Convolutional Networks Shift-invariant Again. Zhang (2019)

Standard image High-resolution image

5.4. Loss Function

We use y to denote the predicted control parameter vector while y * is the ground truth. The loss function is formulated as follows,

Equation (4)

where αj is a scaling factor calculated according to error tolerance of the control parameter and β is a scaling factor to control the relative weight between two terms. We denote ${\sum }_{j}{\alpha }_{j}{L}_{2}({y}_{j},{y}_{j}^{* })$ as scaled L2 loss and f( y , y *) as the PSF loss. From our experiments, β = 10 optimizes the results most but the result is robust to the choice of β within the range of [0.1, 100].

We define the prediction error as the difference between prediction result and ground truth. The error tolerance is derived in Section 3. The scaled L2 loss function uses each parameter's error tolerance to scale its L2 loss. The PSF loss term calculates the rms of the gradients of the wave front (θ80 Section 2.1), averaged over the pupil. Different prediction errors might correspond to the same θ80 and thus have the same PSF loss. Therefore, the PSF loss term allows the degenerate solutions of the control parameters predictions. The code is available on GitHub's repository at https://github.com/JunYinDM/VCRO and on Zenodo (Jun E. Yin 2021).

6. Results

6.1. Data without Noise and Shift

Table 2 shows the results of experiments on data without noise and shift. We can see, compared to the baseline ResNet18, scaled loss boosts the performance. Self-attention and anti-aliasing also help increase the performance but combinations of different techniques lead to not quite consistent results.

Table 2. Experiments on Data without Noise and Shift

ModelMAE ↓ RMSE ↓ R2
ResNet180.30951.1580.2235
ResNet18+Scaled0.38411.6910.7235
ResNet18+Scaled+SelfAttn0.4051.6710.4687
ResNet18+Scaled+AntiAlias0.39611.7380.6155
ResNet18+Scaled+SelfAttn+AntiAlias0.88824.3260.1721
ResNet18+Scaled+PSF+AntiAlias 0.2494 0.9725 0.9563
ResNet18+Scaled+PSF+SelfAttn0.28281.031 0.9583
ResNet18+Scaled+PSF+SelfAttn+AntiAlias0.2811.0340.9558

Note. The ↓ symbol means smaller values correspond to better performance, and ↑ means higher values correspond to better performance. The bold values correspond to the best performance.

Download table as:  ASCIITypeset image

6.2. Data with Noise but without Random Shift

Table 3 shows the results of experiments on data with noise but without random shift. Compared to results in Table 3, self-attention works much better in terms of R2, verifying our assumption that self-attention modules can help the model tolerate noisy inputs as shown in Xie et al. (2018). And as for MAE and RMSE, adding PSF loss is the most effective.

Table 3. Experiments on Data with Noise but without Shift

ModelMAE ↓ RMSE ↓ R2
ResNet180.69013.2210.1553
ResNet18+Scaled0.89034.3280.2506
ResNet18+Scaled+SelfAttn0.90534.3430.3083
ResNet18+Scaled+AntiAlias0.90624.3410.2315
ResNet18+Scaled+SelfAttn+AntiAlias0.89324.3340.2097
ResNet18+Scaled+PSF+SelfAttn0.60962.97 0.6163
ResNet18+Scaled+PSF+AntiAlias0.5945 2.789 0.579
ResNet18+Scaled+PSF+SelfAttn+AntiAlias 0.5833 2.830.5679

Download table as:  ASCIITypeset image

6.3. Data with Noise and Random Shift

Table 4 and Figure 12 show the results of experiments on data with noise and random shift. Compared to results in Tables 2 and 3, the networks are more robust to noise and random translations with anti-aliasing. In short, we show anti-aliasing improves translation equivariance/invariance and self-attention improves noise-robustness of neural networks.

Table 4. Experiments on Data with Noise and Shift

ModelMAE ↓ RMSE ↓ R2
ResNet180.68493.2010.1426
ResNet18+Scaled0.91784.3640.2026
ResNet18+Scaled+SelfAttn0.91974.3570.2303
ResNet18+Scaled+AntiAlias0.91464.360.2316
ResNet18+Scaled+SelfAttn+AntiAlias0.90874.3460.2296
ResNet18+Scaled+PSF+SelfAttn0.63992.9940.5675
ResNet18+Scaled+PSF+AntiAlias 0.5976 2.829 0.5805
ResNet18+Scaled+PSF+SelfAttn+AntiAlias0.61742.8370.4885

Download table as:  ASCIITypeset image

7. Conclusion

Rubin has an AOS with 50 control parameters that must be derived from eight out-of-focus images (donuts) from the wave front sensors. The baseline approach derives Zernike coefficients from the donut images and then transforms these to control parameters via an "influence matrix." This head-on approach can be slow, and near-degeneracies in the influence matrix may require ad hoc regularization choices.

In this work, we use machine learning, constructing a CNN that maps the donut images to CPs. The CNN is trained on 37,000 sets of eight donut images generated for a random distribution of CPs. The CNN recovers CPs from the donuts within the required tolerance (Figure 11) and keeps the AOS contribution to the PSF below 0farcs079, which is negligible when added (in quadrature) to a typical PSF FWHM of 0farcs65 (Figure 12).

Figure 11.

Figure 11. Prediction RMSE over error tolerance for the 50 control parameters. The fact that this ratio is less than one indicates the RMSE predicted by the neural network is within the error tolerance we determined in Table 1.

Standard image High-resolution image
Figure 12.

Figure 12. Selecting the 10% worst cases based on θ80, we construct the PSF for typical seeing of ∼0farcs65, measure its FWHM, and find that the prediction error of the CPs makes only a small contribution to the FWHM.

Standard image High-resolution image

Starting from a backbone model (ResNet18) we investigate the impact of four enhancements:

  • 1.  
    scaled L2 loss,
  • 2.  
    addition of a PSF term to the loss function,
  • 3.  
    anti-aliasing pooling, and
  • 4.  
    self-attention.

The scaled L2 loss function uses each parameter's error tolerance to scale its L2 loss. This is in contrast to the naive L2 loss where the loss of every parameter is weighted equally, regardless of the required tolerance. Because some parameters have units of angle, and some distance, it is not meaningful to combine them without appropriate scaling. Using scaled loss enhances the performance substantially (R2 is increased going from row 1 to row 2 of Tables 24). The RMSE increases in some cases, not because the fit is worse, but because it is calculated without scaling by the tolerance, and a less important parameter may contribute a lot to the RMSE without contribution much to the (scaled) loss function.

The PSF loss function explicitly penalizes a poor PSF (large FWHM). There are many combinations of CPs that produce similar PSFs. We care less about achieving the exact CP values than about finding values that yield a good PSF. Adding a PSF term to the loss function enhances performance substantially (Tables 24).

Anti-aliasing pooling addresses the problem of shift equivariance. It may happen that a star is mis-centered (whether the star is simply found in the image, or drawn from an astrometric catalog of appropriate stars), and the resulting CPs must not depend on a shift of a few pixels. By including anti-aliasing pooling, and augmenting the training data to include randomly shifted donuts, the resulting model performance is insensitive to image shift (compare Tables 3 and 4).

Self-attention is a mechanism that links information from disparate parts of an image, and in some cases allows a neural network to be less sensitive to noisy inputs. Our experiments that included self-attention modules in the CNN led to modest changes in performance. In perhaps the most interesting case (noisy data with shifts, Table 4), the CNN performed slightly worse by all three performance metrics with self-attention than without, though it helped performance in some other cases. Overall, it is not clear that self-attention is a useful enhancement for this problem.

In summary, a straightforward CNN with anti-aliasing pooling and a problem-specific loss function performs well on simulated data even in the presence of realistic noise and image shifts. As with most machine-learning problems, significant up-front computational expense is rewarded with fast evaluation. The training data requires 4000 CPU hours to generate, and the CNN trains in 6 GPU hours; however, it evaluates in a few milliseconds. Indeed, this is the most notable performance difference to previously published approaches (Xin et al. 2015). Both derive control parameters with adequate precision, but the neural net evaluation is much faster. Rubin does not require control parameter updates on subsecond timescales, but future telescopes might benefit from this capability.

An operational system for Rubin would need to address the upstream problem of donut selection, tolerate overlaps of faint donuts, and deal with real-life issues like saturation, cosmic rays, and missing data. This proof of concept on simulated data makes it plausible that such a system could be built, and that neural networks can play a key role in keeping the Rubin images sharp.

It is a pleasure to acknowledge Chuck Claver, Sandrine Thomas, Aaron Roodman, and Bo Xin who provided insight into the VCRO optical system and helped with simulation software. Josh Meyers provided advice on the use of GalSim. Michelle Ntampaka provided feedback on the paper draft. Parts of this work were developed in a class at MIT taught by Phillip Isola.

J.E.Y. is partially supported by U.S. Department of Energy grant DE-SC0007881 and the Harvard Data Science Initiative. D.J.E. is supported by U.S. Department of Energy grant DE-SC0013718 and as a Simons Foundation Investigator. D.P.F. is partially supported by NSF grant AST-1614941. C.W.S. is supported by U.S. Department of Energy grant DE-SC0007881.

Footnotes

  • 4  

    Zernike polynomials are orthogonal on the unit disk, but not on the annulus corresponding to the Rubin pupil. Throughout this analysis we actually use Zernike-like polynomials that are orthogonal over the appropriate annulus.

  • 5  

    We also tested with AlexNet, ResNet34, ResNet50, and a custom-designed architecture, and obtained the best performance from ResNet18.

  • 6  

    Training proceeds via stochastic gradient descent, in which a gradient in NN parameter space is calculated for subsamples of the training data called batches. After each such gradient calculation, the NN parameters are updated by following the gradient by an amount proportional to the learning rate. A training epoch consists of many batches, and makes use of all training data. See Goodfellow et al. (2016) for a more detailed explanation.

  • 7  

    More precisely, we want translational invariance of the control parameters. Other parameters like the (x, y) position of the donut should be equivariant (i.e., varying smoothly and proportionally with the input shift), but those are nuisance parameters in our case.

Please wait… references are loading.
10.3847/1538-3881/abe9b9