Active Optical Control with Machine Learning: A Proof of Concept for the Vera C. Rubin Observatory

Jun E. Yin; Daniel J. Eisenstein; Douglas P. Finkbeiner; Christopher W. Stubbs; Yue Wang

doi:10.3847/1538-3881/abe9b9

1. Introduction

The Legacy Survey of Space and Time (LSST) at the Vera C. Rubin Observatory (Rubin) will be the deepest optical/IR survey ever to cover the majority of the sky. It will provide new insights into the mysteries of dark matter and dark energy, as well as transient phenomena, solar system objects, and Milky Way structure. The telescope's optical system actively compensates for flexure via 50 control parameters that adjust the mirror and camera positions (using two hexapods with six actuators each), and actively deforms the primary, secondary, and tertiary mirrors, using 156 and 72 actuators, respectively. In order to keep the images sharp, we seek an efficient and accurate way to estimate these control parameters.

The light arriving from a distant star at the top of the atmosphere may be thought of as a plane wave. Spatial variations in the index of refraction cause the wave front to distort as it passes through the atmosphere. After the wave front enters the telescope, the optical system introduces additional distortions. A typical value for the point-spread function (PSF) due to atmospheric effects is estimated to be 0 farcs 7. The distortions from the optical system are approximately stable over an exposure, and must be determined and corrected from one exposure to the next.

When substantially out of focus, a star produces an annulus corresponding to the part of the primary mirror unobscured by the secondary. These annuli are colloquially known as "donut" images. The donuts reveal shifts in focus by their overall size, but contain far more information than that. The inner and outer boundaries of each donut deviate from their undistorted shape (in the ideal case, a circle), and the surface brightness of the donut is nonuniform. These perturbations carry sufficient information to enable extraction of the control parameters needed to correct the distortion. The current plan (Xin et al. 2015) for control parameter extraction is based on wave front determination via the iterative fast Fourier transformation (Roddier & Roddier 1993) and the series expansion method (Gureyev & Nugent 1996).

Meanwhile, machine-learning (ML) methods promise high image regression accuracy and short model evaluation time. This short evaluation time may be especially valuable for adaptive optics (AO) where corrections on millisecond timescales are required. For example, Angel et al. (1990), Wizinowich et al. (1991), and Vdovin (1995) studied using fully connected layers for wave front sensing for AO. Hoffmann & Elster (2020) studied using deep learning for wave front slope measurements. Liu et al. (2020) used deep learning and long short-term memory to predict wave front distortions for AO. Others have explored the possibility of wave front correction for extremely large telescopes by reconstructing the atmospheric turbulence profile with neural networks (Osborn et al. 2014; Yang et al. 2019).

In this work we address a related, but different, problem. Our task is to update the optical parameters of Rubin on a 30 s timescale using donut images as the input. A convolutional neural network (CNN) is a natural fit for a regression problem that takes in images and outputs a small number of parameters, as illustrated in Figure 1. We train a CNN on images generated for a realistic range of control parameters, and a range of sky background levels from dark sky to full Moon. The CNN architecture builds on the success of the ResNet18 (He et al. 2015) model, with some enhancements:

1.
We use an approach called self-attention (see Section 5.2) to make the CNN nonlocal.
2.
We use an anti-aliasing method to improve translation equivariance of neural networks.
3.
We explicitly include a measure of PSF width in the loss function.

Our method is reliable and evaluates in milliseconds, faster than baseline methods.

**Figure 1.** Schematics of the algorithm. Left: the input to the network is eight channel images from the wave front sensors, 64 × 64 pixels each, or 128 for 02 pixels. Middle: the convolutional neural network structure is based on ResNet18 with some additional task-specific subnetwork architectures. Right: the output from the network is the prediction of the values of the 50 control parameters.
Download figure:
Standard image High-resolution image

farcs — **Figure 1.** Schematics of the algorithm. Left: the input to the network is eight channel images from the wave front sensors, 64 × 64 pixels each, or 128 for 02 pixels. Middle: the convolutional neural network structure is based on ResNet18 with some additional task-specific subnetwork architectures. Right: the output from the network is the prediction of the values of the 50 control parameters.
Download figure:
Standard image High-resolution image

This paper is organized as follows. The main characteristics of Rubin and its active optics system (AOS) are discussed in Section 2. The control parameters that define the characteristics of the optical system and the accuracy with which they must be determined are discussed in Section 3. The machine-learning determination of the wave front method is discussed in Section 4. The results of the evaluation are summarized in Section 5 and the conclusions are discussed in Section 6.

2. Vera C. Rubin Observatory

The Vera C. Rubin Observatory (Abell et al. 2009) is an 8.4 m telescope located on Cerro Pachón, Coquimbo Region, Chile. It has the world's largest digital camera, with 3.2 billion pixels covering a 3 fdg 5 field of view. The LSST survey, which will be in operation from 2022 to 2032, will take approximately 5 million exposures, probing the sky with a sensitivity and cadence superior to any previous wide-area survey.

The main scientific goals of Vera C. Rubin Observatory are to understand the mysterious dark matter and dark energy, to study the formation of the Milky Way galaxy, to map out objects in the solar system, and to observe a wide variety of transient phenomena. The study of dark matter uses weak gravitational lensing (WL). In WL, the shape of a distant galaxy is weakly distorted by the foreground masses and this shearing is a key observable. Because the shear effects are small (a few percent), the distortions in the images due to imperfections of the optical system must be minimized.

2.1. The Optical System

For a telescope of its size, Rubin has an exceptionally wide field of view giving it the ability to survey the entire sky every three to four nights. It uses a three-mirror design, as illustrated schematically in Figure 2. The optical system has 50 degrees of freedom. The default control parameters are set to correct for the as-built parameters of the telescope structure and mirror imperfections. In practice, changes in environmental conditions and gravity vector will cause changes in the shape and position of the components of the optical system. In this work we are considering the 50 ΔCP values, i.e., the deviations from the default control parameters.

The 50 control parameters used to control the shape and position of the optical system are illustrated schematically in Figure 2. Ten control parameters control the position and orientation of the M2 mirror and camera, i.e., the position in x, y, and z, and rotation about the x and y directions. The other 40 control parameters control the deformation of the mirrors, 20 for the shape of the M1/M3 mirror and 20 for the M2 mirror.

The AOS controls the position/shape of the primary and secondary mirrors and the position of the camera. The AOS measures the optical perturbations after the telescope's first 15 s exposure of each visit, and adjusts the control parameters of the telescope after the second 15 s exposure, before the next visit. The goal of this paper is to create a machine-learning model that improves AOS by providing fast, accurate updates to the control parameters.

2.2. The Wave Front Sensors

The Rubin LSST camera (Figure 3) consists of 189 science CCDs (blue), 8 guide sensors (yellow) and 4 wave front sensors (green). Each of the wave front sensors is split, with half of the surface located 1.5 mm in front of the focal plane and the other half located 1.5 mm behind it. The out-of-focus star images collected by these sensors exhibit the donut shape described above (Figure 3, right side). Our task is to derive the control parameters from such images.

2.3. The Image Quality Budget of the AOS

To achieve its science goals, the image quality obtained by Rubin should be limited only by atmospheric distortions ("seeing"), not the telescope or camera. The telescope site typically delivers a PSF FWHM of 0 farcs 5–1'' with a median of about 0 farcs 7. The overall system image quality budget (Thomas et al. 2016) is 0 farcs 4 FWHM, where 0 farcs 25 is allocated to the telescope optics and 0 farcs 3 is allocated to the camera, including sensor effect like charge diffusion.

The image quality budget associated with the AOS is 0 farcs 079. The part of this budget specifically allocated for distortions from incorrectly determined control parameters is 0 farcs 036. These tolerances correspond to angular resolutions of 0.38 μrad (0 farcs 079) and 0.17 μrad (0 farcs 036).

3. Control Parameters

In this section, the influence of the control parameters on the measured wavefronts are discussed. Using the influence matrix, we estimate the maximum allowable uncertainty in each control parameter.

3.1. Correlation between Control Parameters and Zernike Amplitudes

The correlation between the control parameters ( CP ) and the Zernike⁴ amplitudes ( Z ) of the wave front can be characterized by using an influence matrix. The details of the influence matrix are described in Angeli et al. (2014). Each field sensor of the focal plane of the camera has a different influence matrix. The 50 control parameters cover the major components of the optical system. The first five control parameters, CP1–CP5, describe the translational and rotational degrees of freedom of M2. The next five control parameters, CP6–CP10, describe the translational and rotational degrees of freedom of the camera. The surface figure of M1M3 is controlled by control parameters CP11–CP30 and the surface figure of M2 is controlled by control parameters CP31–CP50. The effect of a change in control parameters, ΔCP on the Zernike coefficients is given by Equation (1)

$\begin{eqnarray}&&\left(\begin{array}{c}{\rm{\Delta }}{Z}_{4}\\ {\rm{\Delta }}{Z}_{5}\\ \ .\\ \ .\\ {\rm{\Delta }}{Z}_{21}\\ {\rm{\Delta }}{Z}_{22}\end{array}\right)=\left(\begin{array}{cccccccc}{A}_{\mathrm{4,1}} & {A}_{\mathrm{4,2}} & . & . & . & . & {A}_{\mathrm{4,49}} & {A}_{\mathrm{4,50}}\\ {A}_{\mathrm{4,1}} & {A}_{\mathrm{5,2}} & . & . & . & . & {A}_{\mathrm{5,49}} & {A}_{\mathrm{5,50}}\\ \ . & . & & & & & . & .\\ \ . & . & & & & & . & .\\ {A}_{\mathrm{21,1}} & {A}_{\mathrm{21,2}} & . & . & . & . & {A}_{\mathrm{21,49}} & {A}_{\mathrm{21,50}}\\ {A}_{\mathrm{22,1}} & {A}_{\mathrm{22,2}} & . & . & . & . & {A}_{\mathrm{22,49}} & {A}_{\mathrm{22,50}}\end{array}\right)\left(\begin{array}{c}{\rm{\Delta }}{{CP}}_{1}\\ {\rm{\Delta }}{{CP}}_{2}\\ \ .\\ \ .\\ \ .\\ \ .\\ {\rm{\Delta }}{{CP}}_{49}\\ {\rm{\Delta }}{{CP}}_{50}\end{array}\right).\end{eqnarray} \tag{ 1 }$

Zernike coefficients 1–3 correspond to piston, tip, and tilt, and are not included. The influence matrix is defined assuming the control parameters for displacements and rotations are expressed in units of microns and arcseconds, respectively. The Zernike amplitudes obtained using Equation (1) are in units of microns.

3.2. Impact of Control Parameters on Image Quality

The impact of variations in the CPs on image quality was explored by determining the 80% enclosed energy diameter of the wave front normal vectors as a function of the control parameters. In these studies, the xy plane is defined to be the plane of the wave front when all CPs are 0 (and all Zernike amplitudes are 0). The following steps were carried out for each control parameter:

1.
For each CP, calculate the corresponding Zernike amplitudes using the process discussed in Section 3.1 and construct the "distorted" wave front. An example of the distorted wavefronts when CP 10 (the camera tilt around the y-axis) is varied between 10'' and 100'' is shown in Figure 4.
2.
Use the gradient of the "distorted" wave front to determine normal directions to the wave front across the pupil.
3.
Calculate the angle between the normal to the distorted wave front and the z-axis (the normal direction to the undistorted wave front.) An example of this angle across the pupil when CP 10 is varied between 10'' and 100'' is shown in Figure 5.
4.
Define the angle θ₈₀ as the angle that contains 80% of all normal vectors. This angle is equivalent to the 80% encircled energy diameter (EE80).
5.
Check that the relation between θ₈₀ and each CP is linear, as shown in Figure 6. Use this relation to determine the value of each CP corresponding to θ₈₀ = 0.38 μrad (79 mas).

**Figure 4.** Reconstructed wavefronts at the edge of the focal plane for CP 10 values between 10'' and 100''. CP 10 is the camera tilt about the y-axis in arcseconds. The color scale shows deviation from the xy plane in units of μm.
Download figure:
Standard image High-resolution image

**Figure 5.** Angle between the normal of the distorted wave front and the z-axis when CP 10 varies between 10'' and 100''. The color scale shows angle in units of μrad.
Download figure:
Standard image High-resolution image

**Figure 6.** θ₈₀ as a function of the value of CP 10. The observed linear relation is used to determine for what value of CP 10 θ₈₀ = 0.38 μrad.
Download figure:
Standard image High-resolution image

These steps were followed for all 50 CPs in order to determine the sensitivity of the wave front to variations in the values of these CPs. The results of these calculation for each CP are shown in Table 1.

Table 1. Deviation for Each CP That Limits the Image Resolution to θ₈₀ = 0.38 μrad

CP	ΔCP	CP	ΔCP	CP	ΔCP	CP	ΔCP	CP	ΔCP
1	18 μm	11	0.15 μm	21	0.11 μm	31	0.23 μm	41	0.10 μm
2	235 μm	12	0.15 μm	22	0.19 μm	32	0.23 μm	42	0.08 μm
3	235 μm	13	0.23 μm	23	0.06 μm	33	0.16 μm	43	0.08 μm
4	7''	14	0.11 μm	24	0.06 μm	34	0.16 μm	44	0.08 μm
5	7''	15	0.11 μm	25	0.16 μm	35	0.19 μm	45	0.08 μm
6	17 μm	16	0.15 μm	26	0.17 μm	36	0.10 μm	46	0.16 μm
7	850 μm	17	0.15 μm	27	0.08 μm	37	0.10 μm	47	0.16 μm
8	850 μm	18	0.07 μm	28	0.08 μm	38	0.14 μm	48	0.70 μm
9	14''	19	0.07 μm	29	0.23 μm	39	0.14 μm	49	0.75 μm
10	14''	20	0.11 μm	30	0.27 μm	40	0.11 μm	50	0.12 μm

Note. The degrees of freedom associated with each CP are shown schematically in Figure 2.

Download table as: ASCII Typeset image

3.3. Error Budget of Control Parameters

The error budget for the AOS is 79 mas, and is discussed in detail in Angeli et al. (2014). An uncertainty in a single CP with the value listed in Table 1 would result in an image spread of 79 mas. If the errors in all 50 CPs were statistically independent, each CP could have an error of 15% of the value listed in Table 1 and the total image spread due to the uncertainties in all CPs would be 79 mas. As a result, the goal of this work is to demonstrate that each CP can be reconstructed with an uncertainty less than 15% of the values listed in Table 1.

4. Data

The training and test images were generated using the wave front analysis code developed by Roodman et al. (2014) for the Dark Energy Camera. We also validated the method on simulated Rubin wavefronts by GalSim (Rowe et al. 2015).

For each set of random control parameters, the corresponding Zernike amplitudes were calculated using the process discussed in Section 3.1. To simulate a focal-plane offset of ±1 mm, we used defocus values (Z₄) of ±40 μm. Such intra- and extra-focal images were produced for each of the four wave front sensors, for a total of eight images. The 224 × 224 images were generated at a pixel scale of 0 farcs 20/pixel, and then zero-padded and binned down by a factor of 4 × 4. This rebinning speeds up model training and evaluation, and does not substantially affect the results.

A total of 37,000 sets of random control parameters were used to generate 37,000 sets of eight intra- and extra-focal images. These images are used to train and test the performance of the neural network. Examples of images generated for various control parameters are shown in Figure 7 and 8.

**Figure 7.** Left: simulated image in Sensor 32 with no perturbation in the optical system (all CPs are zero). Middle: simulated image in Sensor 32 when CP08 = 1700 μm (the limit for this CP as shown in Table 1) and all other CPs are zero. Right: simulated image in Sensor 32 when CP10 = 140'' (five times the limit for this CP as shown in Table 1) and all other CPs are zero. The color scale in each image shows the number of electrons per pixel.
Download figure:
Standard image High-resolution image

It is essential that the CNN is robust to varying amounts of pixel noise, and to small shifts in donut position. We can modify the training data by adding noise and random shifts (uncorrelated among the eight donut images).

Each donut image is generated by simulated 10⁸ counts (photoelectrons), resulting in an signal-to-noise ratio (S/N) of about 600 per pixel. We treat these images as having infinite S/N and then add noise as follows. We model the variance of each pixel as the sum of background counts and image counts for each (rescaled) donut, and then add to each pixel a number drawn from a mean-zero Gaussian with that variance.

Typical sky background in r band for a 15 s exposure is between 800 (no moon) and 5000 (full moon) counts. The sky background for training images is drawn from a uniform distribution between these two values.

5. Method

5.1. Training Algorithm

We use ResNet18 (He et al. 2015) as the backbone network.⁵ Our data set includes 37,000 sets of images (30,000 training, 7000 test), and each set consists of eight wave front images. The model predicts 50 physical parameters for each set of eight images, as illustrated in Figure 1. We compare different combinations of self-attention (SelfAttn), anti-aliasing (AntiAlias), PSF loss (PSF), and scaled loss (Scaled), as discussed later. We do the same experiments on data without noise and random shift, with noise but without random shift, and with noise and random shift.

In each case the model is trained for 250 epochs⁶ with batch size of 64. Through one epoch, the NN has made use of all the training examples once, and then the model is evaluated on the test data set to assess performance.

We train the NN with Adam (Kingma & Ba 2014), a gradient descent based optimization method, with initial learning rate 0.01, and divide the learning rate by 10 at epochs 75, 150, and 200. The training time is on the order of 6 GPU hours. We use mean absolute error (MAE), root-mean-squared error (RMSE), and coefficient of determination (R²) as error metrics.

**Figure 8.** Examples of simulated images from eight wave front sensors. The color scale in each image shows the number of electrons per pixel. A perfectly corrected wave front would be rotationally symmetric, with a uniform distribution of electrons in the donut.
Download figure:
Standard image High-resolution image

5.2. Self-attention

Following Wang et al. (2018) and Xie et al. (2018), we use self-attention modules to denoise feature maps (Goodfellow et al. 2016). Self-attention produces a nonlocal operation. Feature maps refer to both input images and output of the subsequent CNN layers. It calculates the response at a position as a weighted sum of the features at all positions in the input feature map (Wang et al. 2018). It helps the network capture long-range dependence. As shown in Figure 9, the self-attention can be formulated as,

$\begin{eqnarray}&&{o}_{i}=\displaystyle \frac{1}{{C}_{i}}\displaystyle \sum _{j}F({x}_{i},{x}_{j})h({x}_{j}),\end{eqnarray} \tag{ 2 }$

where x is the input feature, o is the output feature, i is the index of an output position, C_i is a normalization factor, h is a function usually formulated as a linear projection, and F is a function to model similarity of x_i and x_j. Usually,

$\begin{eqnarray}&&F({x}_{i},{x}_{j})={e}^{f{\left({x}_{i}\right)}^{T}g({x}_{j})},\end{eqnarray} \tag{ 3 }$

where f( x ) = W _f x , g( x ) = W _g x , and W _f, W _f are learned weight matrices.

**Figure 9.** Self-attention module. Figure credit: Self-attention GAN Zhang et al. (2019).
Download figure:
Standard image High-resolution image

5.3. Anti-aliasing in Feature Maps

A common property of CNNs is that the output depends sensitively on the position of the object in the input image. For example, when a pooling Goodfellow et al. (2016) step is done sparsely and then subsampled, a single pixel shift can have an unacceptably large impact. For our application we would like translational equivariance,⁷ so we employ anti-aliasing max pooling (Zhang 2019). At each convolutional layer we first densely evaluate max pooling on the feature maps, then use a box filter to blur feature maps in each channel, and finally subsample the feature maps using the required stride (Figure 10).

**Figure 10.** Anti-aliasing max pooling. This is conceptually similar to smoothing an image before down-sampling to maintain a well sampled PSF. Anti-aliasing pooling makes the output of the CNN nearly shift invariant. Figure credit: Making Convolutional Networks Shift-invariant Again. Zhang (2019)
Download figure:
Standard image High-resolution image

5.4. Loss Function

We use y to denote the predicted control parameter vector while y * is the ground truth. The loss function is formulated as follows,

$\begin{eqnarray}&&L({\boldsymbol{y}},{\boldsymbol{y}}* )=\displaystyle \sum _{j}{\alpha }_{j}{L}_{2}({y}_{j},{y}_{j}^{* })+\beta f({\boldsymbol{y}},{\boldsymbol{y}}* ),\end{eqnarray} \tag{ 4 }$

where α_j is a scaling factor calculated according to error tolerance of the control parameter and β is a scaling factor to control the relative weight between two terms. We denote ${\sum }_{j}{\alpha }_{j}{L}_{2}({y}_{j},{y}_{j}^{* })$ as scaled L₂ loss and f( y , y *) as the PSF loss. From our experiments, β = 10 optimizes the results most but the result is robust to the choice of β within the range of [0.1, 100].

We define the prediction error as the difference between prediction result and ground truth. The error tolerance is derived in Section 3. The scaled L₂ loss function uses each parameter's error tolerance to scale its L₂ loss. The PSF loss term calculates the rms of the gradients of the wave front (θ₈₀ Section 2.1), averaged over the pupil. Different prediction errors might correspond to the same θ₈₀ and thus have the same PSF loss. Therefore, the PSF loss term allows the degenerate solutions of the control parameters predictions. The code is available on GitHub's repository at https://github.com/JunYinDM/VCRO and on Zenodo (Jun E. Yin 2021).

6. Results

6.1. Data without Noise and Shift

Table 2 shows the results of experiments on data without noise and shift. We can see, compared to the baseline ResNet18, scaled loss boosts the performance. Self-attention and anti-aliasing also help increase the performance but combinations of different techniques lead to not quite consistent results.

Table 2. Experiments on Data without Noise and Shift

Model	MAE ↓	RMSE ↓	R² ↑
ResNet18	0.3095	1.158	0.2235
ResNet18+Scaled	0.3841	1.691	0.7235
ResNet18+Scaled+SelfAttn	0.405	1.671	0.4687
ResNet18+Scaled+AntiAlias	0.3961	1.738	0.6155
ResNet18+Scaled+SelfAttn+AntiAlias	0.8882	4.326	0.1721
ResNet18+Scaled+PSF+AntiAlias	0.2494	0.9725	0.9563
ResNet18+Scaled+PSF+SelfAttn	0.2828	1.031	0.9583
ResNet18+Scaled+PSF+SelfAttn+AntiAlias	0.281	1.034	0.9558

Note. The ↓ symbol means smaller values correspond to better performance, and ↑ means higher values correspond to better performance. The bold values correspond to the best performance.

Download table as: ASCII Typeset image

6.2. Data with Noise but without Random Shift

Table 3 shows the results of experiments on data with noise but without random shift. Compared to results in Table 3, self-attention works much better in terms of R², verifying our assumption that self-attention modules can help the model tolerate noisy inputs as shown in Xie et al. (2018). And as for MAE and RMSE, adding PSF loss is the most effective.

Table 3. Experiments on Data with Noise but without Shift

Model	MAE ↓	RMSE ↓	R² ↑
ResNet18	0.6901	3.221	0.1553
ResNet18+Scaled	0.8903	4.328	0.2506
ResNet18+Scaled+SelfAttn	0.9053	4.343	0.3083
ResNet18+Scaled+AntiAlias	0.9062	4.341	0.2315
ResNet18+Scaled+SelfAttn+AntiAlias	0.8932	4.334	0.2097
ResNet18+Scaled+PSF+SelfAttn	0.6096	2.97	0.6163
ResNet18+Scaled+PSF+AntiAlias	0.5945	2.789	0.579
ResNet18+Scaled+PSF+SelfAttn+AntiAlias	0.5833	2.83	0.5679

Download table as: ASCII Typeset image

6.3. Data with Noise and Random Shift

Table 4 and Figure 12 show the results of experiments on data with noise and random shift. Compared to results in Tables 2 and 3, the networks are more robust to noise and random translations with anti-aliasing. In short, we show anti-aliasing improves translation equivariance/invariance and self-attention improves noise-robustness of neural networks.

Table 4. Experiments on Data with Noise and Shift

Model	MAE ↓	RMSE ↓	R² ↑
ResNet18	0.6849	3.201	0.1426
ResNet18+Scaled	0.9178	4.364	0.2026
ResNet18+Scaled+SelfAttn	0.9197	4.357	0.2303
ResNet18+Scaled+AntiAlias	0.9146	4.36	0.2316
ResNet18+Scaled+SelfAttn+AntiAlias	0.9087	4.346	0.2296
ResNet18+Scaled+PSF+SelfAttn	0.6399	2.994	0.5675
ResNet18+Scaled+PSF+AntiAlias	0.5976	2.829	0.5805
ResNet18+Scaled+PSF+SelfAttn+AntiAlias	0.6174	2.837	0.4885

Download table as: ASCII Typeset image

7. Conclusion

Rubin has an AOS with 50 control parameters that must be derived from eight out-of-focus images (donuts) from the wave front sensors. The baseline approach derives Zernike coefficients from the donut images and then transforms these to control parameters via an "influence matrix." This head-on approach can be slow, and near-degeneracies in the influence matrix may require ad hoc regularization choices.

In this work, we use machine learning, constructing a CNN that maps the donut images to CPs. The CNN is trained on 37,000 sets of eight donut images generated for a random distribution of CPs. The CNN recovers CPs from the donuts within the required tolerance (Figure 11) and keeps the AOS contribution to the PSF below 0 farcs 079, which is negligible when added (in quadrature) to a typical PSF FWHM of 0 farcs 65 (Figure 12).

**Figure 11.** Prediction RMSE over error tolerance for the 50 control parameters. The fact that this ratio is less than one indicates the RMSE predicted by the neural network is within the error tolerance we determined in Table 1.
Download figure:
Standard image High-resolution image

**Figure 12.** Selecting the 10% worst cases based on θ₈₀, we construct the PSF for typical seeing of ∼065, measure its FWHM, and find that the prediction error of the CPs makes only a small contribution to the FWHM.
Download figure:
Standard image High-resolution image

Starting from a backbone model (ResNet18) we investigate the impact of four enhancements:

1.
scaled L2 loss,
2.
addition of a PSF term to the loss function,
3.
anti-aliasing pooling, and
4.
self-attention.

The scaled L2 loss function uses each parameter's error tolerance to scale its L2 loss. This is in contrast to the naive L2 loss where the loss of every parameter is weighted equally, regardless of the required tolerance. Because some parameters have units of angle, and some distance, it is not meaningful to combine them without appropriate scaling. Using scaled loss enhances the performance substantially (R² is increased going from row 1 to row 2 of Tables 2–4). The RMSE increases in some cases, not because the fit is worse, but because it is calculated without scaling by the tolerance, and a less important parameter may contribute a lot to the RMSE without contribution much to the (scaled) loss function.

The PSF loss function explicitly penalizes a poor PSF (large FWHM). There are many combinations of CPs that produce similar PSFs. We care less about achieving the exact CP values than about finding values that yield a good PSF. Adding a PSF term to the loss function enhances performance substantially (Tables 2–4).

Anti-aliasing pooling addresses the problem of shift equivariance. It may happen that a star is mis-centered (whether the star is simply found in the image, or drawn from an astrometric catalog of appropriate stars), and the resulting CPs must not depend on a shift of a few pixels. By including anti-aliasing pooling, and augmenting the training data to include randomly shifted donuts, the resulting model performance is insensitive to image shift (compare Tables 3 and 4).

Self-attention is a mechanism that links information from disparate parts of an image, and in some cases allows a neural network to be less sensitive to noisy inputs. Our experiments that included self-attention modules in the CNN led to modest changes in performance. In perhaps the most interesting case (noisy data with shifts, Table 4), the CNN performed slightly worse by all three performance metrics with self-attention than without, though it helped performance in some other cases. Overall, it is not clear that self-attention is a useful enhancement for this problem.

In summary, a straightforward CNN with anti-aliasing pooling and a problem-specific loss function performs well on simulated data even in the presence of realistic noise and image shifts. As with most machine-learning problems, significant up-front computational expense is rewarded with fast evaluation. The training data requires 4000 CPU hours to generate, and the CNN trains in 6 GPU hours; however, it evaluates in a few milliseconds. Indeed, this is the most notable performance difference to previously published approaches (Xin et al. 2015). Both derive control parameters with adequate precision, but the neural net evaluation is much faster. Rubin does not require control parameter updates on subsecond timescales, but future telescopes might benefit from this capability.

An operational system for Rubin would need to address the upstream problem of donut selection, tolerate overlaps of faint donuts, and deal with real-life issues like saturation, cosmic rays, and missing data. This proof of concept on simulated data makes it plausible that such a system could be built, and that neural networks can play a key role in keeping the Rubin images sharp.

It is a pleasure to acknowledge Chuck Claver, Sandrine Thomas, Aaron Roodman, and Bo Xin who provided insight into the VCRO optical system and helped with simulation software. Josh Meyers provided advice on the use of GalSim. Michelle Ntampaka provided feedback on the paper draft. Parts of this work were developed in a class at MIT taught by Phillip Isola.

J.E.Y. is partially supported by U.S. Department of Energy grant DE-SC0007881 and the Harvard Data Science Initiative. D.J.E. is supported by U.S. Department of Energy grant DE-SC0013718 and as a Simons Foundation Investigator. D.P.F. is partially supported by NSF grant AST-1614941. C.W.S. is supported by U.S. Department of Energy grant DE-SC0007881.

Active Optical Control with Machine Learning: A Proof of Concept for the Vera C. Rubin Observatory

Article metrics

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction