Introduction

Particle accelerators have enabled ground-breaking discoveries in the fields of chemistry1, biology2, and physics3,4. They are also increasingly deployed for societal applications, such as in medical5 or industrial6 fields. During operation, accelerator parameters need to be tuned to produce beams with specific characteristics that match the needs for front-end applications. Measuring these beam properties as a function of one or more input parameters using limited diagnostics and time-consuming measurements is a necessary part of operations, experimental planning, and tolerance determination. This comes at the expense of reducing accelerator availability for experimenters. These challenges are shared by many different scientific fields, which try to characterize complex, highly nonlinear and correlated systems, using difficult to execute scientific measurements and complicated diagnostics7,8.

Due to the complex and time-consuming nature of accelerator measurements, characterization of the beam response to input parameters is often limited to simple, uniformly spaced, grid-like parameter scans in one or two dimensions. This limitation results from the poor scaling of grid-like scans to higher-dimensional spaces, where the number of samples grows exponentially with the number of input parameters. Furthermore, despite its simplicity at face value, it is often difficult to determine the ideal properties of parametric scans which will result in successful and efficient sampling. A predefined grid spacing ultimately limits the ability to resolve fine features while potentially oversampling slow variations of the measured parameters. As a result, specifying the scan parameters a priori requires prior information about the measurement’s functional dependence on each parameter. This slows down frequent routine studies and makes characterization of novel measurements difficult to execute successfully.

The existence of tight constraints in input space which determine if measurements are viable further complicates this process. Upper and lower input parameter limits are often determined by practical constraints of conducting measurements. For example, transverse beam size measurements on diagnostic screens are limited by the screen size (available field of view), which in turn, imposes limits on the strength of upstream focusing magnet parameters. Simulation studies or extra measurements are needed beforehand to determine these limits. Even when simulations are available, they do not necessarily represent realistic machine behavior. Measurements, on the other hand, may become inaccurate due to time dependent changes in the accelerator, further complicating this problem. Furthermore, while these limits can be easily determined for a single parameter experimentally, it becomes practically infeasible to efficiently determine limits in higher-dimensional input spaces, as they are often correlated with multiple parameters. Limitations such as these are shared among many types of scientific experiments9.

Finally, it is desirable to prevent rapid changes in accelerator input parameters during operation. In some cases, it is temporally expensive to make changes in parameters, such as when mechanical actuators are used to change the phase of accelerating cavities. In other cases, fast feedback algorithms used in accelerator subsystems rely on adiabatic changes in external parameters to maintain system stability. Large jumps in parameter space can delay convergence of these feedback systems or worse, cause them to fail entirely. Practical experimental considerations such as these must be taken into account when automated sampling algorithms are used, striking a balance between costs associated with changing input parameters and the  expected information gained by candidate samples.

In this work, we construct an algorithm that replaces simple parametric scans for characterizing a target function of interest, while meeting poorly understood requirements for practical measurements in a flexible manner. The algorithm is “turn-key”, requiring as little prior information about the target function and measurement constraints as possible. As a result, our algorithm reduces beamline tuning time during normal operations, while enabling efficient exploration of novel or poorly characterized systems. Our algorithm, coined “Constrained Proximal Bayesian Exploration” (CPBE), starts from an initial valid observation and sequentially samples points in input space that maximize information gain of the target function at each observation step. It dynamically adapts its sampling strategy to measured functional behavior with respect to each parameter and respects necessary constraints for successful measurements. Finally, the algorithm is biased towards making small jumps in input space, balancing the trade-off between exploration and costs associated with changing input parameters. While our focus here is the application of this method towards parametric accelerator exploration, this algorithm can be applied in any experimental scenario that requires parametric scans.

Results

Our algorithm is an adaptation of Bayesian optimization (BO)10,11, applied to maximizing information gain12, which is often referred to as active learning or uncertainty sampling13,14,15,16. Bayesian optimization generally consists of two components. The first component of BO is a probabilistic surrogate model that predicts the probability distribution of the value f as a function of the input parameter vector x. Gaussian Processes (GPs)17 are a popular model choice for Bayesian optimization. The behavior of a GP model is determined by a kernel function, which describes the correlation between function values based on their location in input space relative to previous measurements. The kernel function itself can depend on a collection of hyperparameters, which often includes a length scale hyperparameter. This parameter describes the effective smoothness of the model, where small and large values correspond to rapidly and slowly varying functional behavior respectively. An independent length scale hyperparameter can be specified for each input parameter, in a process known as automatic relevance determination (ARD)18. These hyperparameters are determined by maximizing the marginal log likelihood of the GP model17, conditioned on experimental measurements, which balances model accuracy and complexity.

The second component of BO is an acquisition function, which characterizes the value gained by observing a particular point in input space. Bayesian optimization selects the next experimental sample by finding the point in input space which maximizes the acquisition function. Several different acquisition functions that are commonly used for optimization are Probability of Improvement19, Expected Improvement20, and Upper Confidence Bound (UCB)12. The UCB acquisition function is defined as

$$\alpha ({{{{{{{\bf{x}}}}}}}})=\mu ({{{{{{{\bf{x}}}}}}}})+\sqrt{\beta }\sigma ({{{{{{{\bf{x}}}}}}}})$$
(1)

where μ(x) is the predicted mean of the function value and σ(x) is the predicted uncertainty, both determined by the GP model. This acquisition function is of particular interest as it allows the user to specify the optimization parameter β, which represents the trade-off between exploitation (sampling to take advantage of predicted extrema) and exploration (sampling to reduce prediction uncertainty). An optimizer with a small value of β prefers exploitation, while a large value of β prioritizes exploration. It has been shown that maximum information gain of the GP model occurs when the acquisition function is only comprised of the uncertainty factor α(x) = σ(x)12, effectively choosing β → . The uncertainty predicted by the GP model is dependent only on the locations of previously sampled points in input space (see Eq. (7) in Methods Section for details).

Figure 1 shows the effect of kernel length scales on Bayesian optimization when we use this exploratory acquisition function. If the GP kernel has a single length scale, then the acquisition function is maximized at points that are the largest distance away from all previous observations. The sampling algorithm will choose points that form a grid-like pattern in input space, dependent on where initial samples were taken. However, if the length scale is different along each input axis, the exploration algorithm will be biased towards sampling points along the axis associated with the shortest length scale. This leads to an efficient sampling strategy, as more samples are needed to resolve rapidly changing features of the function, while fewer samples are used when the function changes slowly. We use this behavior as a starting point for creating the CPBE acquisition function.

Fig. 1: Plots showing Bayesian optimization (BO) sampling patterns depending on radial basis function (RBF) kernel length scales and using α(x) = σ(x) as the acquisition function.
figure 1

Blue points are initial samples and orange points are sampled during Bayesian optimization. a Length scales for both [x1, x2] set to 1. b Length scales for variables [x1, x2] set to [0.25, 1].

We define the CPBE acquisition function to be

$$\alpha ({{{{{{{\bf{x}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{0}}}}}}}}})=\sigma ({{{{{{{\bf{x}}}}}}}}){{\Psi }}({{{{{{{\bf{x}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{0}}}}}}}}})\mathop{\prod }\limits_{i=1}^{N}{P}_{i}[{g}_{i}({{{{{{{\bf{x}}}}}}}})\ge {h}_{i}]$$
(2)
$${{\Psi }}({{{{{{{\bf{x}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{0}}}}}}}}})=\exp \left(-\frac{1}{2}{({{{{{{{\bf{x}}}}}}}}-{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{0}}}}}}}}})}^{T}{{{{{{{{\boldsymbol{\Sigma }}}}}}}}}^{-1}({{{{{{{\bf{x}}}}}}}}-{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{0}}}}}}}}})\right)$$
(3)

where the two additional terms have the following effects.

The factor Ψ(x, x0) in Eq. (2) represents a proximal biasing factor where x0 is the most recently sampled point in input space. The covariance matrix Σ specifies the length scale at which points are biased, where a smaller element in the matrix leads to a stronger biasing towards the most recently observed point. By including the proximal term, we bias the acquisition function away from points that are not within a close proximity to the most recently observed point, while still allowing exploration in cases where large jumps in input space result in highly valued observations. This factor is included for two reasons. First, changing parameters in a real accelerator generally incurs a temporal cost, proportional to the change in the parameter. Second, changing accelerator parameters quickly can disrupt rapid feedback systems used to maintain supporting accelerator subsystems (such as those which stabilize the phase and amplitude of radio-frequency fields in accelerating cavities).

The last factor, \(\mathop{\prod }\nolimits_{i = 1}^{N}{P}_{i}[{g}_{i}({{{{{{{\bf{x}}}}}}}})\ge {h}_{i}]\) represents the multiplicative probability that N operational constraints are satisfied, following21. This factor weights the CPBE acquisition function by the probability that a constraining function gi(x) is greater than a predetermined scalar value hi. As a result, this factor will bias our acquisition function against sampling in regions of input space that are not likely to satisfy the constraints. The probability is calculated based on GP model predictions of μi(x) and σi(x) of the individual constraining functions gi(x), giving

$${P}_{i}[{g}_{i}({{{{{{{\bf{x}}}}}}}})\ge {h}_{i}]=1-{{\Phi }}\left(\frac{{h}_{i}-{\mu }_{i}({{{{{{{\bf{x}}}}}}}})}{{\sigma }_{i}({{{{{{{\bf{x}}}}}}}})}\right)$$
(4)

where Φ(x) is the Gaussian cumulative distribution function.

Experimental demonstration

We conducted an experiment at the Argonne Wakefield Accelerator (AWA) to demonstrate the application of Bayesian exploration to enable beamline characterization with a single-shot emittance measurement. The AWA beamline accelerates electrons to ~42 MeV using a photoinjector electron source, combined with a normal-conducting linear accelerator22. The transverse phase-space area or “emittance” of beams produced by the accelerator is an important figure of merit that must be minimized to ensure ideal transport of the beam through the accelerator and meet specific experimental criteria. The emittance ultimately sets the beam brightness, a critical parameter in accelerator-based light sources23 and colliders24. The emittance is sensitive to several beamline parameters summarized in Fig. 2, including the magnetic field strength of a pair of solenoidal lenses surrounding the photoinjector (referred to here as the “focusing” solenoid and characterized by the scaling parameter K0) and a solenoid in between the photoinjector and the first accelerating cavity (referred to here as the “matching solenoid” and controlled via the scaling parameter K1). Our goal is to explore the emittance response to these solenoids, as well as two quadrupole magnets located downstream of the accelerating structures (DQ4, DQ5). We use a single-shot multislit diagnostic25, also shown in Fig. 2, to measure the beam emittance. The principal challenge of conducting this measurement is that the beamline elements, while effecting the beam emittance, also modify the beam size and divergence at the diagnostic. Unfortunately, emittance measurements with this multislit emittance diagnostic have a finite dynamical range, limited to a narrow range of beam sizes and divergences. Such a limited dynamical range is a common problem shared by many types of diagnostics across accelerator facilities.

Fig. 2: Cartoon depicting the emittance exploration experiment at the Argonne Wakefield Accelerator facility.
figure 2

Conceptual view of emittance diagnostic is shown on the right. Beam travels from left to right.

As a baseline for comparison we conducted a two-dimensional scan of the solenoid parameters (K0, K1). A 10 × 10 point grid, shown in Fig. 3, was created with upper and lower bounds determined from prior experimentation. For each set of parameters, five observations of the beam emittance were conducted while windowing the fluctuating bunch charge between 5.0 and 6.0 nC. In some cases, points in input space result in both valid and invalid measurements due to accelerator noise. We use these observations to train GP models of the vertical beam emittance εy and a single constraining function g(x) that models the measurement validity (see Methods Section), which is set to one for a successful measurements and zero otherwise.

Fig. 3: Comparison between uniform grid sampling and CPBE.
figure 3

a A 10 × 10 grid is sampled over an operational subdomain identified by prior experimentation. Color mesh represents the posterior predicted mean of a Gaussian process trained on the valid measurements. Stacked histograms show sampling density of measurements along each axis. b 2D projected samples from Bayesian exploration after 66 iterations with 4 parameters (K0, K1, DQ4, DQ5). In both plots, shading with dashed line denotes invalid region in subdomain with 50% confidence, calculated from uniform grid sampling.

From Fig. 3(a), we observe that the invalid region of input space (denoted by shaded regions), defined as P[g(x) ≥ 0.5], is roughly half the input space domain. As a result, approximately half of our samples are wasted, as they provide no information on the beam emittance. We observe that in the valid region the emittance is strongly dependent on K1 relative to K0. If we normalize the input domain to the unit cube, the length scales of the GP kernel for each of these parameters is seen in Table 1. The length scales inferred from the data are consistent with what is seen in Fig. 3 where the emittance changes slowly as a function of the focusing solenoid strength (K0) and quickly with respect to the matching solenoid strength (K1).

Table 1 Trained hyperparameter length scales (normalized).

We then used CPBE to explore a similar input space, varying the two solenoid magnet strengths (K0, K1) and two quadrupole-magnet strengths (DQ4, DQ5). We initialized the algorithm with two initial valid measurements, randomly generated from within the valid subdomain, determined from earlier experimentation. The sigma matrix for the proximal factor (given in normalized coordinate space) is set to Σ = 0.01I where I is the identity matrix. This was chosen to reduce the acquisition function by 1/e over 10% of the input domain, which was identified to work well in practice. As in the 2D uniform scan, we use a single constraining inequality g(x) ≥ 0.5 where the constraint function g(x) = 1 if the emittance measurement is valid and zero otherwise.

The results from this exploration, projected onto the 2D subspace where DQ4 = 0, DQ5 = 0 (focusing magnets are off), appear in Fig. 3(b). We again plot the posterior predictive mean of the emittance GP model in this subspace, trained on the sampled data. Estimated length scales from this model are shown in Table 1. Similar to the uniform grid model, the length scale for K0 is significantly longer than the length scale for K1, implying that the function varies faster along K1, which is consistent with what is observed in Fig. 3. Further, we observe that the length scales for K0, K1 are similar when comparing the two models. The difference between length scales for K1 is slightly larger, which could be explained by the larger average separation between points during grid sampling, relative to CPBE sampling. Grid sampling places limits on the fitted length scale, where the spacing of uniform samples determines the fastest changes in the target function that can be observed. Finally, the length scales for variables DQ4, DQ5 are larger than the normalized domain, implying that the emittance is weakly correlated with quadrupole-magnet strength, as expected by first order beam dynamics26. We even observe that the length scale for DQ5 is significantly larger than the length scale associated with DQ4, which possibly results from the magnet’s location in the beamline, closer to the emittance diagnostic, that further reduces the impact of potential higher order effects from the magnet on the emittance.

From the projected histograms of K0 samples, we see that several regions of parameter space are avoided by the algorithm. This is due to the long length scale associated with K0, which reduces model uncertainty in between previous measurements. The algorithm skips over sampling these regions, as they do not significantly reduce uncertainty of our model if observed. On the other hand, samples along the K1 axis, which has a much shorter length scale, are continuously distributed in the valid region. This results in a better characterization the functional dependence on K1 since it shows rapidly changing behavior.

As a result, our sampling algorithm in 4D space significantly speeds up target function characterization over a hypothetical grid parameter scan. If we were to use the grid scan sampling algorithm in the 4D case, the number of samples needed grows exponentially up to 104 samples. However, after only 66 samples taken by CPBE, we were able to produce a qualitatively similar model of the target function when compared to the previous 2D scan, representing a 150× speed increase. Furthermore, a larger portion of samples taken by the exploration algorithm are valid compared to uniform grid sampling (77% vs. 52%), providing more information about the emittance per sample on average. Note that invalid samples which appear inside the valid region shown in Fig. 3(b) are invalid due to correlations with the unplotted variables DQ4, DQ5.

The difference in sampling behavior is also observed in Fig. 4 where we plot the normalized trace of each parameter value during exploration. As a result of the short length scale associated with K1, the algorithm effectively scans back and forth across the valid input region for this parameter while other parameters are slowly modified. The maximum and minimum values of the scan are dynamically determined when the algorithm encounters an invalid region. This is important, as we clearly see that the valid region for K1 changes as a function of the other variables. Finally, we observe that for variables with long length scales (K0, DQ4, DQ5), CPBE ignores intermediate regions that do not significantly reduce model uncertainty, reducing redundant measurements as it learns the associated length scale of the target function along these axes.

Fig. 4: Tracing exploration path through input parameter space.
figure 4

Normalized parameter values as a function of sample index during Bayesian exploration for all four input parameters. Corresponding sample density histograms are shown on the right.

Discussion

Here we have described and demonstrated an algorithm for autonomously and efficiently exploring input parameter spaces with limited prior information, while adapting to measurements that are potentially invalid or unsuccessful. This is achieved without the need for prior information about the target function or measurement constraints. We experimentally demonstrated our algorithm’s ability to efficiently explore the functional dependence of the beam emittance on four accelerator parameters automatically, while navigating a practically difficult to execute measurement. The Bayesian exploration algorithm was able to achieve similar model prediction accuracy as a grid-based scan, with a significantly smaller number of samples and two extra input parameters. The advantages this algorithm confers over traditional methods of characterizing particle accelerator responses are substantial, and we expect it to have broad impact across accelerator-based science facilities. While we demonstrated this algorithm’s effectiveness in the context of particle accelerators, CPBE is a flexible, lightweight, turn-key algorithm that replaces the need for grid type parameter scans in any field. In particular, the addition of a proximal biasing factor in our sampling strategy has advantages over previously described active learning algorithms when exploring spatially dependent systems (robotics, surveying etc.) that encounter costs associated with travel through input space.

While our demonstration here involved a relatively low-dimensional input space, there are a variety of methods that can be used to scale this algorithm to higher-dimensional parameter spaces. The main challenge facing this method is the computational cost associated with training and making predictions using the Bayesian model when the number of training points is large17, as is often the case when operating in high dimensional input spaces. While the exact inference implementation of our algorithm is limited to lower dimensional spaces, several methods have been proposed to overcome this limitation. Of particular interest is the LOVE algorithm27, which can compute covariances used in our algorithm up to 2000 times faster than existing methods, without sacrificing accuracy. If approximate modeling can be tolerated, stochastic methods can be used with variational models that use a limited number of inducing points to approximate the posterior model distribution28. Finally, if samples can be evaluated in parallel, as in simulated exploration contexts, batched style methods could be used to generate batches of samples at each optimization step, similar to the acquisition functions implemented in29.

Methods

Beam emittance diagnostic

The geometric emittance of a beam can be determined experimentally using a multislit emittance diagnostic. This diagnostic consists of a transverse mask with a number of horizontal slits that divides the beam into multiple “beamlets”. A downstream transverse diagnostic screen is used to image the beamlets and the center of mass, size, and integrated intensity of each beamlet is measured. This information is used to calculate aspects of the beam envelope, and thus the geometric beam emittance. To simplify our analysis, we used a slight modification of the emittance formulas derived in ref. 25. Instead of calculating the emittance using the correlated divergence, we used a calculation of the uncorrelated divergence to determine the emittance. While this analysis prevents us from assessing the position-divergence correlation of the beam, it still provides us with an accurate emittance measurement.

Our experiment used a laser etched stainless steel mask with 25 slits. The slit pattern had a separation of 2 mm, a vertical slit width of 50 μm, and a horizontal slit length of 40 mm. The circular beam imaging screen, which had a diameter of 50 mm, was located 2.84 m downstream of the transverse mask. The screen was imaged using an optical camera with a spatial resolution of 46 μm per pixel.

Emittance measurements were only considered valid if the measurement satisfied three conditions: we required that (i) at least five beamlets were produced at the observation screen, (ii) all the beamlets were contained within a predefined region of interest on the screen to prevent biasing due to potential clipping, and (iii) the projection of each beamlet onto the vertical axis did not overlap with any other beamlet projections, in order to properly measure the size of each beamlet. If any of these requirements were not met, we assigned the constraining function at that location in input space a value of zero, tagging the measurement as “invalid”.

Gaussian process model creation

Nonparametric Gaussian process surrogate models are used to predict the value of a target function f(x) using Bayesian statistics17. These models are specified by a covariance function, \(k({{{{{{{\bf{x}}}}}}}},{{{{{{{\bf{x}}}}}}}}^{\prime} ;{{{{{{{\boldsymbol{\phi }}}}}}}})\) with hyperparameters ϕ and a constant mean function C, such that we can write \(f({{{{{{{\bf{x}}}}}}}}) \sim {{{{{{{\mathcal{GP}}}}}}}}(C,k({{{{{{{\bf{x}}}}}}}},{{{{{{{\bf{x}}}}}}}}^{\prime} ))\). In an experimental setting, an observation y is the target function corrupted by noise: y = f(x) + ϵ where we assume that \(\epsilon \sim {{{{{{{\mathcal{N}}}}}}}}(0,{\sigma }_{{{{{{{{\rm{noise}}}}}}}}}^{2})\). Given N previous measurements \({{{{{{{\mathcal{D}}}}}}}}=\{({{{{{{{{\bf{x}}}}}}}}}_{1},{y}_{1}),\ldots ,({{{{{{{{\bf{x}}}}}}}}}_{N},{y}_{N})\}\) the predictive probability distribution of the function value f = f(x) is given by

$$p(f| {{{{{{{\mathcal{D}}}}}}}},{{{{{{{\bf{x}}}}}}}})={{{{{{{\mathcal{N}}}}}}}}(\mu ({{{{{{{\bf{x}}}}}}}}),{\sigma }^{2}({{{{{{{\bf{x}}}}}}}}))$$
(5)

where

$$\mu ({{{{{{{\bf{x}}}}}}}})={{{{{{{{\bf{k}}}}}}}}}^{T}{[K+{\sigma }_{{{{{{{{\rm{noise}}}}}}}}}^{2}I]}^{-1}({{{{{{{\bf{y}}}}}}}}-C)+C$$
(6)
$$\sigma ({{{{{{{\bf{x}}}}}}}})=k({{{{{{{\bf{x}}}}}}}},{{{{{{{\bf{x}}}}}}}})-{{{{{{{{\bf{k}}}}}}}}}^{T}{[K+{\sigma }_{{{{{{{{\rm{noise}}}}}}}}}^{2}I]}^{-1}{{{{{{{\bf{k}}}}}}}}$$
(7)
$${{{{{{{\bf{k}}}}}}}}={[k({{{{{{{\bf{x}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{0}),\ldots ,k({{{{{{{\bf{x}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{N})]}^{T}$$
(8)
$$K=\left[\begin{array}{lll}k({{{{{{{{\bf{x}}}}}}}}}_{1},{{{{{{{{\bf{x}}}}}}}}}_{1})&\cdots \ &k({{{{{{{{\bf{x}}}}}}}}}_{1},{{{{{{{{\bf{x}}}}}}}}}_{N})\\ \vdots &\ddots &\vdots \\ k({{{{{{{{\bf{x}}}}}}}}}_{N},{{{{{{{{\bf{x}}}}}}}}}_{1})&\cdots \ &k({{{{{{{{\bf{x}}}}}}}}}_{N},{{{{{{{{\bf{x}}}}}}}}}_{N})\end{array}\right].$$
(9)

The model hyperparameters θ = {ϕ, σnoise, C} are determined by maximizing the log marginal likelihood, \({{{{{{{\boldsymbol{\theta }}}}}}}}={{{{{{{{\rm{argmax}}}}}}}}}_{{{{{{{{\boldsymbol{\theta }}}}}}}}}{{{{{{\mathrm{log}}}}}}}\,[p(D;{{{{{{{\boldsymbol{\theta }}}}}}}})]\), which balances model accuracy and complexity when choosing hyperparameters. A Matérn kernel (ν = 3/2) was used for the GP models in this paper. During experimentation the hyperparameters were retrained after each observation.