1 Overview

In high energy physics (HEP) experiments, detectors act as imaging devices, allowing physicists to take snapshots of final state particles from collision “events”. Calorimeters are key components of such detectors. When a high-energy primary particle travels through dense calorimeter material, it deposits its energy and produces a shower of secondary particles. Detector “cells” within the calorimeter then capture these energy depositions, forming a set of voxelized images which are characteristic of the type and energy of the primary particle.

The starting point of any physics analysis is the identification of the types of particles produced in each collision and the measurement of the momentum carried by each of these particles. These tasks have traditionally used manually-designed algorithms, producing measurements of physical features such as shower width and rate of energy loss for particles traversing calorimeter layers. In the last few years, researchers have started realizing that machine learning (ML) techniques are well suited for such tasks, e.g. using boosted decision trees (BDTs) on calculated features for doing ID classification and energy regression. Indeed, ML has long been applied to various tasks in HEP [1,2,3], but has recently seen much wider application [4,5,6,7,8,9], including the 2012 discovery of the Higgs boson [10, 11] at the ATLAS [12] and CMS [13] experiments at the Large Hadron Collider (LHC).

In the next decade, the planned High Luminosity Large Hadron Collider (HL-LHC) upgrade [14] will enhance the experimental sensitivity to rare phenomena by increasing the number of collected proton–proton collisions by a factor of ten. In addition, many next-generation detector components, such as the sampling calorimeters proposed for the ILC [15], CLIC [16], and CMS [17] detectors, will improve physicists’ ability to identify and measure particles by using much finer 3D arrays of voxels. These and future accelerator upgrades will lead to higher data volumes and pose a variety of technological and computational challenges in tasks, such as real-time particle reconstruction.

In addition to actual collision data, physics analyses typically require extremely detailed and precise simulations of collisions, generated using software packages such as GEANT4 [18]. This simulated data is used to develop and test analysis techniques. These simulations involve the physics governing the interaction of particles with matter in the calorimeters, and are generally very CPU intensive. In some cases, such as the ATLAS experiment, simulation currently requires roughly half of the experiment’s computing resources [19]. This fraction is expected to increase significantly for the HL-LHC. These challenges require novel computational and algorithmic techniques, which has prompted recent efforts in HEP to apply modern ML to calorimetry [20,21,22,23].

With this work, we aim to demonstrate the applicability of neural-network based approaches to reconstruction and simulation tasks, looking at a real use case. To do this, we use fully simulated calorimeter data for a typical collider detector to train two models: (i) a network for end-to-end particle reconstruction, receiving as input a calorimeter shower from a single isolated particle and acting both as a particle identification algorithm and as a regression algorithm for the particle’s energy; (ii) a generative adversarial network (GAN) [24] for simulating particle showers, designed to return calorimeter-cell voxelized images like those generated by GEANT4. Both models aim to preserve the accuracy of more traditional approaches while drastically reducing the required computing resources and time, thanks partly to a built-in portability to heterogeneous CPU+GPU computing environments.

This paper is a legacy document summarizing two years of work. It builds upon initial simulation, classification, and regression results which we presented at the 2017 Workshop on Deep Learning for Physical Sciences at the NeurIPS conference. Those results were derived using simplified problem formulations [25]. For instance, we only used particles of a single fixed energy for classification, and had only considered showers produced by particles traveling perpendicularly to the calorimeter surface. The results presented in this paper deal with a more realistic use case and supersede the results in Ref. [25].

For the studies presented in this paper, we used two computing clusters: at the University of Texas at Arlington (UTA), and at the Blue Waters supercomputing network, located at the University of Illinois at Urbana Champaign (UIUC). The UTA cluster has 10 NVIDIA GTX Titan GPUs with 6 GB of memory each. Blue Waters uses NVDIA Kepler GPUs, also with 6 GB of memory each.

GAN models were implemented and trained using Keras [26] and Tensorflow [27]. Reconstruction models were implemented and trained using PyTorch [28]. The sample generation [29] and training [30] frameworks were both written in Python.

This document is structured as follows: In Sect. 2, we describe how we created and prepared the data used in these studies. Section 3 introduces the two physics problems, particle simulation and reconstruction. Sections 4 and 5 describe the corresponding models, how they were trained, and the performances they reached. In particular, Sect. 5 compares our results to those of more traditional approaches, and also extends those comparisons to simulated performances on detector geometries similar to those of the ATLAS and CMS calorimeters. Conclusions are given in Sect. 6.

2 Dataset

This study is based on simulated data produced with GEANT4 [18], using the geometric layout of the proposed Linear Collider Detector (LCD) for the CLIC accelerator [31]. We limit the study to the central region (barrel) of the LCD detector, where the electromagnetic calorimeter (ECAL) consists of a cylinder with inner radius of 1.5 m, structured as a set of 25 silicon sensor planes, segmented in \(5.1~\times ~5.1\) mm\(^2\) square cells, alternated with tungsten absorber planes. In the barrel region, the hadronic calorimeter (HCAL) sits behind the ECAL, at an inner radius of 1.7 m. The HCAL consists of 60 layers of polystyrene scintillators, segmented in cells with \(3~\times ~3\) cm\(^2\) area and alternated with layers of steel absorbers.

Fig. 1
figure 1

3D image of a photon (left) and neutral pion (right) shower in ECAL (bottom) and HCAL (top)

The event simulation considers the full detector layout, including the material in front of the calorimeter and the effect of the solenoidal magnetic field. The inner tracker is included in simulation, which allows particles to interact before hitting the calorimeter, but in our studies we focus only on calorimeter data. From the full data for each event we take slices centered around the barycenter of each ECAL energy deposit and we represent the ECAL and HCAL slices as 3D arrays of energy deposits in the cells.

We consider four kinds of particles (electrons e, photons \(\gamma \), charged pions \(\pi \), and neutral pions \(\pi ^0\)) with energies uniformly distributed between 2 and 500 GeV, and with incident angles uniformly distributed between a polar angle \(\theta \) between 1.047 and 2.094 radians with respect to the beam direction (equivalently, a pseudorapidity \(\eta \) between \(-0.549\) and 0.549).

We get the barycenter of a shower by taking the 2D projection of its energy deposit on the ECAL inner surface. This projection is taken along the z direction, which runs perpendicular to the calorimeter surface. Then, knowing the point of origin of the incoming particle, we use the barycenter to estimate the particle’s polar and azimuthal angles \(\theta \) and \(\phi \). The estimated pseudorapidity \(\eta \) is then computed as \(\eta =-\log [\tan \left( \frac{\theta }{2}\right) ]\). Each single-shower event is prepared by taking a slice of the ECAL in a window around the shower barycenter, as well as the corresponding HCAL slice behind. Depending on the task (generation or reconstruction), we take:

  • GEN dataset: A \(51 \times 51 \times 25\) cell window in the ECAL, for electrons in the energy range \(100-200\) GeV. Used in the shower generation task.

  • REC dataset: A \(25 \times 25 \times 25\) cell slice of the ECAL and a corresponding \(11 \times 11 \times 60\) cell slice of the HCAL, for \(e,~\gamma ,~\pi ,\) or \(\pi ^0\) in the energy range \(2-500\) GeV and with \(\eta \) from \(-0.524-0.524\). Used in the particle reconstruction task.

Examples of a photon shower and a neutral-pion shower can be seen in Fig. 1. The incoming particles enter from the bottom (\(z=0\)), at the center of the (xy) transverse plane (\(x=y=25)\). Both events are around 35 GeV in energy. We can see the presence of two subtracks in the neutral pion event, due to decay into two photons.

The window size for the GEN dataset has been defined in order to contain as much of the shower information as practically possible. Motivated by the need of reducing the memory footprint for some of the classification models, we used a smaller window size for the REC dataset. When training classification models on these data, a negligible accuracy increase was observed when moving to larger windows, as described in Appendix A.

We apply a task-dependent filtering of the REC dataset, in order to select the subset of examples for which the task at hand is not trivial. For instance, in general distinguishing a charged pion from an electron is an easy task, and can be accomplished with high accuracy by looking at the HCAL/ECAL energy ratio. On the other hand, a pion with a small HCAL/ECAL ratio leaves most of its energy in the ECAL due to charge conversion processes, and as such would be difficult to distinguish from an electron of equal momentum. Thus, we ignore charged-pion showers with a large HCAL/ECAL energy ratio. To be more specific, we see in Fig. 2 that the ratio of total ECAL energy to total HCAL energy is very different for electrons and charged pions, with the heavier charged pions tending to leave little energy in the ECAL. In order to make the particle-identification task more challenging, we only consider showers with HCAL/ECAL < 0.1 cut. The effects are shown in Fig. 3, where we see the fraction of events from 2 to 500 GeV that pass this selection. We can see that the selection favors mostly low-energy charged pions, which tend to leave less energy in the HCAL if they manage to make it through the ECAL at all. Discriminating accurately between electrons and charged pions in this range is thus crucial for physics analyses where we are interested in decay products with low energy.

Fig. 2
figure 2

HCAL/ECAL energy ratios for electrons and charged pions, plotted on a log scale. The last bin is an overflow bin

Fig. 3
figure 3

Fractions of electrons and charged pions that pass a HCAL/ECAL<0.1 cut at various particle energies (top). Mean charged pion energy as a function of HCAL/ECAL energy ratio (bottom). We see that if a pion makes it into the HCAL, then we tend to see a positive relation between particle energy and the HCAL/ECAL ratio. About 1 out of 5000 events will leave no hits in the calorimeter window at all. These events form the bump in the HCAL/ECAL=0 bin

Fig. 4
figure 4

Opening angle distribution for neutral pions decaying into two photons, plotted on a log scale with an overflow bin. Plot is zoomed in to show opening angle < 0.01. Number of equivalent ECAL cells is shown on the top axis. This plot was generated using pions from the full 2–500 GeV energy range

Fig. 5
figure 5

The fraction of neutral pions passing an opening angle < 0.01 radian selection at various particle energies (top). The mean neutral pion energy as a function of opening angle (bottom)

Photons and neutral pions are more difficult to distinguish. This is because neutral pions decay preferentially into two photons, with a branching ratio of almost \(99\%\). A Lorentz boost due to the motion of the pion causes the photons to become collimated, to the point where they are only separated by a small angle. If the pion has a low energy, the opening angle between the two photons is larger and the shower is easily identified as originating from a neutral pion. High-energy neutral pions produce more collimated photon pairs, which are more easily mistaken as a single high-energy photon. The opening angle distribution for neutral pions is shown in Fig. 4. In order to limit the study to the most challenging case, we filter the neutral-pion dataset by requiring the opening angle between the two photons to be smaller than 0.01 radian. The effect of this requirement on the otherwise uniform energy distribution is shown in Fig. 5. As expected, the selection mostly removes low-energy neutral pions.

The ECAL and HCAL 3D arrays are passed directly to our neural networks. We also compute a set of expert features, as described in Ref. [25]. These features are used to train alternative benchmark algorithms (see Appendices C and D ), representing currently-used ML algorithms in HEP.

3 Benchmark tasks

In this section, we introduce the two benchmark tasks that we aim to solve with ML algorithms:

  • Particle reconstruction: starting from raw detector hits, determine the nature of a particle and its momentum.

  • Particle simulation: starting from a generator-level information of an incoming particle, generate the detector response (raw detector hits) using random numbers to model the stochastic nature of the process.

This paper extends upon previous ML investigations in ATLAS. Some prior classification studies on ATLAS data can be found at [32], and work involving the generation of electron showers at ATLAS can be found at [33, 34]. Since the CLIC datasets we use here are much more granular than those from ATLAS data, we were able to examine more complex neural architectures. Furthermore, we demonstrate the use of a single tool which performs multiple aspects of particle reconstruction simultaneously, simply starting from a calorimeter image.

3.1 Simulation

It is common in HEP to generate large amounts of detailed synthetic data from Monte Carlo simulations. This simulated data allows physicists to determine the expected outcome of a given experiment based on known physics. Having this prior expectation, one can reveal the presence of new phenomena by observing an otherwise inexplicable difference between real and simulated data. An accurate simulation of a detector response is a computationally heavy task, currently taking a significant fraction of the overall computing resources in a typical HEP analysis. Thus we also investigate the use of ML algorithms to speed up the event simulation process. In particular, we build a generative model to simulate detector showers, similar to those on which we train the end-to-end reconstruction algorithm. Such a generator could drastically reduce Monte Carlo simulation time, and turn event generation into an on-demand task.

In order to create realistic calorimetric shower data, we train a generative adversarial network (GAN) on the GEN dataset defined in Sect. 2. Due to training time constraints, we have restricted the current study to ECAL showers for incoming electrons with energy between 100 and 200 GeV. However, we have performed initial studies on expanded energies from 2 to 500 GeV, and will extend on these results in future publications. The task is to create a model that can take an electron’s energy and flight direction as inputs and generate a full ECAL shower, represented as a \(51 \times 51 \times 25\) array of energy deposits along the trajectory of the incoming electron. The advantage of using a GAN is that it’s much faster and less computationally intense than traditional Monte Carlo simulation, and the results may more accurately reproduce physical behavior if the GAN is trained on real data.

3.2 Reconstruction

At particle-collider experiments, data consist of sparse sets of hits recorded by various detector components at beam collision points. A typical analysis begins with a complex reconstruction algorithm that processes these raw data to produce a set of physics objects (jets, electrons, muons, etc.), which are then used further down the line. Traditionally, the reconstruction software consists of a set of rule-based algorithms that are designed based on physics knowledge of the specific problem at hand (e.g., the bending of particles in a solenoidal magnetic field, due to the Lorentz force). Over the past decade or so, machine-learning algorithms have been integrated into certain aspects of particle reconstruction. One example is the identification of electrons and photons via a BDT, taking as input for each event a set of high-level features quantifying the shape of the energy cluster deposited in a calorimeter shower [35].

Event reconstruction is a challenging task, and is a crucial part of any particle physics analysis. In order to improve reconstruction performance beyond conventional techniques, one could imagine using deep learning to extract information directly from calorimetric cell-level data, without first computing high-level features. Following this idea, we investigate here an end-to-end ML model based on computer vision techniques, treating the calorimeter input as a 3D image. Using a combined architecture, the model is designed to simultaneously perform particle identification and energy measurement.

When dealing with particle reconstruction, one is interested in identifying a particle’s type (electron, photon, etc.) and its momentum. An end-to-end application aiming to provide a full reconstruction of a given particle should thus be able to simultaneously solve a multi-class classification problem and a regression problem. In our study, we filter the REC dataset to make the classification task non-trivial, as described in Sect. 2. Since differentiating charged and uncharged particles is trivial, we judged the classification of our model on its ability to distinguish electrons from charged pions, and photons from neutral pions.

Our reconstruction networks were thus given the following three tasks:

  • Identify electrons over a background of charged pions: Charged pions are the most abundant particles produced in LHC collisions. They are typically located in jets, which are collimated sprays resulting from the showering and hadronization processes of quarks and gluons. On the other hand, electrons are rarely produced, and their presence is typically an indication of an interesting event occurring in the collision. A good electron identification algorithm should aim at misidentifying at most 1 in 10,000 pions as an electron. In order to increase the difficulty of our ML problem and to approach the kind of task that one faces at the LHC, we apply the HCAL/ECAL energy ratio cut as described in Sect. 2.

  • Identify photons over a background of neutral pions: At particle colliders, the main background to photon identification comes from neutral pions decaying to photon pairs. In general, a generic \(\gamma /\pi ^0\) classification task is relatively easy, since the presence of two nearby clusters is a clear signature of \(\pi ^0\). Thus, we focus on events with high \(\pi ^0\) momentum, using the opening angle selection described in Sect. 2.

  • Energy measurement: Once the particle is identified, it is very important to accurately determine its energy (and by extension, its momentum), since this allows physicists to calculate all the relevant high-level features, such as the mass of new particles that generated the detected particles when decaying. In this study, we address this problem on the same dataset used for the classification tasks, restricting the focus to range of energies from 2 to 500 GeV, and at various incident angles (\(\eta \)). Regression results using various neural network architectures were compared with results from linear regression, comparing both resolution and bias. The models we consider are designed to return the full particle momentum (energy, \(\eta \), and \(\phi \)) of the incoming particle momentum. At this stage, this functionality is not fully exploited and only the energy determination is considered. An extension of our work to include the determination of \(\eta \) and \(\phi \) could be the matter of future studies.

Fig. 6
figure 6

3DGAN generator and discriminator network architectures

4 Generative model

Generative Adversarial Networks are composed of two networks, a discriminator and a generator. Our model, 3DGAN, implements an architecture inspired by the auxiliary classifier GAN [36]. The generator takes as input a specific particle type, flight direction, and energy, and generates the 3D image of an energy deposit using an auxiliary input vector of random quantities (latent vector). The output has the same format as the 3D array of ECAL hits in the GEN sample (see Sect. 2). The discriminator network receives as input an ECAL 3D array and classifies it as real (coming from the GEANT4-generated GEN dataset) or fake (produced by the generator).

Our initial 3DGAN prototype [25] successfully simulated detector outputs for electrons which were orthogonally incident to the calorimeter surface. In addition, the discriminator performed an auxiliary regression task on the input particle energy. This task was used to cross check the quality of the generation process.

In this study, we consider a more complex dataset, e.g., due to the variable incident angle of the incoming electron on the inner ECAL surface. To monitor this additional complexity, we include more components in the loss function, related to the regression of the particle direction and the pixel intensity distribution (energy deposition in cells). This will be described in more detail below.

Before training our GAN, we pre-processed the GEN dataset by replacing each cell energy content E with \(E^\alpha \), where \(\alpha <1\) is a fixed hyperparameter. This pre-processing compensates for the large energy range (about 7 orders of magnitude) covered by individual cell energies, and mitigates some performance degradation we previously observed at low energies. After testing for different values of \(\alpha \), we observed optimal performance for \(\alpha =0.85\).

4.1 GAN architecture

The 3DGAN architecture is based on 3-dimensional convolutional layers [37], as shown in Fig. 6. The generator takes as input a vector with a desired particle energy and angle, and concatenates a latent vector of 254 normally distributed random numbers. This goes through a set of alternating upsampling and convolutional layers. The first convolution layer has 64 filters with \(6 \times 6 \times 8\) kernels. The next two convolutional layers have 6 filters of \(5 \times 8 \times 8\) and \(3 \times 5 \times 8\) kernels, respectively. The last convolutional layer has a single filter with a \(2 \times 2 \times 2\) kernel. The first three layers are activated by leaky ReLU functions [38], while ReLU functions [39] are used for the last layer. Batch normalization [40] and upscaling layers were added after the first and second convolutional layers.

The discriminator takes as input a \(51 \times 51 \times 25\) array and consists of four 3D convolutional layers. The first layer has 16 filters with \(5 \times 6 \times 6\) kernels. The second, third, and fourth convolutional layers each have 8 filters with \(5 \times 6 \times 6\) kernels. There are leaky ReLU activation functions in each convolutional layer. Batch normalization and dropout [41] layers are added after the second, third, and fourth convolutional layers. The output of the final convolution layer is flattened and connected to two output nodes: a classification node, activated by a sigmoid and returning the probability of a given input to be true or fake; and a regression node, activated by a linear function and returning the input particle energy. The 3DGAN model is implemented in KERAS [26] and Tensorflow [27].

Aside from the architecture shown here, we also tested the use of a Wasserstein GAN [42], but found no practical advantage in terms of computational speed-up or training performance.

4.2 Training and results

The 3DGAN loss function

$$\begin{aligned} L_{Tot} = W_{G}L_{G} + W_{P}L_{P} + W_{A}L_{A} + W_{E}L_{E} + W_{B}L_{B} \end{aligned}$$
(1)

is built as a weighted sum of several terms: a binary cross entropy (\(L_{G}\)) function of the real/fake probability returned by the discriminator, mean absolute percentage error terms (MAPE) related to the regression of the primary-particle energy (\(L_{P}\)) , the total deposited energy (\(L_{E}\)) and the binned pixel intensity distribution (\(L_{B}\)), and a mean absolute error (MAE) for the incident angles measurement (\(L_{A}\)). The binary cross entropy term, percentage errors and absolute error are weighted by 3.0, 0.1 and 25 respectively. The weights W are tuned to balance the relative importance of each contribution. The predicted energy and incident angle provide a feedback on the conditioning of the image. The binned pixel intensity distribution loss compares the counts in different bins of pixel intensities.

The model training is done using the RMSprop [43] optimiser. We alternately train the discriminator on a batch of real images and a batch of generated images, applying label switching. We then train the generator while freezing the discriminator weights.

Figure 7 shows a few events from the GEN data set. The events were selected to cover both ends of the primary-particle energy and angle spectrum. Figure 8 presents the corresponding generated events with the same primary particle energy and angle as the GEN events in Fig. 7. Initial visual inspection shows no obvious difference between the original and GAN generated images. A detailed validation based on several energy-shape related features confirms these results. We discuss a few examples below.

Fig. 7
figure 7

GEN sample: electrons with different primary particle energies and angles

Fig. 8
figure 8

GAN generated electrons with primary energies and angles corresponding to the electrons showed in Fig. 7

Fig. 9
figure 9

GEANT4 vs. GAN comparison for sampling fraction, number of hits and shower shapes along x, y, z axis for different angle bins with 100–200 GeV primary particle energies

The top row in Fig. 9 shows the ratio between the total energy deposited in the calorimeter and the primary particle energy as a function of the primary particle energy (we refer to it as “sampling fraction”) for different angle values. 3DGAN can nicely reproduce the expected behaviour over the whole energy spectrum. The second row in Fig. 9 shows the number of hits above a \(3 \times 10^{-4}\) MeV threshold: the GAN prediction is slightly broader than the Monte Carlo, consistently with the slight overestimation on the shower shapes distributions (10). Figure 9 also shows the calorimeter shower shapes projected onto the x, y, and z axes. Here, z is the axis pointing into the calorimeter, perpendicular to its surface. The agreement is very good, and in particular 3DGAN is able to mimic the way the energy distributions changes with incident angle. Figure 10 shows some additional features aimed at defining the shape of the deposited energy distribution. In particular the second moments along the x, y and z axes are shown on the first column, measuring the width of the deposited energy distribution along those axes. The second column shows the way the energy is deposited along the depth of the calorimeter, by splitting the calorimeter in three parts along the longitudinal direction and measuring the ratios between the energy deposited in each third and the total deposited energy. Finally, the third column in Fig. 10 highlights the tails of the “energy shapes”. It can be seen that, while the core of the distribution is perfectly described by 3DGAN, the network tends to overestimate the amount of energy deposited at the edges of the volume. It should be noted however that energy depositions in those cells are very sparse.

Fig. 10
figure 10

GEANT4 vs. GAN comparison for shower width (second moment) in x, y, z, ratio of energy deposited in parts along direction of particle traversal to total energy and shower shapes along x, y, z axis in log scale for 100–200 GeV primary particle energies and \(60^{\circ }-120^{\circ }\theta \)

The 3DGAN training runs in around 1.5 h per epoch on a single NVIDIA GeForce GTX 1080 card for 60 epochs. The simulation time on a Intel Xeon 8180 is about 13 ms/particle and it goes down to about 4 ms/particle on a NVIDIA GeForce GTX 1080. For comparison GEANT4 simulation takes about 17 s per particle on a Intel Xeon 8180 (currently it is not possible to run a full GEANT4-based simulation on GPUs). Thus our GAN represents a potential simulation speedup of over 4000 times for this specific aspect of the event simulation.

When given as input to a particle regression and reconstruction model (see Sect. 5), this dataset produces the same output as the original GEANT4 sample, as described in Appendix B.

5 End-to-end particle reconstruction

This section describes the use of a deep neural network to accomplish an end-to-end particle reconstruction task. The model consists of a neural architecture which simultaneously performs both particle classification and energy regression. This combined network is trained using the ECAL and HCAL cell arrays as well as the total ECAL energy and total HCAL energy as inputs. The training loss function is written as the sum of a binary cross entropy for particle identification and a mean-square error loss for energy regression. Through experimentation, we found that multiplying the energy component of the loss function by a factor of 200 gave the best results, as it was easier to quickly achieve low loss values for energy regression.

We compare three different architectures for our reconstruction model, each trained using calorimeter cell-level information as inputs:

  • A dense (i.e, fully connected) neural network (DNN).

  • A 3D convolutional network (CNN).

  • A network based on GoogLeNet (GN) [44], using layers of inception modules.

In order to compare the model performance to a typical state-of-the-art particle reconstruction algorithm, we also consider the following alternatives:

  • A feature-based BDT (see Appendix C) for the classification task.

  • A linear regression for the regression task.

  • A BDT for the regression task (for more info on regression baselines see Appendix D).

In a previous study [25], we compared the classification accuracy obtained with a neural model taking as input the energy cells, a feature-based neural models, and a feature-based BDTs. In that context, we demonstrated that feature-based BDTs and neural networks perform equally well, and are both equally capable of correctly classify particles from a small set of calculated features. We do not compare feature-based neural networks in this paper, and use feature-based BDTs to represent the current state-of-the-art classification algorithms.

5.1 Deep network models

The three ML models take as input the ECAL and HCAL 3D energy arrays of the REC dataset (see Sect. 2), together with the total energies recorded in ECAL and in HCAL (i.e., the sum of the values stored in the 3D arrays), as well as the estimated \(\phi \) and \(\eta \) angles of the incoming particle, calculated using the collision origin and the barycenter of the event. The architecture of each model is defined with a number of floating parameters (e.g. number of hidden layers), which are refined through a hyperparameter optimization, as described in Sect. 5.2. Each model returns three numbers. After applying a softmax activation, two of these elements are interpreted as the classification probabilities of the current two-class problem. The third output is interpreted as the energy of the particle.

Here we describe in detail the three model architectures:

  • In the DNN model we first flatten our ECAL and HCAL inputs into 1D arrays. We then concatenate these array along with the total ECAL energy, total HCAL energy, estimated \(\phi \), and estimated \(\eta \), for an array of total size \(25 \times 25 \times 25 + 11 \times 11 \times 60 + 4 = 22889\) inputs. This array is fed as input to the first layer of the DNN, followed by a number of hidden layers each followed by a ReLU activation function and a dropout layer. The number of neurons per hidden layer and the dropout probability are identical for each relevant layer. The number of hidden layers, number of hidden neurons per layer, and dropout rate are hyperparameters, tuned as described in the next session. Finally, we take the output from the last dropout layer, append the total energies and estimated angles again, and feed the concatenated array into a final hidden layer, which results in a three-element output.

  • The CNN architecture consists of one 3D convolutional layer for each of the ECAL and HCAL inputs, each followed by a ReLU activation function and a max pooling layer of kernel size \(2 \times 2 \times 2\). The number of filters and the kernel size in the ECAL convolutional layer are treated as optimized hyperparameter (see next session). The HCAL layer is fixed at 3 filters with a kernel size of \(2 \times 2 \times 6\). The two outputs are then flattened and concatenated along with the total ECAL and HCAL energies, as well as the estimated \(\phi \) and \(\eta \) coordinates of the incoming particle. The resulting 1D array is passed to a sequence of dense layers each followed by a ReLU activation function and dropout layer, as in the DNN model. The number of hidden layers and the number of neurons on each layer are considered as hyperparameters to be optimized. The output layer consists of three numbers, as for the DNN model. We found that adding additional convolutional layers to this model beyond the first had little impact on performance. This may be because a single layer is already able to capture important information about localized shower structure, and reduces the dimensionality of the event enough where a densely connected net is able to do the rest.

  • The third model uses elements of the GoogLeNet [44] architecture. This network processes the ECAL input array with a 3D convolutional layer with 192 filters, a kernel size of 3 in all directions, and a stride size of 1. The result is batch-normalized and sent through a ReLU activation function. This is followed by a series of inception and MaxPool [37] layers of various sizes, with the full architecture described in Appendix E. The output of this sequence is concatenated to the total ECAL energy, the total HCAL energy, the estimated \(\phi \) and \(\eta \) coordinates, and passed to a series of dense layers like in the DNN architecture, to return the final three outputs. The number of neurons in the final dense hidden layer is the only architecture-related hyperparameter for the GN model. Due to practical limitations imposed by memory constraints, this model does not take the HCAL 3D array as input. This limitation has a small impact on the model performance, since the ECAL array carries the majority of the relevant information for the problems at hand (see Appendix F).

Fig. 11
figure 11

Selected hyperparameter scan results for DNN (top), CNN (center), and the GoogLeNet-based architecture (bottom). In each figure, the classification accuracy is displayed as a function of the hyperparameters reported on the two axes

On all models, the regression task is facilitated by using skip connections to directly append the input total ECAL and HCAL energies to the last layer. The impact of this architecture choice on regression performance is described in Appendix G. In addition to using total energies, we also tested the possibility of using 2D projections of the input energy arrays, summing along the z dimension (detector depth). This choice resulted in worse performance (see Appendix H) and was discarded.

5.2 Hyperparameter scans

In order to determine the best architectures for the end-to-end reconstruction models, we scanned over a hyperparameter space for each architecture. Learning rate and decay rate were additional hyperparameters for each architecture. For simplicity, we used classification accuracy for the \(\gamma \) vs. \(\pi ^0\) problem as a metric to determine the overall best hyperparameter set for each architecture. This is because a model optimized for this task was found to generate good results for the other three tasks as well, and because \(\gamma \) vs. \(\pi ^0\) classification was found to be the most difficult problem.

Training was performed at each hyperparameter point ten times, in order to obtain an estimate of the uncertainty associated with each quoted performance value. For each scan point, the DNN and CNN architectures trained on 400,000 events, using another sample of 400,000 events for testing. DNN and CNN scan points trained for three epochs each, taking about seven hours each. GN trained on 100,000 events and tested on another 100,000. Due to a higher training time, each GN scan point only trained for a single epoch, taking about twenty hours.

For CNN and DNN training, we used batches of 1000 events when training. However, due to GPU memory limitations, we could not do the same with GN. Instead, we split each batch into 100 minibatches of ten events each. A single minibatch was loaded on the GPU at a time, and gradients were added up after back-propagation. We waited until after each batch was fully calculated to update network weights using the combined gradients.

The best settings were found to be as follows:

  • For DNN, 4 hidden layers, 512 neurons per hidden layer, a learning rate of 0.0002, decay rate of 0, and a dropout probability of 0.04.

  • For CNN, 4 hidden layers and 512 neurons per hidden layer, a learning rate of 0.0004, decay rate of 0, a dropout probability of 0.12, 6 ECAL filters with a kernel size of \(6 \times 6 \times 6\).

  • For GN, 1024 neurons in the hidden layer, 0.0001 learning rate, and 0.01 decay rate.

Fig. 12
figure 12

Training curves for best DNN (top), CNN (middle), and GoogLeNet (bottom) hyperparameters, trained on variable-angle \(\gamma \)/\(\pi ^0\) samples. We see that the DNN over-trains quickly and saturates at a relatively low accuracy, while the CNN takes longer to over-train and reaches a higher accuracy, and GoogLeNet performs best of all. Each 400 batches corresponds to a single epoch

Fig. 13
figure 13

ROC curve comparisons for \(\gamma \) vs. \(\pi ^0\) (top) and e vs. \(\pi ^\pm \) (bottom) classification using DNN, CNN, BDT, and GoogLeNet (GN). Samples include particle energies from 2 to 500 GeV, and an inclusive \(\eta \) range

The DNN, CNN, and GN-based models had 9823774 (\(\sim \)10M), 3003692 (\(\sim \)3M), and 14956286 (\(\sim \)15M) trainable parameters respectively after the hyperparameter scans.

Selected hyperparameter scan slices are shown in Fig. 11. These 2D scans were obtained setting all values besides the two under consideration (i.e., those on the axes) to be fixed at default values: a dropout rate of 0.08, a learning rate of 0.0004, a decay rate of 0.04, three dense layers for CNN and DNN, and 512 neurons per hidden layer. For GN, the default number of ECAL filters was 3, with a kernel size of 4.

Fig. 14
figure 14

Classification accuracy of best performing network for \(\gamma \) vs. \(\pi ^0\) (top) and e vs. \(\pi ^\pm \) (bottom), in bins of energy (left) and \(\eta \) (right)

Fig. 15
figure 15

Regression bias (left) and resolution (right) as a function of true energy for energy predictions on the REC dataset with variable-angle incident angle. From top to bottom: electrons, charged pions, photons, and neutral pions. Algorithms compared are linear regression, XGBoost (BDT), DNN, CNN, and GoogLeNet (GN)

After performing the hyperparameter scan, we trained each architecture using its optimal hyperparameters for a greater number of epochs. The evolution of the training and validation accuracy as a function of the batch number for these extended trainings is shown in Fig. 12.

5.3 Results

We apply the best architectures described in the previous section separately to our electron vs. charged pion and photon vs. neutral pion reconstruction problems.

5.3.1 Classification performance

Figure 13 shows ROC curve comparisons for the two classification tasks. As expected, the electron vs. charged pion classification problem was found to be a simple task, resulting in an area under the curve (AUC) close to \(100\%\). For a baseline comparison, the curve obtained for a BDT (see Appendix C) is also shown. This BDT was optimized using the scikit-optimize package [45], and was trained using high-level features computed from the raw 3D arrays. It represents the performance of current (non-deep-learning) ML approaches on these problems.

Our deep learning models outperform the BDT, with the GN reaching the best classification performance on both problems. Figure 14 shows the best-model performance as a function of the energy and \(\eta \) of the incoming particle, for the photon vs. neutral pion and the electron vs. charged pion problems. These figures show that classification accuracy is maintained over a wide range of particle energies and angles. The models appear to perform a bit worse at higher energies for the photon vs. neutral pion case, due to the fact that the pion to two photon decay becomes increasingly collimated at higher energies. Similarly, the performance is slightly worse when particles impact the detector perpendicularly than when they enter at a wide angle, because the shower cross section on the calorimeter inner surface is reduced at \(90^{\circ }\), making it harder to distinguish shower features.

5.3.2 Regression performance

Figure 15 shows the energy regression performance for each particle type, obtained from the end-to-end reconstruction architectures. In this case, we compare against a linear regression algorithm and a BDT (labelled as “XGBoost”) representing the current state-of-the-art, as described in Appendix D.

Since the energy regression problem is not as complex as the classification problem, the three architectures (DNN, CNN, GN) perform fairly similarly, with the exception of the GN performance on \(\pi ^\pm \), which is a bit worse. The performance is overall worse for \(\pi ^\pm \), both with the networks and with the benchmark baselines (linear regression and XGBoost).

Fig. 16
figure 16

Example of the resampling procedure used to emulate CLIC data on a different detector geometry (the example shown here is simply a larger grid). First, we extrapolate hit information from one geometry to another (top). Next, we extrapolate back to the original geometry (bottom). This allows us to emulate the rougher granularity of the second geometry, while keeping data array sizes constant and enabling us to use the models we have already developed for the CLIC dataset. Note that some information is lost at the edges

A closer look at the performance boost given by each network can be obtained examining the case of particles entering the calorimeter inner surface at \(90^{\circ }\), i.e. with \(\eta =0\)Footnote 1 In this case, the problem is more constrained and both the networks and the baseline algorithms are able to perform accurately. The results for fixed angle samples are shown in Appendix I.

We have also tested the result of training on one class of particle and performing regression on another. These results can be seen in Appendix J. In addition, we have looked at the effect on energy regression of increasing the ECAL and HCAL window sizes. This can be seen in Appendix K.

Table 1 Detailed description of the three detector geometries used in this study: the baseline CLIC ECAL [46] and the ATLAS [12] and CMS [13] ECALs
Fig. 17
figure 17

ROC curve comparisons for variable-angle \(\gamma \)/\(\pi ^0\) classification on data resampled to ATLAS-like (top) and CMS-like (bottom) geometries. Algorithms compared are DNN, CNN, GoogLeNet (GN), and BDT

5.4 Resampling to ATLAS and CMS geometries

In addition to the results presented so far, we show in this section how the end-to-end reconstruction would perform on calorimeters with granularity and geometry similar to those of the ATLAS and CMS calorimeters. Since the REC dataset (see Sect. 2) is generated using the geometry of the proposed LCD detector, it has a much higher granularity than the current-generation ATLAS and CMS detectors. To visualize how our calorimeter data would look with a coarser detector, we linearly extrapolate the contents of each event to a different calorimeter geometry, using a process we have termed “resampling”. To keep the resampling procedure simple, we discard the HCAL information and consider only the ECAL 3D array.

Fig. 18
figure 18

Bias (left) and resolution (right) as a function of true energy for energy predictions for photons, on variable-angle samples resampled to ATLAS-like (top) and CMS-like (bottom) geometries

Fig. 19
figure 19

Bias (left) and resolution (right) as a function of true energy for energy predictions for \(\pi ^0\), on variable-angle samples resampled to ATLAS-like (top) and CMS-like (bottom) geometries

A not-to-scale example of the full procedure is shown in Fig. 16. In this example, we resample the input to a regular square grid with lower granularity than the input data. The operation is simplified in the figure, in order to make the explanation easy to visualize. The actual ATLAS and CMS calorimeter geometries are more complex than a regular array, as described in Table 1.

In the resampling process, we first extrapolate each energy value from the grid of CLIC cells to a different geometry. To do so, we scale the content of each CLIC cell to the fraction of overlap area between the CLIC cell and the cell of the target geometry. When computing the overlap fraction, we take into account the fact that different materials have different properties (Moliere radius, interaction length, and radiation length). For instance, CLIC is more fine-grained than CMS or ATLAS detectors, but the Moliere radius of the CLIC ECAL is much smaller than in either of those detectors. This difference determines an offset in the fine binning. Thus, when applying our resampling procedure we normalize the cell size by the detector properties. The Moliere radius is used for x and y re-binning, and radiation length is used for the z direction. At this point we have a good approximation for how the event would look in a calorimeter with the target geometry.

To complete the resampling process, we invert the procedure to go back to our original high-granularity geometry. This last step allows us to keep using the model architectures that we have already optimized. It adds no additional information that would not be present in the low-granularity geometry. This up-sampling also allows us to deal with the irregular geometry of the ATLAS calorimeter by turning it into a neat grid. With no up-sampling, it would not be possible to apply the CNN and GN models. This procedure was validated by comparing total energies before and after resampling, and by visually comparing resampled grids. The energy matches for events were not exact, due to losses at the edge of the resampling grid, and the shower resolutions became much less granular after resampling, but overall the energies and distributions matched before and after the procedure was applied.

The resampling procedure comes with a substantial simplification of the underlying physics process. First of all, the information at the edge of the grid is imperfectly translated during the resampling process, leading to worse performance than what could theoretically be achieved in the actual CMS and ATLAS detectors. Also, this simple geometrical rescaling doesn’t capture many other detector characteristics. For example, the CMS ECAL detector has no depth information, but being homogeneous it provides a very precise energy measurement. Our resampling method only captures geometric effects, and would not be able to model the improvement in energy resolution. Furthermore, we are unable to include second-order effects such as gaps in the detector geometries. Despite these limitations, one can still extract useful information from the resampled datasets, comparing the classification and regression performances of the end-to-end models defined in Sects. 5.3.1 and 5.3.2 on different detector geometries.

Comparisons of classification ROC curves between network architectures and our BDT baseline are shown in Fig. 17 for ATLAS-like and CMS-like geometries. Here we can see that the previously observed performance ranking still holds true. The GN model performs best, followed by the CNN, then the DNN. All three networks outperform the BDT baseline. The effect is less pronounced after the CMS-like resampling, due to the low granularity and the single detector layer in the z direction.

Regression results are shown in Figs. 18 and 19 , for photons and neutral pions (we did not train electrons or charged pions for this comparison). Here we have included the regression baselines, DNN networks, and CNN networks, but not GN (which we did not train on resampled data). The results obtained for the ATLAS-like resampling match those on the REC dataset, with DNN and CNN matching the BDT outcome in terms of bias and surpassing it in resolution. With the CMS-like resampling the neural networks match but do not improves over the BDT energy regression resolution. Once again, this is due to the low spatial resolution in the CMS-like geometry, especially due to the lack of z segmentation. We are unable to model the improved energy resolution from the actual CMS detector, so these energy regression results are based on geometry only.

6 Conclusion and future work

This paper shows how deep learning techniques could outperform traditional and resource-consuming techniques in tasks typical of physics experiments at particle colliders, such as particle shower simulation and reconstruction in a calorimeter. We consider several model architectures, notably 3D convolutional neural networks, and we show competitive performance, matched to short execution time. In addition, this strategy comes with a GPU-friendly computing solution and would fit the current trends in particle physics towards heterogeneous computing platforms.

We confirm findings from previous studies of this kind. On the other hand, we do so utilizing a fully accurate detector simulation, based on a complete GEANT4 simulation of a full particle detector, including several detector components, magnetic field, etc. In addition, we design the network so that different tasks are performed by a single architecture, optimized through an hyperparameter scan.

We look forward to the development of similar solutions for current and future particle detectors, for which this kind of end-to-end solution could be extremely helpful.