Introduction

Artificial Neural Networks (NN) are a well-established tool for applications in Machine Learning and they are of increasing interest in both research and industry1,2,3,4,5,6. Inspired by biological NN, they are able to recognise patterns while processing a huge amount of data. In a nutshell, a NNs describes a functional mapping containing many variational parameters, which are optimised during the training procedure. Recently, deep connections between Machine Learning and quantum physics have been identified and continue to be uncovered7. On one hand, NNs have been applied to describe the behaviour of complex quantum many-body systems8,9,10 while, on the other hand, quantum-inspired technologies and algorithms are taken into account to solve Machine Learning tasks11,12,13.

One particular numerical method originated from quantum physics which has been increasingly compared to NNs are Tensor Networks (TNs)14,15,16. TNs have been developed to investigate quantum many-body systems on classical computers by efficiently representing the exponentially large quantum wavefunction \(\left|\psi \right\rangle\) in a compact form and they have proven to be an essential tool for a broad range of applications17,18,19,20,21,22,23,24,25,26. The accuracy of the TN approximation can be controlled with the so-called bond-dimension χ, an auxiliary dimension for the indices of the connected local tensors. Recently, it has been shown that TN methods can also be applied to solve Machine Learning (ML) tasks very effectively13,16,27,28,29,30,31. Indeed, even though NNs have been highly developed in recent decades by industry and research, the first approaches of ML with TN yield already comparable results when applied to standard datasets13,27,32. Due to their original development focusing on quantum systems, TNs allow to easily compute quantities such as quantum correlations or entanglement entropy and thereby they grant access to insights on the learned data from a distinct point of view for the application in ML16,30. Hereafter, we demonstrate the effectiveness of the approach and, more importantly, that it allows introducing algorithms to simplify and explain the learning process, unveiling a pathway to an explainable Artificial Intelligence. As a potential application of this approach, we present a TN supervised learning of identifying the charge of b-quarks (i.e. b or \(\bar{b}\)) produced in high-energy proton–proton collisions at the Large Hadron Collider (LHC) accelerator at CERN.

In what follows, we first describe the quantum-inspired Tree Tensor Network (TTN) and introduce different quantities that can be extracted from the TTN classifier which are not easily accessible for the biological-inspired Deep NN (DNN), such as correlation functions and entanglement entropy which can be used to explain the learning process and subsequent classifications, paving the way to an efficient and transparent ML tool. In this regard, we introduce the Quantum-Information Post-learning feature Selection (QuIPS), a protocol that reduces the complexity of the ML model based on the information the single features provide for the classification problem. We then briefly describe the LHCb experiment and its simulation framework, the main observables related to b-jets physics, and the relevant quantities for this analysis together with the underlying LHCb data33,34. We further compare the performance obtained by the DNN and the TTN, before presenting the analytical insights into the TTN which, among others, can be exploited to improve future data analysis of high-energy problems for a deeper physical understanding of the LHCb data. Moreover, we introduce the Quantum-Information Adaptive Network Optimisation (QIANO), which adapts the TN representation by reducing the number of free parameters based on the captured information within the TN while aiming to maintain the highest accuracy possible. Therewith, we can optimise the trained TN classifier for a targeted prediction speed without the necessity to relearn a new model from scratch.

TNs are not only a well-established way to represent a quantum wavefunction \(\left|\psi \right\rangle\), but more general an efficient representation of information as such. In the mathematical context, a TN approximates a high-order tensor by a set of low-order tensors that are contracted in a particular underlying geometry and have common roots with other decompositions, such as the Singular Value Decomposition (SVD) or Tucker decomposition35. Among others, some of the most successful TN representations are the Matrix Product State—or Tensor Trains18,27,36,37, the TTN—or Hierarchical Tucker decomposition30,38,39, and the Projected Entangled Pair States40,41.

For a supervised learning problem, a TN can be used as the weight tensorW13,27,30, a high-order tensor which acts as classifier for the input data {x}: Each sample x is encoded by a feature map Φ(x) and subsequently classified by the weight tensor W. The final confidence of the classifier for a certain class labelled by l is given by the probability

$${{\mathcal{P}}}_{l}({\bf{x}})\,=\,{W}_{l}\,\cdot\, {{\Phi }}({\bf{x}})\,.$$
(1)

In the following, we use a TTN Ψ to represent W (see Fig. 1, bottom right) which can be described as a contraction of its N hierarchically connected local tensors T{χ}

$${{\Psi }}\,=\,\mathop{\sum}\limits_{\chi }{T}_{l,{\chi }_{1},{\chi }_{2}}^{[1]}\mathop{\prod }\limits_{\eta =2}^{N}{T}_{{\chi }_{n},{\chi }_{2n},{\chi }_{2n\,+\,1}}^{[\eta ]}$$
(2)

where n [1, N]. Therefore, we can interpret the TTN classifier Ψ as well as a set of quantum many-body wavefunctions \(\left|{\psi }_{l}\right\rangle\)—one for each of the class labels l (see Supplementary Methods). For the classification, we represent each sample x by a product state Φ(x). Therefore, we map each feature xix into a quantum spin by choosing the feature map Φ(x) as a Kronecker product of N + 1 local feature maps

$${{{\Phi }}}^{[i]}({x}_{i})\,=\,\left[\cos \left(\frac{\pi x^{\prime} }{2}\right),\sin \left(\frac{\pi {x}_{i}^{\prime}}{2}\right)\right]$$
(3)

where \(x^{\prime} \,\equiv\, {x}_{i}/{x}_{i,\text{max}}\,\in\, [0,1]\) is the re-scaled value with respect to the maximum xi,max within all samples of the training set.

Fig. 1: Data flow of the Machine Learning analysis for the b-jet classification of the LHCb experiment at CERN.
figure 1

After proton–proton collisions, b- and \(\bar{b}\)-quarks are created, which subsequently fragment into particle jets (left). The different particles within the jets are tracked by the LHCb detector. Selected features of the detected particle data are used as input for the Machine Learning analysis by NNs and TNs in order to determine the charge of the initial quark (right).

Accordingly, we classify a sample x by computing the overlap 〈Φ(x)ψl〉 for all labels l with the product state Φ(x) resulting in the weighted probabilities

$${{\mathcal{P}}}_{l}\,=\,\frac{| \langle {{\Phi }}(x)| {\psi }_{l}\rangle {| }^{2}}{{\sum }_{l}| \langle {{\Phi }}(x)| {\psi }_{l}\rangle {| }^{2}}$$
(4)

for each class. We point out, that we can encode the input data in different non-linear feature maps as well (see Supplementary Notes).

One of the major benefits of TNs in quantum mechanics is the accessibility of information within the network. They allow to efficiently measure information quantities such as entanglement entropy and correlations. Based on these quantum-inspired measurements, we here introduce the QuIPS protocol for the TN application in ML, which exploits the information encoded and accessible in the TN in order to rank the input features according to their importance for the classification.

In information theory, entropy as such is a measure of the information content inherent in the possible outcomes of variables, such as e.g. a classification42,43,44. In TNs such information content can be assessed by means of the entanglement entropy S which describes the shared information between TN bipartitions. The entanglement S is measured via the Schmidt decomposition, that is, decomposing \(\left|\psi \right\rangle\) into two bipartitions \(\left|{\psi }_{\alpha }^{A}\right\rangle\) and \(\left|{\psi }_{\alpha }^{B}\right\rangle\)44 such that

$${{\Psi }}\,=\,\mathop{\sum }\limits_{\alpha }^{\chi }{\lambda }_{\alpha }\left|{{{\Psi }}}_{\alpha }^{A}\right\rangle \,\otimes\, \left|{{{\Psi }}}_{\alpha }^{B}\right\rangle ,$$
(5)

where λα are the Schmidt-coefficients (non-zero, normalised singular values of the decomposition). The entanglement entropy is then defined as \(S\,=\,-{\sum }_{\alpha }{\lambda }_{\alpha }^{2}{\mathrm{ln}}\,{\lambda }_{\alpha }^{2}\). Consequently, the minimal entropy S = 0 is obtained only if we have one single non-zero singular value λ1 = 1. In this case, we can completely separate the two bipartitions as they share no information. On the contrary, higher S means that information is shared among the bipartitions.

In the ML context, the entropy can be interpreted as follows: If the features in one bipartition provide no valuable information for the classification task, the entropy is zero. On the contrary, S increases the more information between the two bipartitions are exploited. This analysis can be used to optimise the learning procedure: whenever S = 0, the feature can be discarded with no loss of information for the classification. Thereby, a second model with fewer features and fewer tensors can be introduced. This second, more efficient model results in the same predictions in less time. On the contrary, a high bipartition entropy highlights which feature—or combination of features—are important for the correct predictions.

The second set of measurements we take into account are the correlation functions

$${C}_{i,j}^{l}\,=\,\frac{\langle {\psi }_{l}| {\sigma }_{i}^{z}{\sigma }_{j}^{z}| {\psi }_{l}\rangle }{\langle {\psi }_{l}| {\psi }_{l}\rangle }$$
(6)

for each pair of features (located at site i and j) and for each class l. The correlations offer an insight into the possible relation among the information that the two features provide. In case of maximum correlation or anti-correlation among them for all classes l, the information of one of the features can be obtained by the other one (and vice versa), thus one can be neglected. In case of no correlation among them, the two features may provide fundamentally different information for the classification. The correlation analysis allows pinpointing if two features give independent information. However, the correlation itself—in contrast to the entropy—does not tell if this information is important for the classification.

In conclusion, based on the previous insights, namely: (i) a low entropy of a feature bipartition signals that one of the two bipartitions can be discarded, providing negligible loss of information and (ii) if two features are completely (anti-)correlated we can neglect at least one of them, the QuIPS enables to filter out the most valuable features for the classification.

Nowadays, in particle physics, ML is widely used for the classification of jets, i.e. streams of particles produced by the fragmentation of quarks and gluons. The jet substructure can be exploited to solve such classification problems45. ML techniques have been proposed to identify boosted, hadronically decaying top quarks46, or to identify the jet charge47. The ATLAS and CMS collaborations developed ML algorithms in order to identify jets generated by the fragmentation of b-quarks48,49,50: a comprehensive review on ML techniques at the LHC can be found in51.

The LHCb experiment in particular is, among others, dedicated to the study of the physics of b- and c-quarks produced in proton–proton collisions. Here, ML methods have been introduced recently for the discrimination between b- and c-jets by using Boosted Decision Tree classifiers52. However, a crucial topic for the LHCb experiment, which is yet unexploited by ML, is the identification of the charge of a b-quark, i.e. discriminating between a b or \(\bar{b}\). Such identification can be used in many physics measurements, and it is the core of the determination of the charge asymmetry in b-pairs production, a quantity sensitive to physics beyond the Standard Model53. Whenever produced in a scattering event, b-quarks have a short lifetime as free particles; indeed, they manifest themselves as bound states (hadrons) or as narrow cones of particles produced by the hadronization (jets). In the case of the LHCb experiment, the b-jets are detected by the apparatus located in the forward region of proton–proton collisions (see Fig. 1, left)54. The LHCb detector includes a particle identification system that distinguishes different types of charged particles within the jet, and a high-precision tracking system able to measure the momentum of each particles55. Still, the separation between b- and \(\bar{b}\)-jets is a highly difficult task because the b-quark fragmentation produce dozens of particles via non-perturbative Quantum Chromodynamics processes, resulting in non-trivial correlations between the jet particles and the original quark.

The algorithms used to identify the charge of the b-quarks based on information on the jets are called tagging methods. The tagging algorithm performance is typically quantified with the tagging power ϵtag, representing the effective fraction of jets that contribute to the statistical uncertainty in an asymmetry measurement56,57. In particular, the tagging power ϵtag takes into account the efficiency ϵeff (the fraction of jets for which the classifier takes a decision) and the prediction accuracy a (the fraction of correctly classified jets among them) as follows:

$${\epsilon }_{tag}\,=\,{\epsilon }_{eff}\,\cdot\, {(2a\,-\,1)}^{2}\,.$$
(7)

To date, the muon tagging method gives the best performance on the b- vs. \(\bar{b}\)-jet discrimination using the dataset collected in the LHC Run I58: here, the muon with the highest momentum in the jet is selected, and its electric charge is used to decide on the b-quark charge.

For the ML application, we now formulate the identification of the b-quark charge in terms of a supervised learning problem. As described above, we implemented a TTN as a classifier and applied it to the LHCb problem analysing its performance. Alongside, a DNN analysis is performed to the best of our capabilities, and both algorithms are compared with the muon tagging approach. Both the TTN and the DNN, use as input for the supervised learning 16 features of the jet substructure from the official simulation data released by the LHCb collaboration33,34. The 16 features are determined as follows: the muon with the highest pT among all other detected muons in the jet is selected and the same is done for the highest pT kaon, pion, electron, and proton, resulting in 5 different selected particles. For each particle, three observables are considered: (i) The momentum relative to the jet axis (\({p}_{{\rm{T}}}^{rel}\)), (ii) the particle charge (q), and (iii) the distance between the particle and the jet axis (ΔR), for a total of 5 × 3 observables. If a particle type is not found in a jet, the related features are set to 0. The 16th feature is the total jet charge Q, defined as the weighted average of the particles charges qi inside the jet, using the particles \({p}_{{\rm{T}}}^{rel}\) as weights:

$$Q\,=\,\frac{{\sum }_{i}{({p}_{{\rm{T}}}^{rel})}_{i}{q}_{i}}{{\sum }_{i}{({p}_{{\rm{T}}}^{rel})}_{i}}\,.$$
(8)

Results

Analysis framework

In the following, we present the jet classification performance for the TTN and the DNN applied to the LHCb dataset, also comparing both ML techniques with the muon tagging approach. For the DNN we use an optimised network with three hidden layers of 96 nodes (see Supplementary Methods for details). Hereafter, we aim to compare the best possible performance of both approaches therefore, we optimised the hyperparameters of both methods in order to obtain the best possible results from each of them, TTN and DNN. Therefore, we split the dataset of about 700k events (samples) into two sub-sets: 60% of the samples are used in the training process while the remaining 40% are used as test set to evaluate and compare the different methods. For each event prediction after the training procedure, both ML models output the probability \({{\mathcal{P}}}_{b}\) to classify the event as a jet generated by a b- or a \(\bar{b}\)-quark. A threshold Δ around \({{\mathcal{P}}}_{b}\,=\,0.5\) is then defined in which we classify the quark as unknown in order to optimise the overall tagging power ϵtag.

Jet classification performance

We obtain similar performances in terms of the raw prediction accuracy applying both ML approaches after the training procedure on the test data: the TTN takes a decision on the charge of the quark in \({\epsilon }_{eff}^{\,\text{TTN}\,}\,=\,54.5 \%\) of the cases with an overall accuracy of aTTN = 70.56%, while the DNN decides in \({\epsilon }_{eff}^{\,\text{DNN}\,}\,=\,55.3 \%\) of the samples with aDNN = 70.49%. We checked both approaches for biases in physical quantities to ensure that both methods are able to properly capture the physical process behind the problem and thus that they can be used as valid tagging methods for LHCb events (see Supplementary Methods).

In Fig. 2a we present the tagging power of the different approaches as a function of the jet transverse momentum pT. Evidently, both ML methods perform significantly better than the muon tagging approach for the complete range of jet transverse momentum pT, while the TTN and DNN display comparable performances within the statistical uncertainties.

Fig. 2: Comparison of the DNN and TNN analysis.
figure 2

a Tagging power for the DNN (green), TTN (blue) and the muon tagging (red), (b) ROC curves for the DNN (green) and the TTN (blue, but completely covered by DNN), compared with the line of no-discrimination (dotted navy-blue line), (c) probability distribution for the DNN and (d) for the TTN. In the two distributions (c, d), the correctly classified events (green) are shown in the total distribution (light blue). Below, in black all samples where a muon was detected in the jet.

In Fig. 2c, d we present the histograms of the confidences for predicting a b-flavoured jet for all samples in the test dataset for the DNN and the TTN respectively. Interestingly, even though both approaches give similar performances in terms of overall precision and tagging power, the prediction confidences are fundamentally different. For the DNN, we see a Gaussian-like distribution with, in general, not very high confidence for each prediction. Thus, we obtain less correct predictions with high confidences, but at the same time, fewer wrong predictions with high confidences compared to the TTN predictions. On the other hand, the TTN displays a flatter distribution including more predictions—correct and incorrect—with higher confidence. Remarkably though, we can see peaks for extremely confident predictions (around 0 and around 1) for the TTN. These peaks can be traced back to the presence of the muon; noting that the charge of which is a well-defined predictor for a jet generated by a b-quark. The DNN lacks these confident predictions exploiting the muon charge. Further, we mention that using different cost functions for the DNN, i.e. cross-entropy loss function and the Mean Squared Error, lead to similar results (see Supplementary Methods).

Finally, in Fig. 2b we present the Receiving Operator Characteristic (ROC) curves for the TTN and the DNN together with the line of no-discrimination, which represents a randomly guessing classifier: the two ROC curves for TTN and DNN are perfectly coincident, and the Area Under the Curve (AUC) for the two classifiers is the almost same (AUCTTN = 0.689 and AUCDNN = 0.690). The graph illustrates the similarity in the outputs between TTN and DNN despite the different confidence distributions. This is further confirmed by a Pearson correlation factor of r = 0.97 between the outputs of the two classifiers.

In conclusion, the two different approaches result in similar outcomes in terms of prediction performances. However, the underlying information used by the two discriminators is inherently different. For instance, the DNN predicts more conservatively, in the sense that the confidences for each prediction tend to be lower compared with the TTN. Additionally, the DNN does not exploit the presence of the muon as strongly as the TTN, even though the muon is a good predictor for the classification.

Exploiting insights into the data with TTN

As previously mentioned, the TTN analysis allows to efficiently measure the captured correlations and the entanglement within the classifier. These measurements give insight into the learned data and can be exploited via QuIPS to identify the most important features typically used for the classifications.

In Fig. 3a we present the correlation analysis allowing us to pinpoint if two features give independent information. For both labels (\(l\,=\,b,\bar{b}\)) the results are very similar, thus in Fig. 3a we present only l = b. We see among others that the momenta \({p}_{T}^{rel}\) and distance ΔR of all particles are correlated except for the kaon. Thus this particle provides information to the classification which seems to be independent of the information gained by the other particles. However, the correlation itself does not tell if this information is important for the classification. Thus, we compute the entanglement entropy S of each feature, as reported in Fig. 3b. Here, we conclude that the features with the highest information content are the total charge and \({p}_{T}^{rel}\) and distance ΔR of the kaon. Driven by these insights, we employ the QuIPS to discard half of the features by selecting the eight most important ones: i.–iii. charge, momenta and distance of the muon, iv.–vi. charge, momenta and distance of the kaon, vii. charge of the pion and viii. total detected charge. To test the QuIPS performance, we compared it with an independent but more time-expensive analysis on the importance of the different particle types: the two approaches perfectly matched. Further, we studied two new models, one composed of the eight most important features proposed by the QuIPS, and, for comparison, another with the eight discarded features. In Fig. 3c we show the tagging power for the different analysis with the complete 16-sites (model M16), the best 8 (B8), the worst 8 (W8) and the muon tagging. Remarkably, we see that the models M16 and B8 give comparable results, while model W8 results are even worse than the classical approach. These performances are confirmed by the prediction accuracy of the different models: While only less than 1% of accuracy is lost from M16 to B8, the accuracy of the model W8 drastically drops to around 52%—that is, almost random predictions. Finally, in this particular run, the model B8 has been trained 4.7 times faster with respect to model M16 and predicts 5.5 times faster as well (The actual speed-up depends on the bond-dimension and other hyperparameters).

Fig. 3: Exploiting the information provided by the learned TTN classifier.
figure 3

a Correlations between the 16 input features (blue for anti-correlated, white for uncorrelated, red for correlated). The numbers indicate q, \({p}_{T}^{rel}\), ΔR of the muon (1–3), kaon (4–6), pion (7–9), electron (10–12), proton (13–15) and the jet charge Q (16). b Entropy of each feature as the measure for the information provided for the classification. c Tagging power for learning on all features (blue), the best eight proposed by QuIPS exploiting insights from (a, b) (magenta), the worst eight (yellow) and the muon tagging (red). d Tagging power for decreasing bond-dimension truncated after training: The complete model (blue shades for χ = 100, χ = 50, χ = 5), for using the QuIPS best 8 features only (violet shades for χ = 16, χ = 5), and the muon tagging (red).

A critical point of interest in real-time ML applications is the prediction time. For example, in the LHCb Run 2 data-taking, the high-level software trigger takes a decision approximately every 1 μs55 and shorter latencies are expected in future Runs. Consequently, with the aid of the QuIPS protocol, we can efficiently reduce the prediction computational time while maintaining a comparable high prediction power. However, with TTNs, we can undertake an even further step to reduce the prediction time by reducing the bond-dimension χ after the training procedure. Here, we introduce the QIANO performing this truncation by means of the well-established SVD for TN18,23,25 in a way ensuring to introduce the least infidelity possible. In other words, QIANO can adjust the bond-dimension χ to achieve a targeted prediction time while keeping the prediction accuracy reasonably high. We stress that this can be done without relearning a new model, as would be the case with NN.

Finally, we apply QuIPS and QIANO to reduce the information in the TTN in an optimal way for a targeted balance between prediction time and accuracy. In Fig. 3d we show the tagging power taking the original TTN and truncate it to different bond dimensions χ. We can see, that even though we compress quite heavily, the overall tagging power does not change significantly. In fact, we only drop about 0.03% in the overall prediction accuracy, while at the same time improving the average prediction time from 345 to 37 μs (see Table 1). Applying the same idea to the model B8 we can reduce the average prediction time effectively down to 19 μs on our machines, a performance compatible with current real-time classification rate.

Table 1 TTN prediction time.

Discussion

We analysed an LHCb dataset for the classification of b- and \(\bar{b}\)-jets with two different ML approaches, a DNN and a TTN. We showed that we obtained with both techniques a tagging power about one order of magnitude higher than the classical muon tagging approach, which up to date is the best-published result for this classification problem. We pointed out that, even though both approaches result in similar tagging power, they treat the data very differently. In particular, TTN effectively recognises the importance of the presence of the muon as a strong predictor for the jet classification. Here, we point out that we only used a conjugate gradient descent for the optimisation of our TTN classifier. Deploying more sophisticated optimisation procedures which have already been proven to work for Tensor Trains, such as stochastic gradient descent59 or Riemannian optimisation28, may further improve the performance (in both time and accuracy) in future applications.

We further explained the crucial benefits of the TTN approach over the DNNs, namely (i) the ability to efficiently measuring correlations and the entanglement entropy, and (ii) the power of compressing the network while keeping a high amount of information (to some extend even lossless compression). We showed how the former quantum-inspired measurements help to set up a more efficient ML model: in particular, by introducing an information-based heuristic technique, we can establish the importance of single features based on the information captured within the trained TTN classifier only. Using this insight, we introduced the QuIPS, which can significantly reduce the model complexity by discarding the least-important features maintaining high prediction accuracy. This selection of features based on their informational importance for the trained classifier is one major advantage of TNs targeting to effectively decrease training and prediction time. Regarding the latter benefit of the TTN, we introduced the QIANO, which allows to decrease the TTN prediction time by optimally decreasing its representative power based on information from the quantum entropy, introducing the least possible infidelity. In contrast to DNNs, with the QIANO we do not need to set up a new model and train it from scratch, but we can optimise the network post-learning adaptively to the specific conditions, e.g. the used CPU or the required prediction time of the final application.

Finally, we showed that using QuIPS and QIANO we can effectively compress the trained TTN to target a given prediction time. In particular, we decreased our prediction times from 345 to 19 μs. We stress that, while we only used one CPU for the predictions, in future application we might obtain a speed-up from 10 to 100 times by parallelising the tensor contractions on GPUs60. Thus, we are confident that it is possible to reach a MHz prediction rate while still obtaining results significantly better than the classical muon tagging approach. Here, we also point out that, for using this algorithm on the LHCb real-time data acquisition system, it would be necessary to develop custom electronic cards like FPGAs, or GPUs with an optimised architecture. Such solutions should be explored in the future.

Given the competitive performance of the presented TTN method at its application in high-energy physics, we envisage a multitude of possible future applications in high-energy experiments at CERN and in other fields of science. Future applications of our approach in the LHCb experiment may include the discrimination between b-jets, c-jets and light flavour jets52. A fast and efficient real-time identification of b- and c-jets can be the key point for several studies in high-energy physics, ranging from the search for the rare Higgs boson decay in two c-quarks, up to the search for new particles decaying in a pair of heavy-flavour quarks (\(b\bar{b}\) or \(c\bar{c}\)).

Methods

LHCb particle detection

LHCb is fully instrumented in the phase space region of proton–proton collisions defined by the pseudo-rapidity (η) range [2, 5], with η defined as

$$\eta \,=\,-{\rm{log}}\left[\tan \left(\frac{\theta }{2}\right)\right]\,,$$
(9)

where θ is the angle between the particle momentum and the beam axis (see Fig. 4). The direction of particles momenta can be fully identified by η and by the azimuthal angle ϕ, defined as the angle in the plane transverse to the beam axis. The projection of the momentum in this plane is called transverse momentum (pT). The energy of charged and neutral particles is measured by electromagnetic and hadronic calorimeters. In the following, we work with physics natural units.

Fig. 4: Illustrative sketch showing an LHCb experiment and the two possible tagging algorithms.
figure 4

A single particle tagging algorithm, exploiting information coming from one single particle (muon), and the inclusive tagging algorithm which exploits the information on all the jet constituents.

At LHCb jets are reconstructed using a Particle Flow algorithm61 for charged and neutral particles selection and using the anti-kt algorithm62 for clusterization. The jet momentum is defined as the sum of the momenta of the particles that form the jet, while the jet axis is defined as the direction of the jet momentum. Most of the particles that form the jet are contained in a cone of radius \({{\Delta }}R\,=\,\sqrt{{({{\Delta }}\eta )}^{2}\,+\,{({{\Delta }}\phi )}^{2}}\,=\,0.5\), where Δη and Δϕ are respectively the pseudo-rapidity difference and the azimuthal angle difference between the particles momenta and the jet axis. For each particle inside the jet cone, the momentum relative to the jet axis (\({p}_{{\rm{T}}}^{rel}\)) is defined as the projection of the particle momentum in the plane transverse to the jet axis.

LHCb dataset

Differently from other ML performance analyses, the dataset used in this paper has been prepared specifically for this LHCb classification problem, therefore baseline ML models and benchmarks on it do not exist. In particle physics, features are strongly dependent on the detector considered (i.e. different experiments may have a different response on the same physical object) and for this reason the training has been performed on a dataset that reproduces the LHCb experimental conditions, in order to obtain the optimal performance with this experiment.

The LHCb simulation datasets used for our analysis are produced with a Monte Carlo technique using the framework GAUSS63, which makes use of PYTHIA 864 to generate proton–proton interactions and jet fragmentation and uses EvtGen65 to simulate b-hadrons decay. The GEANT4 software66,67 is used to simulate the detector response, and the signals are digitised and reconstructed using the LHCb analysis framework.

The used dataset contains b and \(\bar{b}\)-jets produced in proton–proton collisions at a centre-of-mass energy of 13 TeV33,34. Pairs of b-jets and \(\bar{b}\)-jets are selected by requiring a jet pT greater than 20 GeV and η in the range [2.2, 4.2] for both jets.

Muon tagging

LHCb measured the \(b\bar{b}\) forward-central asymmetry using the dataset collected in the LHC Run I58 using the muon tagging approach: In this method, the muon with the highest momentum in the jet cone is selected, and its electric charge is used to decide on the b-quark charge. In fact, if this muon is produced in the original semi-leptonic decay of the b-hadron, its charge is totally correlated with the b-quark charge. Up to date, the muon tagging method gives the best performance on the b- vs. \(\bar{b}\)-jet discrimination. Although this method can distinguish between b- and \(\bar{b}\)-quark with good accuracy, its efficiency is low as it is only applicable on jets where a muon is found and it is intrinsically limited by the b-hadrons branching ratio in semi-leptonic decays. Additionally, the muon tagging may fail in some scenarios, where the selected muon is produced not by the decay of the b-hadron but in other decay processes. In these cases, the muon may not be completely correlated with the b-quark charge.

Machine learning approaches

We train the TTN and analyse the data with different bond dimensions χ. The auxiliary dimension χ controls the number of free parameters within the variational TTN ansatz. While the TTN is able to capture more information from the training data with increasing bond-dimension χ, choosing χ too large may lead to overfitting and thus can worsen the results in the test set. For the DNN we use an optimised network with three hidden layers of 96 nodes (see Supplementary Methods for details).

For each event prediction, both methods give as output the probability \({{\mathcal{P}}}_{b}\) to classify a jet as generated by a b- or a \(\bar{b}\)-quark. This probability (i.e. the confidence of the classifier) is normalised in the following way: for values of probability \({{\mathcal{P}}}_{b}\,> \,0.5\) (\({{\mathcal{P}}}_{b}\,<\,0.5\)) a jet is classified as generated by a b-quark (\(\bar{b}\)-quark), with an increasing confidence going to \({{\mathcal{P}}}_{b}\,=\,1\) (\({{\mathcal{P}}}_{b}\,=\,0\)). Therefore a completely confident classifier returns a probability distribution peaked at \({{\mathcal{P}}}_{b}\,=\,1\) and \({{\mathcal{P}}}_{b}\,=\,0\) for jets classified as generated by b- and \(\bar{b}\)-quark respectively.

We introduce a threshold Δ symmetrically around the prediction confidence of \({{\mathcal{P}}}_{b}\,=\,0.5\) in which we classify the event as unknown. We optimise the cut on the predictions of the classifiers (i.e. their confidences) to maximise the tagging power for each method based on the training samples. In the following analysis we find ΔTTN = 0.40 (ΔDNN = 0.20) for the TTN (DNN). Thereby, we predict for the TTN (DNN) a b-quark with confidences \({{\mathcal{P}}}_{b}\,> \,{C}^{\text{TTN}}\,=\,0.70\) (\({{\mathcal{P}}}_{b}\,> \,{C}^{\text{DNN}}\,=\,0.60\)), a \(\bar{b}\)-quark with confidences \({{\mathcal{P}}}_{b}\,<\,0.30\) (\({{\mathcal{P}}}_{b}\,<\,0.40\)) and no prediction for the range in between (see Fig. 2c, d).