Abstract
Classification, which means discrimination between examples belonging to different classes, is a fundamental aspect of most scientific and engineering activities. Machine Learning (ML) tools have proved to be very performing in this task, in the sense that they can achieve very high success rates. However, both “realism” and interpretability of their models are low, leading to modest increases of knowledge and limited applicability, particularly in applications related to nonlinear and complex systems. In this paper, a methodology is described, which, by applying ML tools directly to the data, allows formulating new scientific models that describe the actual “physics” determining the boundary between the classes. The proposed technique consists of a stack of different ML tools, each one applied to a specific subtask of the scientific analysis; all together they form a system, which combines all the major strands of machine learning, from rule based classifiers and Bayesian statistics to genetic programming and symbolic manipulation. To take into account the error bars of the measurements generating the data, an essential aspect of scientific inference, the novel concept of the Geodesic Distance on Gaussian manifolds is adopted. The properties of the methodology have been investigated with a series of systematic numerical tests for different types of classification problems. The potential of the approach to handle real data has been tested with various experimental databases, built using measurements collected in the investigations of complex systems. The obtained results indicate that the proposed method permits to find physically meaningful mathematical equations, which reflect the actual phenomena under study. The developed techniques therefore constitute a very useful information processing system to bridge the gap between data, machine learning models and scientific theories.
Similar content being viewed by others
References
Amari S et al (2000) Methods of information geometry. Translations of mathematical monographs. Oxford University Press
Andreucci F et al (1993) A study on forest fire automatic detection system. Il. Nuovo Cimento 16:35–50. https://doi.org/10.1007/BF02509209
Azad RMA, Ryan C (2014) a simple approach to lifetime learning in genetic programming-based symbolic regression. Evol Comput 22:287–317. https://doi.org/10.1162/EVCO_a_00111
Bahari N. I. S. et al. (2014) Application of support vector machine for classification of multispectral data 2014 IOP Conf. Ser.: Earth Environ. Sci. 20 012038 https://doi.org/10.1088/17551315/20/1/012038
Baseer AZMA (2018) Application of support vector machine models for forecasting solar and wind energy resources: a review. J Clean Prod. https://doi.org/10.1016/j.jclepro.2018.07.164
Beaumont CN et al (2011) Classifying structures in the interstellar medium with support vector machines the g16.05–0.57 supernova remnant. Astrophys J. https://doi.org/10.1088/0004-637X/741/1/14
Bellecci C et al (2007) Application of a CO2 dial system for infrared detection of forest fire and reduction of false alarm. Appl Phys B 87:373–378. https://doi.org/10.1007/s00340-007-2607-9
Bellecci C et al (2010) In-cell measurements of smoke backscattering coefficients using a CO2 laser system for application to lidar-dial forest fire detection. Opt Eng 49(12):124302. https://doi.org/10.1117/1.3526331
Breiman JFL (1984) Classification and regression trees. Taylor & Francis. https://doi.org/10.1201/9781315139470
Burnham KP et al (2002) Model selection and multi-model inference: a practical information-theoretic approach, 2nd edn. Springer
Cannas B et al (2013) Automatic disruption classification based on manifold learning for real-time applications on JET. Nucl Fusion 53:093023. https://doi.org/10.1088/0029-5515/53/9/093023
Clark JW (2012) Application of support vector machines to global prediction of nuclear properties. Int J Modern Phys B. https://doi.org/10.1142/S0217979206036053
Craciunescu T et al (2018) Maximum likelihood bolometric tomography for the determination of the uncertainties in the radiation emission on JET TOKAMAK. Rev Sci Instrum 89:053504. https://doi.org/10.1063/1.5027880
De Vries PC et al (2014) The influence of an ITER-like wall on disruptions at JET. Phys Plasmas. https://doi.org/10.1063/1.4872017
De Vries PC et al (2015) Scaling of the MHD perturbation amplitude required to trigger a disruption and predictions for ITER. Nucl Fusion 56:026007. https://doi.org/10.1088/0029-5515/56/2/026007
Fiocco G et al (1963) Detection of scattering layers in the upper atmosphere (60–140 km) by optical radar. Nature 199:1275–1276. https://doi.org/10.1038/1991275a0
García S et al (2009) A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Comput 13:959. https://doi.org/10.1007/s00500-008-0392-y
Gaudio P et al (2014) An alternative approach to the determination of scaling law expressions for the L-H transition in Tokamaks utilizing classification tools instead of regression. Plasma Phys Control Fusion 56:114002. https://doi.org/10.1088/0741-3335/56/11/114002
Gelfusa M et al (2014) UMEL: A new regression tool to identify measurement peaks in LIDAR/DIAL systems for environmental physics applications. Rev Sci Instr 85:063112. https://doi.org/10.1063/1.4883184
Gelfusa M et al (2015) First attempts at measuring widespread smoke with a mobile lidar system. Fotonica AEIT Italian Conference on Photonics Technologies, https://doi.org/10.1049/cp.2015.0187
Hadlock CR (2012) Six sources of Collapse. Mathematical Association of America Washington. https://doi.org/10.4169/j.ctt13x0mx7
Johnson BA et al (2013) A hybrid pansharpening approach and multiscale object-based image analysis for mapping diseased pine and oak trees. Int J Remote Sens 34(20):6969–6982. https://doi.org/10.1080/01431161.2013.810825
Koza JR (1992) Genetic programming: on the programming of computers by means of natural selection. MIT Press, Cambridge
Lungaroni M et al (2018) On the potential of ruled-based machine learning for disruption prediction on JET. Fusion Eng Des 130:62–68. https://doi.org/10.1016/j.fusengdes.2018.02.087
Lungaroni M et al (2019) Geodesic distance on gaussian manifolds to reduce the statistical errors in the investigation of complex systems. Complexity 2019:5986562. https://doi.org/10.1155/2019/5986562
Marrelli L et al (1998) Total radiation losses and emissivity profiles in RFX. Nucl Fusion 38(5):649. https://doi.org/10.1088/0029-5515/38/5/301
Martin P et al (1997) Soft x-ray and bolometric tomography in RFX. Rev Sci Instrum 68(2):1256–1260. https://doi.org/10.1063/1.1147911
Meitner S et al (2017) Design and commissioning of a three-barrel shattered pellet injector for DIII-D Disruption Mitigation Studies. Fusion Sci Technol 72(3):318–323. https://doi.org/10.1080/15361055.2017.1333854
Molnar C (2017) Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. https://christophm.github.io/interpretable-ml-book/
Murari A et al (2008) Prototype of an adaptive disruption predictor for JET based on fuzzy logic and regression trees. Nucl Fusion. https://doi.org/10.1088/0029-5515/48/3/035010
Murari A et al (2012) A statistical methodology to derive the scaling law for the H-mode power threshold using a large multi-machine database. Nucl Fusion. https://doi.org/10.1088/0029-5515/52/6/063016
Murari A et al (2013) Clustering based on the geodesic distance on Gaussian manifolds for the automatic classification of disruptions. Nucl Fusion. https://doi.org/10.1088/0029-5515/53/3/033006
Murari A et al (2016) A Metric to Improve the Robustness of Conformal Predictors in the Presence of Error Bars. Volume 9653 of the series Lecture Notes in Computer Sciences, pp 105–115. https://doi.org/10.1007/978-3-319-33395-3_8
Murari A et al (2019) A model falsification approach to learning in non-stationary environments for experimental design nature. Sci Rep. https://doi.org/10.1038/s41598-019-54145-7
Murari A et al (2020) (2020) Investigating the physics of Tokamak global stability with interpretable machine learning tools. Appl Sci 10(19):6683. https://doi.org/10.3390/app10196683
Murari A et al (2009) Unbiased and non-supervised learning methods for disruption prediction at JET. Nucl Fusion 49:055028. https://doi.org/10.1088/0029-5515/49/5/055028
Murari A et al (2013) Non-power law scaling for access to the H-mode in tokamaks via symbolic regression. Nucl Fusion 53:043001. https://doi.org/10.1088/0029-5515/53/4/043001
Murari A et al (2015) A new approach to the formulation and validation of scaling expressions for plasma confinement in tokamaks. Nucl Fusion 55:073009. https://doi.org/10.1088/0029-5515/55/7/073009
Murari A et al (2016) Application of transfer entropy to causality detection and synchronization experiments in tokamaks. Nucl Fusion 56:026006. https://doi.org/10.1088/0029-5515/56/2/026006
Murari A et al (2017a) Determining the prediction limits of models and classifiers with applications for disruption prediction in JET. Nucl Fusion 57:016024. https://doi.org/10.1088/0029-5515/57/1/016024
Murari A et al (2017b) Robust scaling laws for energy confinement time, including radiated fraction, in Tokamaks. Nucl Fusion 57:126017. https://doi.org/10.1088/1741-4326/aa7bb4
Murari A et al (2019) Adaptive learning for disruption prediction in non-stationary conditions. Nucl Fusion 59:086037. https://doi.org/10.1088/1741-4326/ab1ecc
Murari A et al (2020) On the transfer of adaptive predictors between different devices for both mitigation and prevention of disruptions. Nucl Fusion 60(5):056003. https://doi.org/10.1088/1741-4326/ab77a6
Ongena J et al (2004) Towards the realization on JET of an integrated H-mode scenario for ITER. Nucl Fusion 44(1):124–133. https://doi.org/10.1088/0029-5515/44/1/015
Peluso E et al (2014) A statistical method for model extraction and model selection applied to the temperature scaling of the L-H transition. Plasma Phys Control Fusion 56:114001. https://doi.org/10.1088/0741-3335/56/11/114001
Platt JC (2000) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola A et al (eds) Advances in large margin classifiers. MIT Press, Cambridge. https://doi.org/10.7551/mitpress/1113.001.0001
Poli R (2003) A simple but theoretically motivated to control bloating in genetic programming” In: Genetic Programming, Proceedings of EuroGP, https://doi.org/10.1007/3-540-36599-0_19
Rattá GA et al (2010) An advanced disruption predictor for JET tested in a simulated real-time environment. Nucl Fusion 50:025005. https://doi.org/10.1088/0029-5515/50/2/025005
Romanelli F et al (2009) Overview of JET results. Nucl Fusion 49(10):104006. https://doi.org/10.1088/0029-5515/49/10/104006
Sahin MÖ et al (2016) Performance and optimization of support vector machines in high-energy physics classification problems. Nuclear Inst Methods Phys Res 838:137–146. https://doi.org/10.1016/j.nima.2016.09.017
Schmid M et al (2009) Distilling free-form natural laws from experimental data. Science 324:81–85. https://doi.org/10.1126/science.1165893
Luke S and Panait L (2002) “Fighting Bloat With Nonparametric Parsimony Pressure” Conference: Proceedings of the 7th International Conference on Parallel Problem Solving from Nature December 2002 https://doi.org/10.1162/EVCO_a_00111
Steinwart I et al (2008) Support Vector Machines. Springer-Verlag, New York. https://doi.org/10.1007/978-0-387-77242-4
Vapnik V (2000) The nature of statistical learning theory. Information Science and Statistics. Springer. https://doi.org/10.1007/978-1-4757-3264-1
Vapnik V (2013) The nature of statistical learning theory. Published by: Springer Science & Business Media, ISBN 1475724403, 9781475724400
Vega J et al (2009) Automated estimation of L/H transition times at JET by combining Bayesian statistics and support vector machines. Nucl Fusion 49(8):085023. https://doi.org/10.1088/0029-5515/49/8/085023
Vega J et al (2010) A universal support vector machines based method for automatic event location in waveforms and video-movies: applications to massive nuclear fusion databases. Rev Sci Instrum 81(2):023505. https://doi.org/10.1063/1.3302629
Vega J et al (2014) Adaptive high learning rate probabilistic disruption predictors from scratch for the next generation of Tokamaks. Nucl Fusion 54:123001. https://doi.org/10.1088/0029-5515/54/12/123001
Vellido A et al (2012) Making machine learning models interpretable. 20th European Symposium on Artificial Neural Networks Bruges, Belgium, April 25-26-27 - ESANN 2012. https://www.i6doc.com/en/book/?GCOI=28001100967420
Wenninger R et al (2016) Power handling and plasma protection aspects that affect the design of the DEMO divertor and first wall. Submitted for publication in Proceedings of 26th IAEA Fusion Energy Conference
Wesson J (2004) Tokamaks. Published by: Clarendon Press Oxford. Third edition. ISBN: 0 19 8509227
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1 Numerical Tests for SR via GP to obtain realistic boundary equations
The procedure described in Sect. 6 has been subjected to a systematic series of numerical tests. A significant set of these tests is reported in this Appendix after a detailed description of the numerical method implemented.
The main technique to produce synthetic data and to test the methodology consists of the following 6 steps:
-
1.
Definition of an initial function for the boundary
-
2.
Generating samples of the two classes from the function
-
3.
Training the SVM for classification
-
4.
Building an appropriate mesh on the domain
-
5.
Determining a sufficient number of points on the hyper-surface identified by the SVM
-
6.
Deploying symbolic regression to identify the equation of the hypersurface from the points previously obtained
In the rest of this Appendix, more details about this procedure are provided with the discussion particularised for the case of binary classification and traditional SVM.
In the first step, an initial function as a combination of arithmetic, trigonometric, and exponential operators of independent variables xi is defined. In general, this function can be written as follows:
In the second step, an adequate number of random points in the valid range of the variables are generated. Then, a positive offset and some random values are added to the y for half of the data to produce the first class; a negative offset and some random values are added to y for the other half to produce the second class. The equations for producing the two classes can be summarized as follow:
where y1 and y2 are the values for the first and second class, respectively.
In the third step, an SVM with "Gaussian Radial Basis Function kernel" is trained. The method used to find the separating hyperplane is "Sequential Minimal Optimization". Depending on the level of random noise, different success rates can be obtained. For the numerical tests presented in the following, the success rate in the classification of the SVM is always very close to 100%.
In the fourth step, a mesh on the domain has to be built in order to identify points sufficiently close to the hypersurface.
The fifth step consists of the identification of the points sufficiently close to the hypersurface, with the algorithm described in Sect. 6.
In the sixth step, the selected hypersurface points are used as inputs to the symbolic regression code, to find the appropriate formula for describing the hypersurface. The settings adopted to run the GP implementing the SR are reported in Table 2.
1.1 Examples for two independent variables
Example 1
As a first test, a purely arithmetic function has been tested. The function and ranges of the variables are:
After carrying out the six-step procedure described in Sect. 6, the following expression has been obtained:
SR via GP converges on a final expression that is in excellent agreement with the initial function, describing the boundary between the two classes. This is particularly true since such a good approximation has been obtained without the non-linear fitting, normally the last step of the SR method.
Example 2
As a second test, a more complex function comprising exponential, arithmetic, and power operators has been assumed for the boundary between the two classes. The function and ranges of the variables are:
After carrying out the six-step procedure in Sect. 6, the following expression has been obtained:
Again SR via GP converges on a final expression that is in excellent agreement with the initial function describing the boundary between the two classes, even without making recourse to the non-linear fitting step.
Example 3
As the third test, a more complex function comprising trigonometric and arithmetic operators has been defined and 4% classification noise was added to the database. The function and ranges for the variables are:
After carrying out the six-step procedure in Sect. 6, the following expression has been obtained:
Again SR via GP converges on a final expression that is in excellent agreement with the initial function describing the boundary between the two classes, even without making recourse to the non-linear fitting step. Figure
11 presents the results of this example in pictorial form.
1.2 Examples for three independent variables
Some examples considering equations with three independent variables are reported in this section.
Example 1
As a first test, a function comprising only arithmetic operators has been defined. The function and ranges for the variables are:
Initial Defined Function: \(y~ = x_{1} ~~ - ~x_{2} ~~ + ~x_{3} ~\)
Range of Variables: \(1~ < ~1~~ < ~2~~~;~~3~ < ~x_{2} ~ < ~5~~~~;~~~0~ < ~x_{3} ~ < ~1\).
The final function obtained from the hypersurface points is:
Example 2
As a second test, a function comprising trigonometric and arithmetic operators has been defined. The function and ranges for the variables are:
Initial Defined Function: \(y~ = x_{1} ~ + ~\sin ~\left( {~x_{2} \cdot ~x_{3} } \right)\).
Range of Variables: \(1 < x_{1} < 2;\quad 3 < x_{2} < 5;\quad 0 < x_{3} < 1\).
The final function obtained from the hypersurface points is:
Again these results confirm the great potential of the approach. Almost exactly the original function can be obtained already at the stage of SR. With additional rounding off of the results or application of non-linear fitting, exactly the original function can easily be recovered.
1.3 Example for four independent variables
In this subsection, we describe the results of the application of the SVM-GP methodology to a more complex and noisy database. A five-dimensional synthetic database has been generated with the characteristics described in Table 6.
The procedure for finding the best sigma for the SVM has been applied and the best sigma for the classification is equal to 0.6. The final accuracies of classification for the train and test data are presented in Table 7.
After generating the grid and finding the hyper-surface points, SR via GP has been applied and the following expression for the hyper-surface has been obtained:
The obtained equation is in good agreement with the initial function. The quality of this estimate can be confirmed by comparing the success rate of the SVM and of the equation found by SR via GP. The classification success rate of the equation found with SR is reported in Table 8 (to be compared with the results reported in Table 8).
The comparison of the accuracies obtained via SVM and with our proposed technique allows concluding that the SVM-GP approach has excellent performance, even for more complex databases and in higher dimensions, in interpreting the SVM hyper-plane as a hyper-surface equation.
Appendix 2 Database of JET with a metallic wall
All experiments in JET campaigns C29 to C31 have been considered. After proper cleaning and validation of the DB, overall 187 disruptive and 1020 non disruptive shots are included, unless differently specified. JET database with the ILW has been used to implement the methodology described in this paper. In building the database, the intentional disruptions have been excluded from the training. Only time slices, whose plasma current exceeds 750 kA, have been considered but no other general selection has been implemented. All the signals have been resampled at 1kH frequency. Alarms, which are launched 10 ms or less from the beginning of the current quench, are considered tardy, since 10 ms is the minimum time required on JET to undertake mitigation action. Alarms triggered more than 2.5 s before the beginning of the current quench are considered early.
Rights and permissions
About this article
Cite this article
Murari, A., Gelfusa, M., Lungaroni, M. et al. A systemic approach to classification for knowledge discovery with applications to the identification of boundary equations in complex systems. Artif Intell Rev 55, 255–289 (2022). https://doi.org/10.1007/s10462-021-10032-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-021-10032-0