Skip to main content
Log in

A systemic approach to classification for knowledge discovery with applications to the identification of boundary equations in complex systems

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Classification, which means discrimination between examples belonging to different classes, is a fundamental aspect of most scientific and engineering activities. Machine Learning (ML) tools have proved to be very performing in this task, in the sense that they can achieve very high success rates. However, both “realism” and interpretability of their models are low, leading to modest increases of knowledge and limited applicability, particularly in applications related to nonlinear and complex systems. In this paper, a methodology is described, which, by applying ML tools directly to the data, allows formulating new scientific models that describe the actual “physics” determining the boundary between the classes. The proposed technique consists of a stack of different ML tools, each one applied to a specific subtask of the scientific analysis; all together they form a system, which combines all the major strands of machine learning, from rule based classifiers and Bayesian statistics to genetic programming and symbolic manipulation. To take into account the error bars of the measurements generating the data, an essential aspect of scientific inference, the novel concept of the Geodesic Distance on Gaussian manifolds is adopted. The properties of the methodology have been investigated with a series of systematic numerical tests for different types of classification problems. The potential of the approach to handle real data has been tested with various experimental databases, built using measurements collected in the investigations of complex systems. The obtained results indicate that the proposed method permits to find physically meaningful mathematical equations, which reflect the actual phenomena under study. The developed techniques therefore constitute a very useful information processing system to bridge the gap between data, machine learning models and scientific theories.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Lungaroni.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1 Numerical Tests for SR via GP to obtain realistic boundary equations

The procedure described in Sect. 6 has been subjected to a systematic series of numerical tests. A significant set of these tests is reported in this Appendix after a detailed description of the numerical method implemented.

The main technique to produce synthetic data and to test the methodology consists of the following 6 steps:

  1. 1.

    Definition of an initial function for the boundary

  2. 2.

    Generating samples of the two classes from the function

  3. 3.

    Training the SVM for classification

  4. 4.

    Building an appropriate mesh on the domain

  5. 5.

    Determining a sufficient number of points on the hyper-surface identified by the SVM

  6. 6.

    Deploying symbolic regression to identify the equation of the hypersurface from the points previously obtained

In the rest of this Appendix, more details about this procedure are provided with the discussion particularised for the case of binary classification and traditional SVM.

In the first step, an initial function as a combination of arithmetic, trigonometric, and exponential operators of independent variables xi is defined. In general, this function can be written as follows:

$$y = f\left( {{\text{ }}x_{1} ,{\text{ }}x_{{2{\text{ }} \ldots }} } \right)~\quad a_{1} < x_{1} < b_{1} \quad ~a_{2} < x_{2} < b_{2} {\text{etc}}$$

In the second step, an adequate number of random points in the valid range of the variables are generated. Then, a positive offset and some random values are added to the y for half of the data to produce the first class; a negative offset and some random values are added to y for the other half to produce the second class. The equations for producing the two classes can be summarized as follow:

$$\begin{gathered} y_{1} = y + noise\;of\;standard\;deviations + offset \hfill \\ y_{2} = y + noise\;of\;standard\;deviations - offset \hfill \\ \end{gathered}$$

where y1 and y2 are the values for the first and second class, respectively.

In the third step, an SVM with "Gaussian Radial Basis Function kernel" is trained. The method used to find the separating hyperplane is "Sequential Minimal Optimization". Depending on the level of random noise, different success rates can be obtained. For the numerical tests presented in the following, the success rate in the classification of the SVM is always very close to 100%.

In the fourth step, a mesh on the domain has to be built in order to identify points sufficiently close to the hypersurface.

The fifth step consists of the identification of the points sufficiently close to the hypersurface, with the algorithm described in Sect. 6.

In the sixth step, the selected hypersurface points are used as inputs to the symbolic regression code, to find the appropriate formula for describing the hypersurface. The settings adopted to run the GP implementing the SR are reported in Table 2.

1.1 Examples for two independent variables

Example 1

As a first test, a purely arithmetic function has been tested. The function and ranges of the variables are:

$$y~ = ~x_{1} ~ + ~x_{2} ~ - ~x_{1} \cdot ~x_{2} ~~\;~{\text{where}}~ - 1~ < ~x_{1} ~ < ~1~\& ~~1~ < ~x_{2} ~ < ~2$$

After carrying out the six-step procedure described in Sect. 6, the following expression has been obtained:

$$y~ = ~1.011~\left( {~x_{1} ~ + ~x_{2} ~ - ~x_{1} ~ \cdot ~x_{2} ~} \right)$$

SR via GP converges on a final expression that is in excellent agreement with the initial function, describing the boundary between the two classes. This is particularly true since such a good approximation has been obtained without the non-linear fitting, normally the last step of the SR method.

Example 2

As a second test, a more complex function comprising exponential, arithmetic, and power operators has been assumed for the boundary between the two classes. The function and ranges of the variables are:

$$y~ = ~e^{{\left( {x_{1} ~ \cdot ~x_{2} ~} \right)^{{0.5}} }} ~{\text{where}}~~0 < ~x_{1} ~ < ~1~\& ~~1~ < ~x_{2} ~ < ~3$$

After carrying out the six-step procedure in Sect. 6, the following expression has been obtained:

$$y~ = ~0.974~e^{{\left( {~x_{1} ~ \cdot ~x_{2} } \right)^{{0.5}} }}$$

Again SR via GP converges on a final expression that is in excellent agreement with the initial function describing the boundary between the two classes, even without making recourse to the non-linear fitting step.

Example 3

As the third test, a more complex function comprising trigonometric and arithmetic operators has been defined and 4% classification noise was added to the database. The function and ranges for the variables are:

$$y = ~\sin \left( {x_{1} } \right) + ~x_{2} ~\;{\text{where}}~ - 3~ < ~x_{1} ~ < 3;\;~ - 2~ < ~x_{2} ~ < 2$$

After carrying out the six-step procedure in Sect. 6, the following expression has been obtained:

$$y = ~0.985~\left( {~\sin \left( {x_{1} } \right)~ + ~x_{2} ~} \right)$$

Again SR via GP converges on a final expression that is in excellent agreement with the initial function describing the boundary between the two classes, even without making recourse to the non-linear fitting step. Figure 

Fig. 11
figure 11

Points and surfaces of example 3 with two independent variables. The green rectangles are points generated from the initial function, the blue are the points belonging to the first class, the red points are those belonging to the second class, and the orange surface identifies the hyper-surface obtained with the SR via GP

11 presents the results of this example in pictorial form.

1.2 Examples for three independent variables

Some examples considering equations with three independent variables are reported in this section.

Example 1

As a first test, a function comprising only arithmetic operators has been defined. The function and ranges for the variables are:

Initial Defined Function: \(y~ = x_{1} ~~ - ~x_{2} ~~ + ~x_{3} ~\)

Range of Variables: \(1~ < ~1~~ < ~2~~~;~~3~ < ~x_{2} ~ < ~5~~~~;~~~0~ < ~x_{3} ~ < ~1\).

The final function obtained from the hypersurface points is:

$$y~ = ~1.002~\left( {x_{1} - x_{2} + x_{3} } \right)$$

Example 2

As a second test, a function comprising trigonometric and arithmetic operators has been defined. The function and ranges for the variables are:

Initial Defined Function: \(y~ = x_{1} ~ + ~\sin ~\left( {~x_{2} \cdot ~x_{3} } \right)\).

Range of Variables: \(1 < x_{1} < 2;\quad 3 < x_{2} < 5;\quad 0 < x_{3} < 1\).

The final function obtained from the hypersurface points is:

$$y = ~0.98~\left( {~x_{1} ~ + ~\sin ~\left( {~x_{2} \cdot ~x_{3} } \right)~} \right)$$

Again these results confirm the great potential of the approach. Almost exactly the original function can be obtained already at the stage of SR. With additional rounding off of the results or application of non-linear fitting, exactly the original function can easily be recovered.

1.3 Example for four independent variables

In this subsection, we describe the results of the application of the SVM-GP methodology to a more complex and noisy database. A five-dimensional synthetic database has been generated with the characteristics described in Table 6.

Table 6 Settings for testing SVM-GP on a five-dimensional synthetic database

The procedure for finding the best sigma for the SVM has been applied and the best sigma for the classification is equal to 0.6. The final accuracies of classification for the train and test data are presented in Table 7.

Table 7 The success rates of the SVM for the train and test data on the classification of the synthetic database with the best sigma that equals to 0.6

After generating the grid and finding the hyper-surface points, SR via GP has been applied and the following expression for the hyper-surface has been obtained:

$$y~ = ~0.9334~\sin ~\left( {0.9190~\left( {~x_{1} ~ + ~x_{2} ~} \right)} \right) - ~0.5010~x_{3} \cdot x_{4}$$

The obtained equation is in good agreement with the initial function. The quality of this estimate can be confirmed by comparing the success rate of the SVM and of the equation found by SR via GP. The classification success rate of the equation found with SR is reported in Table 8 (to be compared with the results reported in Table 8).

Table 8 The success rates obtained for the train and test data for the classification of the synthetic database with the expression obtained via SR

The comparison of the accuracies obtained via SVM and with our proposed technique allows concluding that the SVM-GP approach has excellent performance, even for more complex databases and in higher dimensions, in interpreting the SVM hyper-plane as a hyper-surface equation.

Appendix 2 Database of JET with a metallic wall

All experiments in JET campaigns C29 to C31 have been considered. After proper cleaning and validation of the DB, overall 187 disruptive and 1020 non disruptive shots are included, unless differently specified. JET database with the ILW has been used to implement the methodology described in this paper. In building the database, the intentional disruptions have been excluded from the training. Only time slices, whose plasma current exceeds 750 kA, have been considered but no other general selection has been implemented. All the signals have been resampled at 1kH frequency. Alarms, which are launched 10 ms or less from the beginning of the current quench, are considered tardy, since 10 ms is the minimum time required on JET to undertake mitigation action. Alarms triggered more than 2.5 s before the beginning of the current quench are considered early.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Murari, A., Gelfusa, M., Lungaroni, M. et al. A systemic approach to classification for knowledge discovery with applications to the identification of boundary equations in complex systems. Artif Intell Rev 55, 255–289 (2022). https://doi.org/10.1007/s10462-021-10032-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-021-10032-0

Keywords

Navigation