1 Introduction

A qualitative analysis of multidimensional data constitutes an effective and increasingly used tool used for searching different real data properties. A variety of multidimensional visualization methods are applied for the qualitative analysis of multidimensional data. Thanks to this, we can observe some searched data properties in a way which is the most natural for us—through the sense of sight. In general, such a visualization consists in transformation of a multidimensional space containing data into a two-dimensional space representing a screen so as not to lose these data properties significant from the point of view of the conducted analysis. The multidimensional visualization and qualitative analysis utilizing such a visualization attracts a growing interest [1,2,3]. Various methods are used for such a visualization. A relatively new method is the perspective-based observational tunnels method [4]. It intuitively consists in the prospective parallel projection with the local orthogonal projection along the additional use of perspective. The PCA method [5, 6] uses the orthogonal projection onto two eigenvectors of the data set covariance matrix corresponding to two eigenvalues largest in terms of module. Multidimensional scaling [7, 8] constructs a representation of the multidimensional space into a two-dimensional space in such a way that for each pair of points, their mutual distance in the target space and in the source space is as close as possible. The method of relevance maps [9, 10] performs the multidimensional space transformation into a two-dimensional image in such a way that the distance of the image of a given data vector from each of specially created points \(P_i\) is as close as possible to the ith coordinate of the data vector to which the image refers. In the method of parallel coordinates [11, 12], n-coordinate axes are distributed in parallel next to each other. In this method, each point is represented by a polyline going through each of the axes in a point corresponding to the appropriate coordinate. Star graphs [13, 14] constitute a similar method, except that in this method, n-axes go radially outward from one point. Visualizations of multidimensional data are also performed using tSNE and its variations [15, 16]. The main advantage of this method is the fact that if points are close to each other in the multidimensional space, then they will also be close to each other after reducing the number of dimensions.

Neural networks are also used for multidimensional visualizations of data. In general, the initial data processing and choice of parameters significantly affect the operation of neural networks. Therefore, apart from the basic models of neural networks, many strategies of designing and improving these networks were developed. Many approaches to learning divide the set into two parts: a training set and validation set. By comparing the accuracy on a validation set with accuracy on a training set, it can be stated that the network requires further learning or is already over-learned [17]. The manual selection of the network parameters requires a great deal of experience, knowledge, effort and is complicated; therefore, guidebooks devoted to this subject were written [18, 19]. The automatic selection of the network parameters is also possible [20]; however, it is computationally complex. Thus, from the practical point of view, the possibility to obtain visual information presenting how the network operates inside [21] is valuable during designing the network. This improves the network operation and even alter its structure [22] and conduct the analysis of the role of separate neurons [23]. The analysis through visualization is also carried out on convolutional neural networks (CNNs) [24, 25]. The visualization of the network’s hidden layers is performed also using PCA [26]. However, this type of visualization is different from the one performed in the paper, because its main goal is optimization of the network operation, not the data analysis. In the case of the visualization of multidimensional data, self-organizing neural networks (Kohonen maps, SOM), in which learning occurs without any trainer, are usually applied [27, 28]. In their case, during learning weights are modified in such a manner to additionally enhance the response of the winning neuron and its neighbors. The winning neuron is the one whose response on a given sample is the strongest. Autoassociative neural networks, in which one of the interlayers comprises two neurons serving for the visualization [28,29,30], are also applied for the visualization. Researchers are still improving the approach utilizing networks for the visualization of multidimensional data [31,32,33,34]. There are also attempts to significantly accelerate the learning process of such networks [35]. A relatively new approach is the use of restricted radius basis function networks [36, 37]. In this approach, one part of the network is trained without any trainer and the other one with a trainer. The significance and popularity of neural network studies results from the fact that the general idea of networks and mechanisms similar to those in neural networks, including different variations of neural networks, are used for solving many problems in real world. They can be applied to search for parking spaces, where wireless sensor networks (WSNs) were used and gradient ascent method along with gradient field in the parking space structures [38]. Another example includes the use of self-learning interval type-2 fuzzy neural network (SLIT2FNN) for designating robot trajectories [39].

In this paper, criteria, which can be applied in any artificial neural network or its modification if only such a network can comprise many layers and can operate as a network with a trainer and (applying the criterion comparing the network’s outputs with its inputs) as the autoassociative neural network, are applied. As a result, in each such network all analyzed criteria can be applied, so the analogical comparison of the criteria analyzed in the paper can be made. Therefore, the paper is focused on the comparison of the effectiveness of the one, adopted type of network with different criteria. Only the criterion of the autoassociative neural networks constitutes the classic use of neural networks to visualize multidimensional data. The remaining criteria constitute a new approach to the use of neural networks for the visualization of multidimensional data. From among the criteria presented in the paper, only the criterion of the autoassociative neural network was used in the previous papers by the author. Studies based on the remaining criteria used in the paper are new, and the idea to utilize such criteria results from the experience gained in the course of numerous studies. For the first time, the comparison of the effectiveness of methods utilizing the above criteria using the evaluating criterion of the qualitative analysis readability introduced in earlier papers was made in this paper.

2 Learning criteria

In the paper, neural networks applying different learning criteria were used for the qualitative analysis of multidimensional data through their visualization. Networks in the case of the analyzed n-dimensional data had n inputs and one of the interlayers consisting of two outputs used directly for the visualization. All used networks consisted of six layers in total. Three layers served to change n inputs into two outputs of the interlayer used for the visualization. Then, three layers served to change two outputs of the interlayer into the appropriate number of network outputs which depends on the assumed criterion. After the completion of learning, data visualization consisted in providing each n-dimensional data vector to n neural network inputs and displaying a point with coordinates equal to values obtained from two outputs of the interlayer used for the visualization on the screen. All networks were trained by the method of error back-propagation. And these networks were different in the assumed learning criterion and the number of outputs resulting from it. The method of error back-propagation has been used successfully for many years, and there are many publications devoted to this method which describe both its theoretical background and practical application [40, 41].

2.1 The criterion of the autoassociative neural network

It is a classic example of an autoassociative neural network used for the visualization of multidimensional data in which the number of outputs is equal to the number of inputs. In connection with the above, this network had n outputs for the analyzed data. It is an example of self-organized neural networks, learning of which occurs in an unsupervised way. The learning criterion is the same value as the one at the ith input appears at each ith output. If the network is trained in this way, this means that the whole information from n inputs was compressed to two outputs of the interlayer and then decompressed to n network outputs. This means that by utilizing the interlayer consisting of two outputs it is possible to present the whole information concerning the analyzed data on a two-dimensional computer screen. A diagram of such a network is presented in Fig. 1.

Fig. 1
figure 1

A diagram of the autoassociative neural network used for the visualization of multidimensional data. Three layers served to change n inputs into two outputs of the interlayer used for the visualization. Then, three layers served to change two interlayer outputs into n network outputs. Signals obtained from the interlayer consisting of two neurons constitute coordinates of a screen for the image of the analyzed data vector

2.2 The criterion in which the network has outputs representing separate classes

With this criterion, the network has as many outputs as there are classes. The learning criterion is such that values at outputs represent separate classes. It is realized that one output is attributed to each class. If a sample belongs to a given class, then one should appear at the output attributed to this class and 0 at the remaining outputs. If the network is trained in this way, the whole information concerning the division into classes was compressed to two outputs of the interlayer and then processed to the appropriate number of network outputs. Such a network is an example of neural networks in the case of which learning occurs in a supervised way. A diagram of such a network differers from the one presented in Fig. 1 in only that the last layer of the network comprises as many neurons as there are classes.

2.3 The criterion in which the network has one output whose values represent separate classes

With this criterion, the network has one output. The learning criterion is such that values of the output represent separate classes. It is realized that the next index is attributed to each class occurring in the data. If a sample belongs to a given class, then the value equal to the index attributed to this class divided by the number of classes occurring in the data should appear at the output. If the network is trained in this way, this means that the whole information concerning the division into classes was compressed to two outputs of the interlayer and then processed to 1 network output. Such a network is an example of neural networks in the case of which learning occurs in a supervised way. A diagram of such a network differs from the one presented in Fig. 1 that the last layer of the network comprises one neuron.

2.4 The criterion in which the network has one output representing a random value attributed to a given sample

The learning criterion is such that the value at the output is equal to a random value attributed to a given sample. Random values are attributed to each sample before learning starts. During training of such networks, the whole information about the dependence between samples and random values attributed to them was compressed to 2 outputs of the interlayer and then processed to 1 network output. A diagram of such a network differs from the one presented in Fig. 1 that the last layer of the network comprises one neuron.

3 The algorithm

The network’s learning is the first thing which must be done in order to obtain possibilities of multidimensional data visualization. At this stage, weights of all neurons are counted. Before learning starts, input data must be scaled in such a way that it is contained within the range of neuron outputs’ values. Because the hyperbolic tangent function was assumed at the output of each neuron, output values are contained within range \((-1, 1)\). Values of each data vector coordinates were thus scaled to the range \((-0.9, 0.9)\). Before learning starts, initial values of all weights attributed to all neurons must also be determined. Each weight was attributed a random value from range \((-0.5, 0.5)\). The network’s learning procedure described in the form of pseudocode is presented as Algorithm 1.

figure b

This whole procedure can be repeated many times in order to better train the neural network. As can be observed in Algorithm 1, at first, for the mth data vector, we calculate the output value of all neurons in the first layer:

$$\begin{aligned} y_{1,j}=g\left( w_{1,j,0}+ \sum _{k=1}^{n}w_{1,j,k}x_{k,m}\right) \end{aligned}$$
(1)

Let g denote the assumed nonlinear function (hyperbolic tangent was used in the conducted experiments), n—the number of network inputs, and \(y_{1,j}\)—the output value of a neuron placed in the first network layer at the jth position. Similarly, \(x_{k,m}\) denotes the kth coordinate of mth input data set vector, \(w_{1,j,k}\) denotes weight of the kth input of a neuron placed in the first network layer at the jth position, and the weight \(w_{1,j,0}\) for input number 0 denotes the additional constant component. Hence, the weight plays a very important role in increasing possibilities of each neuron by enabling the calculation of the value of its output on the basis of not only variables but also a constant. It plays an analogical role to, e.g., coefficient c in function \(y=ax^2+bx+c\).

Then, we calculate the output values of all neurons located in the remaining network layers. The calculation of neuron outputs can occur after the calculation of output values of the previous layer’s neurons:

$$\begin{aligned} y_{i,j}=g\left( w_{i,j,0}+ \sum _{k=1}^{size(i-1)}w_{i,j,k}y_{i-1,k}\right) \end{aligned}$$
(2)

\(size(i-1)\) denotes the number of neurons in layer number \(i-1\), \(y_{i,j}\) denotes the output value of a neuron placed in the ith network layer at the jth position, \(w_{i,j,k}\) denotes the weight of the kth input of a neuron placed in the ith network layer at the jth position, the weight \(w_{i,j,0}\) denotes analogical meaning as in formula 1, and g denotes nonlinear function, the same as in formula 1.

In the next part of the algorithm, we calculate errors of the network output. For this purpose, we calculate the difference between the values we obtained at the outputs of the last network layer neurons and values we should obtain. Values we should obtain depend directly on the assumed criterion:

  • for the autoassociative neural network, for each vector, they will be directly its coordinates, that is values provided to the network inputs.

  • for the criterion in which the network has outputs representing separate classes, they will be combinations of zeros and ones resulting from the vector’s affinity to a given class.

  • for the criterion in which the network has one output whose values represent separate classes, for each vector, the value equal to the index of a class to which it belongs divided by the number of classes.

  • for the criterion in which the network has one output representing a random value attributed to a given sample, for each vector, this will be a random value attributed to this vector before the network’s learning starts.

We multiply the difference obtained depending on the selected criterion by the derivative of assumed function g, that is by the derivative of the hyperbolic tangent function, and we obtain:

$$\begin{aligned} \delta _{i,j}=\left( 1-y_{i,j}^2\right) \left( z_{j,m}-y_{i,j}\right) \end{aligned}$$
(3)

\(\delta _{i,j}\) denotes the value of the calculated error of the output of a neuron placed in the ith network layer at the jth position (in this formula i denotes the number of the last network layer), \(y_{i,j}\) denotes the value of the output of the jth neuron from the ith layer, and \(z_{j,m}\) denotes the value resulting from the criterion related to jth output of the mth input data set vector. In the order from the penultimate layer to the first layer, we calculate errors of outputs of neurons from the remaining network layers.

$$\begin{aligned} \delta _{i,j}=\left( 1-y_{i,j}^2\right) \sum _{k=1}^{size(i+1)}\left( \delta _{i+1,k}w_{i+1,k,j}\right) \end{aligned}$$
(4)

where \(\delta _{i,j}\)—the value of the calculated error of the output of a neuron placed in the ith network layer at the jth position, \(w_{i+1,k,j}\)—the weight of the jth input of the kth neuron from layer \(i+1\), \(size(i+1)\)—the number of neurons in layer number \(i+1\), and \(y_{i,j}\)—the output value of a neuron placed in the ith network layer at the jth position. Based on the previously calculated errors, we modify weights of all network neurons:

$$\begin{aligned} \widetilde{w}_{i,j,k}=w_{i,j,k}+\eta \delta _{i,j}y_{i-1,k} \end{aligned}$$
(5)

where \(w_{i,j,k}\) denotes the weight of the kth input of the jth neuron from the ith layer, \(\delta _{i,j}\) denotes the value of the error of the output of the jth neuron from the ith layer, \(y_{i-1,k}\) denotes the output value of the kth neuron from layer \(i-1\), \(y_{0,k}\) denotes the kth coordinate of mth input data set vector \(x_{k,m}\), and \(\eta \) denotes the parameter specifying the learning rate.

After the completion of learning, we may proceed to obtain the view of multidimensional data sets. For each mth data vector, we calculate values of outputs of subsequent neuron layers using formulas 1 and 2. We conduct these calculations up to the moment we obtain values of two outputs of neurons belonging to the interlayer used for the visualization. These two values constitute directly coordinates of the location of a screen on which the image of the mth data vector should be drawn. In this way, we can draw images representing all vectors belonging to the multidimensional data set on the screen. It can be observed that in the presented algorithm and equations occurring in it, the applied neural networks can be very easily paralleled because the whole network layer can be calculated in parallel. In order to accelerate the operation of the visualization of multidimensional spaces, the author repeatedly applied both parallel programming, writing programmes using CUDA (Compute Unified Device Architecture) under GPU (graphics processing unit) and the equipment-based execution of the algorithm (the parallel stream processors specialized for visualization), creating hardware structures in VHDL (Very High Speed Integrated Circuits Hardware Description Language) and executing them on FPGA (field-programmable gate array). However, in the case of networks described in the paper and used for the visualization, in the course of the conducted studies, there was no situation in which accelerating their operation would be purposeful. Learning parameters were always selected in such a manner that time needed for the sufficient training of the network was acceptable.

4 Experiments’ results

Applying the above-mentioned criteria for neural networks’ learning, at first, seven-dimensional real data describing different energy classes of coal are presented on the computer screen. These data were already used multiple times before, also during the examination of numerous methods of qualitative analysis of multidimensional data through their visualization [42]. The complete data set was published earlier in [43]. According to the Polish classification, the data set consisted of 205 samples, out of which 72 samples represented class of coal 31 (energetic coal), 61 samples represented class of coal 34.2 (semi-coking coal) and 72 samples represented class of coal 35 (coking coal). Each sample was described by seven features: density, mass, combustion heat, ash contents, sulfur contents, volatile matter contents and analytical moisture. Therefore, these data can be interpreted as a set placed in the seven-dimensional space of features. A special system was created to obtain the results presented in the paper. This system was created based on the algorithm presented in Sect. 3 using programming language C++. The purpose of the qualitative analysis of the presented data was to state whether samples belonging to different classes of coal occupy separate subareas of the multidimensional space of features. This allows to state whether selected features are sufficient for the correct recognition of the class of coal.

Figures 2, 3, 4, 5, 6 and 7 present the obtained views of the analyzed seven-dimensional data using different criteria for the neural network’s learning. In each of these figures, signals representing samples of coal of class 31 are marked with a symbol of a square (\(\square \)), samples of coal of class 34.2—plus (+), samples of coal of class 35—circle (o). Figure 2 presents the result obtained using the criterion for the autoassociative neural network’s learning. Parameter ITER denotes the number of repetitions of the network’s learning. It can be observed in the figure that signals \((y_{3,1}, y_{3,2})\) of the interlayer used for the visualization, being the response to data representing samples of coal of a given class accumulate in aggregations. It can be seen that these aggregations can be separated from each other. Classes 34.2 and 35 formed two aggregations of points each, while class 31 formed two aggregations of points in the figure and the third subarea of the figure occupied by two points. As a result, by using the analyzed criterion, we can indicate the possibility to divide the space of features into subareas occupied by different classes. It must be noted that in the case of autoassociative neural networks, information on belonging of data vectors to specific classes is not used during learning. In this situation, the way in which signals going through the layer consisting of two neurons, representing a given class, will be grouped depends only on some properties of this data observed by the network.

Fig. 2
figure 2

The view of the analyzed coal data obtained using the criterion for the autoassociative neural network’s learning with parameter ITER = 21004

Figure 3 presents the result obtained using the learning criterion in which the network has three outputs representing separate coal classes. It can be observed in the figure that signals being the response to data representing samples of coal of a given class accumulate in aggregations. Furthermore, it can be seen that these aggregations can be separated from each other that makes it significantly more readable than with the use of earlier criteria. Each of the three classes formed one aggregation. As a result, by using the analyzed criterion, we can indicate the possibility to divide the space of features into subareas occupied by different classes in a way more readable than with the use of the previous criteria. In the case of the used criterion, information on belonging of data vectors to specific classes is used during learning. We may thus indicate whether the network correctly learned to recognize classes or not. Then, why do we need to conduct the additional analysis concerning the possibility to divide the space of features into subareas occupied by different classes? However, it must be noted that in the case of such networks, conclusions drawn from the visualization obtained by these networks can be independent of the fact whether learning in the case of such a network was successful or not. It may turn out that despite the lack of success in the network’s learning, the view from the interlayer will allow to indicate the possibility to easily separate areas containing different classes. On the other hand, even when the neural network learns to correctly recognize a learning sequence, this does not mean that easy separation of areas containing different classes is possible. Even then they can overlap. As a result, the learning fact itself does not allow to state that the selected features are sufficient for the possibility to correctly and easily distinguish areas occupied by different classes.

Fig. 3
figure 3

The view of the analyzed coal data with the learning criterion in which the network has three outputs the values at which are to represent separate classes of coal with parameter ITER = 1000

Figure 4 presents the result obtained using the learning criterion in which the network has one output at which the appropriate values are to represent separate coal classes. It can be observed in the figure that signals being the response to data representing samples of coal of a given class accumulate in aggregations. It can be seen that these aggregations can be separated from each other to make it more readable as compared to the earlier criterion. Each of the three classes formed one aggregation. As a result, by using the analyzed criterion, we can indicate the possibility to divide the space of features into subareas occupied by different classes in a way equally readable as with the use of the previous criterion. The described criterion differs from the previous one in the number of network outputs with the network learning to reply to the question: what class we are to deal with. As it can be seen, this does not cause any change in the readability of results.

Fig. 4
figure 4

The view of the analyzed coal data with the learning criterion in which the network has one output at which the appropriate values are to represent separate classes of coal with parameter ITER = 4000

Figure 5 presents the result of the operation of the same network as in Fig. 4, but with different initial values of the drawn network weights. Despite a different distribution of point aggregations, the obtained view allows to indicate the possibility to divide the space of features into subareas occupied by different classes in an equally readable way.

Fig. 5
figure 5

The view of the analyzed coal data with the learning criterion and number of learning repetitions the same as in Fig. 4, but with different initial values of the drawn network weights

The idea for the last criterion appeared as a result of experiments related to the visualization of multidimensional data. There were situations in which the obtained view became more readable and then less readable as a result of the network’s learning. As a result of the further learning, the view became more readable again. Thus, a question, what effect the degree of the network’s learning has on the readability of the visualization results, appeared. It turned out that we do not have to aim at the best network’s training to obtain correct conclusions resulting from the qualitative analysis—it does not make any difference. The above conclusion is inconsistent with the general approach to using neural networks, where the indicator of the correct network operation is usually the degree of its learning. However, in such a general case we want the network to correctly recognize objects belonging to the input space. Therefore, in such a general situation, the degree of the network’s learning (indicator specifying the correct responses) is significant. However, the qualitative analysis using the visualization of multidimensional data is a completely different case. Here, the degree of the network’s learning does not matter at all. However, the indication of the possibility to divide the input space in terms of belonging to classes is of significance. If in a given moment we can indicate this based on the obtained view, this means that there is a representation allowing to divide such a space. The representation realized by the network part used for the visualization is such a representation. The existence of this representation is thus irrespective of the degree of the network’s learning. It is possible to imagine a situation in which the network without learning allows to indicate the possibility to divide the space of features—an example of such a situation will be presented in the further part. The criterion in this situation can only be a “pretext” enabling to change the network into a specified direction, not necessarily consistent with the improvement in the readability of the visualization results. For the qualitative analysis purposes, it is sufficient that there will be a moment somewhere on the way in which the readability of the visualization results will be satisfying. For the same reason, the control of over-learning is of no significance. It also turns out that the network outputs are over-learned and yet the visualization layer shows readable views. This may be caused by the fact that layers located behind the visualization layer were subjected to over-learning, and thus, their over-learning does not affect the readability of results of the visualization layer. Therefore, in this type of application of the network, over-learning should be redefined as a situation in which the readability of results of the visualization layer is continuing to deteriorate.

Figure 6 presents the view obtained using the learning criterion in which the network has one output at which the value is to be equal to a random value attributed to a given sample. Random values are attributed to each sample before learning starts. It can be observed in the figure that signals being the response to data representing samples of coal of a given class accumulate in aggregations. These aggregations partially overlap, thus based on the obtained view we cannot indicate the possibility to separate them from each other. However, the mere fact that with the criterion coercing random values at the network output, there are clear aggregations of points representing the same classes formed in the picture is very interesting. How is it possible that with the same criterion, we obtain any ordering? Let us analyze the results obtained in the next figures to explain this.

Fig. 6
figure 6

The view of the analyzed coal data with the learning criterion in which the network has one output at which the value is to be equal to a random value attributed to a given sample, with parameter ITER = 3990

Figure 7 presents the view obtained as a result of the further network’s learning. It can be seen that the readability of results deteriorated completely—signals being the response to data representing samples of coal of all classes is mixed up with each other. Based on the obtained view, it is not possible to observe grouping of signals being the response to data representing samples of coal of the same class. Better results were not obtained as a result of further learning. As a result, the network with the assumed criterion aims to randomly distribute points in the space of signals. It is consistent with intuition—if the criterion is based on random values, then the space of signals will also aim at random values. But why during learning, before more accurate network’s learning, did we obtain views in which clearly ordered aggregations were formed?

Fig. 7
figure 7

The view of the analyzed coal data with the same learning criterion as in Fig. 6 obtained as a result of the further network’s learning with parameter ITER = 7000

Artificially generated seven-dimensional data were used to explain the raised questions. These data were prepared in such a way that points belonging to two classes were placed in separate subareas of a seven-dimensional space. Each of these subareas contained a shape of a seven-dimensional cube. The criterion for network’s learning was the same as in Figs. 6 and 7, that is for each data vector, a value is to appear at the network output as a result of learning was drawn. Figure 8 presents the view obtained for artificially generated data before the network’s learning starts. As a result, values of weights at the moment of obtaining this view were equal to a random value. It can be seen that the view obtained using the network without its learning allows to indicate the possibility to divide the space into subareas occupied by different classes. Figure 9 presents that the view becomes even more readable as a result of the network’s learning—subareas occupied by different classes move away from each other. However, as a result of further learning, the situation deteriorates and stays so.

Fig. 8
figure 8

The view of artificially generated seven-dimensional data with the same learning criterion as in Figs. 6 and 7 obtained before the network’s learning started, that is with parameter ITER = 0

Fig. 9
figure 9

The view of the same artificially generated seven-dimensional data as in Fig. 8 with the same learning criterion, obtained as a result of the network’s learning, with parameter ITER = 100

Figure 10 presents the view obtained as a result of the further network’s learning. It can be observed that signals being the response to data representing different classes are mixed up with each other. Based on the obtained view, it is not possible to indicate the possibility to divide the space into subareas occupied by different classes anymore. Better results were not obtained as a result of further learning. Let us repeat the already raised question: why during learning the network and even for the unlearned network, did we obtain views in which clearly ordered aggregations were formed? The response results directly from the network operation principle presented in Sect. 3. Let us take the initial situation before the network’s learning as an example. Then, weights assume random values. It is sufficient that the sum of values of vectors’ coordinates of a given class significantly differs from the sum of the value of vectors’ coordinates of the remaining classes. Then, these coordinates after their introduction into inputs and multiplication by the initial random values of weights, after summing up statistically, will also differ from the remaining ones. Thus, values which for vectors of a given class statistically differ from other classes will appear at outputs of neurons of the first layer. These values will be statistically maintained on their way through the next layers of neurons with random weights. Therefore, before the network’s learning starts, in exceptional cases, it is possible to obtain readable results of the qualitative analysis of multidimensional data. The described criterion is also the proof that more accurate network’s learning, omitting even the phenomenon of the network over-learning, does not have to cause the increase in the readability of the qualitative analysis results.

Fig. 10
figure 10

The view of the same artificially generated seven-dimensional data as in Figs. 8 and 9 with the same learning criterion, obtained as a result of the network’s learning, with parameter ITER = 2000

By summing up the results of analysis of seven-dimensional data describing different classes of coal, it can be stated that the readability of obtained results strongly depends on the applied criterion for neural network’s learning. The readability of obtained results until this moment in the paper was determined in an intuitive way. It can be observed that separation of areas occupied by signals belonging to different classes, e.g., in Fig. 3 is significantly easier than, e.g., in Fig. 2. It must be noted that visualization of multidimensional data in this work is used solely to determine a specific property of this data, that is, to determine whether given features allow to separate the space of features into subareas occupied by different classes. The possibility to reply to the question asked above is thus important from the point of view of the analysis conducted in the paper. For example, it is not important how much of information contained in multidimensional data will be retained in the obtained view. Moreover, it can be assumed that the more general information will be retained, the less readable the result of the specific targeted analysis can be. The example of the autoassociative neural network can confirm such an assumption. After the successful training of network, the complete information from the network inputs appear at its outputs. In order to make it happen, this complete information must go through every layer of the network, and therefore through the layer used for visualization (consisting of two neurons). However, as can be observed in Fig. 2, the view obtained using such a network is not at all more readable among the obtained views. Therefore, to evaluate the effectiveness of qualitative analysis it is best to use the criterion allowing to evaluate solely the readability of this analysis. The example of this is a criterion which consists in drawing a curve separating images of points belonging to different classes is shown in figure [42]. The more complicated this curve is, the less readable the view allowing to indicate the possibility to separate subsets of points from each other is. It was assumed that the curve consists of arcs and the more complicated it is, the more inflection points it has. Inflection points are points joining arcs turning into different directions. This criterion was formulated in order to create a ranking of different methods of multidimensional visualization used for the qualitative analysis [42]. Assuming such a criterion, it can be easily noticed that in Figs. 3, 4, 5, 8 and 9 each class can be separated from the remaining ones by means of a curve without inflection points. Such views constitute the most readable result of the qualitative analysis possible to obtain. This means that the use of any other method will not bring more readable views; thus, every other method can be at most as good as the above-mentioned one in the sense of the assumed criterion. It can be easily noticed that in Fig. 2, curve separating a given class from the remaining ones will have inflection points; thus, it is less readable from the point of view of the assumed criterion. In the remaining figures, classes cannot be separated; thus, they are beyond the criterion.

Fig. 11
figure 11

The view of five-dimensional data obtained as a result of the printed text recognition, obtained using the criterion for the autoassociative neural network learning, with parameter ITER = 49

With the use of the previously described criteria for neural networks’ learning, also five-dimensional real data obtained as a result of the printed text recognition were presented. The way of obtaining these data is already described [44]. Each sample was created as a result of separating five features; therefore, these data can be interpreted as a set placed in a five-dimensional space of features. This set consisted of 4810 samples representing all alphabet characters. An exemplary goal of the conducted analysis was to determine whether samples representing character “m” occupy separate subareas of a multidimensional space of features. This allows to state whether five analyzed features are sufficient for the correct recognition of this character.

Figures 11, 12, 13 and 14 present the obtained views of the analyzed five-dimensional data using different criteria for the neural network’s learning. In each of these figures, signals representing character “m” are marked with a symbol of a circle (o) and signals representing the remaining characters with a square (\(\square \)). In Figs. 12 and 13, signals representing character “m” can be separated from signals representing the remaining characters by a curve without inflection points. In Fig. 11, signals representing character “m” can be separated from signals representing the remaining characters by a curve with one inflection point. In Fig. 14, signals representing character “m” cannot be separated from the remaining characters.

Figure 11 presents the result obtained using the criterion for the autoassociative neural network’s learning. It can be observed that signals of the interlayer used to visualize which are the response to samples representing character “m” accumulated in an aggregation. It can be noticed that this aggregation can easily be separated from signals representing the remaining characters. As a result, by using the analyzed criterion, we can state the possibility to separate a subarea of the space of features occupied solely by samples of character “m.”

Similarly, Fig. 12 presents the result obtained using a learning criterion in which the network has two outputs. Value one obtained at the first output represents the occurrence of character “m,” and value one obtained at the second output represents the occurrence of the remaining characters. It can be observed in this figure in a more readable way than in the previous one that signals to be the response to samples representing character “m” can easily be separated from signals representing the remaining characters. As a result, by using the analyzed criterion, we can state the possibility to separate a subarea of the space of features occupied solely by samples of character “m” in a more readable way than by using the previous criterion.

Fig. 12
figure 12

The view of five-dimensional data obtained as a result of the printed text recognition with the learning criterion in which the network has two outputs. The value of the first output represents the occurrence of character “m” and value of the second output the remaining characters, with parameter ITER = 190

Figure 13 presents the result obtained using the learning criterion in which the network has one output at which the value equal to 1 represents the occurrence of character “m” and value equal to 0 represents the occurrence of the remaining characters. It can be observed in this figure in an equally readable way as in the previous one that signals to be the response to samples representing character “m” can easily be separated from signals representing the remaining characters. As a result, by using the analyzed criterion, we can state the possibility to separate a subarea of the space of features occupied solely by samples of character “m” in an equally readable way as by using the previous criterion.

Fig. 13
figure 13

The view of five-dimensional data obtained as a result of the printed text recognition with the learning criterion in which the network has one output at which value 1 represents character “m” and value 0 represents the remaining characters, with parameter ITER = 500

Figure 14 presents the view obtained using the learning criterion in which the network has one output at which the value is to be equal to a random value attributed to each sample. It can be observed that signals being the response to samples representing character “m” accumulated in an aggregation. This aggregation partially overlaps the area occupied by signals representing the remaining characters. As a result, by using the analyzed criterion, we cannot state the possibility to separate a subarea of the space of features occupied solely by samples of character “m.”

Fig. 14
figure 14

The view of five-dimensional data obtained as a result of the printed text recognition with the learning criterion in which the network has one output at which the value is to be equal to a random value attributed to a given sample, with parameter ITER = 300

The additional set which is presented by using the criteria for neural networks’ learning described earlier constituted 216-dimensional data. These data were created as a result of separating 216 features (profile correlations) from images presenting hand-written digits that is publicly available [45] as part of the data set called “Multiple Features”. This set consists of 2000 samples representing all digits, 200 samples representing each digit. An exemplary goal of the conducted analysis was to determine whether samples representing digit 1 occupy separate subareas of a multidimensional space of features. This in turn allows to state whether 216 analyzed features are sufficient for the correct recognition of this digit.

Figures 15, 16, 17 and 18 present the obtained views of the analyzed 216-dimensional data using different criteria for the neural network’s learning. Signals representing digit 1 are marked with a symbol of a circle (o) and signals representing the remaining digits with a square (\(\square \)) as shown in the figures. In Figures 16 and 17, signals representing digit 1 can be separated from signals representing the remaining digits by a curve without inflection points. Similarly, in Figs. 15 and 18, signals representing digit 1 cannot be separated from the remaining digits.

Figure 15 presents the result obtained using the criterion for the autoassociative neural network’s learning. It can be observed in it that signals being the response to samples representing digit 1 partly accumulate in aggregations. In many places, these aggregations overlap signals representing the remaining digits. As a result, the use of the analyzed criterion does not allow to state the possibility to separate a subarea of the space of features occupied solely by samples of digit 1.

Fig. 15
figure 15

The view of 216-dimensional data representing hand-written digits, obtained using the criterion for the autoassociative neural network learning, with parameter ITER = 1000

Fig. 16
figure 16

The view of 216-dimensional data representing hand-written digits, obtained with the learning criterion in which the network has two outputs. The value of the first output represents the occurrence of digit 1 and value of the second output the remaining digits, with parameter ITER = 45

Figure 16 presents the result obtained using a learning criterion in which the network has two outputs. Value one obtained at the first output represents the occurrence of digit 1, and value one obtained at the second output represents the occurrence of the remaining digits. It can be observed in the figure that signals being the response to samples representing digit 1 can be easily separated from signals representing the remaining digits. As a result, by using the analyzed criterion, we can state the possibility to separate a subarea of the space of features occupied solely by samples of digit 1.

Figure 17 presents the result obtained using the learning criterion in which the network has one output at which the value equal to 1 represents the occurrence of digit 1 and value equal to 0 represents the occurrence of the remaining digits. It can be observed in this figure in an equally readable way as in the previous one that signals being the response to samples representing digit 1 can easily be separated from signals representing the remaining digits. As a result, by using the analyzed criterion, we can state the possibility to separate a subarea of the space of features occupied solely by samples of digit 1 in an equally readable way as by using the previous criterion.

Fig. 17
figure 17

The view of 216-dimensional data representing hand-written digits, obtained with the learning criterion in which the network has one output at which value 1 represents digit 1 and value 0 represents the remaining digits, with parameter ITER = 100

Figure 18 presents the view obtained using the learning criterion in which the network has one output at which the value is to be equal to a random value attributed to each sample. It can be observed in it that signals being the response to samples representing digit 1 in majority overlap the area occupied by signals representing the remaining digits. As a result, by using the analyzed criterion, we cannot state the possibility to separate a subarea of the space of features occupied solely by samples of digit 1.

Fig. 18
figure 18

The view of 216-dimensional data representing hand-written digits, obtained with the learning criterion in which the network has one output at which the value is to be equal to a random value attributed to a given sample, with parameter ITER = 10

It must be noted that the methods compared in the paper can be used on any data set. The number of network inputs assumes the value equal to the number of dimensions of the analyzed set. But the length of the learning sequence is equal to the number of samples of the analyzed data set. With the change in the analyzed data set, the number of the network inputs and the length of the learning sequence can change.

The greatest advantage during the qualitative analysis of multidimensional data through their visualization is the fact that some data properties without any additional analysis can plainly be seen. However, these are not always the properties we would like to find out about. Therefore, it was decided to verify the effectiveness of criteria for searching the assumed information which is described in the paper. For this purpose, 20 sets of multidimensional data obtained from one of the publicly available repositories were randomly selected. By applying the same analysis of each of these data sets as those conducted in the paper and described above, it was decided to verify the possibility to separate a subarea of the space of features occupied solely by samples of a randomly selected class. Of course, if data contained information on only two classes, then there was no need to select a class. If one of dimensions in the data was an ordinal number, then such a dimension was removed, because frequently it itself allowed to recognize the class.

Table 1 Results of the qualitative analysis of randomly selected 20 sets of multidimensional data obtained from one of publicly available repositories using different learning criteria

Obtained results are presented in Table 1. It can be analyzed that data sets varied greatly. If there is ‘yes’ in the column related to the effectiveness of a given learning criterion of the network, this means that with its use, the visualization allows to state the possibility to separate a subarea of the space of features occupied solely by samples of a randomly selected class. If there appears ‘no’ in a given place, this means that obtaining the view allowing to draw such a conclusion was not possible.

Fig. 19
figure 19

The view of 15-dimensional Leaf data set, obtained with the learning criterion in which the network has one output at which the value is to be equal to a random value attributed to a given sample, with parameter ITER = 97

Fig. 20
figure 20

The view of six-dimensional Acute Inflammations data set, obtained with the learning criterion in which the network has one output at which the value is to be equal to a random value attributed to a given sample, with parameter ITER = 181

As can be observed, the use of the learning criterion of the autoassociative neural network allowed to obtain information on the possibility to divide the space of features in the case of six data sets. This means that using this criterion allowed to obtain the searched information in the case of 30% of sets. The best results were obtained in the case of two learning criteria, in which the information on belonging of a sample to a given class should appear at one output or at two outputs as a result of learning. In the case of using both these criteria, information on the possibility to divide the space of features was obtained in the case of 12 data sets, which constitutes 60% of analyzed sets. This means that each of these two criteria allowed to obtain the searched information twice as often as in the case of the autoassociative neural network. This constitutes a perfect result and is quite surprising. The result obtained using a criterion in which the network has one output at which the value is to be equal to the random value attributed to each sample is even more surprising. In the case of using this criterion, information on the possibility to divide the space of features was obtained in the case of three data sets, which constitutes 15% of analyzed sets. This means that using such a seemingly absurd criterion allowed to obtain the searched information only twice less frequently than in the case of the autoassociative neural network. It is a very interesting result; therefore, Figs. 19, 20 and 21 present views obtained using this criterion on these sets, on which the information on the possibility to divide the space of features was obtained.

The best results were obtained with the use of criteria with which as a result of learning, information on belonging of a sample to a given class should appear at one output or several outputs. These views turned out to be more readable than with the use of the autoassociative neural network. However, in the previous paper, in which the ranking of selected methods of data qualitative analysis through its visualization was developed, autoassociative neural networks occupied the first position [42]. Combining the above facts, it can be stated that the method using the criterion which turned out to be the best in this paper, simultaneously becomes the best among all methods compared in this earlier ranking [42].

Fig. 21
figure 21

The view of nine-dimensional Glass Identification data set, obtained with the learning criterion in which the network has one output at which the value is to be equal to a random value attributed to a given sample, with parameter ITER = 1196

5 Conclusions

With the use of the learning criterion of the autoassociative neural network, signals being the response to data representing samples of coal of a given class accumulated in aggregations. These aggregations could easily be separated from each other. Classes 34.2 and 35 formed two aggregations of points each, while class 31 formed two aggregations of points in the figure and the third subarea of the figure occupied by two points. As a result, by using the analyzed criterion, we can indicate the possibility to divide the space of features into subareas occupied by different classes. In the case of using the same criterion on five-dimensional real data obtained as a result of the printed text recognition, the possibility to divide the space of features into subareas occupied by different classes could be stated. While in the case of using the same criterion on 216-dimensional real data presenting hand-written digits, this possibility could not be stated.

The most readable views were obtained using the criteria with which, as a result of learning, information on belonging of a given sample to a given class should appear at one output or several outputs. In the case of the use of these criteria, views in which each of classes formed one aggregation were obtained. These aggregations could be separated from each other in a way that is significantly more readable than with the use of the remaining criteria. As a result, by using these criteria, we can indicate the possibility to divide the space of features into subareas occupied by different classes in a way more readable than with the use of the remaining criteria.

Even such a seemingly nonsensical learning criterion, in which the output value is to be equal to a random value attributed to a given sample, can lead to obtaining information on data.

The more accurate network’s learning, omitting even the phenomenon of the network over-learning, does not have to cause an increase in the readability of the qualitative analysis results. It may also cause deterioration in the readability of the qualitative analysis results. Therefore, with this kind of analysis, the attention should not be paid to the degree of the network’s learning.

In exceptional cases, even the network without learning (before the learning process starts) allows obtaining readable results of the qualitative analysis of multidimensional data through its visualization.

From the point of view of the analysis conducted in the paper, it is not important how much of information contained in multidimensional data will be retained in the obtained view. Moreover, it can be assumed that the more general information will be retained, the less readable can be the result of the specific targeted analysis. Therefore, to evaluate the effectiveness of qualitative analysis it is best to use the criterion allow evaluating solely the readability of this analysis.

During the analysis of randomly selected 20 sets of multidimensional data obtained from one of publicly available repositories, interesting results were obtained. Use of the learning criterion of the autoassociative neural network allowed to obtain information on the possibility to divide the space of features in the case of six data sets (30%). The same information in the case of two learning criteria, in which the information on belonging of a sample to a given class should appears at one output or at two outputs as a result of learning, was obtained in the case of 12 data sets (60%). Using a criterion in which the network has one output at which the value is to be equal to the random value attributed to each sample depicts the most surprising result. In the case of using this criterion, information on the possibility to divide the space of features was obtained in the case of three data sets (15%). This means that using such a seemingly absurd criterion allows obtaining the searched information only twice less frequently than in the case of the autoassociative neural network.