1 Introduction

In this paper we propose a new classification algorithm, called multi-dimensional F-transform classification (for short, MFC), in which the direct and inverse multi-dimensional fuzzy transforms are used to classify instances.

Our goal is to build a robust classification model based on the multi-dimensional fuzzy transform, which integrates a K-fold cross validation resampling method to overcome the problem of data over-fitting, which represents one of the major problems of classification models.

Classification tasks (Aggarwal 2014; Duda et al. 2001; Han et al. 2012; Hastie et al. 2013; Johnson and Wichern 1992; Mitchell 1997; Mitra et al. 2002; Nirmalraj and Nagarajan 2020; Mitra et al. 2002) consist of assigning patterns to only one of predefined categories, and it is a general problem that encompasses many diverse applications. Many techniques are used for data classification such as probabilistic and decision tree algorithms, rule-based and instance-based methods, support vector machine and neural networks algorithms (Aggarwal 2014; Witten et al. 2016).

Classification algorithms require a training phase to acquire the knowledge to be fitted in the related model and to classify patterns. Then a testing phase follows in which the performance of the algorithm is measured as well.

A good machine learning classification algorithm must be robust with respect to data under-fitting and over-fitting. Under-fitting occurs when the algorithm fails to fit sufficiently the data, producing high training and test errors, thus an under-fitting machine learning algorithm produces errors on the training data. The presence of under-fitting is evaluated by measuring a performance index after the learning process. Over-fitting represents a problem versus under-fitting and it occurs when a machine learning algorithm captures noise in the data and the data in the training dataset are optimally fitted, but the fitting itself is less accurate. The evaluation of the presence of over-fitting is more complex. There are two main techniques that we can use to limit over-fitting in machine learning algorithms: to measure the accuracy by using a validation dataset or to adopt a resampling technique by using random subsets of data.

After selecting and tuning the machine learning algorithm on the training dataset, we can evaluate the learned models on the validation dataset and hence to measure how the models could perform on unseen data.

In Sect. 2 we give the preliminary concepts. In Sect. 3 we recall the multi-dimensional F-transform, in Sect. 4 the MFC algorithm is presented. In Sect. 5 we present the results of our experiments: we apply the MFC algorithm in many datasets known in literature, comparing it with other classification methods. Conclusions are given in Sect. 6.

2 The MFC algorithm

Now we model the input data as a collection of instances. Each instance is characterized by a pair (X, Y), where X is a set of numerical attributes (X1,…,Xs) and Y is a special attribute designated as class which has C categories. We apply the fuzzy transform (for short, F-transform) algorithm (Perfilieva 2006) for finding a relation between attributes in the form:

$${\text{Y }} = f\left( {{\text{x}}_{{\text{1}}} , \ldots ,{\text{x}}_{{\text{s}}} } \right)$$
(1)

where f is a discrete function f: [a1,b1] × [a2,b2] × … × [as,bs] → Y \(\in\) {1,2,…,C} and x1 \(\in\) [a1,b1],…,xs \(\in\) [as,bs]. The F-transform algorithm is a technique used to approximate an original function “f”, of which a priori only a finite number values is known, but the expression of the function itself is unknown. The F-transform concept has been used in image processing (Di Martino and Sessa 2007, 2019a, b, 2020; Di Martino et al. 2008), data analysis (Di Martino et al. 2010, 2011; Di Martino and Sessa 2017a, b, 2019a, b, 2020; Novak et al. 2014; Formato et al. 2014; Perfilieva et al. 2008), time series analysis (Di Martino et al. 2011; Di Martino and Sessa 2017a, b; Johnson and Wichern 1992), forecasting problems (Di Martino et al. 2011; Di Martino and Sessa 2017a, b) and fuzzy approximation method (Tanaka et al. 1982; Khastan et al. 2015, 2017).

In Lee and Yen (2004), a classification based on the dependency between attributes in datasets is proposed. Indeed, in the paper (Di Martino and Sessa 2017a, b) an attribute dependency based on the inverse multi-dimensional F-transform is presented as regression function and in the work (Di Martino et al. 2011) as prediction function in time series. In the algorithms based on F-transforms, a set of fuzzy partitions of the input data domains is created. An assigned family of ni fuzzy sets \(A_{{i1}} , \ldots ,A_{{in_{i} }}\):[ai, bi] → [0,1] is considered for any input variable. We say that the training dataset is sufficiently dense w.r.t. the fuzzy partition if, for each combination of fuzzy sets of the input variables, exists at least an instance with positive membership degree. We illustrate an example in the case of two attributes defined in the universe of discourse [a1,b1] × [a2,b2] via two fuzzy partitions of fuzzy sets \(\left\{ {A_{{i1}} ,A_{{i2}} , \ldots ,A_{{in_{i} }} } \right\}\), i = 1,2. If xi1 = ai < xi2 < … < \(x_{{in_{i} }}\) = bi is a subdivision of [ai,bi], then we have that the fuzzy sets Ai1,…,\(A_{{in_{i} }}\):[ai,bi] → [0,1], i = 1,2, must satisfy suitable properties in each interval [xi1,xi2],…,[\(x_{{i(n_{i} - 1)}}\),\(x_{{in_{i} }}\)], respectively, (cfr. Sect. 3) and they stand for the instances of the training dataset represented from red points in the example of Fig. 1. We deduce that in this example the data are not sufficiently dense w.r.t. the chosen fuzzy partitions: indeed in the grey zone there are no instance p1i and p2j such that x1,i − 1 < p1i < x1,i + 1 and x2,j − 1 < p2j < x2,i + 1.

Fig. 1
figure 1

Example of training dataset not sufficiently dense w.r.t. the fuzzy partitions

The case of Fig. 1 is due to the too fine-grained fuzzy partition of input variable domain in the case of a family of two fuzzy sets A1. The model fails to fit the output for input data in the grey zone of Fig. 1.

The F-transform attribute dependency algorithm (for short, FAD), presented in Di Martino and Sessa (2007), controls the presence of under-fitting and it verifies that the dataset is not sufficiently dense with respect to the fuzzy partitions of the input variable domains. The FAD algorithm is an iterative process in which a uniform fuzzy partition with n fuzzy sets is created for each input variable. If the data are not sufficiently dense w.r.t. the fuzzy partition, the process stops and the functional dependency cannot be found, otherwise the direct and inverse multi-dimensional F-transforms are calculated. Afterwards an index of determinacy, called MADMEAN (Di Martino and Sessa 2017a, b) is calculated; this index measures if the functional model created fits the data and it ranges in [0,1]. The value of MADMEAN is closer to 1, the more the model fits the data. If MADMEAN is greater than a threshold value α, the process stops and the inverse multi-dimensional F-transform is used to approximate the functional dependency, else a finer fuzzy partition is set with n: = n + 1. In Fig. 2 we illustrate the FAD algorithm.

Fig. 2
figure 2

FAD algorithm

However, even if the training data is sufficiently dense w.r.t. the fuzzy partition, the presence of over-fitting cannot prevented. Hence we propose a new iterative classification algorithm based on multi-dimensional F-transforms in which we apply the K-fold (K is a positive integer) cross-validation resampling algorithm to control this presence.

The K-fold cross-validation is the most popular resampling technique: it allows to train and to test the model K times on different subsets of the training dataset, moreover an estimation of a machine learning model on unseen data is obtained as well.

Performance analysis of this classification algorithm is pointed out in the works (Wong 2015, 2017). In the paper (Wen et al. 2017), a new SVM K-fold cross-validation method is proposed to increase the efficiency of such approach.

A K-fold cross-validation algorithm works under K iterations. A partition of the sample dataset into K folds (i.e., subsets) is done. At any iteration a fold is considered as validation subset which is joined to the other folds for constituting the training set and then the classification process is performed. The main advantage of this method w.r.t. other resampling techniques is that each fold is considered once as validation set. In Fig. 3 we schematize this technique for K = 4.

Fig. 3
figure 3

Schema of a fourfold cross-validation technique

As in FAD algorithm (Di Martino and Sessa 2017a, b), by using a K-fold cross-validation technique, we can calculate two indices (say) CV1 (resp., CV2) for evaluating the performance of our algorithm. These indices are given by the average of the percentages of misclassified instances in the training data \({\text{CV}}_{{\text{1}}}^{{\text{k}}}\) (resp., testing data \({\text{CV}}_{{\text{2}}}^{{\text{k}}}\)) for the kth fold, k = 1,…,K:

$$CV_{1} = \frac{1}{K}\sum\limits_{{k = 1}}^{K} {{\text{CV}}_{{\text{1}}}^{{\text{k}}} }$$
(2)
$$CV_{2} = \frac{1}{K}\sum\limits_{{k = 1}}^{K} {{\text{CV}}_{{\text{2}}}^{{\text{k}}} }$$
(3)

We set a threshold α (resp., β) for CV1 (resp., CV2), then the training (resp., testing) set is partitioned in K folds and a set of fuzzy partitions is created for the input attributes, verifying that each subset is sufficiently dense w. r. t. the set of fuzzy partitions. Then the direct F-transforms are calculated in each fold. The average of the inverse F-transforms calculated for each fold is useful to classify an instance. Strictly speaking, if \({\text{CV}}_{{\text{1}}}^{{\text{k}}}\)(resp.,\({\text{CV}}_{{\text{2}}}^{{\text{k}}}\)) = \(f_{{n_{1} n_{2} \ldots n_{s} }}^{{F_{k} }} (p_{1} ,p_{2} , \ldots ,p_{s} )\) is the inverse F-transform for x1 = p1,…,xs = ps, being n1,…,ns are the number of fuzzy sets of the partition, respectively, we classify the corresponding variable Y in the kth fold of the training (resp., testing) as

$$Y = f_{{n_{1} n_{2} \ldots n_{s} }} (p_{1} ,p_{2} , \ldots ,p_{s} ) = \frac{1}{K}\sum\limits_{{k = 1}}^{K} {f_{{n_{1} n_{2} \ldots n_{s} }}^{{F_{k} }} (p_{1} ,p_{2} , \ldots ,p_{s} )}$$
(4)

The class label assigned to the output variable is given by

$$y = \left[ {\mathop {\min }\limits_{{c = 1, \ldots ,C}} \left( {\left| {f_{{n_{1} n_{2} \ldots n_{s} }} (p_{1} ,p_{2} , \ldots ,p_{s} ) - c} \right|} \right)} \right]$$
(5)

where [a] stands for the greatest integer containing the positive real number a. Then, we calculate the two indices CV1 and CV2. If CV1 ≥ α or CV2 ≥ β, the process is iterated by considering a finer set of fuzzy partitions, otherwise the process stops. Summarizing, the MFC algorithm consists into the following steps:

  1. 1.

    The number of folds K and the thresholds α and β for the two indices CV1 and CV2 are defined.

  2. 2.

    The patterns formed by the input attributes X1,…,Xs, and the output Y are extracted from the dataset. The input attributes are numerical attributes and the output attribute contain the class label assigned to each instance.

  3. 3.

    The set of instances is randomly partitioned in K folds of equal dimension, by using the schema of Fig. 3.

  4. 4.

    For each input attribute Xi, i = 1,…,s, a uniform fuzzy partition is created with the basic functions \(\left\{{A}_{i1},{A}_{i2}, \ldots ,{A}_{i{n}_{i}}\right\}\).

  5. 5.

    For each fold we verify that the corresponding training subset is sufficiently dense w.r.t. the set of fuzzy partitions. If for at least onefold the sufficient density is not respected, the process stops and the classification is not found.

  6. 6.

    For each fold the direct fuzzy transform (4) is calculated.

  7. 7.

    For each fold the index \(CV_{1}^{k}\), k = 1,…, K, is calculated as the number of all the misclassified patterns in the training subset, where the classification is performed by using the average (5).

  8. 8.

    For each fold the index \(CV_{2}^{k}\), k = 1,…,K, is calculated as the number of all the misclassified patterns in the testing subset, where the classification is performed by using the average (5).

  9. 9.

    The indices CV1 and CV2 are calculated via Eqs. (2) and (3), respectively. If CV1 ≥ α or CV2 ≥ β, a finer fuzzy partition set is considered and the process returns to the step 3.

  10. 10.

    The classification is validated.

In Fig. 4 we schematize the MFC algorithm.

Fig. 4
figure 4

Schema of the MFC algorithm

3 Multi-dimensional F-transforms

Following the definitions and notations of Perfilieva (2006), let n ≥ 2 and x1, x2,…, xn be points of [a,b], called nodes, such that x1 = a < x2 < … < xn = b. The assigned family of fuzzy sets A1,…,An: [a,b] → [0,1], called basic functions, is a fuzzy partition of [a,b] if the following properties hold:

  1. (1)

    Ai(xi) = 1 for every i = 1,2,…,n;

  2. (2)

    Ai(x) = 0 if x \(\notin\) (xi1,xi+1) for i = 2,…,n;

  3. (3)

    Ai(x) is a continuous function on [a,b];

  4. (4)

    Ai(x) strictly increases on [xi1, xi] for i = 2,…, n and strictly decreases on [xi,xi+1] for i = 1,…, n − 1;

  5. (5)

    \(\sum\limits_{{i = 1}}^{n} {A_{i} (x) = 1}\) for every x \(\in\) [a,b].

    We say that the fuzzy sets {A1(x),…,An(x)} form a uniform fuzzy partition if

  6. (6)

    n ≥ 3 and xi = a + h∙(i − 1), where h = (b − a)/(n − 1) and i = 1, 2,…, n (that is, the nodes are equidistant);

  7. (7)

    Ai(xi − x) = Ai(xi + x) for every x \(\in\) [0,h] and i = 2,…, n − 1;

  8. (8)

    Ai+1(x) = Ai(x − h) for every x \(\in\) [xi, xi+1] and i = 1,2,…, n − 1.

Avoiding the case of a continuous function studied in the paper (Perfilieva 2006) and considering the discrete case which is applicable in many real situations, we know that the function f:[a,b] → [0,1] assumes assigned values in the points P = {p1,…,pm}\(\subset\) [a,b]. If the set P = {p1,…,pm} is sufficiently dense w.r.t. the fixed partition {A1, A2,…, An}, that is for each i = 1,…,n, there exists an index v \(\in\) {1,…,m} such that Ai(pv) > 0, we define the n-tuple \((F_{1} ,F_{2} , \ldots ,F_{n} )\) as the direct F-transform of f w.r.t. {A1,A2, …, An}, where each Fi is given by

$$F_{{\text{i}}} = \frac{{\sum\limits_{{j = 1}}^{m} {f(p_{j} )A_{i} (p_{j} )} }}{{\sum\limits_{{j = 1}}^{m} {A_{i} (p_{j} )} }}$$
(6)

for i = 1,…,n. Similarly, we define the inverse F-transform of f w.r.t. {A1, A2,…, An} by setting

$$f_{{F,n}} (p_{j} ) = \sum\limits_{{i = 1}}^{n} {F_{i} } A_{i} (p_{j} )$$
(7)

for every j \(\in\){1,…,m}. We have the following theorem (Perfilieva 2006):

Theorem 1

Let f(x) be a function assigned on a set P = {p1,…, pm}\(\subset\) [a,b] and assuming values in [0,1]. Then for every ε > 0, there exists an integer n(ε) and a related fuzzy partition {A1, A2, …, An(ε)} of [a,b] for which there P is sufficiently dense. Moreover, for every pj \(\in\) [a,b], j = 1,…,m, the following inequality holds:

$$\left| {f(p_{j} ) - f_{{F,n(\varepsilon )}} \left( {p_{j} } \right)} \right| < \varepsilon$$
(8)

The above concepts can be extended to functions in k (≥ 2) variables. Let the Cartesian product [a1,b1] × [a2,b2] × … × [as,bs] of s intervals [ai,bi], i = 1,…, s, be the universe of discourse. In the discrete case, the function \(f\left( {x_{1} , \ldots ,x_{s} } \right)\) assumes determined values in m points (pj1,pj2,…,pjs) \(\in\) [a1,b1] × [a2,b2] × … × [as,bs] for j = 1,…,m. We say that the set P = {(p11, p12,…,p1s), (p21, p22,…, p2s),…,(pm1, pm2,…,pms)} is sufficiently dense w.r.t. the chosen partitions \(\left\{ {A_{{11}} ,A_{{12}} , \ldots ,A_{{1n_{1} }} } \right\}\),…,\(\left\{ {A_{{s1}} ,A_{{s2}} , \ldots ,A_{{sn_{s} }} } \right\}\) if for any {h1,…,hs} \(\in\) {1,…,n1} × … × {1,…,ns}, there is some (pv1,pv2,…,pvs) \(\in\) P, v \(\in\) {1,…,m}, such that \(A_{{1h_{1} }} (p_{{v1}} ) \cdot A_{{2h_{2} }} (p_{{v2}} ) \cdot \ldots \cdot A_{{sh_{s} }} (p_{{vs}} ) > 0\). So we can define the (h1,h2,…,hs)th component \(F_{{h_{1} h_{2} \ldots h_{s} }}\) of the direct F-transform of f w.r.t. the basic functions \(\left\{ {A_{{11}} ,A_{{12}} , \ldots ,A_{{1n_{1} }} } \right\}\), …,\(\left\{ {A_{{s1}} ,A_{{s2}} , \ldots ,A_{{sn_{s} }} } \right\}\) as

$$F_{{h_{1} h_{2} \ldots h_{s} }} = \frac{{\sum\limits_{{j = 1}}^{m} {f(p_{{j1}} ,p_{{j2}} , \ldots p_{{js}} ) \cdot A_{{1h_{1} }} (p_{{j1}} ) \cdot A_{{2h_{2} }} (p_{{j2}} ) \cdot \ldots \cdot A_{{sh_{s} }} (p_{{js}} )} }}{{\sum\limits_{{j = 1}}^{m} {A_{{1h_{1} }} (p_{{j1}} ) \cdot A_{{2h_{2} }} (p_{{j2}} ) \cdot \ldots \cdot A_{{sh_{s} }} (p_{{js}} )} }}$$
(9)

Now we define the inverse F-transform of f w.r.t. the basic functions \(\left\{ {A_{{11}} ,A_{{12}} , \ldots ,A_{{1n_{1} }} } \right\}\),…,\(\left\{ {A_{{s1}} ,A_{{s2}} , \ldots ,A_{{sn_{s} }} } \right\}\) to be the following function by setting for each point (pj1, pj2,…,pjs) \(\in\) [a1,b1] × … × [as,bs]:

$$f_{{n_{1} n_{2} \ldots n_{s} }}^{F} (p_{{j1}} ,p_{{j2}} , \ldots ,p_{{js}} ) = \sum\limits_{{h_{1} = 1}}^{{n_{1} }} {\sum\limits_{{h_{2} = 1}}^{{n_{2} }} { \ldots \sum\limits_{{h_{s} = 1}}^{{n_{s} }} {F_{{h_{1} h_{2} \ldots h_{s} }} \cdot A_{{1h_{1} }} (p_{{j1}} ) \cdot \ldots \cdot A_{{sh_{s} }} (p_{{js}} )} } }$$
(10)

for j = 1,…,m. The following theorem, which is an extension of Theorem 1, holds:

Theorem 2

Let \(f\left( {x_{1} , \ldots ,x_{s} } \right)\) be a function assigned on the set of points P = {(p11,p12, …,p1s),(p21, p22, …, p2s),…,(pm1, pm2, …,pms)} \(\subset\) [a1,b1] × [a2,b2] × … × [as,bs] assuming values in [0,1]. Then for every ε > 0, there exist integers n1(ε),…, ns(ε) and related fuzzy partitions

$$\left\{ {A_{{11}} ,A_{{12}} , \ldots ,A_{{1n_{1} (\varepsilon )}} } \right\}, \ldots ,\left\{ {A_{{s1}} ,A_{{s2}} , \ldots ,A_{{sn_{s} \left( \varepsilon \right)}} } \right\}$$
(11)

such that the set P is sufficiently dense w.r.t. (11). Moreover, for every (pj1, pj2,…,pjs) \(\in\) P, j = 1,…,m, the following inequality holds:

$$\left| {f(p_{{j1}} ,p_{{j2}} , \ldots ,p_{{js}} ) - f_{{n_{1} (\varepsilon )n_{2} (\varepsilon )}}^{F} \ldots _{{n_{s} (\varepsilon )}} (p_{{j1}} ,p_{{j2}} , \ldots ,p_{{js}} )} \right| < \varepsilon$$
(12)

4 Proposed algorithm

Continuing to maintain notations given in Sect. 1, we consider a dataset formed by s attributes X1 = [a1,b1],…,Xi = [ai,bi],…,Xs = [as,bs], and an attribute Y \(\in\) {1,…,C}. We try to find a dependency between attributes in the form (1), where f is a discrete function f: [a1,b1] × [a2,b2] × … × [as,bs] → {1,…,C}.

In the MFC algorithm we apply the K-fold validation algorithm by partitioning randomly the dataset in K folds having the same number of instances. Each fold is formed by a training subset with m = \(\frac{{K - 1}}{K} \cdot M\) instances and by a testing subset with \(\frac{M}{K}\) instances, where M is the number of instances in the original dataset. Now we schematize the training subset in a fold with m instances O1,…,Oj,…,Om in the following tabular form (Di Martino and Sessa 2017a, b):

 

X1

Xs

Y

O1

p11

 

p1s

y1

.

.

.

.

.

.

.

.

.

.

Oj

pj1

.

pjs

yj

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Om

pm1

.

pms

ym

where pji is the value of the attribute Xi for the instance Oj. Each attribute Xi can be considered as a numerical variable assuming values in the domain [ai,bi], where ai = min{p1i,…,pmi} and bi = max{p1i,…,pmi}. The parameters (to be fixed) are the number K of folds and the threshold α (resp., β) for CV1 (resp., CV2). We first construct a uniform fuzzy partition for the domain of each input attribute [ai,bi] constituted by following ni basic functions defined as

$$\begin{gathered} A_{{i1}} (x) = \left\{ \begin{gathered} 0.5 \cdot \left( {1 + \cos \frac{\pi }{{h_{i} }}(x - x_{{i1}} )} \right){\text{ if x }} \in {\text{ [x}}_{{{\text{i1}}}} {\text{,x}}_{{{\text{i2}}}} {\text{]}} \hfill \\ 0{\text{ otherwise}} \hfill \\ \end{gathered} \right.{\text{ }} \hfill \\ A_{{ij}} (x) = \left\{ \begin{gathered} 0.5 \cdot \left( {1 + \cos \frac{\pi }{{h_{i} }}(x - x_{{ij}} )} \right){\text{ if x }} \in {\text{ [x}}_{{{\text{i(j}} - {\text{1)}}}} {\text{,x}}_{{{\text{i(j}} + {\text{1)}}}} {\text{] }} \hfill \\ 0{\text{ otherwise}} \hfill \\ \end{gathered} \right. \hfill \\ A_{{in_{i} }} (x) = \left\{ \begin{gathered} 0.5 \cdot \left( {1 + \cos \frac{\pi }{{h_{i} }}(x - x_{{in_{i} }} )} \right){\text{ if x }} \in {\text{ [}}x_{{i(n - 1)}} {\text{,}}x_{{in{\text{i}}}} {\text{]}} \hfill \\ 0{\text{ otherwise}} \hfill \\ \end{gathered} \right.{\text{ }} \hfill \\ \end{gathered}$$
(13)

where hi = (bi − ai)/(ni − 1), xij = ai + hi·(j − 1), i = 1,2,…,s and j = 1,2,…,ni. Initially we consider the coarsest grained uniform fuzzy partition by setting n1 = n2 = … = ns = 3. Afterwards in each fold, we control that the data are sufficiently dense w.r.t. the fuzzy partition, that is we verify that for any combination of basic functions \(A_{{1h_{1} }} , \ldots ,A_{{kh_{s} }}\), there exists at least one instance Oj such that \(A_{{1h_{1} }} (p_{{j1}} ) \cdot \ldots \cdot A_{{kh_{s} }} (p_{{js}} ) > 0\). If this condition is not satisfied in any fold, the process stops and the classification is not found, otherwise the direct F-transform is calculated in each fold.

By setting f(pj1,pj2,…,pjs) = yj for j = 1,2,…,m, the direct F-transform of f is constructed by using Eq. (9) in the following form:

$$F_{{h_{1} h_{2} \ldots h_{s} }}^{k} = \frac{{\sum\limits_{{j = 1}}^{m} {y_{j} \cdot A_{{1h_{1} }} (p_{{j1}} ) \cdot \ldots \cdot A_{{sh_{s} }} (p_{{js}} )} }}{{\sum\limits_{{j = 1}}^{m} {A_{{1h_{1} }} (p_{{j1}} ) \cdot \ldots \cdot A_{{sh_{s} }} (p_{{js}} )} }}$$
(14)

where k = 1,…,K. Given a set of values p1,.,pi,..,ps for the input variables X1,…,Xi,…,Xs, we calculate the inverse F-transform \(f_{{n_{1} n_{2} \ldots n_{s} }}^{{F^{k} }}\) for each fold, defined as

$$f_{{n_{1} n_{2} \ldots n_{s} }}^{{F^{k} }} (p_{1} ,p_{2} , \ldots p_{s} ) = \sum\limits_{{h_{1} = 1}}^{{n_{1} }} {\sum\limits_{{h_{2} = 1}}^{{n_{2} }} { \ldots \sum\limits_{{h_{s} = 1}}^{{n_{s} }} {F_{{h_{1} h_{2} \ldots h_{s} }}^{k} \cdot A_{{1h_{1} }} (p_{1} ) \cdot \ldots \cdot A_{{kh_{s} }} (p_{s} )} } }$$
(15)

The algorithm is schematized below in the form of pseudocode. The function Classify() return the class yf assigned to a pattern with input values p = (p1,…,ps) by using the inverse F-transform. The MFC algorithm compares the resulting class with the value y of the attribute Y assigned in the dataset to calculate the numbers of misclassified patterns \({\text{CV}}_{{\text{1}}}^{{\text{k}}}\) and \({\text{CV}}_{2}^{{\text{k}}}\), for the kth training and testing subset, respectively, and to calculate the two indexes CV1 and CV2.

Algorithm MFC

Arguments:

α, β, K

Return value:

Boolean values (TRUE = “classification found”, FALSE = “classification not found”)

Input:

Dataset composed by m instances with input attributes X1, X2,…,Xs and class attribute Y

Output:

Direct F-transform components

1

status: = FALSE

2

n1: = 3,…,nS: = 3

3

WHILE (status == FALSE)

4

Partition randomly the dataset in K subsets creating the K folds

5

FOR i = 1 to s

6

Create a uniform fuzzy partition {\({A}_{i1},{A}_{i2},\dots ,{A}_{i{n}_{}}\)} of the interval [ai,bi] by using the basic functions (13)

7

NEXT i

8

FOR k = 1 to K

9

IF the kth training subset is not dense w.r.t. the fuzzy partition product THEN

10

RETURN status // “Classification not found”

11

ELSE

12

Calculate the direct F-Transform components by using Eq. (9)

13

END IF

14

NEXT k

15

CV1: = 0

16

CV2: = 0

17

FOR k = 1 to K

18

\({\text{CV}}_{{\text{1}}}^{{\text{k}}}\): = 0

19

FOR each (p,y) in the k-th learning subset

20

yf = Classify(C,p) // class assigned to the pattern p

21

IF(yf ! = y) THEN

22

C1k: = C1k + 1

23

END IF

24

CV1: = CV1 + \({\text{CV}}_{{\text{1}}}^{{\text{k}}}\)

25

\({\text{CV}}_{2}^{{\text{k}}}\): = 0

26

FOR each (p,y) in the k-th testing subset

27

yf = Classify(C,p) // class assigned to the pattern p

28

IF(yf ! = y) THEN

29

\({\text{CV}}_{2}^{{\text{k}}}\): = \({\text{CV}}_{2}^{{\text{k}}}\) + 1

30

END IF

31

CV2: = CV2 + \({\text{CV}}_{2}^{{\text{k}}}\)

32

NEXT k

33

CV1: = CV1/K

34

CV2: = CV2/K

35

IF (CV1 \(\ge\) α) OR (CV2 \(\ge\) β) THEN

36

Set greater values for n1,…,nS // a finer fuzzy partitions set is selected

37

ELSE

38

Status: = TRUE

39

END IF

40

END WHILE

41

RETURN status // “Classification found”

42

END

Function classify

Arguments:

C, p1,…,ps

Return value:

Class value calculated for the attribute Y

Input:

Direct F-transform components

Output:

 

1

y: = 1

2

FOR k = 1 to K

3

Calculate the inverse F-Transform \(f_{{n_{1} n_{2} \ldots n_{s} }}^{{F^{k} }} (p_{1} ,p_{2} , \ldots ,p_{s} )\) by using Eq. (15)

4

f: = f + \(f_{{n_{1} n_{2} \ldots n_{s} }}^{{F^{k} }} (p_{1} ,p_{2} , \ldots ,p_{s} )\)

5

NEXT k

1

f: = f/K

2

w: = |f − 1|

3

FOR c = 2 to C

4

IF |f − c|< w THEN

5

y: = c

6

END IF

7

NEXT c

8

RETURN y

9

END

5 Tests and results

Now we present our experiments on known sample of over 100 datasets extracted from UCI machine Learning repository (http://archive.ics.uci.edu/ml/datasets.html) and from Knowledge Extraction Evolution Learning (KEEL) dataset (http://sci2s.ugr.es/keel/category.php?cat=clas).

In each experiment we randomly select a set of instances to be stored in a testing dataset. The other instances are randomly partitioned in K folds. We set K = 10, α = 2%, β = 4%. After application of the MFC algorithm, we use the classifier on the testing dataset obtaining the error as percentage of misclassified instances.

For brevity, we present the detailed results obtained only for four datasets: IRIS, balance-scale, banana and wine.

Now present the results for IRIS dataset which is formed by 150 instances and 5 attributes. The dataset consists of 50 samples, each of three species of plant: Iris setosa, Iris versicolor and Iris virginica. For each instance the length and the width in cm of the sepals and petals are measured. Based on the combination of these four features, the biologist R. Fisher (https://en.wikipedia.org/wiki/Iris_flower_data_set) developed a linear discriminant model to distinguish the species from each other. It is well known that only the Iris setosa is well linearly separable from the other two. In Table 1 we show the attributes in the IRIS dataset.

Table 1 Attributes in the IRIS dataset

In training (resp., testing) phase, each fold has m = 135 (resp., 15) instances. In Table 2, we show CV1 and CV2 obtained for each fold.

Table 2 CV1 and CV2 indices obtained for each fold and their average values

We use a testing dataset of 50 instances downloaded from the IRIS WEB page project (http://netcologne.dl.sourceforge.net/project/irisdss/IRIS%20Test%20data.txt). Table 3 shows the MFC classifies improperly the data only in one case.

Table 3 Test results obtained for a sample of 50 instances

Now we present the results obtained by applying the MFC algorithm on the balance-scale dataset: the attributes are the left weight, the left distance, the right weight and the right distance (Table 4).

Table 4 Attributes in the balance scale dataset

The dataset is composed of 625 instances. We select randomly 600 instances, storing the other 25 instances for testing the classification results, we set K = 10 and each fold contains 540 (resp., 60) training (resp., testing) instances. In Table 5, we show the results obtained for each fold.

Table 5 CV1 and CV2 indices obtained for each fold and their average values

We apply the classification on the testing dataset of 25 instances. In Table 6, we show the results obtained. The instance is improperly classified only in one case.

Table 6 Test results obtained for a sample of 25 instances

We present a third experiment in which we apply the MFC algorithm to the Keel dataset Banana dataset by the Fraunhofer Intelligent Data Analysis Group (https://www.iais.fraunhofer.de/). It contains instances formed by three attributes: the first two ones represent the X and Y axis of the cluster of banana’s form, the last attribute assumes values − 1 and 1 representing two forms of banana (Table 7).

Table 7 Attributes in the banana dataset

The dataset contains 5300 instances. We select randomly 300 instances for the final testing phase and we partition randomly the other 5000 instances creating tenfold. Each fold consists of 4500 (resp., 500) training (resp., testing) instances. In Table 8, we show CV1 and CV2 obtained for each fold and their averages. In Table 9, we show the results obtained on the testing dataset.

Table 8 CV1 and CV2 indices obtained for each fold and their average values
Table 9 Test results obtained for a sample of 300 instances

Finally, we show the results of the tests performed on the UCI machine learning Wine dataset. This dataset is given by 178 instances having 13 numerical features characterizing the chemical composition of an Italian wine. Each wine belongs to one of three different crops; the dataset is partitioned in three classes identified as 1, 2 and 3 whose belong 59, 71, 48 instances, respectively.

We select randomly 150 instances, storing the other 28 instances for testing the classification results. We construct K = tenfold containing 150 (resp., 28) training (resp., testing) instances. In Table 10 (resp., 11), we show the values of CV1 and CV2 obtained for each fold and their averages (resp., testing dataset) (Table 11).

Table 10 CV1 and CV2 indices obtained for each fold and their average values
Table 11 Test results obtained for a sample of 28 instances

For completeness, we present a comparison of the MFC algorithms with classification algorithms implemented in the mining tool WEKA (Waikato Environment for Knowledge Analysis, https://www.cs.waikato.ac.nz/ml/weka/; Witten et al. 2016). For each algorithm, we set K = 10. In Table 12, we show the mean running time (in ms) and the testing errors obtained by applying the MFC algorithm and the classification algorithms Decision tree-based J48 (Bhargawa et al. 2013; Mitchell 1997), Multi-Layer Perceptron (Pal and Mitra 1992, 2004; Chaudhuri and Bhattacharya 2007), naive Bayes (Dimitoglou et al. 2012; Panda and Matra 2007) and Lazy K-Nearest Neighbor IBK (Aha 1997; Bhargawa et al. 2013; Jiang et al. 2007). In Table 13, we show the mean running time, the mean and the standard deviation of the testing percentage errors obtained by applying the five classification algorithms to all the datasets used in our tests.

Table 12 Running time, misclassified instances and testing percentage errors obtained for the datasets IRIS, balance scale, banana and wine by using different classification algorithms
Table 13 Mean running time and mean and standard deviation testing percentage error obtained by using the 5 classification algorithms

These results show that the performance of the MFC algorithm, obtained by setting α = 2% and β = 4%, are comparable with the ones obtained applying the Decision tree J48 and the Multilayer Perceptron algorithms. Moreover, MFC produces better performance results than the Lazy IBK and naive Bayes algorithms, although it gives slightly higher running times.

In Table 14, we show the mean accuracy, precision and recall classification measures obtained by running the five algorithms on all the test datasets.

Table 14 Mean accuracy, precision and recall with 5 classification algorithms

These results show that MFC, in addition to producing accurate results, also has high precision and high sensitivity, comparable to those obtained using the Decision tree J48 and Multilayer Perceptron algorithms.

6 Conclusions

In this paper we propose a new classification algorithm based on direct and inverse F-transforms for machine learning data. We apply the K-fold cross-validation resampling algorithm to control the presence of over-fitting in the data. The MFC algorithm is an iterative algorithm in which the dataset is partitioned randomly in K subsets, then the algorithm calculates the performance indices \({\text{CV}}_{1}^{{\text{k}}}\) (resp., \({\text{CV}}_{{\text{2}}}^{{\text{k}}}\)) for each fold: they correspond to the percentage of misclassified instances in the training (resp., testing) subsets. Initially a coarse-grained fuzzy partition of the data domain is set: if the two indices are greater or equals to two specified thresholds for each fold, then the algorithm stops otherwise a finer fuzzy partition of the data domain is set.

We compare the MFC algorithm with other known classification algorithms implemented in the mining tool WEKA. The results obtained for over 100 classification datasets show that the performances of the MFC algorithms are better than the ones obtained by using the naive Bayes and Lazy IBK algorithms and they are comparable with the ones obtained by the Decision tree J48 and the Multilayer Perceptron algorithms.

The MFC algorithm has an high computational complexity when used on large and high-dimensional datasets and this is a shortcoming. In the future we intend to improve the algorithm to manage massive and multi-dimensional datasets. The inclusion in the preprocessing phase of feature selection techniques and the use of data compression methods allows to reduce the computational complexity of the search for the optimal fuzzy partition and the calculation of the components of the multi-dimensional direct F- transforms.