1 Introduction

The classification of an object is a decision-making process whose outcome is the assignment of a specific class membership to the object under observation. Medical diagnosis (Mangasarian et al. 1995), chemistry (Jurs 1986), cybersecurity (Astorino et al. 2017), image processing (Khalaf et al. 2017) are only some of the possible application areas of classification.

Each object (sample) is characterized by a finite number of (quantitative and/or qualitative) attributes, usually referred to as the features.

Construction of a classifier is a supervised learning activity, where a dataset of samples, whose class membership is known in advance, is given as the input. The objective is to gather a mathematical model capable to correctly classify newly incoming samples whose class membership is, instead, unknown.

Classification deals, mainly, with separation of sets of samples in the feature space, which is assumed to be \({\mathbb {R}}^n\). Whenever classes are two, we are faced with a binary classification problem. In this paper, in fact, the training dataset is partitioned into two subsets, say \(\mathcal {A}\) and \(\mathcal {B}\), and thus, the problem consists in finding a separation surface, if any, between them.

Most of the models rely on setting an appropriate optimization problem whose output is either a separating surface or a nearly separating one, resulting in the minimization of some measure of the classification error.

Starting from the pioneering works by Bennett and Mangasarian (1992) and (Vapnik 1995), the hyperplane has been considered as the election surface to be looked for, although the use of nonlinear separation surfaces has been pursued too by Rosen (1965), Plastria et al. (2014) and Astorino and Gaudioso (2005, 2009).

The literature on classification is huge. We cite Vapnik (1995), Cristianini and Shawe-Taylor (2000), Schölkopf et al. (1999) and Sra et al. (2011) as basic references in support vector machine (SVM) framework and Thongsuwan et al. (2020) as a recent approach in the deep learning.

It is well known (see, e.g., Cristianini and Shawe-Taylor (2000)) that if the convex hulls of the two sets \(\mathcal {A}\) and \(\mathcal {B}\) do not intersect, there exists a separating hyperplane such that the set \(\mathcal {A}\) is on one side of such hyperplane and \(\mathcal {B}\) is on the other side. It can be calculated by linear programming Bennett and Mangasarian (1992), and the two sets are referred to as linearly separable. On the other hand, if \(\mathrm{conv}(\mathcal {A}) \cap \mathrm{conv}(\mathcal {B}) \ne \emptyset \), a number of algorithms can be adopted to determine a quasi-separating hyperplane such that the related error functions are minimized (see, for example, Mangasarian (1999)).

In this paper, we deal with binary classification based on the use of a polyhedral surface. The concept of polyhedral separability was introduced in Megiddo (1988) and applied within the classification framework in Astorino and Gaudioso (2002) and Astorino and Fuduli (2015).

Whenever, in fact, the two sets \(\mathcal {A}\) and \(\mathcal {B}\) are not linearly separable, it is possible to resort to polyhedral separation, that is to determine \(h>1\) hyperplanes such that \(\mathcal {A}\) is in the convex polyhedron given by the intersection of h half-spaces and \(\mathcal {B}\) lies outside such polyhedron.

In Astorino and Gaudioso (2002), an optimization model was proposed to calculate a set of h hyperplanes generating a polyhedral separation, whenever possible, for the sets \(\mathcal {A}\) and \(\mathcal {B}\). The model consists, as usual in classification, in minimizing an error function to cope with the case when the two sets are not h-polyhedrally separable. Parallel to SVM, the model was extended in Astorino and Fuduli (2015) to accommodate for margin maximization.

The error function adopted in Astorino and Gaudioso (2002) is neither convex nor concave, and it was dealt with by means of successive linearizations.

In this paper, we focus on the numerical treatment of the optimization problem to be solved in order to get a polyhedral separation surface. In particular, we fully exploit the DC (difference of convex) nature of the objective function (Hiriart-Urruty 1986), and thus, differently from Astorino and Gaudioso (2002), we adopt an algorithm designed to treat DC functions. In fact, the literature provides a wide set of efficient algorithms in this area nowadays.

In Pham Dinh (2005), Pham Dinh and Le Thi Hoai (2014), an iterative algorithm was introduced to minimize functions of the form \(f = f_1-f_2\), with g and h convex functions. The algorithm, called DCA, considers at iteration t the linearization \(f_2^{(t)}\) of function \(f_2\) at point \(x^t\) and determines the next iterate \(x^{t+1}\) as an optimal solution of the convex problem

$$\begin{aligned} \min _x f_1(x)-f_2^{(t)}(x) \end{aligned}$$
(1)

DCA has proven to be an efficient method to tackle DC problems, even non-smooth and different artificial intelligence problems have been approached by means of the DCA (Astorino et al. 2010, 2012, 2014; Astorino 2014; Khalaf et al. 2017).

Several other methods have been more recently proposed in the literature (see Gaudioso et al. (2018) and Joki et al. (2017)) which allow to solve large-scale DC programs too.

In this paper, we build on the model Astorino and Gaudioso (2002) and adopt a decomposition of the error function for the h-polyhedral separability problem as the difference of two convex functions. We then apply the DCA to carry out an extensive experimentation on several classes of benchmark instances.

The paper is organized as follows. In Sect. 2, we describe the h-polyhedral classification model and its reformulation as a difference of convex optimization problem. In Sect. 3, we describe how the DCA has been adapted to the DC reformulations. In Sect. 4, we present the results of our implementation on a number of benchmark classification problems. Some conclusions are drawn in Sect. 5.

2 The polyhedral separability model

Let \(\mathcal {A}=\{a_1, \dots , a_m\}\) and \(\mathcal {B}=\{b_1, \dots , b_k\}\) be two finite sets of \({\mathbb {R}}^n\).

Definition 1

The sets \(\mathcal {A}\) and \(\mathcal {B}\) are h-polyhedrally separable if there exist a set of h hyperplanes \(\{(v^j, \eta ^j)\}\), \(v^j \in {\mathbb {R}}^n\), \(\eta ^j \in {\mathbb {R}}\), \(j=1,\ldots ,h\), such that

$$\begin{aligned} \begin{aligned} a_i^Tv^j&< \eta ^j \quad \forall i=1, \dots , m, j=1, \dots , h\\ b_l^Tv^j&> \eta ^j \quad \forall l=1, \dots , k, \hbox { and at least one}\ j \in \{1, \dots , h\} \end{aligned} \end{aligned}$$
(2)

The following proposition gives an equivalent characterization of h-polyhedral separability:

Proposition 1

The sets \(\mathcal {A}\) and \(\mathcal {B}\) are h-polyhedrally separable if and only if there exist h hyperplanes \(\{(w^j, \gamma ^j)\}\) such that

$$\begin{aligned} \begin{aligned} a_i^Tw^j&\le \gamma ^j-1 \quad \forall i=1, \dots , m, j=1, \dots , h\\ b_l^Tw^j&\ge \gamma ^j +1\quad \forall l=1, \dots , k, \hbox { and at least one}\ j \in \{1, \dots , h\}. \end{aligned} \end{aligned}$$
(3)

Proof

[Astorino and Gaudioso (2002), Proposition 2.1] \(\square \)

Moreover, in Astorino and Gaudioso (2002) [Proposition 2.2] it is proven that a necessary and sufficient condition for the sets \(\mathcal {A}\) and \(\mathcal {B}\) to be h-polyhedrally separable (for some \(h \le |\mathcal {B}|\)) is given by

$$\begin{aligned} \mathrm{conv}(\mathcal {A}) \cap \mathcal {B} = \emptyset . \end{aligned}$$
(4)

Remark 1

The roles of \(\mathcal {A}\) and \(\mathcal {B}\) in (4) are not symmetric.

According to Proposition 1, a point \(a_i \in \mathcal {A}\) is well classified by the set of hyperplanes \(\{w^j, \gamma ^j\}\) if \(a_i^Tw^j - \gamma ^j + 1 \le 0\) for all \(j=1, \dots , h\). Therefore, we can compute the classification error of the point \(a_i\) with respect to \(\{w^j, \gamma ^j\}\) as

$$\begin{aligned} \max _{1 \le j \le h}\{0, a_i^Tw^j - \gamma ^j + 1\}. \end{aligned}$$
(5)

Analogously, if \(\min _{1\le j \le h}\{-b_l^Tw^j + \gamma ^j +1\} \le 0\), a point \(b_l \in \mathcal {B}\) is well classified by the set \(\{w^j, \gamma ^j\}\). Thus, the error of classification of the point \(b_l\) is

$$\begin{aligned} \max \left\{ 0, \min _{1\le j \le h}\{-b_l^Tw^j + \gamma ^j +1\}\right\} . \end{aligned}$$
(6)

Given a set of h hyperplanes \(\{w^j, \gamma ^j\}\), we denote with \(W = [w^1: \cdots : w^h]\) the matrix whose jth column is the vector \(w^j\) and with \(\varGamma = (\gamma ^1, \dots , \gamma ^h)\) the vector whose components are the \(\gamma ^j\)’s. The classification error function for the h-polyhedral separability problem for the sets \(\mathcal {A}\) and \(\mathcal {B}\), with respect to the hyperplanes \(\{w^j, \gamma ^j\}\), is then given by

$$\begin{aligned} e(W, \varGamma ) := e_1(W, \varGamma )+e_2(W, \varGamma ), \end{aligned}$$
(7)

where

$$\begin{aligned} e_1(W, \varGamma ) := \frac{1}{m}\sum _{i = 1}^m \max _{1\le j \le h}\{\max \{0, a_i^Tw^j - \gamma ^j + 1\}\} \end{aligned}$$
(8)

and

$$\begin{aligned} e_2(W, \varGamma ) := \frac{1}{k}\sum _{l=1}^k\max \left\{ 0, \min _{1\le j\le h}\{-b_l^Tw^j + \gamma ^j + 1\}\right\} \end{aligned}$$
(9)

represent the errors for points of \(\mathcal {A}\) and \(\mathcal {B}\), respectively. Function \(e(W, \varGamma )\) is nonnegative and piecewise affine; \(e_1(W, \varGamma )\) is convex, and \(e_2(W, \varGamma )\) is quasi-concave; moreover, in Astorino and Gaudioso (2002) it has also been proven that the sets \(\mathcal {A}\) and \(\mathcal {B}\) are h-polyhedrally separable if and only if there exists a set of h hyperplanes \((W^*, \varGamma ^*)\) such that \(e(W^*, \varGamma ^*) = 0\) and, in that case, \(w^j=0\) for all \(j=1, \dots , h\) cannot be the optimal solution.

In Astorino and Gaudioso (2002), the problem of minimizing the error function (7) is tackled by solving, at each iteration, a linear program providing a descent direction. Here, instead, we rewrite \(e(W, \varGamma )\) as difference of convex functions, and then, we address its minimization through ad hoc DC techniques.

To obtain such a reformulation, considert the following identity, which is valid for any set of h affine functions \(z^j(x)\), \(j=1, \dots , h\):

$$\begin{aligned} \begin{aligned} \max (0, \min _j z^j(x))&= \max (0, -\max _j(-z^j(x)))\\&= \max (0, \max _j(-z^j(x))) - \max _j(-z^j(x)). \end{aligned} \end{aligned}$$
(10)

By applying (10) to \(e_2(W, \varGamma )\), we obtain the following DC decomposition of \(e(W, \varGamma )\):

$$\begin{aligned} e(W, \varGamma ) = {\hat{e}}_1(W, \varGamma ) - {\hat{e}}_2(W, \varGamma ), \end{aligned}$$
(11)

where both

$$\begin{aligned} {\hat{e}}_1(W, \varGamma ) = e_1(W, \varGamma ) + \frac{1}{k}\sum _{l=1}^k \max [0, \max _j(b_l^Tw^j - \gamma ^j-1)]\}\nonumber \\ \end{aligned}$$
(12)

and

$$\begin{aligned} {\hat{e}}_2(W, \varGamma ) = \frac{1}{k}\sum _{l=1}^k\{\max _j(b_l^Tw^j - \gamma ^j - 1)\} \end{aligned}$$
(13)

are convex.

The DC decomposition (11) has been already discussed in Strekalovsky et al. (2015), where the authors suggested an algorithm that combines a local and a global search in order to find a global minimum of the error function. In the numerical experience we are going to discuss in the next sections, we confine ourselves to find just local minima of the error functions involved.

3 Exploiting the function structure in the DCA implementation

We have applied DCA to the minimization of the error function (11). Before discussing our experiment setting, we describe how we have adapted DCA to deal with polyhedral separation applied to a number of datasets from the classification literature.

At iteration t, in any possible configuration \((W_t, \varGamma _t)\) of the h hyperplanes we can calculate for each \(l=1,\ldots ,k\) the index \(j_l\) where the maximum in (13) is achieved:

$$\begin{aligned} j_l= \arg \max _{1\le j\le h}(b_l^Tw_{(t)}^{j}-\gamma _{(t)}^{j}- 1), \quad l=1, \dots , k \end{aligned}$$
(14)

and we define consequently the linearization of function \(\hat{e}_2\) at iteration t:

$$\begin{aligned} \hat{e}_2^t(W, \varGamma ) := \frac{1}{k}\sum _{l=1}^k(b_l^Tw^{j_l} - \gamma ^{j_l} - 1) \end{aligned}$$
(15)

which satisfies \({\hat{e}}_2^t(W, \varGamma ) \le \hat{e}_2(W, \varGamma )\). We observe that the choice of the index (14) means selecting a subgradient \(g_t\in \partial {\hat{e}}_2(W_t, \varGamma _t)\), which yields the construction of (15), the linearization function of \({\hat{e}}_2(W_t, \varGamma _t)\), according to the DCA scheme.

Then, we consider the convexification of the original DC function:

$$\begin{aligned} e^t(W, \varGamma ) = \hat{e}_1(W, \varGamma ) - \hat{e}_2^t(W, \varGamma ), \end{aligned}$$
(16)

so that next configuration \((W_{t+1}, \varGamma _{t+1})\) is obtained by solving the convex program

$$\begin{aligned} \displaystyle \min _{W, \varGamma } e^t(W, \varGamma ),\end{aligned}$$

which in turn can be put in the form of the following linear program, thanks to the introduction of the additional variables \(\xi _i\), \(i=1,\ldots ,m\) and \(\zeta _l\), \(l=1,\ldots ,k\):

$$\begin{aligned} \begin{aligned}&(W_{t+1}, \varGamma _{t+1}) = \arg \min \frac{1}{m}\sum _{i=1}^m \xi _i \\&\quad + \frac{1}{k}\sum _{l=1}^k(\zeta _l - b_l^Tw^{j_l} + \gamma ^{j_l} + 1)\\&\quad \xi _i \ge 0 \quad \quad \quad \quad \quad \quad \quad i=1, \ldots , m\\&\quad \xi _i \ge a_i^Tw^j - \gamma ^j + 1, \quad j=1, \dots , h, \quad i=1, \ldots , m\\&\quad \zeta _l \ge 0 \quad \quad \quad \quad \quad \quad \quad l=1,\ldots ,k\\&\quad \zeta _l \ge b_l^Tw^j - \gamma ^j - 1, \quad j=1, \dots , h, \quad l=1,\ldots ,k. \end{aligned} \end{aligned}$$
(17)

Summing up, the DCA-based algorithm for the minimization of (11) can be stated as follows:

figure a

The above algorithm is a descent method. It is easy to verify that if \(e(W_{t+1}, \varGamma _{t+1}) = e(W_{t}, \varGamma _{t})\), then \((W_{t}, \varGamma _{t})\) is a critical point, i.e.,

$$\begin{aligned} \partial {\hat{e}}_1(W_t, \varGamma _t) \cap \partial {\hat{e}}_2(W_t, \varGamma _t)\ne \emptyset \,. \end{aligned}$$

Hence, this result provides the stopping criterion at step 3. For more theoretical details and the convergence theorem, see Pham Dinh and Le Thi Hoai (2014), Pham Dinh (2005).

Table 1 Datasets
Table 2 The number of variables/constraints
Table 3 Training and testing correctness percentage
Table 4 CPU time (secs)
Table 5 Numerical results

Following the SVM paradigm, aimed at obtaining a good generalization capability, we have added to \(\hat{e}_1(W, \varGamma )\) in (11) the margin term:

$$\begin{aligned} \frac{1}{2}\sum _{j=1}^h \Vert w^j\Vert ^2, \end{aligned}$$
(18)

thus coming out with the following DC model:

$$\begin{aligned} {\bar{e}}(W, \varGamma ) = \left( \frac{C}{2}\sum _{j=1}^h \Vert w^j\Vert ^2+{\hat{e}}_1(W, \varGamma )\right) - \hat{e}_2(W, \varGamma ),\nonumber \\ \end{aligned}$$
(19)

where \(C>0\) is a trade-off parameter between the two objectives of maximizing the margin and minimizing the classification error. The minimization of (19) can be addressed by DCA, too. In this case, at each iteration we have to solve a quadratic program that differs from the linear program (17) only for the quadratic margin term (18). Consequently, the algorithmic scheme is unchanged except for the step 2 where a quadratic program has to be solved.

Table 6 Percentage of correctness with h-PolSepDC (\(h\ge 2\))

4 Numerical experiments

In the numerical experiments, we have implemented two DCA codes:

  • h-PolSepDC, where we minimize (11) (the separation problem with no separation margin) by solving at each iteration the linear program (17);

  • h-PolSepDC-QP, where we minimize (19) (a margin has been accounted for) by solving at each iteration a quadratic program.

In particular, we have chosen \(h=2\) according to the hyperparameter tuning performed in Astorino and Gaudioso (2002).

Moreover, since the role of the sets \({\mathcal {A}}\) and \({\mathcal {B}}\) is not symmetric in the definition of polyhedral separability, in the numerical experiments one has to define which is \({\mathcal {A}}\) and \({\mathcal {B}}\) in any dataset. So, we have called set \({\mathcal {A}}\) the one with less number of points, following, also for this issue, the rule adopted in Astorino and Gaudioso (2002).

We have used MATLAB R2015b calling CPLEX library, under a 2,6 GHz Intel Core i7 processor, on an OS X 10.12.6 operating system.

To evaluate the impact of the DC decomposition of the error function (7) with respect to the classic non-smooth optimization approach, we have reimplemented, in MATLAB, the algorithm proposed in Astorino and Gaudioso (2002) (2-PolSep code). Finally, for sake of completeness we have also used the standard MATLAB SVM package to run the linear separability classification problem (SVM-LINEAR code).

We have considered several test problems drawn from the binary classification literature which are described in Table 1. In particular, all datasets are taken from the LIBSVM (Library for Support Vector Machines) repository Chang and Lin (2011), except for g50c and g10n, which are described in Chapelle and Zien (2005).

For all datasets, we have performed a standard tenfold cross-validation protocol and in Table 2, we summarize the LP/QP problems solved at each fold, in terms of number of variables and constraints.

For each approach, in the columns train and test of Table 3 we report the average percentage of training and testing correctness, respectively. The best results in terms of testing correctness have been underlined.

A preliminary tuning for the parameter C in 2-PolSepDC-QP and SVM-LINEAR codes has been performed, and we have selected, for each dataset, that value optimizing the performance on the testing set.

Table 7 Percentage of correctness with h-PolSepDC-QP (\(h\ge 2\))

The numerical results indicate the good performance, in terms of correctness, of the DC-based approaches w.r.t. both the algorithm Astorino and Gaudioso (2002) and standard SVM. In particular, the DC model equipped with margin maximization (Code 2-PolSepDC-QP) has provided the best testing correctness in 13 out of 15 datasets. By comparing 2-PolSepDC and 2-PolSepDC-QP, we can observe that the addition of margin provides a better testing correctness except for galaxy and a9a datasets. On the contrary, 2-PolSepDC has a better performance in terms of training correctness except for Tic-Tac-Toe and letter-a datasets. This means that the classifier coming out from 2-PolSepDC-QP has a higher generalization capability.

As for computation time (see Table 4), the DC decomposition has been much more effective than 2-PolSep method. Moreover, the increase in computation time with respect to single-hyperplane separation model SVM-LINEAR has not been particularly severe.

2-PolSep and the 2-PolSepDC are different algorithms to solve the same problem, i.e., the minimization of (11). By comparing the objective function values obtained by the two codes starting from the same initial point, we note that the second approach provides better solutions, but the difference is not so significant.

Since in the definition of polyhedral separability the role of the sets \({\mathcal {A}}\) and \({\mathcal {B}}\) is not symmetric, we compare the results also in terms of recall, specificity, precision and F1-score (see Table 5). The trend of these key performance indicators confirms the goodness of the DC-based approaches.

For completeness, we have launched both the codes h-PolSepDC and h-PolSepDC-QP with \(h>2\). The running time is not dramatically larger, but the numerical experiments show that there is no significant improvement in terms of correctness. Even worse, in some cases the improvement of training performance is not accompanied with an improvement of testing one. This proves that a high value of h—number of hyperplanes—provides classifiers with a smaller generalization capability. For instance, we report some results (see Tables 6 and 7).

5 Conclusions

We have adopted a difference of convex decomposition of the error function in polyhedral separation and have tackled the resulting optimization problem via DCA algorithm.

The numerical results we have obtained demonstrate the good performance of the approach both in terms of classification correctness and computation time.

Future research would investigate the integration between feature selection Gaudioso et al. (2017) and polyhedral separation aimed at detecting a possibly smaller subsets of significant attributes in terms of classification correctness.