Smooth twin bounded support vector machine with pinball loss

Li, Kai; Lv, Zhen

doi:10.1007/s10489-020-02085-5

Smooth twin bounded support vector machine with pinball loss

Open access
Published: 09 January 2021

Volume 51, pages 5489–5505, (2021)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Smooth twin bounded support vector machine with pinball loss

Download PDF

Kai Li¹ &
Zhen Lv¹

1887 Accesses
10 Citations
Explore all metrics

Abstract

The twin support vector machine improves the classification performance of the support vector machine by solving two small quadratic programming problems. However, this method has the following defects: (1) For the twin support vector machine and some of its variants, the constructed models use a hinge loss function, which is sensitive to noise and unstable in resampling. (2) The models need to be converted from the original space to the dual space, and their time complexity is high. To further enhance the performance of the twin support vector machine, the pinball loss function is introduced into the twin bounded support vector machine, and the problem of the pinball loss function not being differentiable at zero is solved by constructing a smooth approximation function. Based on this, a smooth twin bounded support vector machine model with pinball loss is obtained. The model is solved iteratively in the original space using the Newton-Armijo method. A smooth twin bounded support vector machine algorithm with pinball loss is proposed, and theoretically the convergence of the iterative algorithm is proven. In the experiments, the proposed algorithm is validated on the UCI datasets and the artificial datasets. Furthermore, the performance of the presented algorithm is compared with those of other representative algorithms, thereby demonstrating the effectiveness of the proposed algorithm.

Lagrangian Twin-Bounded Support Vector Machine Based on L2-Norm

Robust Pinball Twin Bounded Support Vector Machine for Data Classification

Article 13 July 2022

Robust General Twin Support Vector Machine with Pinball Loss Function

1 Introduction

The support vector machine (SVM) proposed by Vapnik et al. [1] is a machine learning method that is based on the principle of the Vapnik-Chervonekis (VC) dimension and structural risk minimization in statistical learning. It has been widely used in many fields, such as text recognition [2], feature extraction [3], multiview learning [4, 5], and sample screening [6]. This method uses the margin maximization strategy to find the optimal classification hyperplane that classifies the training datasets correctly with a sufficiently large certainty factor. This means that the hyperplane not only classifies positive and negative samples, but also has a sufficiently large certainty factor for classifying samples that are close to the hyperplane. Therefore, SVM has many advantages in terms of classification and prediction of unknown samples. In addition, SVM is formalized to solve a convex quadratic programming problem. However, with the continuous growth of the scale of data, some problems with SVM have been exposed. For example, SVM has high complexity with O (m³), where m is the number of training samples. It has difficulty coping with the processing requirements of large-scale data, and its anti-noise performance is weak. To this end, researchers have proposed some improved methods. Mangasarian et al. [7] proposed the proximal support vector machine (PSVM). This algorithm uses equation constraints to transform a quadratic programming problem into a problem of solving linear equations, and this improves the training speed of the algorithm. On this basis, they further proposed a generalized eigenvalue proximal support vector machine GEPSVM [8]. In their proposed algorithm, by relaxing the parallel constraints of the hyperplane, the original problem is transformed into solving two generalized eigen equations.

Inspired by the PSVM and GEPSVM, Jayadeva et al. [9] proposed the twin support vector machine (TWSVM) by minimizing the empirical risk, turning the single large quadratic programming problem into a pair of smaller quadratic programming problems. Unlike the GEPSVM, the TWSVM requires one hyperplane at least one unit distance away from other class. The complexity of the algorithm becomes 1/4 that of the SVM, and this further saves computation time. It has been shown that the TWSVM has more favorable properties than the SVM in terms of computing complexity and classification accuracy. Since TWSVM has excellent classification performance, it has attracted increasing attention. Therefore, many variants of the TWSVM have been proposed. For example, the twin bounded support vector machine (TBSVM) proposed by Shao et al. [10] is implemented by adding a regularization term to the objective function. It further improves the generalization performance of the TWSVM. In addition, researchers have proposed different twin support vector machines by exploring the structures of datasets and the different roles of samples [11,12,13,14,15,16,17,18].

From the above, to establish a support vector machine model, researchers mainly use a hinge loss function, which is sensitive to noise and unstable for resampling. To cope with these problems, Huang et al. [19] introduced the pinball loss into the support vector machine and proposed the Pin-SVM, which better solved the problems of noise sensitivity and resampling instability than previous SVMs. After that, researchers introduced the pinball loss into twin support vector machines and proposed different versions [20,21,22]. However, these models still need to solve two quadratic programming problems in the dual space. In fact, for the solution of the model in the original space, Mangasarian et al. [23] studied the support vector machine using smoothing technique and proposed a smooth support vector machine (SSVM). Moreover, researchers applied the smoothing technique to the twin support vector machines and proposed different algorithms for such twin support vector machines [24,25,26,27]. To obtain the required decision surface, the corresponding models of the twin support vector machines mentioned above are mainly solved in the dual space. Even if the researchers proposed a twin support vector machine solved in the original space, the model is still based on the hinge loss function.

The contributions of this paper are as follows:

1.
We introduce the pinball loss into the twin bounded support vector machine to construct a twin bounded support vector machine model with pinball loss, which overcomes noise sensitivity and resampling instability in the hinge loss function.
2.
To solve the nondifferentiable term in the objective function of the model, we construct a smooth approximation function for the pinball loss function so that the model in the original space can be solved directly. At the same time, the smooth twin bounded support vector machine with pinball loss is proposed.
3.
A proof of the convergence of the algorithm is given theoretically.
4.
We select UCI datasets and artificially generated datasets to validate the presented algorithm and compare its performance with the representative algorithms.

The remainder of this paper is organized as follows. Section 2 recalls three kinds of support vector machine models and analyzes the pinball loss function. Moreover, the corresponding smooth approximation function for the pinball loss function is constructed. Section 3 introduces the smooth twin bounded support vector machine model with pinball loss. The Newton-Armijo method is used to solve the model. Based on this, a smooth twin bounded support vector machine algorithm with pinball loss is proposed. Section 4 provides the theoretical analysis and proof of convergence for the presented algorithm. Experiments are conducted on UCI datasets and artificially generated datasets in section 5. In addition, we also discuss the other approximation function based on the integral of sigmoid function for the pinball loss function and experiments on the toy dataset. The detailed derivations are provided in the appendix. Finally, the conclusion is given.

2 Related works

In this section, we briefly review the formulations of the smooth support vector machine (SSVM), twin support vector machine (TWSVM), twin bounded support vector machine (TBSVM) and a smooth approximation function for the pinball loss function.

2.1 SSVM

Lee et al. [23] introduced a smooth method into the SVM and conducted a new formulation of SVM.

The standard support vector machine is

$$ {\displaystyle \begin{array}{c}\kern0.3em \underset{\left(w,b,\xi \right)\in {\mathfrak{R}}^{n+1+m}}{\min}\kern0.9300001em v{e}^T\xi +\frac{1}{2}\parallel w{\parallel}^2\\ {}\kern1.65em s.t.\kern1.98em D\left( Aw- eb\right)+\xi \ge e,\\ {}\kern4.95em \xi \ge 0.\end{array}} $$

(1)

The matrix A contains m points in n-dimensional place, v > 0, w ∈ Rⁿ, b ∈ R.ξ is the slack variable and e is a column vector of ones that is of arbitrary dimension and D is an m × m order matrix with ones or minus ones on its diagonal. In the smooth approach, using twice the reciprocal squared of the margin instead, the modified SVM problem is as follows:

$$ {\displaystyle \begin{array}{c}\kern0.5em \underset{\left(w,b,\xi \right)\in {\mathfrak{R}}^{n+1+m}}{\min}\kern0.9300001em \frac{v}{2}{\xi}^T\xi +\frac{1}{2}\left(\parallel w{\parallel}^2+{b}^2\right)\\ {}\kern1.65em s.t.\kern1.98em D\left( Aw- eb\right)+\xi \ge e,\\ {}\kern4.95em \xi \ge 0.\end{array}} $$

(2)

According to [28], the plus function is defined as

$$ {(u)}_{+}=\mathit{\max}\left\{u,0\right\}, $$

(3)

and ξ is given by

$$ \xi =\kern0.66em {\left(e-D\left( Aw- eb\right)\right)}_{+} $$

(4)

Furthermore, the equivalent unconstrained problem is formulated as

$$ \kern0.50em \underset{w,b}{\mathit{\min}}\kern0.30em \;\frac{v}{2}\parallel {\left(e-D\left( Aw- eb\right)\right)}_{+}{\parallel}_2^2+\frac{1}{2}\left(\parallel w{\parallel}^2+{b}^2\right). $$

(5)

It is clear that (5) is strongly convex and has a unique solution. However, it is not twice differentiable at zero. Therefore, the integral of the sigmoid function

$$ \kern0.1em p\left(x,\alpha \right)=x+\frac{1}{\alpha}\log \left(1+{e}^{-\alpha x}\right),\alpha >0, $$

(6)

is used here to replace the first term in (5), and a smooth support vector machine can be obtained:

$$ \kern0.50em \underset{w,b}{\mathit{\min}}\kern0.60em {\varPhi}_{\alpha}\left(w,b\right):= \underset{w,b}{\mathit{\min}}\kern0.1em \;\;\frac{v}{2}\parallel p\left(e-D\left( Aw- eb\right),\alpha \right){\parallel}_2^2+\frac{1}{2}\left(\parallel w{\parallel}^2+{b}^2\right). $$

(7)

It can be seen that (7) is smooth and twice differentiable, so it can be solved by using the Newton-Armijo method [29,30,31].

2.2 TWSVM

Consider a binary classification dataset X = {(x_i, y_i)| i = 1, 2, …, m}, x_i ∈ Rⁿ, y_i ∈ {+1, −1}. It contains m₁ positive samples and m₂ negative samples, which are represented by the matrices A and B, respectively, with m₁ + m₂ = m. Jayadeva et al. [9] proposed the twin support vector machine (TWSVM) using the hinge loss function, turning the optimization problem into a pair of smaller quadratic programming problems. Two nonparallel classification hyperplanes $ {f}_1(x)={w}_1^Tx+{b}_1=0 $ and $ {f}_2(x)={w}_2^Tx+{b}_2=0 $ need to be found to separate the samples correctly.

The primal problems of the TWSVM can be expressed as follows:

$$ {\displaystyle \begin{array}{c}\kern0.1em \underset{w_1,{b}_1,{\xi}_1}{\min}\kern1.1em \frac{1}{2}{\left\Vert A{w}_1+{e}_1{b}_1\right\Vert}^2+{c}_1{e_2}^T{\xi}_1\\ {}\kern0.9000001em s.t.\kern1.2em -\left(B{w}_1+{e}_2{b}_1\right)+{\xi}_1\ge {e}_2,\kern0.3em \end{array}} $$

(8)

$$ {\displaystyle \begin{array}{c}\kern0.5em \underset{w_2,{b}_2,{\xi}_2}{\min}\kern1.1em \frac{1}{2}{\left\Vert B{w}_2+{e}_2{b}_2\right\Vert}^2+{c}_2{e_1}^T{\xi}_2\\ {}\kern0.9000001em s.t.\kern1.2em \left(A{w}_2+{e}_1{b}_2\right)+{\xi}_2\ge {e}_1,\kern0.3em \end{array}} $$

(9)

where c_1,c₂ > 0, ξ₁ and ξ₂ are slack variables, e₁ and e₂ are column vectors of ones those are of arbitrary dimension. For optimization problems (8) and (9), the Lagrange method and K.K.T. optimality conditions are used to obtain their dual problems:

$$ {\displaystyle \begin{array}{c}\kern0.5em \underset{\alpha }{\max}\kern1.1em {e_2}^T\alpha -\frac{1}{2}{\alpha}^TG{\left({H}^TH\right)}^{-1}{G}^T\alpha \\ {}\kern0.9000001em s.t.\kern1.2em 0\le \alpha \le {c}_1,\end{array}} $$

(10)

$$ {\displaystyle \begin{array}{c}\kern0.5em \underset{\alpha }{\max}\kern1.1em {e_1}^T\gamma -\frac{1}{2}{\gamma}^TP{\left({Q}^TQ\right)}^{-1}{P}^T\gamma \\ {}\kern0.9000001em s.t.\kern1.2em 0\le \gamma \le {c}_2,\end{array}} $$

(11)

where α ≥ 0, γ ≥ 0.

2.3 TBSVM

Shao et al. [10] introduced a regularization term into the TWSVM and proposed a twin bounded support vector machine (TBSVM). The primal problems of the TBSVM can be expressed as follows:

$$ {\displaystyle \begin{array}{c}\kern0.5em \underset{w_1,{b}_1,{\xi}_1}{\min}\kern1.1em \frac{1}{2}{\left\Vert A{w}_1+{e}_1{b}_1\right\Vert}^2+\frac{1}{2}{c}_3\left({\left\Vert {w}_1\right\Vert}^2+{b_1}^2\right)+{c}_1{e_2}^T{\xi}_1\\ {}\kern0.9000001em s.t.\kern1.2em -\left(B{w}_1+{e}_2{b}_1\right)+{\xi}_1\ge {e}_2,\kern0.3em \end{array}} $$

(12)

$$ {\displaystyle \begin{array}{c}\kern0.5em \underset{w_2,{b}_2,{\xi}_2}{\min}\kern1.1em \frac{1}{2}{\left\Vert B{w}_2+{e}_2{b}_2\right\Vert}^2+\frac{1}{2}{c}_4\left({\left\Vert {w}_2\right\Vert}^2+{b_2}^2\right)+{c}_2{e_1}^T{\xi}_2\\ {}\kern0.9000001em s.t.\kern1.2em \left(A{w}_2+{e}_1{b}_2\right)+{\xi}_2\ge {e}_1,\kern0.3em \end{array}} $$

(13)

where c_1,c_2,c_3,c₄ > 0, ξ₁ and ξ₂ are slack variables, e₁ and e₂ are column vectors of ones those are of arbitrary. For optimization problems (12) and (13), the Lagrange method and K.K.T. optimality conditions are used to obtain their dual problems:

$$ {\displaystyle \begin{array}{c}\kern0.5em \underset{\alpha }{\max}\kern1.1em {e_2}^T\alpha -\frac{1}{2}{\alpha}^TG{\left({H}^TH+{c}_3I\right)}^{-1}{G}^T\alpha \\ {}\kern0.9000001em s.t.\kern1.2em 0\le \alpha \le {c}_1,\end{array}} $$

(14)

$$ {\displaystyle \begin{array}{c}\kern0.5em \underset{\upalpha}{\max}\kern1.1em {e_1}^T\gamma -\frac{1}{2}{\gamma}^TP{\left({Q}^TQ+{c}_4I\right)}^{-1}{P}^T\gamma \\ {}\kern0.9000001em s.t.\kern1.2em 0\le \gamma \le {c}_2,\end{array}} $$

(15)

where α ≥ 0, γ ≥ 0.

2.4 Pinball loss and its smooth approximation function

In support vector machines or twin support vector machines, the hinge loss function is typically used to measure the correctness and error of classification. However, this function has strong sensitivity to noise and instability of resampling. Therefore, this paper uses the pinball loss function to study the twin bounded support vector machine.

We know that the Chen-Harker-Kanzow-Smale function [28] is one of the approximations of the plus function, and it is denoted by

$$ \frac{u+\sqrt{u^2+4{\varepsilon}^2}}{2}, $$

(16)

where ε > 0.

The pinball loss function can be rewritten as

$$ {L}_{\tau }(u)={(u)}_{+}+{\left(-\tau u\right)}_{+},\kern0.75em \tau \in \left[0,1\right]. $$

(17)

From (17), we know that the left and right derivatives of the function at zero are

$$ \underset{\varDelta u\to {0}^{-}}{\lim}\frac{L_{\tau}\left(0+\varDelta u\right)-{L}_{\tau }(0)}{\varDelta u}=\frac{-\tau \left(0+\varDelta u\right)-0}{\varDelta u}=-\tau \kern0.9000001em \mathrm{and} $$

$$ \underset{\varDelta u\to {0}^{+}}{\lim}\frac{L_{\tau}\left(0+\varDelta u\right)-{L}_{\tau }(0)}{\varDelta u}=\frac{0+\varDelta u-0}{\varDelta u}=1 $$

It can be seen that the left and right derivatives of the pinball loss function at zero are not equal, that is, the function is not differentiable at zero. To this end, a smooth function ϕ_τ(u, ε) is used to approximate it by using (16) and (17), where ε is a sufficiently small parameter, and the function ϕ_τ(u, ε) is shown in (18):

$$ \kern0.1em {\phi}_{\tau}\left(u,\varepsilon \right)=\frac{u+\sqrt{u^2+4{\varepsilon}^2}}{2}+\frac{-\tau u+\sqrt{\tau^2{u}^2+4{\varepsilon}^2}}{2}. $$

(18)

From (18), it is known that this function has arbitrary-order derivatives. Figure 1 provides a graph with the pinball loss function and its approximation function ϕ_τ(u, ε) at τ = 0.5 and ε = 10⁻⁶. It can be intuitively seen that the approximation function ϕ_τ(u, ε) is smooth everywhere when ε is small enough.

Moreover, we also study another approximation function for the pinball loss function that is defined as follows:

$$ \kern0.1em {p}_{\tau}\left(x,\alpha \right)=p\left(x,\alpha \right)+p\left(-\tau x,\alpha \right)=x+\frac{1}{\alpha}\mathit{\log}\left(1+{e}^{-\alpha x}\right)+\left[-\tau x+\frac{1}{\alpha}\mathit{\log}\left(1+{e}^{\alpha \tau x}\right)\right], $$

where α > 0.

Below, we introduce a smooth twin bounded support vector machine with pinball loss based on the approximation function ϕ_τ(u, ε). For the other approximation function p_τ(x, α), the detailed process is presented in the appendix.

3 Smooth twin bounded support vector machine with pinball loss

3.1 Linear case

To overcome the shortcomings of the hinge loss function, pinball loss is introduced into the twin bounded support vector machine in this paper, and the optimization problems are obtained as follows:

$$ {\displaystyle \begin{array}{c}\kern0.5em \underset{w_1,{b}_1,\xi }{\min}\kern.5em \frac{1}{2}\sum \limits_{i=1}^{m_1}{\left({w}_1^T{x}_i^{(1)}+{b}_1\right)}^2+\frac{1}{2}{c}_3\left(\parallel {w}_1{\parallel}^2+{b}_1^2\right)+{c}_1\sum \limits_{j=1}^{m_2}{\xi}_j\\ {}s.t.\kern.5em -\left({w}_1^T{x}_j^{(2)}+{b}_1\right)+{\xi}_j\ge 1,\kern0.3em \\ {}\kern.5em -\left({w}_1^T{x}_j^{(2)}+{b}_1\right)-\frac{\xi_j}{\tau_2}\le 1,\end{array}} $$

(19)

$$ {\displaystyle \begin{array}{c}\kern0.5em \underset{w_2,{b}_2,\eta }{\min}\kern.3em \frac{1}{2}\sum \limits_{q=1}^{m_2}{\left({w}_2^T{x}_q^{(2)}+{b}_2\right)}^2+\frac{1}{2}{c}_4\left(\parallel {w}_2{\parallel}^2+{b}_2^2\right)+{c}_2\sum \limits_{p=1}^{m_1}{\eta}_p\\ {}\kern0.9000001em s.t.\kern.5em \left({w}_2^T{x}_p^{(1)}+{b}_2\right)+{\eta}_p\ge 1,\\ {}\kern2.2em \left({w}_2^T{x}_p^{(1)}+{b}_2\right)-\frac{\eta_p}{\tau_1}\le 1.\end{array}} $$

(20)

For (19) and (20), the solutions can be obtained by solving the quadratic programming problems in the dual space. The required classification surfaces are further obtained. This method is referred to as Pin-GTBSVM. In the following, we mainly introduce the iterative solutions of (19) and (20) in the original space.

First, we convert (19) and (20) into unconstrained optimization problems and obtain the following model:

$$ \kern0.40em \underset{w_1,{b}_1}{\mathit{\min}}\kern0.20em \;{\varPhi}_1\left({w}_1,{b}_1;{\tau}_2\right)=\underset{w_1,{b}_1}{\mathit{\min}}\kern0.60em \frac{1}{2}\sum \limits_{i=1}^{m_1}{\left({w}_1^T{x}_i^{(1)}+{b}_1\right)}^2+\frac{1}{2}{c}_3\left({\left\Vert {w}_1\right\Vert}^2+{b_1}^2\right)+{c}_1\sum \limits_{j=1}^{m_2}{L}_{\tau_2}\left({w}_1^T{x}_j^{(2)}+{b}_1+1\right) $$

(21)

$$ \kern0.40em \underset{w_2,{b}_2}{\mathit{\min}}\;{\varPhi}_1\left({w}_2,{b}_2;{\tau}_1\right)=\underset{w_2,{b}_2}{\mathit{\min}}\kern0.80em \frac{1}{2}\sum \limits_{q=1}^{m_2}{\left({w}_2^T{x}_q^{(2)}+{b}_2\right)}^2+\frac{1}{2}{c}_4\left(\parallel {w}_2{\parallel}^2+{b}_2^2\right)+{c}_2\sum \limits_{p=1}^{m_1}{L}_{\tau_1}\left(1-{w}_2^T{x}_p^{(1)}-{b}_2\right) $$

(22)

It can be seen that the objective functions of optimization problems (21) and (22) are continuous but not smooth. For this reason, we use the smooth function ϕ(⋅, ⋅) to approximate$ {L}_{\tau_2}\left(\cdot \right) $, namely,

$$ {\phi}_{\tau_2}\left({w}_1^T{x}_j^{(2)}+{b}_1+1,\varepsilon \right)\approx {L}_{\tau_2}\left({w}_1^T{x}_j^{(2)}+{b}_1+1\right), $$

(23)

$$ {\phi}_{\tau_1}\left(1-{w}_2^T{x}_p^{(1)}-{b}_2,\varepsilon \right)\approx {L}_{\tau_1}\left(1-{w}_2^T{x}_p^{(1)}-{b}_2\right). $$

(24)

Here, we rewrite

$$ {\phi}_1\left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)={\phi}_{\tau_2}\left({w}_1^T{x}_j^{(2)}+{b}_1+1,\varepsilon \right), $$

(25)

$$ {\phi}_2\left({w}_2,{b}_2;{\tau}_1,\varepsilon \right)={\phi}_{\tau_1}\left(1-{w}_2^T{x}_p^{(1)}-{b}_2,\varepsilon \right). $$

(26)

We replace the last terms of (21) and (22) with (25) and (26), respectively, to obtain the following model:

$$ \kern0.1em \underset{w_1,{b}_1}{\mathit{\min}}\kern0.30em \varPhi \left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)=\underset{w_1,{b}_1}{\mathit{\min}}\kern0.40em {c}_1\sum \limits_{j=1}^{m_2}{\phi}_1\left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)+\frac{1}{2}\sum \limits_{i=1}^{m_1}{\left({w}_1^T{x}_i^{(1)}+{b}_1\right)}^2+\frac{1}{2}{c}_3\left(\parallel {w}_1{\parallel}^2+{b}_1^2\right) $$

(27)

$$ \underset{w_2,{b}_2}{\mathit{\min}}\kern0.30em \varPhi \left({w}_2,{b}_2;{\tau}_1,\varepsilon \right)=\underset{w_2,{b}_2}{\mathit{\min}}\kern0.30em {c}_2\sum \limits_{p=1}^{m_1}{\phi}_2\left({w}_2,{b}_2;{\tau}_1,\varepsilon \right)+\frac{1}{2}\sum \limits_{q=1}^{m_2}{\left({w}_2^T{x}_q^{(2)}+{b}_2\right)}^2+\frac{1}{2}{c}_4\left(\parallel {w}_2{\parallel}^2+{b}_2^2\right) $$

(28)

where c_i > 0(i = 1,2,3,4).

3.2 Nonlinear case

In the nonlinear case, by introducing a kernel function that map the input space to the feature space, a nonlinear problem can be converted into a linear problem in the feature space. In the feature space, the two nonparallel classification hyperplanes to be found are K(x^T, C^T)u₁ + b₁ = 0 and K(x^T, C^T)u₂ + b₂ = 0, where C^T = [X^(1)TX^(2)T]^T, $ {X}^{(1)}=\left[{x}_1^{(1)},{x}_2^{(1)},\dots, {x}_{m1}^{(1)}\right] $, and $ {X}^{(2)}=\left[{x}_1^{(2)},{x}_2^{(2)},\dots, {x}_{m2}^{(2)}\right] $ . The corresponding optimization problems are as follows:

$$ {\displaystyle \begin{array}{c}\kern0.1em \underset{u_1,{b}_1,\xi }{\min}\kern.2em \frac{1}{2}\sum \limits_{i=1}^{m_1}{\left(K\left({x}_i^{(1)T},{C}^T\right){u}_1+{b}_1\right)}^2+\frac{1}{2}{c}_3\left(\parallel {u}_1{\parallel}^2+{b}_1^2\right)+{c}_1\sum \limits_{j=1}^{m_2}{\xi}_j\\ {}\kern0.9000001em s.t.\kern1.2em -\left(K\left({x}_j^{(1)T},{C}^T\right){u}_1+{b}_1\right)+{\xi}_j\ge 1,\kern0.3em \\ {}\kern2.6em -\left(K\left({x}_j^{(1)T},{C}^T\right){u}_1+{b}_1\right)-\frac{\xi_j}{\tau_2}\le 1,\end{array}} $$

(29)

$$ {\displaystyle \begin{array}{c}\kern0.1em \underset{u_2,{b}_2,\eta }{\min}\kern.3em \frac{1}{2}\sum \limits_{q=1}^{m_2}{\left(K\left({x}_q^{(2)T},{C}^T\right){u}_2+{b}_2\right)}^2+\frac{1}{2}{c}_4\left({\left\Vert {u}_2\right\Vert}^2+{b_2}^2\right)+{c}_2\sum \limits_{p=1}^{m_1}{\eta}_p\\ {}\kern0.9000001em s.t.\kern1.2em \left(K\left({x_p^{(1)}}^T,{C}^T\right){u}_2+{b}_2\right)+{\eta}_p\ge 1,\\ {}\kern1.6em \left(K\left({x_p^{(1)}}^T,{C}^T\right){u}_2+{b}_2\right)-\frac{\eta_p}{\tau_1}\le 1.\end{array}} $$

(30)

Similar to the linear case, the following model is obtained after the smooth and unconstrained processing of (29) and (30):

$$ \kern0.1em \underset{u_1,{b}_1}{\mathit{\min}}\kern0.30em \varPhi \left({u}_1,{b}_1;{\tau}_2,\varepsilon \right)=\underset{u_1,{b}_1}{\mathit{\min}}\kern0.30em {c}_1\sum \limits_{j=1}^{m_2}{\phi}_1\left({u}_1,{b}_1;{\tau}_2,\varepsilon \right)+\frac{1}{2}\sum \limits_{i=1}^{m_1}{\left(K\left({x_i^{(1)}}^T,{C}^T\right){u}_1+{b}_1\right)}^2+\frac{1}{2}{c}_3\left({\left\Vert {u}_1\right\Vert}^2+{b_1}^2\right) $$

(31)

$$ \underset{u_2,{b}_2}{\mathit{\min}}\kern0.30em \varPhi \left({u}_2,{b}_2;{\tau}_1,\varepsilon \right)=\underset{u_2,{b}_2}{\mathit{\min}}\kern0.30em {c}_2\sum \limits_{p=1}^{m_1}{\phi}_2\left({u}_2,{b}_2;{\tau}_1,\varepsilon \right)+\frac{1}{2}\sum \limits_{q=1}^{m_2}{\left(K\left({x_q^{(2)}}^T,{C}^T\right){u}_2+{b}_2\right)}^2+\frac{1}{2}{c}_4\left({\left\Vert {u}_2\right\Vert}^2+{b_2}^2\right) $$

(32)

where c_i > 0(i = 1,2,3,4).

From Eqs. (27)–(28) and (31)–(32), although the objective functions are different, the solutions are similar. Therefore, for optimization problem (27), the Newton-Armijo iteration method is used below to provide an algorithm for solving the nonparallel hyperplane, and it is denoted as Pin-SGTBSVM.

Pin-SGTBSVM Algorithm:

Step 1:
Given the initial point $ \left({w}_1^0,{b}_1^0\right)\in {R}^{n+1} $ and error η;
Step 2:
If $ \left\Vert \nabla \varPhi \left({w}_1^i,{b}_1^i;{\tau}_2,\varepsilon \right)\right\Vert <\eta $ or the required number of iterations is satisfied, then (w₁ⁱ, b₁ⁱ) is considered to be the solution, and the algorithm ends; otherwise, the Newton direction dⁱ should be recalculated by solving the equations ∇²Φ(w₁ⁱ, b₁ⁱ; τ₂, ε)dⁱ = − ∇ Φ(w₁ⁱ, b₁ⁱ; τ₂, ε);
Step 3:
w₁^i + 1, b₁^i + 1 can be calculated by choosing an appropriate step size $ {\lambda}_i=\mathit{\max}\left\{1,\frac{1}{2},\frac{1}{4}\dots \right\} $ according to(w₁^i + 1, b₁^i + 1) = (w₁ⁱ, b₁ⁱ) + λ_id_i, where the constraints $ \varPhi \left({w_1}^i,{b_1}^i;{\tau}_2,\varepsilon \right)-\varPhi \left(\left({w_1}^i,{b_1}^i\right)+{\lambda}_i{d}^i;{\tau}_2,\varepsilon \right)\ge -\delta {\lambda}_i\nabla \varPhi \left({w_1}^i,{b_1}^i;{\tau}_2,\varepsilon \right){d}^i,\delta \in \left[0,\frac{1}{2}\right] $ should be met and δ is fixed at 1/2 here.

For an unknown sample x, it is assigned to class i (i = +1,-1) by using the following method:

$$ {\displaystyle \begin{array}{c} class(x)=\mathit{\operatorname{sgn}}\left(\frac{w_1^Tx+{b}_1}{\left\Vert {w}_1\right\Vert }+\frac{w_2^Tx+{b}_2}{\left\Vert {w}_2\right\Vert}\right)\ \mathrm{or}\\ {} class(x)=\mathit{\operatorname{sgn}}\left(\frac{K\left({x}^T,{C}^T\right){u}_1+{b}_1}{\left\Vert {u}_1\right\Vert }+\frac{K\left({x}^T,{C}^T\right){u}_2+{b}_2}{\left\Vert {u}_2\right\Vert}\right),\end{array}} $$

where sgn denotes the sign function.

4 Convergence of the algorithm

In section 3, the Pin-SGTBSVM algorithm is used to obtain the solution sequences$ \left\{\left({w}_1^i,{b}_1^i\right)\right\}\ and\ \left\{\left({w}_2^i,{b}_2^i\right)\right\} $. This section mainly studies the convergence of the sequence.

Theorem 1 Given the functions Φ(w₁, b₁; τ₂, ε), Φ(w₂, b₂; τ₁, ε), Φ₁(w₁, b₁; τ₂), and Φ₁(w₂, b₂; τ₁) respectively, the following conclusions are established:

(1)
Φ(w₁, b₁; τ₂, ε) and Φ(w₂, b₂; τ₁, ε) are arbitrary, continuous and smooth functions with respect to w₁, b₁ and w₂, b₂ for arbitrary (w₁, b₁) ∈ R^n + 1 and (w₂, b₂) ∈ R^n + 1.
(2)
For arbitrary (w₁, b₁) ∈ R^n + 1 and(w₂, b₂) ∈ R^n + 1, when ε > 0, we have
$$ {\displaystyle \begin{array}{c}{\varPhi}_1\left({w}_1,{b}_1;{\tau}_2\right)\le \varPhi \left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)\le {\varPhi}_1\left({w}_1,{b}_1;{\tau}_2\right)+2{c}_1{m}_2\varepsilon, \\ {}{\varPhi}_1\left({w}_2,{b}_2;{\tau}_1\right)\le \varPhi \left({w}_2,{b}_2;{\tau}_1,\varepsilon \right)\le {\varPhi}_1\left({w}_2,{b}_2;{\tau}_1\right)+2{c}_2{m}_1\varepsilon .\end{array}} $$
(3)
Φ(w₁, b₁; τ₂, ε) and Φ(w₂, b₂; τ₁, ε) are continuously differentiable and strictly convex functions for any arbitrary ε > 0.

Proof:

(1)
According to the smoothness of the constructed function, conclusion (1) is obviously true.
(2)
We bound the difference between the smooth function and pinball loss function:
$$ \kern0.1em {\phi}_{\tau}\left(u,\varepsilon \right)-{L}_{\tau }(u)=\left\{\begin{array}{c}\frac{u+\sqrt{u^2+4{\varepsilon}^2}}{2}+\frac{-\tau u+\sqrt{\tau^2{u}^2+4{\varepsilon}^2}}{2}-u\kern0.96em ,u\ge 0,\\ {}\frac{u+\sqrt{u^2+4{\varepsilon}^2}}{2}+\frac{-\tau u+\sqrt{\tau^2{u}^2+4{\varepsilon}^2}}{2}-\left(-\tau u\right),u<0.\end{array}\right. $$

Then, we can obtain

$$ {\displaystyle \begin{array}{c}\kern0.1em \\ {}{\phi}_{\tau}\left(u,\varepsilon \right)-{L}_{\tau }(u)=\left\{\begin{array}{c}\frac{\sqrt{u^2+4{\varepsilon}^2}-u}{2}+\frac{\sqrt{\tau^2{u}^2+4{\varepsilon}^2}-\tau u}{2}\kern1.8em ,u\ge 0,\\ {}\frac{\sqrt{u^2+4{\varepsilon}^2}+u}{2}+\frac{\sqrt{\tau^2{u}^2+4{\varepsilon}^2}+\tau u}{2}\kern1.68em ,u<0,\end{array}\right.\end{array}} $$

so 0 ≤ ϕ_τ(u, ε) − L_τ(u) ≤ 2ε.

As a result, Φ₁(w₁, b₁; τ₂) ≤ Φ(w₁, b₁; τ₂, ε) ≤ Φ₁(w₁, b₁; τ₂) + 2c₁m₂ε.

In the same way, we can obtain

$$ {\varPhi}_1\left({w}_2,{b}_2;{\tau}_1\right)\le \varPhi \left({w}_2,{b}_2;{\tau}_1,\varepsilon \right)\le {\varPhi}_1\left({w}_2,{b}_2;{\tau}_1\right)+2{c}_2{m}_1\varepsilon $$

(3) From (1), Φ(w₁, b₁; τ₂, ε) is a continuous and smooth function, so we see that

$$ \nabla \Phi \left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)=\left(\begin{array}{c}\frac{\partial \varPhi \left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)}{\partial {w}_1}\\ {}\frac{\partial \varPhi \left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)}{\partial {b}_1}\end{array}\right). $$

For Φ(w₁, b₁; τ₂, ε), its first-order partial derivatives with respect to w₁ and b₁ are as follows:

$$ {\displaystyle \begin{array}{l}\frac{\partial \varPhi \left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)}{\partial {w}_1}=\frac{c_1}{2}\sum \limits_{j=1}^{m_2}\left({x}_j^{(2)}+\frac{x_j^{(2)}\left({w}_1^T{x}_j^{(2)}+{b}_1+1\right)}{\sqrt{{\left({w}_1^T{x}_j^{(2)}+{b}_1+1\right)}^2+4{\varepsilon}^2}}+\left(-{\tau}_2{x}_j^{(2)}+\frac{\tau_2^2{x}_j^{(2)}\left({w}_1^T{x}_j^{(2)}+{b}_1+1\right)}{\sqrt{\tau_2^2{\left({w}_1^T{x}_j^{(2)}+{b}_1+1\right)}^2+4{\varepsilon}^2}}\right)\right)\\ {}+\sum \limits_{i=1}^{m_1}{x}_i^{(1)}\left({w}_1^T{x}_i^{(1)}+{b}_1\right)+{c}_3{w}_1,\\ {}\begin{array}{l}\frac{\partial \varPhi \left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)}{\partial {b}_1}=\frac{c_1}{2}\sum \limits_{j=1}^{m_2}\left(1+\frac{\left({w}_1^T{x}_j^{(2)}+{b}_1+1\right)}{\sqrt{{\left({w}_1^T{x}_j^{(2)}+{b}_1+1\right)}^2+4{\varepsilon}^2}}+\left(-{\tau}_2+\frac{\tau_2^2\left({w}_1^T{x}_j^{(2)}+{b}_1+1\right)}{\sqrt{\tau_2^2{\left({w}_1^T{x}_j^{(2)}+{b}_1+1\right)}^2+4{\varepsilon}^2}}\right)\right)\\ {}+\sum \limits_{i=1}^{m_1}\left({w}_1^T{x}_i^{(1)}+{b}_1\right)+{c}_3{b}_1.\end{array}\end{array}} $$

We further obtain

$$ {\nabla}^2\varPhi \left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)=\left(\begin{array}{cc}\frac{\partial^2\varPhi \left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)}{{\partial w}_1^2}& \frac{\partial^2\varPhi \left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)}{\partial {b}_1\partial {w}_1}\\ {}\frac{\partial^2\varPhi \left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)}{\partial {w}_1\partial {b}_1}& \frac{\partial^2\varPhi \left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)}{{\partial b}_1^2}\end{array}\right). $$

The second-order partial derivatives of the function Φ(w₁, b₁; τ₂, ε) with respect to w₁ and b₁ are

$$ {\nabla}^2\varPhi \left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)=\left[\begin{array}{c}\sum \limits_{j=1}^{m_2}{x}_j^{(2)}{x}_j^{(2)T}{\lambda}_1\left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)+\sum \limits_{i=1}^{m_1}{x}_i^{(1)}{x}_i^{(1)T}+{c}_3{I}_n\kern0.24em \sum \limits_{j=1}^{m_2}{x}_j^{(2)}{\lambda}_1\left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)+\sum \limits_{i=1}^{m_1}{x}_i^{(1)}\\ {}\sum \limits_{j=1}^{m_2}{x}_j^{(2)T}{\lambda}_1\left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)+\sum \limits_{i=1}^{m_1}{x}_i^{(1)T}\kern2.4em \sum \limits_{j=1}^{m_2}{\lambda}_1\left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)+{m}_1+{c}_3\end{array}\right] $$

where

$$ {\lambda}_1\left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)=\frac{c_1}{2}\left(\frac{4{\varepsilon}^2}{{\left({\left({w}_1^T{x}_j^{(2)}+{b}_1+1\right)}^2+4{\varepsilon}^2\right)}^{\raisebox{1ex}{$3$}\!\left/ \!\raisebox{-1ex}{$2$}\right.}}+\frac{\tau_2^24{\varepsilon}^2}{{\left({\tau}_2^2{\left({w}_1^T{x}_j^{(2)}+{b}_1+1\right)}^2+4{\varepsilon}^2\right)}^{\raisebox{1ex}{$3$}\!\left/ \!\raisebox{-1ex}{$2$}\right.}}\right). $$

It can be seen thatλ₁(w₁, b₁; τ₂, ε) > 0. For arbitrary $ {\xi}_1^T=\left({\xi}^T,{\xi}_0\right)\in {R}^{n+1} $ and ξ₁ ≠ 0, ξ ∈ Rⁿ, we have

$$ {\displaystyle \begin{array}{l}{\xi}_1^T{\nabla}^2\varPhi \left({w}_1,{b}_1;{\tau}_2,\varepsilon \right){\xi}_1={\xi}^T\left(\sum \limits_{j=1}^{m_2}{x}_j^{(2)}{x}_j^{(2)T}{\lambda}_{\mathsf{1}}\left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)+\sum \limits_{i=1}^{m_1}{x}_i^{(1)}{x}_i^{(1)T}+{c}_3{I}_n\right)\xi \\ {}+{\xi}_0\left(\sum \limits_{j=1}^{m_2}{x}_j^{(2)T}{\lambda}_{\mathsf{1}}\left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)+\sum \limits_{i=1}^{m_1}{x}_i^{(1)T}\right)\xi \\ {}\begin{array}{l}+{\xi}^T\left(\sum \limits_{j=1}^{m_2}{x}_j^{(2)}{\lambda}_{\mathsf{1}}\left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)+\sum \limits_{i=1}^{m_1}{x}_i^{(1)}\right){\xi}_0\\ {}+{\xi}_0\left(\sum \limits_{j=1}^{m_2}{\lambda}_{\mathsf{1}}\left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)+{m}_1+{c}_3\right){\xi}_0\\ {}={\lambda}_{\mathsf{1}}\left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)\sum \limits_{j=1}^{m_2}{\left({x}_j^{(2)T}\xi +{\xi}_0\right)}^2+\sum \limits_{i=1}^{m_1}{\left({x}_i^{(1)T}\xi +{\xi}_0\right)}^2+{c}_3{\xi}_1^T{\xi}_1>0.\end{array}\end{array}} $$

This shows that∇²Φ(w₁, b₁; τ₂, ε) is a positive definite matrix. Therefore, we can obtain Φ(w₁, b₁; τ₂, ε)is a continuously differentiable and strictly convex function. For the functionΦ(w₂, b₂; τ₁, ε), we can obtain the same conclusion.

Theorem 2 The Pin-SGTBSVM algorithm converges to the solutions of problems (21) and (22), that is, when $ \left\{\left({w}_1^i,{b}_1^i\right)\right\} $and $ \left\{\left({w}_2^i,{b}_2^i\right)\right\} $are the solution sequences of (27) and (28), respectively, the minimum points of (27) and (28),which are (w₁^∗, b₁^∗) and (w₂^∗, b₂^∗), yield $ \underset{k\to \infty }{\lim}\left({w_1}^i,{b_1}^i\right)=\left({w_1}^{\ast },{b_1}^{\ast}\right) $ and $ \underset{k\to \infty }{\lim}\left({w_2}^i,{b_2}^i\right)=\left({w_2}^{\ast },{b_2}^{\ast}\right) $.

Proof:

According to Theorem 1, Φ(w₁, b₁; τ₂, ε) and Φ(w₂, b₂; τ₁, ε) are continuous and differentiable, strictly convex functions; therefore, there is a unique minimum because0 ≤ Φ(w₁, b₁; τ₂, ε) − Φ₁(w₁, b₁; τ₂) ≤ 2c₁m₂ε and0 ≤ Φ(w₂, b₂; τ₁, ε) − Φ₁(w₂, b₂; τ₁) ≤ 2c₂m₁ε, so the sequences{(w₁ⁱ, b₁ⁱ)} and {(w₂ⁱ, b₂ⁱ)}converge to (w₁^∗, b₁^∗) and(w₂^∗, b₂^∗), respectively.

5 Experimental results and analysis

To validate the performance of the proposed Pin-SGTBSVM algorithm, we selects German, Haberman, CMC, Fertility, WPBC, Ionosphere, and Live-disorders from the UCI database [32] for experimentation. For a convenient comparison, some representative algorithms, including the TWSVM [9], TBSVM [10], Pin-GTWSVM [21] (TWSVM based on pinball loss and solved in the dual space), Pin-GTBSVM (TBSVM based on pinball loss and solved in the dual space), and Pin-SGTWSVM (TWSVM based on pinball loss and solved in the original space), are selected. The kernel function used in the experiments is K(x, y) = exp(−θ × ‖x − y‖²), where θ is a parameter in the kernel function. A tenfold cross-validation method is used to find the optimal parameter value in the range of [10⁻⁶,10⁵]. The values of τ₁ and τ₂ are 0.5, 0.8, 1, and the value range of c_i > 0 (i = 1, 2, 3, 4) is [2⁻¹⁰, 2¹⁰]. The value of ε in the experiments is 10⁻⁶, the value of η in the algorithm is 10⁻⁴, and the experimental results are the average accuracy and standard deviation. All experiments in this paper are run on an Intel (R) Core (TM) i7-5500U CPU @ 2.40 GHz with 4 GB of memory and MATLAB R2016a.

5.1 UCI datasets

In this subsection, we compare the performance of the presented TWSVM,TBSVM, and Pin-GTWSVM algorithms with those of the traditional solving method and the Pin-GTBSVM, Pin-SGTWSVM and Pin-SGTBSVM solved by the iterative method. The experimental results are shown in Table 1. The accuracy (%) and standard deviation are used for evaluation, abbreviated as Acc and sd, respectively. It can be seen that the Pin-SGTBSVM algorithm is superior to the other five methods on five of the seven datasets. It obtains the same result as Pin-SGTWSVM on the Haberman dataset. And on the Fertility dataset, it obtains the same result as those of the other five algorithms. In addition, the Pin-SGTWSVM algorithm achieves the higher accuracy on six of the seven datasets, but its accuracy on the German dataset is inferior to that of the TWSVM algorithm.

Table 1 Performance of different algorithms on UCI datasets with nonlinear kernel

Full size table

To further analyze the influence of parameters on the algorithms, we selected 6 datasets to conduct experiments with the Pin-GTWSVM, Pin-GTBSVM, Pin-SGTWSVM algorithms for different values of τ. As shown in Fig. 2, although these algorithms use the pinball loss function, for the Pin-SGTBSVM and Pin-SGTWSVM algorithms, when the parameters take different values, there exist certain impacts on the classification accuracy; for the Pin-GTBSVM and Pin-GTWSVM algorithms, the influence of the parameter values on the classification accuracy is small.

Moreover, Gaussian characteristic noise with mean value 0 and variance σ² are added to the selected UCI datasets according to the ratio r = 5% and 10% to test the noise sensitivity of the algorithms, and σ² is obtained as the variance of different features multiplied by the ratio r. The experimental results are shown in Table 2, where Acc and sd are the accuracy (%) and standard deviation, respectively. When the noise ratio is 5%, the proposed algorithm achieves the best classification performance on five of the seven datasets. However, its classification performance on the German dataset is worse than the TWSVM algorithm but better than that of the TBSVM algorithm. A similar conclusion can be reached when the noise ratio is 10%.

Table 2 Performance of different algorithms on datasets with noise

Full size table

To further validate the anti-noise performance of the proposed algorithm, the real-world dataset Waveform (with noise) from UCI is selected for the experiment. It includes Waveform (v1) and Waveform (v2). The number of samples is 5000, where each contains 21(v1) or 40(v2) features that take real numbers and belong to three categories. For the Waveform (v1) dataset, Gaussian noise is added to each feature with a mean of 0 and variance of 1. For the Waveform (v2) dataset, 19 features of Gaussian noise with a mean of 0 and variance of 1 are added to the Waveform (v1) dataset. As the dataset is a triple-class problem, two categories of samples are taken from the dataset each time, and ten experiments are averaged in three combinations. The experimental results are shown in Table 3.

Table 3 Classification performance of different algorithms on Waveform datasets with noise

Full size table

From Table 3, it can be seen that the proposed algorithms achieve the best classification performance on both the Waveform (v1) and Waveform (v2) datasets. It is worth mentioning that these algorithms take much different amounts of time to classify the datasets. For the proposed algorithm and the Pin-SGTWSVM algorithm using the iterative technique, on the two datasets, the time spent is 28.4918 s / 31.0971 s, and 35.8810 s / 35.0951 s, respectively. The main reason for this is that the proposed algorithm solves linear equations instead of solving the quadratic programming problems.

5.2 Friedman test

In the following, we mainly use the Friedman test to evaluate the classification performance of the proposed Pin-SGTBSVM algorithm on the UCI datasets. The Friedman test is a statistical test method, and it is a nonparametric test method that uses ranks to determine whether there are significant differences between multiple population distributions. This method can make full use of the information in the samples and is suitable for comparisons of different classifiers on many datasets.

First, the null hypothesis and the opposite hypothesis are given:

H₀: There is no significant difference among the 6 classifiers;
H_1: There are significant differences among the 6 classifiers.

We discuss k classifiers (k = 6) on N datasets (N = 7) for comparison purposes. The corresponding classifiers are sorted according to their accuracy on each dataset, and the classifier corresponding to the highest accuracy has the smallest rank r_i. Table 4 shows the average ranks of different algorithms on the UCI datasets.

Table 4 Average ranks of different algorithms on UCI datasets

Full size table

In the Friedman test, the chi-square distribution is

$$ {\chi}_F^2=\frac{12N}{k\left(k+1\right)}\left[\sum \limits_{j=1}^k{R}_j^2-\frac{k{\left(k+1\right)}^2}{4}\right], $$

where $ {R}_j=\frac{1}{N}\sum \limits_{i=1}^N{r}_i $. Its degree of freedom is (k-1). In addition, on the basis of the Friedman test, the F distribution is further used for testing, where the F distribution is

$$ {F}_F=\frac{\left(N-1\right){\chi}_F^2}{N\left(k-1\right)-{\chi}_F^2}. $$

And its degrees of freedom are (k-1) and (k-1) × (N-1). According to the data in Table 4, we obtain $ {\chi}_F^2=16.8556 $ and F_F = 5.5739. After looking up the table, we know that the critical value of F(5,30) at the confidence level of 0.05 is 2.53, and 5.5739 > 2.53, so the null hypothesis H₀ is rejected. That is, there are significant differences among the 6 classifiers. The classification performance of Pin-SGTBSVM algorithm on the above 7 datasets is better than those of the other five algorithms.

According to the same method, the corresponding average ranks are calculated by using the classification results with different noise ratios from Table 2, as shown in Table 5.

Table 5 Average ranks of different algorithms on UCI datasets with noise

Full size table

In the same way, two sets of null hypotheses and opposite hypotheses are given as follows:

H₀(r = 0.05)_: There is no significant difference among the 6 classifiers on the datasets with 5% noise added;
H₁(r = 0.05): There are significant differences among the 6 classifiers on the datasets with 5% noise added.
H₀(r = 0.1)_: There is no significant difference among the 6 classifiers on the datasets with 10% noise added;
H₁(r = 0.1)_: There are significant differences among the 6 classifiers on the datasets with 10% noise added.

The test values are calculated according to Table 5 as follows: $ {\chi}_{F\left(r=0.05\right)}^2=14.2857 $, F_{F(r = 0.05)} = 4.1379, $ {\chi}_{F\left(r=0.1\right)}^2=15.5886 $, F_{F(r = 0.1)} = 4.8184. Since the test values 4.1379 and 4.8184 at both noise levels are greater than 2.53, the H₀ (r = 0.05) and H₀ (r = 0.1) hypotheses are rejected. That is, there are significant differences among the 6 classifiers. The classification performance of Pin-SGTBSVM on the selected datasets with different noise is better than those of the other five algorithms.

In addition, we compare the proposed Pin-SGTBSVM and Pin-SGTWSVM algorithms with the TWSVM, TBSVM, Pin-GTWSVM, and Pin-GTBSVM algorithms on 7 datasets, as shown in Fig. 3. It can be seen that the algorithms using the iterative method perform significantly better on 6 noise-free datasets and 5 datasets with noise, where r = 0, 0.05 and 0.1 represent no noise, the 5% noise ratio and the 10% noise ratio, respectively.

5.3 Artificial datasets

In this section, six algorithms are tested on two datasets Art1 and Halfmoons [27]. The samples of each class follow a Gaussian distribution. Each class consists of two clusters, each cluster contains 100 samples, where positive samples x_i(y_i = 1, i = 1, 2, …, 100)~N(μ₁, Σ₁), x_i(y_i = 1, i = 101,102, …, 200)~N(μ₂, Σ₁), and negative samples x_j(y_j = − 1, j = 1, 2, …, 100)~N(μ₃, Σ₂), x_j(y_j = − 1, j = 101,102, …, 200)~N(μ₄, Σ₂). Among them, μ₁ = [5, 1], μ₂ = [1, 0], μ₃ = [0, 5], μ₄ = [4, 7], Σ₁ = [0.5, 0; 0, 4], Σ₂ = [1, 0; 0, 1]. The two types of samples in the Halfmoons dataset consist of two half months, and each half month contains 250 samples. The experimental results are shown in Fig. 4. It can be seen that the computational time of the algorithms that use the iterative method is lower on both datasets than those of the other algorithms for the linear case.

At the same time, we also add 5% and 10% Gaussian characteristic noise to the above two datasets, respectively. The experimental results are shown in Fig. 5. It can be seen that the accuracy of the proposed algorithm on both datasets is higher than those of the other five algorithms.

5.4 NDC datasets

To further study the classification performance of the proposed algorithm on large-scale datasets, experiments are carried out on the NDC [33] datasets, and 5% characteristic noise is added to the generated datasets. Detailed information about the datasets is shown in Table 6. The experimental results are shown in Fig. 6.

Table 6 Information about the NDC datasets

Full size table

It can be seen that the Pin-SGTBSVM algorithm has higher classification accuracy on the NDC datasets than those of the other five algorithms. At the same time, we also analyze the execution time of the algorithms. Experiments show that the running time of the Pin-SGTBSVM and Pin-SGTWSVM algorithms become much lower than those of the other four algorithms as the sizes of the NDC datasets continues to grow. Their time complexity is better than that of the algorithms without using the iterative method, for which the execution time of the Pin-SGTBSVM algorithm is slightly lower than that of the Pin-SGTWSVM algorithm. To clarify the comparison of the time consumed by the six algorithms on the NDC datasets, the logarithms of the execution time for the six algorithms are calculated here, and the experimental results are shown in Fig. 7.

5.5 Toy dataset

In this subsection, we discuss a smooth twin bounded support vector machine model with approximate pinball loss ϕ_τ(u, α) and p_τ(u, α). The corresponding algorithms using the function p_τ(u, α) are denoted as Pin-PSGTWSVM (α = 5), Pin-PSGTWSVM (α = 1000), Pin-PSGTBSVM (α = 5), and Pin-PSGTBSVM (α = 1000). In the following, experiments are carried out on the toy dataset (named “crossplane”), which is seen in Fig. 8, and the experimental results are shown in Table 7.

Table 7 Comparison of performance achieved by different algorithms

Full size table

From Table 7, it can be seen that the accuracy of the Pin-SGTBSVM algorithm is the highest in the linear case, followed by Pin-PSGTWSVM (α = 5). In the nonlinear case, the accuracy rates of the Pin-SGTBSVM and Pin-SGTWSVM algorithms are higher than those of the other algorithms. The experimental results show that ϕ_τ(u, α) is better than p_τ(u, α).

6 Conclusion

Aiming at the shortcomings of the twin support vector machine, we introduce the pinball loss function into the twin bounded support vector machine. By constructing a smooth approximation function, a smooth twin bounded support vector machine model with pinball loss is obtained. On this basis, a twin bounded support vector machine algorithm with pinball loss is proposed, and the convergence of the algorithm is theoretically proven. We compare the proposed algorithms with other representative algorithms on the UCI datasets and artificial datasets. Under the premise of ensuring the accuracy, the defects of the twin support vector machine with regard to noise sensitivity and resampling instability are solved, and the time complexity is improved as well, thereby demonstrating the effectiveness of the proposed algorithm.

References

Vapnik VN (1995) The nature of statistical learning theory. Springer-Verlag, New York
Book Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Proc of the 10th Eur Conf on Mach Learn:137–142
Li YQ, Guan CT (2008) Joint feature re-extraction and classification using an iterative semi-supervised support vector machine algorithm. Mach Learn 71(1):33–53
Article Google Scholar
Sun S, Xie X, Dong C (2018) Multiview learning with generalized eigenvalue proximal support vector machines. IEEE Trans Cybern 49(2):688–697
Xie X, Sun S (2019) Multi-view support vector machines with the consensus and complementarity information. IEEE Trans Knowl Data Eng 99:1–1
Google Scholar
Zhao J, Xu YT, Fujita H (2019) An improved non-parallel Universum support vector machine and its safe sample screening rule. Knowl-Based Syst 170:79–88
Article Google Scholar
Fung G, Mangasarian OL (2001) Proximal support vector machine classifiers. Proceed Seventh ACM SIGKDD Int Conf Knowl Discov Data Minin:77–86
Mangasarian OL, Wild EW (2006) Multisurface proximal support vector machine classification via generalized eigenvalues. IEEE Trans Pattern Anal Mach Intell 28(1):69–74
Article Google Scholar
Jayadeva RK, Khemchandani R, Chandra S (2007) Twin support vector machine for pattern classification. IEEE Trans Pattern Anal Mach Intell 29(5):905–910
MATH Google Scholar
Shao YH, Zhang CH, Wang XB et al (2011) Improvements on twin support vector machines. IEEE Trans Neural Netw 22(6):962–968
Article Google Scholar
Kumar MA, Gopal M (2009) Least squares twin support vector machines for pattern classification. Expert Syst Appl 36(4):7535–7543
Article Google Scholar
Peng XJ (2010) A v-twin support vector machine(v-TSVM)classifier and its geometric algorithms. Inf Sci 180(20):3863–3875
Article MathSciNet Google Scholar
Peng XJ (2011) TPMSVM: a novel twin parametric-margin support vector machine for pattern recognition. Pattern Recogn 44(10):2678–2692
Article Google Scholar
Peng XJ, Wang YF, Xu D (2013) Structural twin parametric-margin support vector machine for binary classification. Knowl-Based Syst 49:63–72
Qi ZQ, Tian YJ, Shi Y (2013) Structural twin support vector machine for classification. Knowl-Based Syst 43(1):74–81
Article Google Scholar
Rastogi R, Sharma S, Chandra S (2017) Robust parametric twin support vector machine for pattern classification. Neural Process Lett 47(1):293–323
Article Google Scholar
Gupta D, Richhariya B, Borah P (2019) A fuzzy twin support vector machine based on information entropy for class imbalance learning. Neural Comput & Applic 24:7153–7164
Article Google Scholar
Chen SG, Cao JF, Huang Z, Shen CS (2019) Entropy-based fuzzy twin bounded support vector machine for binary classification. IEEE Access 7:86555–86569
Article Google Scholar
Huang XL, Shi L, Suykens JAK (2014) Support vector machine classifier with pinball loss. IEEE Trans Pattern Anal Mach Intell 36(5):984–997
Article Google Scholar
Xu YT, Yang ZJ, Pan XL (2017) A novel twin support-vector machine with pinball loss. IEEE Trans Neural Networks Learn Syst 28(2):359–370
Article MathSciNet Google Scholar
Tanveer M, Sharma A, Suganthan PN (2019) General twin support vector machine with pinball loss function. Inf Sci 94:311–327
Article MathSciNet Google Scholar
Tanveer M, Tiwari A, Choudhary R, Jalan S (2019) Sparse pinball twin support vector machines. Appl Soft Comput 78:164–175
Article Google Scholar
Lee YJ, Mangasarian OL (2001) SSVM: a smooth support vector machine for classification. Comput Optim Appl 20(1):5–22
Article MathSciNet Google Scholar
Kumar MA, Gopal M (2008) Application of smoothing technique on twin support vector machines. Pattern Recogn Lett 29(13):1842–1848
Article Google Scholar
Wang Z, Shao YH, Wu TR (2013) A GA-based model selection for smooth twin parametric-margin support vector machine. Pattern Recogn 46(8):2267–2277
Article Google Scholar
Ding SF, Huang HJ, Shi ZZ (2014) Weighted smooth CHKS twin support vector machines. J Softw 24(11):2548–2557
Article MathSciNet Google Scholar
Chen WJ, Shao YH, Hong N (2014) Laplacian smooth twin support vector machine for semi-supervised classification. Int J Mach Learn Cybern 5(3):459–468
Article Google Scholar
Chen C, Mangasarian OL (1996) A class of smoothing functions for nonlinear and mixed complementarity problems. Comput Optim Appl 5(2):97–138
Article MathSciNet Google Scholar
Larry A (1966) Minimization of functions having Lipschitz-continuous first partial derivatives. Pac J Math 16:1–3
Article MathSciNet Google Scholar
Bertsekas D (1999) Nonlinear programming, 2nd edn. Athena Scientific, Belmont
MATH Google Scholar
Dennis JE, Schnabel RB (1983) Numerical methods for unconstrained optimization and nonlinear equations. Prentice-Hall, Englewood Cliffs
MATH Google Scholar
Dua D, Graff C (2019) UCI machine learning repository. University of California, School ofInformation and Computer Science, Irvine. Available: http://archive.ics.uci.edu/ml
Musicant DR (1998) NDC: Normally distributed clustered datasets. Available: https://research.cs.wisc.edu/dmi/svm/ndc/

Download references

Acknowledgments

The authors are grateful for the supports provided by Natural Science Foundation of Hebei Province (No. F2018201060).

Author information

Authors and Affiliations

School of Cyber Security and Computer, Hebei University, Baoding, 071000, China
Kai Li & Zhen Lv

Authors

Kai Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Lv
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1

According to (17), we also obtain another approximation smooth function of the pinball loss function as follows:

$$ \kern0.1em {p}_{\tau}\left(x,\alpha \right)=p\left(x,\alpha \right)+p\left(-\tau x,\alpha \right)=x+\frac{1}{\alpha}\mathit{\log}\left(1+{e}^{-\alpha x}\right)+\left[-\tau x+\frac{1}{\alpha}\mathit{\log}\left(1+{e}^{\alpha \tau x}\right)\right],\alpha >0 $$

We further apply it to the (23) and (24):

$$ {\displaystyle \begin{array}{c}{p}_{\tau_2}\left({w}_1^T{x}_h^{(2)}+{b}_1+1,\alpha \right)\approx {L}_{\tau_2}\left({w}_1^T{x}_h^{(2)}+{b}_1+1\right),\\ {}{p}_{\tau_1}\left(1-{w}_2^T{x}_p^{(1)}-{b}_2,\alpha \right)\approx {L}_{\tau_1}\left(1-{w}_2^T{x}_p^{(1)}-{b}_2\right).\end{array}} $$

Here, we rewrite

$$ {\displaystyle \begin{array}{c}{p}_1\left({w}_1,{b}_1;{\tau}_2,\alpha \right)={p}_{\tau_2}\left({w}_1^T{x}_h^{(2)}+{b}_1+1,\alpha \right),\\ {}{p}_2\left({w}_2,{b}_2;{\tau}_1,\alpha \right)={p}_{\tau_1}\left(1-{w}_2^T{x}_p^{(1)}-{b}_2,\alpha \right).\end{array}} $$

Finally, we can obtain

$$ {\displaystyle \begin{array}{c}\kern0.1em \underset{w_{\mathsf{1}},{b}_{\mathsf{1}}}{\min}\kern0.30em P\left({w}_1,{b}_1;{\tau}_2,\alpha \right)=\underset{w_{\mathsf{1}},{b}_{\mathsf{1}}}{\min}\kern0.40em \frac{1}{2}{c}_1{\left\Vert \sum \limits_{h=1}^{m_2}{p}_{\mathsf{1}}\Big({w}_{\mathsf{1}},{b}_{\mathsf{1}};{\tau}_2,\alpha \Big)\right\Vert}^2+\frac{1}{2}\sum \limits_{g=1}^{m_1}{\left({w}_1^T{x}_g^{(1)}+{b}_1\right)}^2+\frac{1}{2}{c}_3\left(\parallel {w}_1{\parallel}^2+{b}_1^2\right),\\ {}\underset{w_2,{b}_2}{\min}\kern0.30em P\left({w}_2,{b}_2;{\tau}_2,\alpha \right)=\underset{w_2,{b}_2}{\min}\kern0.30em \frac{1}{2}{c}_2{\left\Vert \sum \limits_{k=1}^{m_1}{p}_{\mathsf{2}}\Big({w}_{\mathsf{2}},{b}_{\mathsf{2}};{\tau}_{\mathsf{1}},\alpha \Big)\right\Vert}^2+\frac{1}{2}\sum \limits_{l=1}^{m_2}{\left({w}_2^T{x}_l^{(2)}+{b}_2\right)}^2+\frac{1}{2}{c}_4\left(\parallel {w}_2{\parallel}^2+{b}_2^2\right).\end{array}} $$

The first partial derivatives of P(w₁, b₁; τ₂, α) are

$$ {\displaystyle \begin{array}{c}\frac{\partial P\left({w}_1,{b}_1;{\tau}_2,\alpha \right)}{\partial {w}_1}={c}_1\sum \limits_{h=1}^{m_2}\left(\frac{x_h^{(2)}}{1+{e}^{-\alpha \left({w}_1^T{x}_h^{(2)}+{b}_1+1\right)}}-\frac{\tau_2{x}_h^{(2)}}{1+{e}^{\alpha {\tau}_2\left({w}_1^T{x}_h^{(2)}+{b}_1+1\right)}}\right)\sum \limits_{h=1}^{m_2}{p}_1\left({w}_1,{b}_1;{\tau}_2,\alpha \right)\\ {}+\sum \limits_{g=1}^{m_1}{x}_g^{(1)}\left({w}_1^T{x}_g^{(1)}+{b}_1\right)+{c}_3{w}_1,\kern28.75em \\ {}\begin{array}{l}\frac{\partial P\left({w}_1,{b}_1;{\tau}_2,\alpha \right)}{\partial {b}_1}={c}_1\sum \limits_{h=1}^{m_2}\left(\frac{1}{1+{e}^{-\alpha \left({w}_1^T{x}_h^{(2)}+{b}_1+1\right)}}-\frac{1}{1+{e}^{\alpha {\tau}_2\left({w}_1^T{x}_h^{(2)}+{b}_1+1\right)}}\right)\sum \limits_{h=1}^{m_2}{p}_1\left({w}_1,{b}_1;{\tau}_2,\alpha \right)\\ {}+\sum \limits_{g=1}^{m_1}\left({w}_1^T{x}_g^{(1)}+{b}_1\right)+{c}_3{b}_1\end{array}\end{array}} $$

The second-order partial derivatives of the function P(w₁, b₁; τ₂, α) with respect to w1 and b1 are

$$ {\nabla}^2P\left({w}_1,{b}_1;{\tau}_2,\alpha \right)=\left[\begin{array}{c}\sum \limits_{h=1}^{m_2}{x}_h^{(2)}{x}_h^{(2)T}{\lambda}_2\left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)+\sum \limits_{g=1}^{m_1}{x}_g^{(1)}{x}_g^{(1)T}+{c}_3{I}_n\kern0.24em \sum \limits_{h=1}^{m_2}{x}_h^{(2)}{\lambda}_2\left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)+\sum \limits_{g=1}^{m_1}{x}_g^{(1)}\\ {}\sum \limits_{h=1}^{m_2}{x}_h^{(2)T}{\lambda}_2\left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)+\sum \limits_{g=1}^{m_1}{x}_g^{(1)T}\kern2.8em \sum \limits_{h=1}^{m_2}{\lambda}_2\left({w}_1,{b}_1;{\tau}_2,\varepsilon \right)+{m}_1+{c}_3\end{array}\right], $$

where

$$ {\displaystyle \begin{array}{c}{\lambda}_2\left({w}_1,{b}_1;{\tau}_2,\alpha \right)={c}_1\sum \limits_{h=1}^{m_2}\left(\frac{\alpha {e}^{-\alpha \left({w}_1^T{x}_h^{(2)}+{b}_1+1\right)}}{{\left(1+{e}^{-\alpha \left({w}_1^T{x}_h^{(2)}+{b}_1+1\right)}\right)}^2}-\frac{\tau_2}{{\left(1+{e}^{\alpha {\tau}_2\left({w}_1^T{x}_h^{(2)}+{b}_1+1\right)}\right)}^2}\right)\sum \limits_{h=1}^{m_2}{p}_1\left({w}_1,{b}_1;{\tau}_2,\alpha \right)\\ {}+{c}_1\sum \limits_{h=1}^{m_2}\left(\frac{1}{1+{e}^{-\alpha \left({w}_1^T{x}_h^{(2)}+{b}_1+1\right)}}-\frac{1}{1+{e}^{\alpha {\tau}_2\left({w}_1^T{x}_h^{(2)}+{b}_1+1\right)}}\right)\sum \limits_{h=1}^{m_2}\left(\frac{1}{1+{e}^{-\alpha \left({w}_1^T{x}_h^{(2)}+{b}_1+1\right)}}-\frac{1}{1+{e}^{\alpha {\tau}_2\left({w}_1^T{x}_h^{(2)}+{b}_1+1\right)}}\right).\end{array}} $$

This paper studies the twin bounded support vector machine with pinball loss and its iterative solution in the original space. The main innovations are as follows:

1.
The pinball loss function is introduced into the twin bounded support vector machine, and the twin bounded support vector machine with pinball loss is proposed
2.
We research two methods for solving the model:
(1)
Using the Lagrangian method, the proposed model is transformed into the dual space for the solution
(2)
By constructing a smooth approximation function, the non-differentiable problem of the pinball loss function at the zero point is solved, and the iterative method is used to solve the proposed model in the original space, and then the smooth twin bounded support vector machine with pinball loss is proposed. The iterative sequences of the proposed algorithm are theoretically proved in the original space
3.
We select the standard datasets and artificial datasets to study the proposed algorithm, and select other representative algorithms for comparisons to validate the effectiveness of the proposed algorithm

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Li, K., Lv, Z. Smooth twin bounded support vector machine with pinball loss. Appl Intell 51, 5489–5505 (2021). https://doi.org/10.1007/s10489-020-02085-5

Download citation

Accepted: 17 November 2020
Published: 09 January 2021
Issue Date: August 2021
DOI: https://doi.org/10.1007/s10489-020-02085-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Smooth twin bounded support vector machine with pinball loss

Abstract

Similar content being viewed by others

Lagrangian Twin-Bounded Support Vector Machine Based on L2-Norm

Robust Pinball Twin Bounded Support Vector Machine for Data Classification

Robust General Twin Support Vector Machine with Pinball Loss Function

1 Introduction

2 Related works

2.1 SSVM

2.2 TWSVM

2.3 TBSVM

2.4 Pinball loss and its smooth approximation function

3 Smooth twin bounded support vector machine with pinball loss

3.1 Linear case

3.2 Nonlinear case

4 Convergence of the algorithm

5 Experimental results and analysis

5.1 UCI datasets

5.2 Friedman test

5.3 Artificial datasets

5.4 NDC datasets

5.5 Toy dataset

6 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Smooth twin bounded support vector machine with pinball loss

Abstract

Similar content being viewed by others

Lagrangian Twin-Bounded Support Vector Machine Based on L2-Norm

Robust Pinball Twin Bounded Support Vector Machine for Data Classification

Robust General Twin Support Vector Machine with Pinball Loss Function

1 Introduction

2 Related works

2.1 SSVM

2.2 TWSVM

2.3 TBSVM

2.4 Pinball loss and its smooth approximation function

3 Smooth twin bounded support vector machine with pinball loss

3.1 Linear case

3.2 Nonlinear case

4 Convergence of the algorithm

5 Experimental results and analysis

5.1 UCI datasets

5.2 Friedman test

5.3 Artificial datasets

5.4 NDC datasets

5.5 Toy dataset

6 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix 1

Appendix 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation