1 Introduction

Support vector machines (SVMs) are optimization-based numerical methods for data classification problems [9, 24] that are generally formulated as linear or convex optimization problems. SVMs have become one of the most widely used methods for binary classification, which separates data into two desired groups, and have found applications in numerous fields of science [9], engineering [15] and medicine [12, 14, 27, 46]. These methods are inherently performed in the face of data uncertainty due to the presence of noise in the training data.

In recent years, robust optimization, which was pioneered in the 1970s for treating uncertain linear programming problems, has now emerged as a powerful approach for decision-making in the face of data uncertainty. It treats uncertainty as deterministic but does not limit data values to point estimates. Two decades since the advent of robust optimization, in the late 1990s, Ben-Tal et al. [4, 5] provided a highly successful computationally tractable treatment of the robust optimization approach for linear as well as nonlinear optimization problems under data uncertainty [21, 28, 45].

In this framework, one associates with the uncertain SVM classification problem with feature uncertainty its robust counterpart, where the uncertain constraints are enforced for every possible value of the data within their prescribed uncertainty sets [6, 7, 29, 41]. In this paper, we consider a broad class of robust SVM classification problems under general spectrahedron uncertainty sets [10, 42]. The spectrahedron uncertainty set covers the most commonly used uncertainty sets of numerically solvable robust optimization models, such as boxes, balls and ellipsoids [4, 6, 7]. The robust counterpart, in general, is a hard nonlinear optimization problem with infinitely many constraints, and we reformulate it as a numerically tractable equivalent conic linear program using conic duality [3, 4] and a support function technique [3, 18]. We show that the robust counterpart reduces to the second-order cone programs for the cases where the uncertainty sets are ellipsoids, balls or boxes. These second-order cone programs can be solved more efficiently.

In the case of the box-uncertainty model, employing Lagrangian duality [17, 19], we also provide a new robust SVM classifier by transforming the robust counterpart into a convex quadratic program with non-negative variables, leading to a very efficient computational scheme via a simple iterative algorithm. This approach, which was inspired by the Lagrangian support vector machine developed by Mangasarian et al [16, 34], extends the pq-SVM developed by Dunbar et al [14] to robust SVMs. Computational results on a range of datasets indicate that our methods allow for greater classification accuracy than conventional SVMs in addition to selecting smaller groups of highly correlated features.

The conic duality-based robust SVM methods were also applied to a new dataset, Enroll-HD, which contains 36,953 sets of observations on 32 physical features from subjects with, or at risk of, Huntington disease (HD). HD is a neurodegenerative movement disorder with motor (relating to movement), cognitive and psychiatric manifestations caused by an inherited mutation in the Huntingtin (HTT) gene [1, 11, 33]. Characterizing the onset of the disease in subjects harbouring the causative mutation in terms of its associated features is of significant clinical and research importance. Our robust SVM methods also performed well on the Enroll-HD dataset, achieving accuracies of over 95% and selecting meaningful features for classifying subjects as having manifest (post-onset) or non-manifest HD.

The outline of the paper is as follows. Section 2 develops robust SVM data classification models. Section 3 presents equivalent conic program reformulations of these robust classification models for various classes of uncertainty sets. Section 4 provides a robust classification scheme in the case of box uncertainty and gives a simple iterative algorithm to find robust classifiers. Section 5 provides results on the computational experiments on three publicly available datasets. Section 6 describes the Enroll-HD dataset, the performance of conic duality-based robust methods on this dataset and the implications of these results on the characterization of HD onset. Section 7 concludes with brief discussion on further work. The appendix provides additional technical details on spectrahedra, lists the features contained in the Enroll-HD dataset and the proof of (linear) convergence of our iterative algorithm.

2 Robust Optimization-Based Data Classification

In this section, we introduce the SVM formulation and describe the so-called robust SVM formulations. We begin by fixing the notations that will be used later in the paper. Given a vector \(x\in \mathbb {R}^n\), |x| denotes the vector consisting of the absolute value of each component \(x_i\) for \(i = 1,2,\ldots , n\). The zero vector in \(\mathbb {R}^n\) is denoted by \(0_n\). For a vector \(x \in \mathbb {R}^n\), \(x \ge 0_n\) if every component, \(x_i\ge 0\) for \(i = 1,2,\ldots , n\). The \(n\times n\) identity matrix is denoted by \(I_n\) (or \(I_{n\times n}\)). The \(n \times n\) matrix of zeros is denoted by \({0}_{n\times n}\) (or simply 0 if the dimension is clear). The vector of all ones in \(\mathbb {R}^n\) is denoted by \(e_n\). We denote by \(\mathbb {S}^n\) the space of all real-valued \(n\times n\) symmetric matrices. For a vector \(x\in \mathbb {R}^n\), the p-norm, for \(1\le p < \infty \) is defined as: \( \Vert x\Vert _p = \left( \sum _{i=1}^n |x_i|^p\right) ^{\frac{1}{p}} \) with \( \Vert x\Vert _{\infty } = \max _{1\le i\le n}|x_i|. \) For convenience, we also write \( \Vert x\Vert _2 \equiv \Vert x\Vert \). For a matrix \(A\in \mathbb {R}^{m\times n}\), its norm (or 2-norm) is denoted by \(\Vert A\Vert \), and is given by \(\Vert A\Vert = \delta _{max}(A)\) where \(\delta _{max}(A)\) is the largest singular value of A. This corresponds to the magnitude of the largest eigenvalue of A, \(|\lambda _{max}(A)|\) if \(A\in \mathbb {R}^{n\times n}\) and symmetric. For a vector \(x \in \mathbb {R}^n\), \(\mathrm{diag}(x)\) denotes a diagonal matrix in \(\mathbb {R}^{n\times n}\) whose entries consist of the elements of x. The gradient of the scalar function \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) with respect to the vector x is denoted by \(\nabla _{x} f(x)\). The closed unit ball of \(\mathbb {R}^n\) is denoted by \(\mathbb {B}_n\).

Consider two sets of data \(\mathcal A\) and \(\mathcal B\) whose elements are vectors in \(\mathbb {R}^s\). The SVM classifier distinguishes between these two datasets by attempting to separate the m data points into one of two open halfspaces with minimal error —each halfspace containing only those datapoints that correspond to the set \(\mathcal A\) or \(\mathcal B\), respectively, where m is the cardinality of \(\mathcal A\cup \mathcal B\). Each datapoint \(u_i\), \(i=1,\ldots ,m\), has a corresponding class label \(\alpha _i\in \left\{ -1,1\right\} \) according to the set \(\mathcal A\) or \(\mathcal B\) in which it is contained. The classifier used in the standard (linear) SVM formulation is a hyperplane of the form:

$$\begin{aligned} u^Tw +\gamma =0, \end{aligned}$$
(2.1)

where w is the normal to the surface of the hyperplane and \(\gamma \) determines the location of the hyperplane relative to the origin. To construct the SVM classifier, the margin (denoted by \(M = 2/\Vert w\Vert \)) between the planes

$$\begin{aligned}&u^Tw + \gamma = +1 \end{aligned}$$
(2.2)
$$\begin{aligned}&u^Tw + \gamma = -1, \end{aligned}$$
(2.3)

is maximized, subject to the condition that each plane bounds one of the sets (the so-called “hard-margin” case). The optimal classifier lies midway between these two bounding planes.

Often the data are not linearly separable, and so the data cannot be correctly classified by linear bounding hyperplanes. This situation results in the following “soft-margin” SVM formulation with the tuning parameter \(\lambda \) (see [4, 9]):

$$\begin{aligned} (SVM)&\displaystyle \min _{(w,\gamma , \xi )\in \mathbb {R}^{s}\times \mathbb {R}\times \mathbb {R}^m}&\lambda \Vert w \Vert _2^2 + \sum _{i=1}^m \xi _i \\&\text{ s.t. }&\xi _i \ge 0, \; \alpha _{i}(u_{i}^{T}w+\gamma )+\xi _i \ge 1, \ i=1,\ldots ,m, \end{aligned}$$

where \((u_{i},\alpha _{i})\in \mathbb {R}^{s}\times \{-1,1\}\) are the given training data, \(\alpha _{i}\) is the class label for each data \(u_{i}\), and the number of nonzero entries in the slack vector \(\xi \) is the number of errors the classifier makes on the training data.

The soft-margin classification via a doubly regularized support vector machine (DrSVM) examined in [4, Section 12.1.1][14] can be formulated as:

$$\begin{aligned} (DrSVM)&\displaystyle \min _{(w,\gamma , \xi )\in \mathbb {R}^{s}\times \mathbb {R}\times \mathbb {R}^m}&\lambda _1 \Vert w \Vert _1 + \lambda _2 \Vert w \Vert _2^2 + \sum _{i=1}^m \xi _i \\&\text{ s.t. }&\xi _i \ge 0, \; \alpha _{i}(u_{i}^{T}w+\gamma )+\xi _i \ge 1, \ i=1,\ldots ,m, \end{aligned}$$

where \(\lambda _1\ge 0\) and \(\lambda _2 \ge 0\) are tuning parameters. The (DrSVM) formulation that incorporates 1-norm is known to generate sparse solutions. When a linear classifier is used, solution sparsity implies that the separating hyperplane depends on few input features. This makes the doubly regularized approach a very effective tool for feature selection in classification problems [8, 14, 16, 24].

The soft-margin SVM model (SVM) is a convex quadratic optimization problem with finitely many linear inequality constraints. By introducing auxiliary variables, the doubly regularized support vector machine (DrSVM) can also be equivalently reformulated as a convex quadratic optimization problem with finitely many linear inequality constraints. Noting that the feasible regions of these optimization models are nonempty, the celebrated Frank–Wolfe theorem [2] ensures that the optimal solutions always exist for these two optimization models.

In practice, the given data \(u_{i}\), \(i=1,\ldots ,m\), are often uncertain. We assume that these data are subject to the following spectrahedral data uncertainty parameterized with the radius parameter \(r_i \ge 0\):

$$\begin{aligned} u_{i}\in \mathcal {U}_{i}(r_i)=\overline{u}_{i}+r_{i}\mathcal {V}_i, \end{aligned}$$

where each \(\mathcal {V}_i\), \(i=1,\ldots ,m\), is a bounded spectrahedron given by

$$\begin{aligned} \mathcal {V}_i=\{(v_{i1},\ldots ,v_{is}) \in \mathbb {R}^s: A_0^{(i)}+\sum _{l=1}^s v_{il} A_l^{(i)} \succeq 0\}, \end{aligned}$$

with \(A_l^{(i)}\), \(l=1,\ldots ,s\), being symmetric \((p \times p)\) matrices. The spectrahedral uncertainty encompasses many important commonly used uncertainty sets such as polyhedral uncertainty sets (where all \(A_l^{(i)}\) are diagonal matrices), ball uncertainty sets, ellipsoidal uncertainty sets and their intersections. We assume that the label \(\alpha _i\) is free of uncertainty.

Let \(r=(r_{1},\ldots ,r_{m}).\) Then, the robust support vector machine can be stated as:

$$\begin{aligned} (RSVM_{r})&\displaystyle \min _{(w,\gamma ,\xi )\in \mathbb {R}^{s}\times \mathbb {R}\times \mathbb {R}^m}&\ \lambda _1 \Vert w \Vert _1 + \lambda _2 \Vert w \Vert _2^2 + \sum _{i=1}^m \xi _i \\&\text{ s.t. }&\xi _i \ge 0, \; \alpha _{i}(u_{i}^{T}w+\gamma )+\xi _i \ge 1,\ \forall \,u_{i}\in \mathcal {U}_{i}(r_i),\ i=1,\ldots ,m. \end{aligned}$$

Note that a robust support vector machine model problem is, in general, a semi-infinite convex optimization problem. Note also that an optimal solution exists for \((RSVM_r)\) whenever the robust feasible set F is nonempty where \(F=\{(w,\gamma ,\xi ): \xi _i \ge 0, \; \alpha _{i}(u_{i}^{T}w+\gamma )+\xi _i \ge 1,\ \forall \,u_{i}\in \mathcal {U}_{i}(r_i),\ i=1,\ldots ,m\}\), and the label sets \(\mathcal {I}_{\mathcal {A}}\) and \(\mathcal {I}_{\mathcal {B}}\) are both nonempty, where \(\mathcal {I}_{\mathcal {A}}=\{1 \le i \le m: \alpha _i=1\}\) and \(\mathcal {I}_{\mathcal {B}}=\{1 \le i \le m: \alpha _i=-1\}\). To see this, denote the objective function of \((RSVM_r)\) by f and let the optimal value of \((RSVM_r)\) be \(\inf (RSVM_r)\). As the robust feasible set is nonempty and the objective function f is always bounded below by 0, and so, \(\inf (RSVM_r)\) is a non-negative real number. Let \((w^k,\gamma ^k,\xi ^k)\) be a minimizing sequence, that is, \((w^k,\gamma ^k,\xi ^k) \in F\) and \(f(w^k,\gamma ^k,\xi ^k) \rightarrow \inf (RSVM_r)\). From the definitions of f and the fact that \(\xi ^k \in \mathbb {R}^m_+\), we see that \(\{w^k\}\) and \(\{\xi ^k\}\) are bounded sequences. Now, \((w^k,\gamma ^k,\xi ^k) \in F\) shows that \( \min _{u_i \in \mathcal {U}_{i}(r_i)}\{\alpha _{i} u_{i}^{T}w^k\}+ \alpha _i \gamma ^k+\xi _i^k \ge 1\), \(i=1,\ldots ,m\). Take \(i \in \mathcal {I}_{\mathcal {A}}\) and \(i' \in \mathcal {I}_{\mathcal {B}}\). Then, \(\gamma ^k=\alpha _i \gamma ^k\) and \(-\gamma ^k=\alpha _{i'} \gamma ^k\) are both bounded below. So, \(\{\gamma ^k\}\) is bounded, and hence \(\{(w^k,\gamma ^k,\xi ^k)\}\) is also a bounded sequence in F. As F is a closed set and f is a continuous function, it follows that an optimal solution exists for \((RSVM_r)\).

Figure 1 presents an illustration for both robust and non-robust SVM classifiers. On the left, we see the separating hyperplane and two bounding hyperplanes found by solving a standard (non-robust) SVM; on the right, we see the corresponding hyperplanes found by solving a robust SVM with box uncertainty: \(\mathcal {V}_i = \{v\in \mathbb {R}^2 : \Vert v\Vert _\infty \le 1\}\).

Fig. 1
figure 1

a Classifier determined by the standard SVM. b Classifier determined by the robust SVM with box uncertainty. The uncertainty set around each datapoint is shown

In the next section, we will turn our attention to reformulating the robust SVM \((RSVM_{r})\) into a numerically tractable optimization problem. Without loss of generality throughout this paper, we always assume that an optimal solution for the robust SVM model \((RSVM_r)\) exists.

3 SDP Formulations for Robust SVM via Conic Duality

In this section, we first show that the robust support vector machine problem under more general spectrahedron uncertainty can be equivalently reformulated as a semi-definite programming problem via a support function technique and conic duality [3, 4]. We then derive simple numerically tractable formulations for the cases where the uncertainty sets are ellipsoids, balls and boxes.

We begin by establishing a simple lemma which shows that the robust support vector machine problem is equivalent to a nonsmooth convex optimization problem with finitely many inequality constraints. As we see later in the section, this lemma allows us to easily achieve an equivalent semi-definite programming reformulation for \((RSVM_r)\) . To do this, we define the support function of a closed convex and bounded set \(C\subset \mathbb {R}^s\) by

$$\begin{aligned} \sigma _C(x)=\max \{x^Tz: z \in C\}. \end{aligned}$$

Then, the support function, \(\sigma _C(\cdot )\), is a convex function and closed-form formulae for the support function are known for various cases of C, such as balls and boxes. For instance, if \(C=\{z\in \mathbb {R}^s : \Vert z\Vert _p \le 1\}\), then \(\sigma _C(x)=\Vert x\Vert _q\), where \(1/p+1/q =1\) and \(p>1\). When \(p=1\), \(\sigma _C(x)=\Vert x\Vert _\infty \). For details, see [3, 4].

Consider the following nonsmooth convex optimization problem which we associate with \((RSVM_r)\):

$$\begin{aligned} (AP_{r})&\displaystyle \min _{(w,\gamma ,t,\mu ,\xi )\in \mathbb {R}^s\times \mathbb {R}\times \mathbb {R}^s\times \mathbb {R}\times \mathbb {R}^m}&\ \lambda _1 \sum _{l=1}^s t_l + \lambda _2 \mu + \sum _{i=1}^m \xi _i \\&\text{ s.t. }&\Vert (w, \frac{1-\mu }{2}) \Vert _2 \le \frac{1+\mu }{2}, \; \; w_l \le t_l, \; -w_l \le t_l,\ l=1,\dots ,s\\&\xi _i \ge 0, \; \sigma _{\mathcal {U}_{i}(r_i)}(-\alpha _{i}w)- \alpha _i \gamma -\xi _i +1 \le 0 ,\ i=1,\ldots ,m, \end{aligned}$$

where for each \(i=1,2,\ldots , m\), the support function \(\sigma _{\mathcal {U}_{i}(r_i)}(\cdot )\) is a convex function.

Lemma 3.1

(Equivalent nonsmooth convex program) Consider the robust support vector machine problem \((RSVM_r)\) with a bounded spectrahedron uncertainty set. Then,

$$\begin{aligned} \min (RSVM_r)=\min (AP_r). \end{aligned}$$

Moreover, \((w,\gamma ,\xi )\in \mathbb {R}^{s}\times \mathbb {R}\times \mathbb {R}^m\) is a solution for \((RSVM_r)\) if and only if there exist \(t \in \mathbb {R}^s\) and \(\mu \in \mathbb {R}\) such that \((w,\gamma ,t,\mu ,\xi )\in \mathbb {R}^{s}\times \mathbb {R}\times \mathbb {R}^s \times \mathbb {R}\times \mathbb {R}^m\) is a solution for \((AP_r)\).

Proof

The robust SVM problem can be equivalently rewritten as follows:

$$\begin{aligned}&\displaystyle \min _{(w,\gamma ,t,\mu ,\xi )\in \mathbb {R}^{s}\times \mathbb {R}\times \mathbb {R}^s \times \mathbb {R}\times \mathbb {R}^m}&\ \lambda _1 \sum _{l=1}^s t_l + \lambda _2 \mu + \sum _{i=1}^m \xi _i \\&\text{ s.t. }&\Vert w\Vert _2^2 \le \mu , \; \; |w_l| \le t_l, \ l=1,\dots ,s\\&\xi _i \ge 0, \; -\alpha _i(w^Tu_i + \gamma ) - \xi _i + 1 \le 0,\ \forall \ u_{i}\in \mathcal {U}_{i}(r_i),\ i=1,\ldots ,m. \end{aligned}$$

Note that \(|w_l| \le t_l\) is equivalent to \(-w_l \le t_l\) and \(w_l \le t_l\) for all \(l=1,\dots ,s\). Moreover, \(\Vert w\Vert _2^2 \le \mu \) can be equivalently rewritten in terms of conic constraints as follows.

$$\begin{aligned} \Vert w\Vert _2^2 \le \mu \ \Leftrightarrow \ \Vert (w, \frac{1-\mu }{2}) \Vert _2 \le \frac{1+\mu }{2}. \end{aligned}$$

To finish the proof, we only need to show that for all \(i=1,\ldots ,m\),

$$\begin{aligned} \alpha _{i}(u_{i}^{T}w+\gamma )+\xi _i \ge 1,\ \forall \,u_{i}\in \mathcal {U}_{i}(r_i) \ \Leftrightarrow \ \sigma _{\mathcal {U}_{i}(r_i)}(-\alpha _{i}w)- \alpha _i \gamma -\xi _i +1 \le 0. \end{aligned}$$
(3.1)

To see this, observe that \(\alpha _{i}(u_{i}^{T}w+\gamma )+\xi _i \ge 1,\ \forall \,u_{i}\in \mathcal {U}_{i}(r_i)\) is equivalent to the system,

$$1-\alpha _{i} u_{i}^{T}w-\alpha _i \gamma -\xi _i \le 0,\ \forall \,u_{i}\in \mathcal {U}_{i}(r_i),$$

which is in turn equivalent to

$$\begin{aligned} 0\ge & {} \max _{u_{i}\in \mathcal {U}_{i}(r_i)}\{1-\alpha _{i} u_{i}^{T}w-\alpha _i \gamma -\xi _i\} \\= & {} 1-\alpha _i \gamma -\xi _i + \max \{ u_{i}^{T}(-\alpha _{i} w): u_{i}\in \mathcal {U}_{i}(r_i)\} \\= & {} 1-\alpha _i \gamma -\xi _i + \sigma _{\mathcal {U}_{i}(r_i)}(-\alpha _{i} w). \end{aligned}$$

This means that (3.1) holds, and so, the conclusion follows. \(\square \)

We now consider the following semi-definite program which can easily be shown to be equivalent to \((RSVM_r)\) by conic duality [3, 4].

$$\begin{aligned} (SDP_{r})&\displaystyle \min _{\begin{array}{c} (w,\gamma ,t,\mu ,\xi )\in \mathbb {R}^s\times \mathbb {R}\times \mathbb {R}^s\times \mathbb {R}\times \mathbb {R}^m \\ W_i\in \mathbb {S}^p, \ i=1,\dots ,m \end{array}}&\ \lambda _1 \sum _{l=1}^s t_l + \lambda _2 \mu + \sum _{i=1}^m \xi _i \\&{ \text{ s.t. } }&\Vert (w, \frac{1-\mu }{2}) \Vert _2 \le \frac{1+\mu }{2}, \; \; w_l \le t_l, \; -w_l \le t_l,\ l=1,\dots ,s\\&W_i \succeq 0, \; \; -\alpha _i(w^T\overline{u}_i + \gamma )\\&- \xi _i + 1 + \mathrm{Trace}(W_i A_0^{(i)}) \le 0, \ i=1,\ldots ,m \\&- r_i \alpha _{i}w_l + \mathrm{Trace}(W_i A_l^{(i)})=0, \ l=1,\ldots ,s, \, i=1,\ldots ,m \\&\xi _i\ge 0, \ i=1,\dots ,m \end{aligned}$$

Theorem 3.1

(Spectrahedral uncertainty: semi-definite program) Consider the robust support vector machine problem \((RSVM_r)\) with spectrahedral uncertainty, and its associated semi-definite program problem \((SDP_r)\). Suppose that for each \(i=1,\ldots ,m\), the interior of the spectrahedron \(\mathcal {V}_i\) is nonempty, i.e., there exists \(\overline{v}_i=(\overline{v}_{i1},\ldots ,\overline{v}_{is}) \in \mathbb {R}^s\) such that \( A_0^{(i)}+\sum _{l=1}^p \overline{v}_{il} A_l^{(i)} \succ 0, i=1,\ldots ,m. \) Then,

$$\begin{aligned} \min (RSVM_r)=\min (SDP_r). \end{aligned}$$

Moreover, \((w,\gamma ,\xi )\) is a solution of \((RSVM_r)\) if and only if there exist \(W_i \succeq 0\), \(i=1,\dots ,m\), \(t \in \mathbb {R}^s\) and \(\mu \in \mathbb {R}\) such that \((w,\gamma ,t,\mu ,\xi ,W_1,\dots ,W_m)\) is a solution of \((SDP_r)\).

Proof

By lemma 3.1, the robust SVM problem is equivalent to

$$\begin{aligned} (AP_{r})&\displaystyle \min _{(w,\gamma ,t,\mu ,\xi )\in \mathbb {R}^s\times \mathbb {R}\times \mathbb {R}^s\times \mathbb {R}\times \mathbb {R}^m}&\ \lambda _1 \sum _{l=1}^s t_l + \lambda _2 \mu + \sum _{i=1}^m \xi _i \\&\text{ s.t. }&\Vert (w, \frac{1-\mu }{2}) \Vert _2 \le \frac{1+\mu }{2}, \; \; w_l\le t_l, \\&-w_l \le t_l,\ l=1,\dots ,s\\&\sigma _{\mathcal {U}_{i}(r_i)}(-\alpha _{i}w)- \alpha _i \gamma -\xi _i +1 \le 0 ,\ i=1,\ldots ,m \\&\xi _i \ge 0, \ i=1,\ldots ,m. \end{aligned}$$

Recall that \(\mathcal {U}_{i}(r_i)=\overline{u}_i+r_i \mathcal {V}_i\) where \( \mathcal {V}_i=\{(v_{i1},\ldots ,v_{is}) \in \mathbb {R}^s: A_0^{(i)}+\sum _{l=1}^s v_{il} A_l^{(i)} \succeq 0\}, \) with \(A_l^{(i)}\), \(l=1,\ldots ,s\), being symmetric \((p \times p)\) matrices. To see the conclusion, it suffices to show that for each \(i=1,\ldots ,m\),

$$\begin{aligned} \sigma _{\mathcal {U}_{i}(r_i)}(-\alpha _{i}w)- \alpha _i \gamma -\xi _i +1 \le 0 \end{aligned}$$
(3.2)

is equivalent to the existence of \(W_i \succeq 0\) such that

$$\begin{aligned} \left\{ \begin{array}{c} -\alpha _{i}(w^T\overline{u}_i + \gamma ) - \xi _i + 1 + \mathrm{Trace}(W_i A_0^{(i)}) \le 0 \\ - r_i \alpha _{i}w_l + \mathrm{Trace}(W_i A_l^{(i)})=0, \ l=1,\ldots ,s. \end{array} \right. \end{aligned}$$
(3.3)

Suppose that (3.2) holds. Then, the following implication holds:

$$\begin{aligned} A_0^{(i)}+\sum _{l=1}^s v_{il} A_l^{(i)} \succeq 0 \implies (-\alpha _{i}w)^T(\overline{u}_i + r_iv_i)-\alpha _i \gamma -\xi _i +1 \le 0. \end{aligned}$$
(3.4)

As the interior point condition holds, it follows from the conic duality theorem [3, 4] that there exists \(W_i \succeq 0\) such that

$$\begin{aligned}&(-\alpha _{i}w)^T(\overline{u}_i + r_iv_i)-\alpha _i \gamma -\xi _i +1 + \mathrm{Trace}\big (W_i (A_0^{(i)}+\sum _{l=1}^s v_{il} A_l^{(i)}) \big ) \le 0, \nonumber \\&\forall \, v_i=(v_{i1},\ldots ,v_{is}) \in \mathbb {R}^s. \end{aligned}$$
(3.5)

Note that the validity of the affine inequality, \(a^Tx+b \le 0\) for all \(x \in \mathbb {R}^s\), with \(a \in \mathbb {R}^s\) and \(b \in \mathbb {R}\), means that \(a=0_s\) and \(b \le 0\). Thus, we see that (3.5) is equivalent to (3.3).

Conversely, suppose that for each \(i=1,\ldots ,m\), there exists \(W_i \succeq 0\) such that (3.3) holds. Then, (3.5) holds. Consequently,

$$\begin{aligned} \sigma _{\mathcal {U}_{i}(r_i)}(-\alpha _{i}w)- \alpha _i \gamma -\xi _i +1= & {} \max _{v_i \in \mathcal {V}_i} (-\alpha _{i}w)^T(\overline{u}_i + r_iv_i)-\alpha _i \gamma -\xi _i +1 \\\le & {} \max _{v_i \in \mathcal {V}_i}\{- \mathrm{Trace}\big (W_i (A_0^{(i)}+\sum _{l=1}^s v_{il} A_l^{(i)}) \big )\} \\\le & {} 0, \end{aligned}$$

where the last inequality follows from \(W_i \succeq 0\) and \(v_i \in \mathcal {V}_i\) (and so, \(A_0^{(i)}+\sum _{l=1}^s v_{il} A_l^{(i)} \succeq 0\)). So, (3.2) holds.

Hence, (3.2) is equivalent to the existence of \(W_i \succeq 0\) such that (3.3) holds and the conclusion follows. \(\square \)

We now derive numerically tractable formulations for \((RSVM_r)\) in terms of the second-order cone programs, under the uncertainty sets that take the form of an ellipsoid, ball or box. Although these equivalent formulations and the associated duality results may be derived from \((SDP_r)\) and Theorem 3.1, respectively, by appropriately choosing the matrices \(A_l^{(i)}, l=1,\ldots ,s, i=1,\ldots ,m\), of the spectrahedron \(\mathcal {V}_i \), in the interest of simplicity, we present the results from the model \((AP_r)\) and Lemma 3.1. Related special cases for the standard robust SVM models, where \(\lambda _1=0\) or \(\lambda _2=0\), can be found in [4, 6].

Ellipsoidal Uncertainty Consider the case where the uncertainty sets \(\mathcal {V}_i\) are ellipsoids in the sense that

$$\begin{aligned} \mathcal {V}_i = \{ v_i\in \mathbb {R}^s : v_i^T M_i^{-1} v_i \le 1 \} \end{aligned}$$
(3.6)

for some \(M_i \succ 0\). Let \(M_i= L_iL_i^T\) with \(L_i\) being an invertible matrix. We associate with this case the following second-order cone program:

$$\begin{aligned} (SOCP_{r,E})&\displaystyle \min _{\begin{array}{c} (w,\gamma ,\xi ,t,\mu ) \in \ \mathbb {R}^s\times \mathbb {R}\times \mathbb {R}^m\times \mathbb {R}^s\times \mathbb {R} \end{array}}&\lambda _1 \sum _{j=1}^{s}{t_j} + \lambda _2\mu + \sum _{i=1}^{m}{\xi _i} \\&\text{ subject } \text{ to }&\Vert (w, \frac{1-\mu }{2}) \Vert _2 \le \frac{1+\mu }{2}, \; w_j \le t_j, \;\\&-w_j \le t_j, \ j=1,\dots ,s \\&-\alpha _i(w^T\bar{u}_i + \gamma ) -\xi _i + 1\\&+ r_i \Vert L_i^Tw\Vert _2 \le 0, \ i=1,\dots ,m \\&\xi _i \ge 0, \ i=1,\dots ,m \end{aligned}$$

Proposition 3.1

(Ellipsoidal uncertainty: second-order cone program) For the robust support vector machine problem \((RSVM_r)\) under ellipsoidal uncertainty, as defined in (3.6), and its associated second-order cone problem \((SOCP_{r,E})\), it holds that

$$\begin{aligned} \min (RSVM_r)=\min (SOCP_{r,E}). \end{aligned}$$

Moreover, \((w,\gamma ,\xi )\) is a solution of \((RSVM_r)\) under ellipsoidal uncertainty if and only if there exists \(t\in \mathbb {R}^s\) and \(\mu \in \mathbb {R}\) such that \((w,\gamma ,\xi ,t,\mu )\) is a solution of \((SOCP_{r,E})\).

Proof

In the case of ellipsoidal uncertainty as defined in (3.6), the support function \(\sigma _{\mathcal {U}_i(r_i)}(-\alpha _i w) \) can be expressed as:

$$\begin{aligned} \sigma _{\mathcal {U}_i(r_i)}(-\alpha _i w)= & {} -\alpha _i w^T\overline{u}_i +r _i \max \{ (-\alpha _i w)^Tv_i: v_i^T M_i^{-1} v_i \le 1\} \\= & {} -\alpha _i w^T\overline{u}_i +r _i \max \{ (-\alpha _i w)^Tv_i: \Vert L_i^{-1} v_i\Vert _2 \le 1\} \\= & {} -\alpha _i w^T\overline{u}_i +r _i \max \{ (-\alpha _i)(L_i^T w)^Tz_i: \Vert z_i\Vert _2 \le 1\} \\= & {} -\alpha _i w^T\overline{u}_i +r _i \Vert (-\alpha _i) L_i^T w\Vert _2 \\= & {} -\alpha _i w^T\overline{u}_i +r _i \Vert L_i^T w\Vert _2, \end{aligned}$$

where the last two equalities follow from the support function formula and \(\alpha _i \in \{-1,1\}\), respectively. Thus, the conclusion follows from Lemma 3.1. \(\square \)

Ball Uncertainty We now consider the case where the perturbation sets \(\mathcal {V}_i\) are unit balls:

$$\begin{aligned} \mathcal {V}_i = \{ v_i\in \mathbb {R}^s : \Vert v_i \Vert _2^2 \le 1 \} \end{aligned}$$
(3.7)

In this case, we consider the following second-order cone program:

$$\begin{aligned} (SOCP_{r,B})&\displaystyle \min _{\begin{array}{c} (w,\gamma ,\xi ,t,\mu ) \in \ \mathbb {R}^s\times \mathbb {R}\times \mathbb {R}^m\times \mathbb {R}^s\times \mathbb {R} \end{array}}&\lambda _1 \sum _{l=1}^{s}{t_l} + \lambda _2\mu + \sum _{i=1}^{m}{\xi _i} \\&\text{ subject } \text{ to }&\Vert (w, \frac{1-\mu }{2}) \Vert _2 \le \frac{1+\mu }{2}, \; w_l \le t_l, \;\\&-w_l \le t_l, \; \ l=l,\dots ,s \\&-\alpha _i(w^T\bar{u}_i + \gamma ) - \xi _i + 1\\&+ r_i \Vert w\Vert _2 \le 0, \ i=1,\dots ,m \\&\xi _i\ge 0, \ i=1,\dots ,m. \end{aligned}$$

Corollary 3.1

(Ball uncertainty: simple second-order cone program) For the robust support vector machine problem \((RSVM_r)\) under ball uncertainty, as defined in (3.7), and its associated second-order cone problem \((SOCP_{r,B})\), it holds that

$$\begin{aligned} \min (RSVM_r)=\min (SOCP_{r,B}). \end{aligned}$$

Moreover, \((w,\gamma ,\xi )\) is a solution of \((RSVM_r)\) under ball uncertainty if and only there exist \(t\in \mathbb {R}^s\) and \(\mu \in \mathbb {R}\) such that \((w,\gamma ,\xi ,t,\mu )\) is a solution of \((SOCP_{r,B})\).

Proof

The result follows immediately from Proposition 3.1, since \(\Vert v_i\Vert _2^2 = v_i^T I_s^{-1}v_i\), and so \(L_i = I_s\). \(\square \)

Box Uncertainty Finally, we consider the case where the perturbation sets \(\mathcal {V}_i\) are unit boxes:

$$\begin{aligned} \mathcal {V}_i = \{v_i\in \mathbb {R}^s : \Vert v_i\Vert _\infty \le 1\} \end{aligned}$$
(3.8)

We associate with this case the following second-order cone program (see [6]):

$$\begin{aligned} (SOCP_{r,\infty })&{ \displaystyle \min _{(w,\gamma ,\xi ,t,\mu )\ \in \ \mathbb {R}^s\times \mathbb {R}\times \mathbb {R}^m\times \mathbb {R}^s\times \mathbb {R}}}&\lambda _1\sum _{l=1}^{s}{t_l} + \lambda _2\mu + \sum _{i=1}^{m}{\xi _i} \\&\text {subject to}&\Vert (w,\frac{1-\mu }{2}) \Vert _2 \le \frac{1+\mu }{2}, \; w_l \le t_l, \; \\&-w_l\le t_l, \ l=1,\dots ,s \\&-\alpha _i(w^T\overline{u}_i + \gamma ) -\xi _i +1\\&+r_i\Vert w\Vert _1 \le 0, \ i=1,\dots ,m \\&\xi _i\ge 0, \ i=1,\dots ,m. \end{aligned}$$

Proposition 3.2

(Robust SVM under Box Uncertainty) For the robust support vector machine problem \((RSVM_r)\) under box uncertainty, as defined in (3.8), and its associated second-order cone program problem \((SOCP_{r,\infty })\), it holds that

$$\begin{aligned} \min (RSVM_r) = \min (SOCP_{r,\infty }). \end{aligned}$$

Moreover, \((w,\gamma ,\xi )\) is a solution of \((RSVM_r)\) under polytope uncertainty if and only if there exist \(t\in \mathbb {R}^s\) and \(\mu \in \mathbb {R}\) such that \((w,\gamma ,\xi ,t,\mu )\) is a solution of \((SOCP_{r,\infty })\).

Proof

In the case of box uncertainty (that is, \(\mathcal {U}_i(r_i)=\overline{u}_i+r_i \mathcal {V}_i\) with \(\mathcal {V}_i = \{ v_i\in \mathbb {R}^s : \Vert v_i\Vert _{\infty } \le 1 \}\)), the support function \(\sigma _{\mathcal {U}_i(r_i)}(-\alpha _i w) \) can be expressed as:

$$\begin{aligned} \sigma _{\mathcal {U}_i(r_i)}(-\alpha _i w)= & {} -\alpha _i w^T\overline{u}_i +r _i \max \{ (-\alpha _i w)^Tv_i: \Vert v_i\Vert _{\infty } \le 1\} \\= & {} -\alpha _i w^T\overline{u}_i +r _i \Vert (-\alpha _i) w\Vert _1 \\= & {} -\alpha _i w^T\overline{u}_i +r _i \Vert w\Vert _1. \end{aligned}$$

Thus, the conclusion follows from Lemma 3.1. \(\square \)

4 A New Robust pq-SVM for Efficient Classification

In this section, we derive an efficient scheme for finding a robust classifier under box uncertainty by extending the approach in [14, 16, 27] and using a variable transformation and Lagrangian duality [18] to reformulate the robust SVM model \((SOCP_{r,\infty })\) into a simple non-negative quadratic program.

4.1 QP Reformulation via Lagrangian Duality

Recall that the problem \((SOCP_{r,\infty })\) can be equivalently rewritten as:

$$\begin{aligned}&{\displaystyle \min _{{(w,\gamma ,\xi )\in \mathbb {R}^s\times \mathbb {R}\times \mathbb {R}^m}}}&\lambda _1 \Vert w\Vert _1 + \lambda _2 \Vert w \Vert _2^2 + e_m^T\xi \nonumber \\&\text {subject to}&\alpha _i(\bar{u}_i^Tw + \gamma ) + \xi _i - r_i\Vert w\Vert _1 \ge 1, i=1,2,\dots , m \nonumber \\&\xi \ge 0_m. \end{aligned}$$
(4.1)

Define vectors \(p,q\in \mathbb {R}^s_+\) by

$$\begin{aligned} p_i = {\left\{ \begin{array}{ll} 0, &{} \text {for } w_i \le 0 \\ w_i, &{} \text {for } w_i> 0, \end{array}\right. } \quad q_i = {\left\{ \begin{array}{ll} -w_i, &{} \text {for } w_i \le 0 \\ 0, &{} \text {for } w_i > 0, \end{array}\right. } \end{aligned}$$
(4.2)

for \(i=1,\dots ,s\). Then, it is easy to see that

$$\begin{aligned} w = p - q, \quad p, q \ge 0_s \ \text{ and } \ p^Tq=0. \end{aligned}$$

Consequently, we can rewrite \(\Vert w\Vert _2^2 \) and \(\Vert w\Vert _1\) as \(\Vert w\Vert _2^2 = \Vert p\Vert _2^2 + \Vert q\Vert _2^2\) and \(\Vert w\Vert _1 = e_s^T(p+q)\). So, the problem \((SOCP_{r,\infty })\) can be defined as:

$$\begin{aligned}&{ \displaystyle \min _{{(p,q,\gamma ,\xi )\in \mathbb {R}^s\times \mathbb {R}^s\times \mathbb {R}\times \mathbb {R}^m}}}&\lambda _1e_s^T(p+q)+ \lambda _2\Vert p\Vert _2^2 + \lambda _2 \Vert q\Vert _2^2 + e_m^T\xi \\&\text {subject to}&\alpha _i(\bar{u}_i^T(p-q) + \gamma ) + \xi _i - r_ie_s^T(p+q) \ge 1, \, i=1,2,\dots , m \\&p,q \ge 0_{s}, \ \xi \ge 0_m. \end{aligned}$$

Now, we define:

$$\begin{aligned} y = \begin{bmatrix} \xi \\ p \\ q \end{bmatrix} \in \mathbb {R}^{m+2s}, \quad C = \begin{bmatrix} 0_{m \times m} &{} &{} \\ &{} \lambda _2 I_s &{} \\ &{} &{} \lambda _2 I_s \end{bmatrix}, \text{ and } b = \begin{bmatrix} \frac{1}{\lambda _1} e_m \\ e_s \\ e_s \end{bmatrix}\ . \end{aligned}$$

Let

$$\begin{aligned} \hat{D} = \begin{bmatrix} I_{m} &{} 0_{m\times 2s} \\ 0_{2s\times m} &{} 0_{2s\times 2s} \end{bmatrix} \end{aligned}$$
(4.3)

and

$$\begin{aligned} \hat{U}= \begin{bmatrix} 0_{m \times m} &{} U_1 &{} U_2 \\ 0_{2s\times m} &{} 0_{2s\times s} &{} 0_{2s\times s} \end{bmatrix} \end{aligned}$$
(4.4)

where

$$\begin{aligned} U_1= \begin{bmatrix} \alpha _1 \overline{u}_1^T- r_1 e_s^T \\ \vdots \\ \alpha _m \overline{u}_m^T- r_m e_s^T \end{bmatrix} \in \mathbb {R}^{m \times s} \text{ and } U_2= \begin{bmatrix} -\alpha _1 \overline{u}_1^T- r_1 e_s^T \\ \vdots \\ -\alpha _m \overline{u}_m^T- r_m e_s^T \end{bmatrix} \in \mathbb {R}^{m \times s} \ . \end{aligned}$$

Define further that

$$\begin{aligned}&A = \begin{bmatrix} \alpha _1 &{} &{} \\ &{} \ddots &{} \\ &{} &{} \alpha _m \end{bmatrix} \in \mathbb {R}^{m\times m}, \quad \hat{A} = \begin{bmatrix} A^{-1} &{} \\ &{} 0_{2s \times 2s} \end{bmatrix}\in \mathbb {R}^{(m+2s)\times (m+2s)} \text{ and } \\&\quad \hat{e}= \begin{bmatrix} A e_m \\ 0_s \\ 0_s \end{bmatrix} \in \mathbb {R}^{m+2s}. \end{aligned}$$

Then, problem \((SOCP_{r,\infty })\) can be rewritten as the following convex quadratic programming problem:

$$\begin{aligned}&{\displaystyle \min _{{(y, \gamma )\in \mathbb {R}^{m+2s}\times \mathbb {R}}}}&y^T C y+ \lambda _1 b^T y \\&\text {subject to}&\hat{D}(\hat{U} y + \hat{e} \gamma )+y \ge \hat{A}\hat{e} \\&y \ge 0_{m+2s}, \end{aligned}$$

where \( b \ge 0\). By removing the linear term of the objective function via regularization, as in [14, 16, 27] and by regularizing \(\gamma \), we arrive at the regularized problem

$$\begin{aligned}&{ \displaystyle \min _{ {(y,\gamma )\in \mathbb {R}^{m+2s}\times \mathbb {R}}}}&\quad \frac{1}{2}y^TCy + \frac{1}{2}\gamma ^2 + \frac{\nu }{2} y^T y\\&\text {subject to}&\quad \hat{D}(\hat{U} y + \hat{e} \gamma )+y \ge \hat{A}\hat{e}, \end{aligned}$$

where \(\nu > 0\). Note that the regularization makes the non-negativity condition \(y \ge 0_{m+2s}\) redundant. The Lagrangian dual of the regularized problem is given by

$$\begin{aligned} { \displaystyle \max _{{(y,\gamma ,u)\in \mathbb {R}^{m+2s}\times \mathbb {R}\times \mathbb {R}^{m+2s}}}} \mathcal {L}(y,\gamma ,u) \ \text { subject to } \ \nabla _{y,\gamma }\mathcal {L}(y,\gamma ,u)=0, \ u \ge 0_{m+2s}, \end{aligned}$$

where

$$\begin{aligned} \mathcal {L}(y,\gamma ,u) = \frac{1}{2} y^T (C + \nu { I_{m+2s}}) y + \frac{1}{2}\gamma ^2 - u^T(\hat{D}(\hat{U} y + \hat{e} \gamma )+y - \hat{A}\hat{e}), \end{aligned}$$

and

$$\begin{aligned}&\nabla _y\mathcal {L}(y,\gamma ,u) = (C+\nu { I_{m+2s}})y - ( (\hat{D}\hat{U})^T + { I_{m+2s}})u \implies y \nonumber \\&\quad = (C + \nu { I_{m+2s}})^{-1}(\hat{D}\hat{U} + { I_{m+2s}})^Tu \end{aligned}$$
(4.5)
$$\begin{aligned}&\nabla _\gamma \mathcal {L}(y,\gamma ,u) = \gamma - (\hat{D}\hat{e})^Tu = 0 \implies \gamma = \hat{e}^T\hat{D}^Tu. \end{aligned}$$
(4.6)

Eliminating the equality constraints, the dual can be written as:

$$\begin{aligned}&{ \displaystyle \min _{u\in \mathbb {R}^{m+2s}}}&\frac{1}{2}u^T\left( (\hat{D}\hat{U} + I)(C + \nu { I_{m+2s}})^{-1}(\hat{D}\hat{U} + { I_{m+2s}})^T + \hat{D}\hat{e}\hat{e}^T\hat{D}^T \right) u - (\hat{A}\hat{e})^Tu \\&\text { subject to }&\ u \ge 0_{m+2s}. \end{aligned}$$

Define the matrix \(Q\in \mathbb {R}^{(m+2s)\times (m+2s)}\) and the vector \(\eta \in \mathbb {R}^{m+2s}\) by

$$\begin{aligned} Q = (\hat{D}\hat{U} + { I_{m+2s}})(C + \nu { I_{m+2s}})^{-1}(\hat{D}\hat{U} + { I_{m+2s}})^T + \hat{D}\hat{e}\hat{e}^T\hat{D}, \quad \eta = \hat{A}\hat{e}. \end{aligned}$$
(4.7)

Then, a direct verification shows that \((\hat{D}\hat{U} + { I_{m+2s}})^Td=0_{m+2s}\) if and only if \(d=0_{m+2s}\), and so, Q is positive definite. So, we arrive at a simple strictly convex quadratic programming problem over non-negative orthant:

$$\begin{aligned} (QP) \quad { \displaystyle \min _{u\in \mathbb {R}^{m+2s}}} \frac{1}{2}u^T Q u - \eta ^T u \ \text { subject to } \ u \ge 0_{m+2s}. \end{aligned}$$
(4.8)

Notice that (QP) no longer has the hyperparameter \(\lambda _1\) in its formulation. Having found a solution u of (4.8), we can then retrieve a solution to our original problem, via the dual equality constraints (4.5) and (4.6), i.e.,

$$\begin{aligned}&{ y = (C + \nu { I_{m+2s}})^{-1}(\hat{D}\hat{U} + I_{m+2s})^Tu}, \quad p = \begin{bmatrix} y_{m+1} \\ \vdots \\ y_{m+s} \end{bmatrix}, \quad q = \begin{bmatrix} y_{m+s+1} \\ \vdots \\ y_{m+2s} \end{bmatrix}, \\&\quad w = p - q, \quad \gamma = \hat{e}^T\hat{D}u. \end{aligned}$$

4.2 Efficient Iterative Scheme

To solve (QP), we propose a variation of the LSVM algorithm put forth in [34]. Recall from [34] that the point u is an optimal solution for (QP) if and only if

$$\begin{aligned} 0_{m+2s} \le u \perp (Qu -\eta ) \ge 0_{m+2s}. \end{aligned}$$

Notice that

$$\begin{aligned} 0_{m+2s} \le a \perp b \ge 0_{m+2s}, \, a,b \in \mathbb {R}^{m+2s} \iff \exists \alpha >0 \ \text{ such } \text{ that } \ a = \max (a-\alpha b, 0_{m+2s}), \end{aligned}$$
(4.9)

where the \(k^{th}\)-component of \(\max (z, 0_{m+2s})=\max (z_k, 0)\), \(k=1,\ldots ,m+2s\). Unlike the LSVM Algorithm where it is taken that \(a = Qu - \eta \) and \(b = u\), we propose instead to take \(a = u\) and \(b = Qu - \eta \). Therefore, we arrive at the optimality condition

$$\begin{aligned} u = \max (u - \alpha (Qu - \eta ), 0_{m+2s}), \ \alpha > 0 \end{aligned}$$

from which we derive the simple iterative scheme:

$$\begin{aligned} u^{(i+1)} = \max (u^{(i)} - \alpha (Qu^{(i)} - \eta ), 0_{m+2s}), \ i = 0, 1, \dots \end{aligned}$$
(4.10)

with a starting point given by \(u^{(0)} = Q^{-1}\eta \).

The difference between our proposed iterative scheme and the LSVM Algorithm is that we require only the inversion of the matrix Q once, to define the starting point for the iteration, whereas the LSVM Algorithm requires solving a linear system in Q at each step of the iteration. We also only require a single matrix multiplication per iteration.

The complete iterative algorithm required to return the optimal solution \(u^*\) of (QP) is given below, and its proof of (linear) convergence is given in Appendix C.

figure a

Notice that maxiter is a threshold for the maximum number of iterations, tol is for the convergence tolerance of the algorithm, and \(\alpha > 0\) is a pre-selected constant.

5 Experiments with Real-World Data Sets

In this section, we evaluate several different SVM models, derived from our above results, against some real-world datasets, which are all available from the UCI Machine Learning repository [13]. The aim is to compare the models in terms of accuracy, computational expense and feature selection.

5.1 Experimental Setup

Three datasets are used for the comparison of our models:

  • The Wisconsin Breast Cancer (Diagnostic) (WBCD) [13]: this dataset describes the (binary) classification problem of labelling tumours as either malignant or benign. The dataset contains 569 instances, 30 features and a 63/37 split of the two classes.

  • Cylinder Bands (CYLBN) [13]: this dataset describes the problem of mitigating delays known as “cylinder bands” in rotogravure printing. It contains 541 instances, 33 features and a 58/42 class split.

  • Pima Indians Diabetes (PIMA) [39]: this dataset describes the problem of classifying diabetes in patients. It contains 768 instances, 8 features and a 35/65 class split.

Each of these datasets had their features standardized and then split into 80%/20% training and test sets. For each of our models, we then perform the following. We tune the model hyperparameters in their cross-validation range via fivefold cross-validation on the training set. The model is then fitted to the full training set, to obtain optimal solution \(w^*, \gamma ^*\), and the training accuracy is recorded. We next determine the features selected by the model. This is done by considering the significance (weighting) of each feature in \(w^*\). More precisely, feature k is considered significant by a model if

$$\begin{aligned} \left| \frac{w_k^*}{\displaystyle \max _{1\le j \le s}{w_j^*}}\right| > 0.05. \end{aligned}$$
(5.1)

Having done this, we set the value of \(w_k^*\) to zero for each insignificant feature. Note that this is a stricter version of the feature selection methods employed in [14, 27]. Finally, the model predicts the classification of all datapoints \(u_i\) in the test set: \(\alpha _i := \mathrm{sign}(w^Tu_i + \gamma )\), and we record the test accuracy.

5.2 Classification Methods

We will apply the following classification methods to each of the datasets. For robust models, we will assume that the radius of robustness is constant for each datapoint: \(r_i = r\), \(i=1,\dots ,m\).

  • The standard SVM model (SVM). This method does not consider uncertainty in the datapoints. We refer to this method as SVM.

  • The ball uncertainty robust SVM model \((SOCP_{r,B})\). We refer to this method as Ball-SVM.

  • The box-uncertainty robust SVM model \((SOCP_{r,\infty })\). We refer to this method as Box-SVM.

  • The pq-robust SVM over box-uncertainty (QP), which we solved by Algorithm 1. We set \(\texttt {maxiter} = 1000\), \(\texttt {tol} = 10^{-4}\) and \(\alpha \) tuned from \(10^{-16}\) to \(10^{-4}\). We refer to this method as Box-pq -SVM.

For our experiments, the cross-validation range for each hyperparameter was determined as follows. For the first three methods, \(\lambda _1\) and \(\lambda _2\) were tuned over values \(2^k\), \(k\in \{-10,\dots ,4\}\). (Note that SVM does not tune the parameter \(\lambda _1\) which is set to 0.) For the pq-SVM, \(\nu \) and \(\lambda _2\) were tuned over the range \(2^j\), \(j\in \{-10,9,\dots ,10\}\). This is due to the sensitivity of the pq-SVM to hyperparameters.

For all of our robust methods (Ball-SVM, Box-SVM, and Box-pq -SVM), the radius of robustness was tuned as follows. We set a lower bound of \(2^{-20}\), and an upper bound given by a simple heuristic: for each datapoint in one class, we calculate the maximum distance from it to any point in the other class. We then take the minimum of these distances and finally use as our upper bound the maximum over the two classes. Formally, we can define our upper bound U as:

$$\begin{aligned} U = \max {\left\{ \min _{u\in \mathcal {A}}{\max _{v\in \mathcal {B}}{\Vert u-v\Vert _2}}, \min _{v\in \mathcal {B}}{\max _{u\in \mathcal {A}}{\Vert u-v\Vert _2}} \right\} }. \end{aligned}$$

Having obtained our upper bound on the radius, we then tuned our radius of robustness within an exponential range over this lower and upper bound.

Our choice of heuristic for the upper bound is justified as follows: for any radius of robustness larger than this value, consider a (non-trivial) true classifier. It is certain that every datapoint would simultaneously have a point in its uncertainty set on one side of the classifier, and another point in its uncertainty set on the other side. In this case, it is clear that the robust SVM will select a trivial solution, i.e., a majority class prediction.

5.3 Results

All computations were performed using a 3.2GHz Intel(R) Core(TM) i7-8700 and 16GB of RAM, equipped with MATLAB R2019B. All optimization problems were solved via the MOSEK software [36], handled through the YALMIP interface [31].

Table 1 shows the results for each of the classification methods on each of the datasets. The columns are interpreted as follows:

  • Dataset: as written.

  • Instances: number of instances in the dataset.

  • Model: the classification method used.

  • Type: either nominal (non-robust) or robust.

  • Train Acc: Accuracy obtained on the training set, as a percentage.

  • Test Acc: Accuracy obtained on the test set, as a percentage.

  • Fit Time (s): CPU time (seconds) taken to solve the final optimization problem, after cross validation.

  • Features: the number of selected features from each model, out of the total number of features in the dataset.

Table 1 Results for each classification method on each dataset

5.4 Discussion

Overall, it is apparent from Table 1 that on all three datasets, for both testing accuracy and number of selected features, the standard SVM is outperformed by both Box-SVM and Ball-SVM robust methods.

Box-SVM–Best performing robust method Regarding best performance in terms of testing accuracy, this method performs best in this area, producing the highest testing accuracy on all three datasets. It is also consistently economical in its selection of features. We see the method as the most consistent of the four.

Box-pq -SVM–Faster robust method This method is remarkably efficient for reasonable sized datasets, and much faster than our other two robust methods.

Robust methods in applications Regarding the choice of which of the three robust methods to utilize in an application, one should consider which subset of accuracy, feature selection and computational time is most desired. No one of our robust methods outperforms any other on all three counts. We do note that the Box-pq -SVM method requires storage of an \((m+2s)\times (m+2s)\) size matrix which, even if defined sparsely, can consume significant amounts of physical memory in computation for very large datasets.

Finally, we can also compare our results to others in the recent literature on robust optimization methods for SVMs, namely [6]. Our methods achieve near identical accuracies compared on WBCD and PIMA to the feature-robust method presented in [6] (which also is designed for box-uncertainty), with a slightly higher accuracy achieved by our methods on PIMA.

We do note that whilst the heuristic we use for defining an upper bound on the radius of robustness tuning range is simple, it is both effective and computationally efficient even for very large datasets. Characterizing an upper bound on the radius within a mathematical framework would need to be investigated in a further study.

6 Application to Characterization of Huntington Disease Onset

HD is a neurodegenerative movement disorder caused by a mutation in the HTT gene with motor, cognitive and psychiatric manifestations [1, 11, 33]. It is one of the most common monogenic neurological diseases in the developed world [1]. From its onset, typically in the fourth decade of life, the signs and symptoms of HD progress inexorably until certain death. No treatment currently available can alter this course.

Defining the onset of HD is of significant clinical and research importance. Studies tracking the progression of the motor manifestations of HD over time have shown that there is an acceleration in motor decline around the time of onset diagnosis [30, 32]. Thus, making a clinical diagnosis of HD onset heralds a poor prognosis for the patient and, in addition to carrying significant emotional weight, can have important implications for key life decisions such as family planning. Secondly, clearly defining disease onset is paramount for establishing endpoints in clinical trials where the efficacy of putative disease-modifying therapies in the future could be measured by their ability to delay onset in subjects harbouring the HTT mutation [25].

The current formal diagnosis of HD onset or “manifest HD” (mHD) is based on the motor manifestations of the disease according to the motor component of the Unified Huntington Disease Rating Scale (UHDRS). The UHDRS has been shown to have a high degree of internal consistency and interrater reliability [26]. It contains 31 items relating to characteristic motor abnormalities in HD, which are scored from 0 (normal) to 4 (significantly abnormal), and their sum, the total motor score (TMS). Based on these scores, the clinician assigns a “diagnostic confidence level” (DCL) between 0 (normal) and 4 (\(\ge \) 99% confidence that the motor abnormalities are unequivocal signs of HD), representing their confidence that the subject’s motor abnormalities are due to mHD. A diagnosis of mHD is made when the DCL is 4 [33]. However, there are currently no rules relating the scores from the 31 items and the TMS to the DCL and so the DCL rating relies on the clinician’s expertise. Given the significance of making a diagnosis of HD onset, a standardized and objective means of arriving at the diagnosis from the motor assessment is needed [11, 33].

6.1 Conic duality Methods

We applied the conic duality-based robust SVM methods to data from the Enroll-HD study consisting of 36,953 sets of motor scores and corresponding DCLs from patients with HD around the world. Subjects with a DCL of 4 were defined as having mHD, whilst those with a DCL less than 4 (i.e. 0, 1, 2 and 3) were defined as having non-manifest HD (nHD). There were 19303 (52%) cases of mHD and 17650 (48%) cases of nHD defined in this way, a roughly even split between the two groups, minimizing classification bias. Data from the 31 motor items from the UHDRS and the TMS (32 features in total) for these subjects were used as potential features to predict classification by conic duality-based robust SVM models.

Due to the physical memory limitations of our hardware, and the size of the dataset, Box-pq -SVM was not applied to the Enroll-HD dataset.

6.2 Results

Both box and ball robust SVM models achieved similarly high accuracies of over 95% in both the training and testing phases (Table 2).

Table 2 Results for each classification method on Enroll-HD dataset

The features that were selected by each model for each problem are given in Table 3 above. The descriptions of all features are given later in “Section 8.2 Appendix B”.

Table 3 Motor features selected by the classifier found by each method

6.3 Discussion

In this application, we have used conic duality-based robust SVM methods to establish a highly accurate classification of HD disease status based solely on UHDRS motor scores.

Not all features were selected by the models despite achieving very high classification accuracies, suggesting that the feature selection aspect of these models successfully eliminated unnecessary variables that may interfere with prediction. The Ball-SVM achieved a marginally higher classification accuracy than the Box-SVM at the expense of selecting more features. All of the features selected by the Box-SVM were also selected by the Ball-SVM.

The features that were not selected by both models, vertical saccade velocity and right-sided arm rigidity, are members of two pairs of features examining identical aspects of motor function on different sides of the body (left/right) or in different planes (horizontal/vertical). It would not be surprising if only one feature from each pair was sufficient and more efficient for prediction. Alternatively, this may reflect an inherent asymmetry in HD [37, 44].

Notably, none of the five features relating to dystonia (involuntary muscle contractions) were selected by the Box-SVM model. This may reflect the particular disease phenotype or disease stage of the study population. In typical adult-onset HD, choreiform (“dance-like”) movements, which were features that were selected by both models, dominate early in the disease course, whereas dystonia is not prominent until the later stages [1].

Similarly, gait abnormality, which is a composite effect of multiple other motor abnormalities including late features like dystonia, was not selected by the Box-SVM model [43]. TMS was also not selected by this model, suggesting that it does not strongly influence classification when its component items are used as features.

7 Conclusion and Further Work

In this paper, by employing a support function approach and conic duality of convex optimization, we first presented a readily numerically solvable semi-definite program reformulation for a general robust SVM data classification problem, where the uncertainty sets are spectrahedra. A spectrahedron, which is an important generalization of a polyhedron, has played a key role in a range of fields, including algebraic geometry and semi-definite optimization. It also encompasses the most commonly used uncertainty sets of robust optimization models, such as boxes, balls and ellipsoids. We have shown that the conic duality-based robust SVMs with box and ball uncertainty sets achieved classification accuracies of over 95% on the large Enroll-HD dataset and provided important information about the features that characterize HD onset.

As an alternative to the second-order cone program reformulation of the robust SVM with box-uncertainty sets, we also presented a new efficient iterative scheme, Box-pq -SVM, for solving the robust SVM by reformulating it as a simple convex quadratic optimization problem via Lagrangian duality. We have demonstrated through computational studies on a range of datasets that these robust classification methods allow for greater classification accuracies than conventional support vector machines in addition to selecting groups of highly correlated features.

Further work is planned to examine the generalizability of the Enroll-HD robust classifier on other HD datasets with the aim of producing a simple and reliable clinical decision support tool to aid in the identification of patients with manifest HD. It would also be of interest to study how the Box-pq -SVM approach can be extended to treating large-scale data sets.