Multicategory Classification Via Forward–Backward Support Vector Machine

Zhou, Xuan; Wang, Yuanjia; Zeng, Donglin

doi:10.1007/s40304-019-00179-2

Multicategory Classification Via Forward–Backward Support Vector Machine

Published: 15 May 2019

Volume 8, pages 319–339, (2020)
Cite this article

Communications in Mathematics and Statistics Aims and scope Submit manuscript

Xuan Zhou¹,
Yuanjia Wang² &
Donglin Zeng¹

270 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

In this paper, we propose a new algorithm to extend support vector machine (SVM) for binary classification to multicategory classification. The proposed method is based on a sequential binary classification algorithm. We first classify a target class by excluding the possibility of labeling as any other classes using a forward step of sequential SVM; we then exclude the already classified classes and repeat the same procedure for the remaining classes in a backward step. The proposed algorithm relies on SVM for each binary classification and utilizes only feasible data in each step; therefore, the method guarantees convergence and entails light computational burden. We prove Fisher consistency of the proposed forward–backward SVM (FB-SVM) and obtain a stochastic bound for the predicted misclassification rate. We conduct extensive simulations and analyze real-world data to demonstrate the superior performance of FB-SVM, for example, FB-SVM achieves a classification accuracy much higher than the current standard for predicting conversion from mild cognitive impairment to Alzheimer’s disease.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature Selection Based on SVM Significance Maps for Classification of Dementia

A Classification Algorithm Based on Discriminative Transfer Feature Learning for Early Diagnosis of Alzheimer’s Disease

Robust multicategory support vector machines using difference convex algorithm

Article 29 November 2017

References

Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing multiclass to binary: a unifying approach for margin classifiers. J. Mach. Learn. Res. 1, 113–141 (2001)
MathSciNet MATH Google Scholar
Bartlett, P.L., Jordan, M.I., McAuliffe, J.D.: Convexity, classification, and risk bounds. J. Am. Stat. Assoc. 101(473), 138–156 (2006)
Article MathSciNet Google Scholar
Bredensteiner, E.J., Bennett, K.P.: Multicategory classification by support vector machines. In: Computational Optimization. Springer, pp. 53–79 (1999)
Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Google Scholar
Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2, 265–292 (2002)
MATH Google Scholar
Cui, Y., Liu, B., Luo, S., Zhen, X., Fan, M., Liu, T., Zhu, W., Park, M., Jiang, T., Jin, J.S., et al.: Identification of conversion from mild cognitive impairment to Alzheimer’s disease using multivariate predictors. PLoS ONE 6(7), e21896 (2011)
Article Google Scholar
Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. J. Artif. Intell. Res. 2, 263–286 (1995)
Article Google Scholar
Dogan, U., Glasmachers, T., Igel, C.: A unified view on multi-class support vector classification. J. Mach. Learn. Res. 17, 1–32 (2016)
MathSciNet MATH Google Scholar
Hill, S.I., Doucet, A.: A framework for kernel-based multi-category classification. J. Artif. Intell. Res. (JAIR) 30, 525–564 (2007)
Article MathSciNet Google Scholar
Kreßel, U.H.G.: Pairwise classification and support vector machines. In: Advances in Kernel Methods. MIT Press, pp. 255–268 (1999)
Lauer, F., Guermeur, Y.: Msvmpack: a multi-class support vector machine package. J. Mach. Learn. Res. 12, 2293–2296 (2011)
MathSciNet MATH Google Scholar
Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines: theory and application to the classification of microarray data and satellite radiance data. J. Am. Stat. Assoc. 99(465), 67–81 (2004)
Article MathSciNet Google Scholar
Liu, Y.: Fisher consistency of multicategory support vector machines. In: International Conference on Artificial Intelligence and Statistics, pp. 291–298 (2007)
Liu, Y., Shen, X.: Multicategory $\psi $-learning. J. Am. Stat. Assoc. 101(474), 500–509 (2006)
Article MathSciNet Google Scholar
Liu, Y., Yuan, M.: Reinforced multicategory support vector machines. J. Comput. Graph. Stat. 20(4), 901–919 (2011)
Article MathSciNet Google Scholar
Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)
MATH Google Scholar
Tewari, A., Bartlett, P.L.: On the consistency of multiclass classification methods. J. Mach. Learn. Res. 8, 1007–1025 (2007)
MathSciNet MATH Google Scholar
Vapnik, V.N., Vapnik, V.: Statistical Learning Theory, vol. 1. Wiley, New York (1998)
MATH Google Scholar
Weiner, M.W., Aisen, P.S., Jack, C.R., Jagust, W.J., Trojanowski, J.Q., Shaw, L., Saykin, A.J., Morris, J.C., Cairns, N., Beckett, L.A., et al.: The Alzheimer’s disease neuroimaging initiative: progress report and future plans. Alzheimer’s Dementia 6(3), 202–211 (2010)
Article Google Scholar
Weston, J., Watkins, C., et al.: Support vector machines for multi-class pattern recognition. ESANN 99, 219–224 (1999)
Google Scholar
Zhang, T.: Statistical analysis of some multi-category large margin classification methods. J. Mach. Learn. Res. 5, 1225–1251 (2004)
MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work is supported by NIH Grants R01GM124104, NS073671, NS082062, NUL1 RR025747, Alzheimer’s Disease Neuroimaging Initiative (ADNI) (U01 AG024904, DOD ADNI, W81XWH-12-2- 0012), and a pilot award from the Gillings Innovation Lab at the University of North Carolina. The authors acknowledge the investigators within the ADNI who contributed to the design and implementation of ADNI.

Author information

Authors and Affiliations

Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA
Xuan Zhou & Donglin Zeng
Department of Biostatistics, Columbia University, New York City, NY, USA
Yuanjia Wang

Authors

Xuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yuanjia Wang
View author publications
You can also search for this author in PubMed Google Scholar
Donglin Zeng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Donglin Zeng.

Appendices

Proof of Theorem 3.1

We start from class label k and follow the order in FB-SVM. First, we show ${{\mathcal {D}}}^*({\varvec{X}})=k$ if and only if $P_k({\varvec{X}})=\max _{h=1}^k P_h({\varvec{X}})$. For any ${\varvec{X}}$ with ${{\mathcal {D}}}^*({\varvec{X}})=k$, by the definition of ${{\mathcal {D}}}^*$, there exists a permutation $(j_1,\ldots ,j_{k-1})$ of $\{1,\ldots ,k-1\}$ such that ${{\mathcal {D}}}^{*(k)}_l({\varvec{X}})=-1$ for $l=j_1,\ldots ,j_{k-1}$. That is,

$$\begin{aligned} f^{*}_{j_1}({\varvec{X}})<0,\quad f^{*}_{j_2}({\varvec{X}})<0,\dots ,f^{*}_{j_{k-1}}({\varvec{X}})<0. \end{aligned}$$

(A.1)

On the other hand, from the estimation of ${\widehat{f}}_{j_1}$, it is clear that $f^{*}_{j_1}$ is the minimizer of the expectation of a weighted hinge loss corresponding to $V_{n,j1}$, which is given by

$$\begin{aligned}&E\left[ \frac{k-1}{k}I(Y=j_1)[1-f({\varvec{X}})]_{+}+\frac{1}{k}I(Y\ne j_1)[1+f({\varvec{X}})]_{+}\right] \\&\quad = E\left[ \frac{k-1}{k} P_{j_1}(X)[1-f({\varvec{X}})]_{+} +\frac{1}{k}(1-P_{j_1}(X))[1+f({\varvec{X}})]_{+}\right] . \end{aligned}$$

Simple algebra following standard SVM theory gives

$$\begin{aligned} \text {sign}(f^{*}_{j_1}({\varvec{X}}))= \text {sign}\left( P_{j_1}({\varvec{X}}) (k-1)-(1-P_{j_1}({\varvec{X}}))\right) =\text {sign}\left( P_{j_1}({\varvec{X}})- \frac{1}{k}\right) . \end{aligned}$$

That is, $f^*_{j_1}({\varvec{X}})<0$ is equivalent to $P_{j_1}({\varvec{X}})<1/k$. Now, in the next step, because we restrict to data with $Y_i\ne j_1$ and $f^*_{j_1}(X_i)<0$, it is clear that $f^{*}_{j_2}$ minimizes

$$\begin{aligned}&E\left[ \frac{k-2}{k-1}I(Y=j_2)[1-f({\varvec{X}})]_{+}+\frac{1}{k-1}I(Y\ne j_2)[1+f({\varvec{X}})]_{+}\right. \\&\quad \qquad \left. \Big | Y\ne j_1, f^*_{j_1}(X)<0\right] \\&\quad = E\left[ \frac{1}{k-1} \frac{1}{1-P_{j_1}(X)}\left\{ P_{j_2}(X) (k-2)[1-f({\varvec{X}})]_{+}\right. \right. \\&\qquad \qquad \left. \left. +\,(1-P_{j_1}(X)-P_{j_2}(X))[1+f({\varvec{X}}_i)]_{+} \right\} \Big |P_{j_1}(X)<1/k\right] . \end{aligned}$$

Thus, we conclude that

$$\begin{aligned} \text {sign}(f^{*}_{j_2}({\varvec{X}}))=\text {sign} \left( P_{j_2}({\varvec{X}}) (k-2)-(1-P_{j_1}({\varvec{X}})-P_{j_2}({\varvec{X}}))\right) \times I\{P_{j1}({\varvec{X}})<1/k\} . \end{aligned}$$

That is, $f^*_{j2}({\varvec{X}})<0$ if and only if

$$\begin{aligned} \frac{P_{j_2}({\varvec{X}})}{1-P_{j_1}({\varvec{X}})}<\frac{1}{k-1}. \end{aligned}$$

We continue the same arguments and establish the relationship between $f^{*}_{j_l}$ and $P_{j_l}$ as

$$\begin{aligned} \text {sign}(f^{*}_{j_l}({\varvec{X}}))=\text {sign}\left( \frac{P_{j_l}({\varvec{X}})}{\sum _{h=l}^k P_{j_h}({\varvec{X}})}-\frac{1}{k-l+1}\right) . \end{aligned}$$

In other words, we obtain that for this subject with $f_{j_1}^*({\varvec{X}})<0, \ldots , f_{j_{k-1}}^*({\varvec{X}})<0$, it holds that

$$\begin{aligned}&P_{j_1}({\varvec{X}})<\frac{1}{k},\quad \frac{P_{j_2}({\varvec{X}})}{1-P_{j_1}({\varvec{X}})} <\frac{1}{k-1},\cdots , \end{aligned}$$

(A.2)

$$\begin{aligned}&\frac{P_{j_{k-2}}({\varvec{X}})}{P_{j_{k-2}}({\varvec{X}})+P_{j_{k-1}}({\varvec{X}})+P_{k}({\varvec{X}})}<\frac{1}{3}, \quad \frac{P_{j_{k-1}}({\varvec{X}})}{P_{j_{k-1}}({\varvec{X}})+P_{k}({\varvec{X}})}<\frac{1}{2} . \end{aligned}$$

(A.3)

Starting from the last inequality, we have

$$\begin{aligned}&P_{j_{k-1}}({\varvec{X}})<P_{k}({\varvec{X}}),\quad P_{j_{k-2}}({\varvec{X}})< \frac{1}{3}(P_{j_{k-2}}({\varvec{X}})+P_{j_{k-1}}({\varvec{X}})+P_{j_k}({\varvec{X}})), \\&\quad \dots , P_{j_2}({\varvec{X}})<\frac{\sum _{h=2}^kP_{j_h}({\varvec{X}})}{k-1},\quad P_{j_1}({\varvec{X}})<\frac{1}{k}, \end{aligned}$$

so obtain

$$\begin{aligned} P_{j_{k-1}}({\varvec{X}})<P_{k}({\varvec{X}}), \quad P_{j_{k-2}}({\varvec{X}})<P_{k}({\varvec{X}}), \dots , P_{j_2}({\varvec{X}})<P_{k}({\varvec{X}}), \quad P_{j_1}({\varvec{X}})<\frac{1}{k}. \end{aligned}$$

Therefore, $P_{k}({\varvec{X}})=\max _{l=1}^k P_{l}({\varvec{X}})$.

For the other direction, suppose that $P_k({\varvec{X}})=\max _{l=1}^k P_l({\varvec{X}})$. We order $P_1({\varvec{X}}),\ldots ,P_{k-1}({\varvec{X}})$ to obtain a sequence with $P_{j1}({\varvec{X}})\le P_{j2}({\varvec{X}})\le \cdots \le P_{j(k-1)}({\varvec{X}})\le P_{jk}({\varvec{X}})$. We consider the corresponding classification rules for the same sequence. Because all inequalities in (A.2) and (A.3) hold, from the equivalence between $f^*_{j_l}$ and $P_{j_l}$, it is straightforward to see that

$$\begin{aligned} f^*_{j_1}({\varvec{X}})<0, \ldots , f^*_{j_{k-1}}({\varvec{X}})<0. \end{aligned}$$

In other words, ${{\mathcal {D}}}^*({\varvec{X}})=k$. Hence, we have proved that the FB-SVM rule is the Bayesian rule that correctly classifies subjects into class k.

To prove the consistency of the remaining classes, FB-SVM obtains the rule for class $(k-1)$ conditional on $Y\ne k$ and $\mathcal{D}^*(X)\ne k$. Using the same proof as above, we conclude that

$$\begin{aligned} {{{\mathcal {D}}}}^*({\varvec{X}})=(k-1) \text { if and only if } (k-1)=\text {argmax}_{l=1}^{k-1}{{\widetilde{P}}}_{k-1,l}({\varvec{X}}), \end{aligned}$$

where ${\widetilde{P}}_{k-1,l}({\varvec{X}})$ is the conditional probability of $Y=l$ given $X={\varvec{X}}$, $Y\ne k$, and ${{\mathcal {D}}}^*(X)\ne k$. Clearly, ${\widetilde{P}}_{k-1,l}({\varvec{X}})$ is proportional to $P_l({\varvec{X}})$ for $l=1,\ldots ,k-1$. Moreover, ${{\mathcal {D}}}^*({\varvec{X}})\ne k$ implies that $P_k({\varvec{X}})$ cannot be the maximum. Therefore,

$$\begin{aligned} (k-1)=\text {argmax}_{l=1}^{k-1}P_{k-1,l}({\varvec{X}})= \text {argmax}_{l=1}^{k}P_{l}({\varvec{X}}). \end{aligned}$$

That is,

$$\begin{aligned} {{{\mathcal {D}}}}^*({\varvec{X}})=(k-1) \text { if and only if } (k-1)= {\text {argmax}}_{l=1}^{k}P_{l}({\varvec{X}}). \end{aligned}$$

We continue this proof for the remaining classes and finally obtain Theorem 3.1.

Proof of Theorem 3.2

We first examine the difference

$$\begin{aligned} \Delta _k=P(Y=k, \widehat{{\mathcal {D}}}({\varvec{X}})\ne k)-P(Y=k, \mathcal{D}^{*}({\varvec{X}})\ne k). \end{aligned}$$

Clearly,

$$\begin{aligned} \Delta _k\le P(Y=k, \widehat{{\mathcal {D}}}({\varvec{X}})\ne k, {{\mathcal {D}}}^*({\varvec{X}})=k). \end{aligned}$$

From Theorem 3.1, for any ${\varvec{x}}$ in the domain of ${\varvec{X}}$, we let $j_1({\varvec{x}}), j_2({\varvec{x}}), \ldots , j_{k-1}({\varvec{x}})$ be the permutation of $\{1,\ldots ,k-1\}$ such that

$$\begin{aligned} P(Y=j_1({\varvec{x}})|{\varvec{X}}={\varvec{x}})<\cdots <P(Y=j_{k-1}({\varvec{x}})|{\varvec{X}}={\varvec{x}}). \end{aligned}$$

Then, ${{\mathcal {D}}}^*({\varvec{x}})= k$ implies that $f_{j_l({\varvec{x}})}^*({\varvec{x}})<0$ for any $l=1,\ldots ,k-1$. On the other hand, $\widehat{{\mathcal {D}}}({\varvec{X}})\ne k$ implies that for this particular permutation, there exists some $l=1,\ldots ,k-1$ such that ${\widehat{f}}_{j_l}({\varvec{x}})>0$ so $\widehat{f}_{j_l}({\varvec{x}})f_{j_l}^*({\varvec{x}})<0$. Therefore, we obtain

$$\begin{aligned} \Delta _k&\le P\left( \cup _{j}\left\{ Y=k, \text {there exists some} \, l=1,\ldots ,k-1 \text { such that} \, \widehat{f}_{j_l}({\varvec{X}})f_{j_l}^*({\varvec{X}})<0 \right\} \right) \\&\le \sum _jP\left( Y=k, \text {there exists some} \, l=1,\ldots ,k-1 \, \text {such that} \, {\widehat{f}}_{j_l}({\varvec{X}})f_{j_l}^*({\varvec{X}})<0 \right) \\&\le \sum _j P\left( Z_{j_1}=-1,\ldots ,Z_{j_{k-1}}=-1, \widehat{f}_{j_l}({\varvec{X}})f_{j_l}^*({\varvec{X}})<0\right) . \end{aligned}$$

Hence, it suffices to bound each term on the right-hand side of the above inequality.

When $l=1$, under conditions (C.1)–(C.4), from Theorem 8.25 in Steinwart and Christmann [16], there exists a probability at least $1-3e^{-\epsilon }$ and a constant $C_1$ such that

$$\begin{aligned} P(Z_{j_1}{\widehat{f}}_{j_1}({\varvec{X}})<0)-P(Z_{j_1}f_{j_1}^*({\varvec{X}})<0)\le C_1Q_n(\epsilon ), \end{aligned}$$

where

$$\begin{aligned} Q_{n}(\epsilon )=\left\{ \lambda _n^{\frac{\tau }{2+\tau }} \sigma _n^{-\frac{d\tau }{d+\tau }}+\sigma _n^{-\beta } +\epsilon \left( n\lambda _n^{p}\sigma _n^{\frac{1-p}{1+\epsilon _0d}} \right) ^{-\frac{q+1}{q+2-p}}\right\} \end{aligned}$$

with any constant $\epsilon _0>0$ and $d/(d+\tau )<p<2$. According to Lemma 5 in Bartlett et al. [2] and condition (C.2), this gives

$$\begin{aligned} P({\widehat{f}}_{j_1}({\varvec{X}}) f_{j_1}^*({\varvec{X}})<0)\le [C_1Q_n(\epsilon )]^{\alpha }, \end{aligned}$$

where $\alpha =q/(1+q)$.

When $l=2$, because $Z_{ij_2}$ is no longer defined if $Z_{ij_1}=1$, we extend to define $Z_{ij_2}=1$ if $Z_{ij_1}=1$. We then consider the following minimization

$$\begin{aligned} n^{-1}\sum _{i=1}^n I(Z_{ij_1}{\widehat{f}}_{j_1}({\varvec{X}}_i)>0) (1-Z_{ij_2} g(Z_{ij_1}, {\varvec{X}}_i))_++\lambda _n (\Vert g(1, {\varvec{x}})\Vert +\Vert g(-1, {\varvec{x}})\Vert ), \end{aligned}$$

which is equivalent to minimizing

$$\begin{aligned} n^{-1}\sum _{i=1}^n I(Z_{ij_1}=1, {\widehat{f}}_{j_1}({\varvec{X}}_i)>0) (1-g(1, {\varvec{X}}_i))_++\lambda _n \Vert g(1, {\varvec{x}})\Vert \end{aligned}$$

and

$$\begin{aligned} n^{-1}\sum _{i=1}^n I(Z_{ij_1}=-1, {\widehat{f}}_{j_1}({\varvec{X}}_i)<0) (1-Z_{ij_2}g(-1, {\varvec{X}}_i))_+ +\lambda _n \Vert g(-1, {\varvec{x}})\Vert . \end{aligned}$$

Thus, it is obvious that the optimal estimator for g, denoted by ${\widehat{g}}$, is given as

$$\begin{aligned} {\widehat{g}}(1, {\varvec{x}})=1, \ \ {\widehat{g}}(-1, {\varvec{x}})={\widehat{f}}_{j_2}({\varvec{x}}). \end{aligned}$$

Similarly, the optimal estimator that minimizes the limit is given as

$$\begin{aligned} g^*(1, {\varvec{x}})=1, \ \ g^*(-1, {\varvec{x}})=f_{j_2}^*({\varvec{x}}). \end{aligned}$$

We then apply to ${\widehat{g}}$ the same arguments used by Steinwart and Christmann [16] to prove Theorem 8.25 and obtain

$$\begin{aligned}&P(Z_{j_2} {\widehat{g}}(Z_{j_1}, {\varvec{X}})<0)-P(Z_{j_2}g^*(Z_{j_1},{\varvec{X}})<0) \\&\quad \le C_2\left\{ Q_n(\epsilon ) +|P(Z_{j_1}\widehat{f}_{j_1}({\varvec{X}})>0)-P(Z_{j_1}f_{j_1}^*({\varvec{X}})>0)|\right\} \end{aligned}$$

with a probability at least $1-3e^{-\epsilon }$ for a constant $C_2$. The second term in the right-hand side is due to that the estimation is conditional on a random set with $Z_{ij_1}\widehat{f}_{j_1}({\varvec{X}}_i)>0$. On the other hand, from the previous result at $l=1$, this term is bounded by $C_1Q_n(\epsilon )$ with probability at least $1-3e^{-\epsilon }$. We conclude that with a probability at least $1-6e^{-\epsilon }$, it holds that

$$\begin{aligned} P(Z_{j_2} {\widehat{g}}(Z_{j_1}, {\varvec{X}})<0)-P(Z_{j_2}g^*(Z_{j_1},{\varvec{X}})<0)\le C_3 Q_n(\epsilon ) \end{aligned}$$

for $C_3=C_2(1+C_1)$. From the fact that ${\widehat{g}}=g^*=1$ if $Z_{j_1}=1$, we have that with a probability at least $1-3e^{-\epsilon }$,

$$\begin{aligned}&P(Z_{j_1}=-1, Z_{j_2}{\widehat{f}}_{j_2}({\varvec{X}})<0) -P(Z_{j_1}=-1, Z_{j_2}f_{j_2}^*({\varvec{X}})<0) \le C_3 Q_n(\epsilon ). \end{aligned}$$

Thus, Lemma 5 in Bartlett et al. [2] gives

$$\begin{aligned} P(Z_{j_1}=-1, {\widehat{f}}_{j_2}({\varvec{X}})f_{j_2}^*({\varvec{X}})<0)\le [C_3Q_n(\epsilon )]^{\alpha }. \end{aligned}$$

We continue the same arguments for $l=3,\ldots ,k-1$ to obtain

$$\begin{aligned}&E\left[ I\left\{ Z_{j_l}{\widehat{f}}_{j_l}({\varvec{X}})<0, Z_{j_{l-1}}=-1,\ldots , Z_{j_1}=-1\right\} \right. \\&\quad \left. -\,I\left\{ Z_{j_l}f_{j_l}^*({\varvec{X}})<0, Z_{j_{l-1}}=-1,\ldots , Z_{j_1}=-1\right\} \right] \le C_lQ_n(\epsilon ) \end{aligned}$$

with a probability at least $1-3l e^{-\epsilon }$. Hence, with a probability $1-[3k(k-1)/2]e^{-\epsilon }$, $\Delta _k\le CQ_n(\epsilon )^{\alpha }$ for a constant C.

Similarly, we can examine the difference

$$\begin{aligned} \Delta _{k-1}= & {} P(Y=k-1, \widehat{{\mathcal {D}}}({\varvec{X}})\ne k-1)-P(Y=k-1, {{\mathcal {D}}}^*({\varvec{X}})\ne k-1). \end{aligned}$$

We follow exactly the same arguments as before by considering all possible permutations from $\{1,\ldots ,k-2\}$ and $l=1,\ldots ,k-2$. The only difference in the argument is that the random set is restricted to subjects with $Y_{i}\ne k$ and $\widehat{{\mathcal {D}}}^{(k)}=-1$. However, the probability of the latter differs from the probability $Y_i\ne k$ and ${{\mathcal {D}}}^{*(k)}=-1$ by $CQ_n(\epsilon )$ from the previous conclusion. Therefore, we obtain that with a probability at least $1-[3k(k-1)/2+3(k-1)(k-2)/2]e^{-\epsilon }$, $\Delta _{k-1}\le CQ_n(\epsilon )^{\alpha }$ for another constant C. Continue the same arguments for $\Delta _{l}, l=k-2,\ldots ,1$, where $\Delta _l=P(Y=l, \widehat{{\mathcal {D}}}({\varvec{X}})\ne l)-P(Y=l, {{\mathcal {D}}}^*({\varvec{X}})\ne l).$ Finally, by combining all these results, we conclude that

$$\begin{aligned} P(Y\ne \widehat{{\mathcal {D}}}({\varvec{X}}))\le P(Y\ne \mathcal{D}^*({\varvec{X}}))+CQ_n(\epsilon )^{\alpha } \end{aligned}$$

with a probability at least $1-C'e^{-\epsilon }$, where $C'$ is a constant depending on k. Theorem 3.2 holds.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, X., Wang, Y. & Zeng, D. Multicategory Classification Via Forward–Backward Support Vector Machine. Commun. Math. Stat. 8, 319–339 (2020). https://doi.org/10.1007/s40304-019-00179-2

Download citation

Received: 12 July 2018
Revised: 03 January 2019
Accepted: 12 March 2019
Published: 15 May 2019
Issue Date: September 2020
DOI: https://doi.org/10.1007/s40304-019-00179-2

Keywords

Mathematics Subject Classification

68Q32

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multicategory Classification Via Forward–Backward Support Vector Machine

Abstract

Access this article

Similar content being viewed by others

Feature Selection Based on SVM Significance Maps for Classification of Dementia

A Classification Algorithm Based on Discriminative Transfer Feature Learning for Early Diagnosis of Alzheimer’s Disease

Robust multicategory support vector machines using difference convex algorithm

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Proof of Theorem 3.1

Proof of Theorem 3.2

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Multicategory Classification Via Forward–Backward Support Vector Machine

Abstract

Access this article

Similar content being viewed by others

Feature Selection Based on SVM Significance Maps for Classification of Dementia

A Classification Algorithm Based on Discriminative Transfer Feature Learning for Early Diagnosis of Alzheimer’s Disease

Robust multicategory support vector machines using difference convex algorithm

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Proof of Theorem 3.1

Proof of Theorem 3.2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation