Skip to main content

Advertisement

Log in

Multicategory Classification Via Forward–Backward Support Vector Machine

  • Published:
Communications in Mathematics and Statistics Aims and scope Submit manuscript

Abstract

In this paper, we propose a new algorithm to extend support vector machine (SVM) for binary classification to multicategory classification. The proposed method is based on a sequential binary classification algorithm. We first classify a target class by excluding the possibility of labeling as any other classes using a forward step of sequential SVM; we then exclude the already classified classes and repeat the same procedure for the remaining classes in a backward step. The proposed algorithm relies on SVM for each binary classification and utilizes only feasible data in each step; therefore, the method guarantees convergence and entails light computational burden. We prove Fisher consistency of the proposed forward–backward SVM (FB-SVM) and obtain a stochastic bound for the predicted misclassification rate. We conduct extensive simulations and analyze real-world data to demonstrate the superior performance of FB-SVM, for example, FB-SVM achieves a classification accuracy much higher than the current standard for predicting conversion from mild cognitive impairment to Alzheimer’s disease.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing multiclass to binary: a unifying approach for margin classifiers. J. Mach. Learn. Res. 1, 113–141 (2001)

    MathSciNet  MATH  Google Scholar 

  2. Bartlett, P.L., Jordan, M.I., McAuliffe, J.D.: Convexity, classification, and risk bounds. J. Am. Stat. Assoc. 101(473), 138–156 (2006)

    Article  MathSciNet  Google Scholar 

  3. Bredensteiner, E.J., Bennett, K.P.: Multicategory classification by support vector machines. In: Computational Optimization. Springer, pp. 53–79 (1999)

  4. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)

    Google Scholar 

  5. Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2, 265–292 (2002)

    MATH  Google Scholar 

  6. Cui, Y., Liu, B., Luo, S., Zhen, X., Fan, M., Liu, T., Zhu, W., Park, M., Jiang, T., Jin, J.S., et al.: Identification of conversion from mild cognitive impairment to Alzheimer’s disease using multivariate predictors. PLoS ONE 6(7), e21896 (2011)

    Article  Google Scholar 

  7. Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. J. Artif. Intell. Res. 2, 263–286 (1995)

    Article  Google Scholar 

  8. Dogan, U., Glasmachers, T., Igel, C.: A unified view on multi-class support vector classification. J. Mach. Learn. Res. 17, 1–32 (2016)

    MathSciNet  MATH  Google Scholar 

  9. Hill, S.I., Doucet, A.: A framework for kernel-based multi-category classification. J. Artif. Intell. Res. (JAIR) 30, 525–564 (2007)

    Article  MathSciNet  Google Scholar 

  10. Kreßel, U.H.G.: Pairwise classification and support vector machines. In: Advances in Kernel Methods. MIT Press, pp. 255–268 (1999)

  11. Lauer, F., Guermeur, Y.: Msvmpack: a multi-class support vector machine package. J. Mach. Learn. Res. 12, 2293–2296 (2011)

    MathSciNet  MATH  Google Scholar 

  12. Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines: theory and application to the classification of microarray data and satellite radiance data. J. Am. Stat. Assoc. 99(465), 67–81 (2004)

    Article  MathSciNet  Google Scholar 

  13. Liu, Y.: Fisher consistency of multicategory support vector machines. In: International Conference on Artificial Intelligence and Statistics, pp. 291–298 (2007)

  14. Liu, Y., Shen, X.: Multicategory \(\psi \)-learning. J. Am. Stat. Assoc. 101(474), 500–509 (2006)

    Article  MathSciNet  Google Scholar 

  15. Liu, Y., Yuan, M.: Reinforced multicategory support vector machines. J. Comput. Graph. Stat. 20(4), 901–919 (2011)

    Article  MathSciNet  Google Scholar 

  16. Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)

    MATH  Google Scholar 

  17. Tewari, A., Bartlett, P.L.: On the consistency of multiclass classification methods. J. Mach. Learn. Res. 8, 1007–1025 (2007)

    MathSciNet  MATH  Google Scholar 

  18. Vapnik, V.N., Vapnik, V.: Statistical Learning Theory, vol. 1. Wiley, New York (1998)

    MATH  Google Scholar 

  19. Weiner, M.W., Aisen, P.S., Jack, C.R., Jagust, W.J., Trojanowski, J.Q., Shaw, L., Saykin, A.J., Morris, J.C., Cairns, N., Beckett, L.A., et al.: The Alzheimer’s disease neuroimaging initiative: progress report and future plans. Alzheimer’s Dementia 6(3), 202–211 (2010)

    Article  Google Scholar 

  20. Weston, J., Watkins, C., et al.: Support vector machines for multi-class pattern recognition. ESANN 99, 219–224 (1999)

    Google Scholar 

  21. Zhang, T.: Statistical analysis of some multi-category large margin classification methods. J. Mach. Learn. Res. 5, 1225–1251 (2004)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work is supported by NIH Grants R01GM124104, NS073671, NS082062, NUL1 RR025747, Alzheimer’s Disease Neuroimaging Initiative (ADNI) (U01 AG024904, DOD ADNI, W81XWH-12-2- 0012), and a pilot award from the Gillings Innovation Lab at the University of North Carolina. The authors acknowledge the investigators within the ADNI who contributed to the design and implementation of ADNI.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Donglin Zeng.

Appendices

Proof of Theorem 3.1

We start from class label k and follow the order in FB-SVM. First, we show \({{\mathcal {D}}}^*({\varvec{X}})=k\) if and only if \(P_k({\varvec{X}})=\max _{h=1}^k P_h({\varvec{X}})\). For any \({\varvec{X}}\) with \({{\mathcal {D}}}^*({\varvec{X}})=k\), by the definition of \({{\mathcal {D}}}^*\), there exists a permutation \((j_1,\ldots ,j_{k-1})\) of \(\{1,\ldots ,k-1\}\) such that \({{\mathcal {D}}}^{*(k)}_l({\varvec{X}})=-1\) for \(l=j_1,\ldots ,j_{k-1}\). That is,

$$\begin{aligned} f^{*}_{j_1}({\varvec{X}})<0,\quad f^{*}_{j_2}({\varvec{X}})<0,\dots ,f^{*}_{j_{k-1}}({\varvec{X}})<0. \end{aligned}$$
(A.1)

On the other hand, from the estimation of \({\widehat{f}}_{j_1}\), it is clear that \(f^{*}_{j_1}\) is the minimizer of the expectation of a weighted hinge loss corresponding to \(V_{n,j1}\), which is given by

$$\begin{aligned}&E\left[ \frac{k-1}{k}I(Y=j_1)[1-f({\varvec{X}})]_{+}+\frac{1}{k}I(Y\ne j_1)[1+f({\varvec{X}})]_{+}\right] \\&\quad = E\left[ \frac{k-1}{k} P_{j_1}(X)[1-f({\varvec{X}})]_{+} +\frac{1}{k}(1-P_{j_1}(X))[1+f({\varvec{X}})]_{+}\right] . \end{aligned}$$

Simple algebra following standard SVM theory gives

$$\begin{aligned} \text {sign}(f^{*}_{j_1}({\varvec{X}}))= \text {sign}\left( P_{j_1}({\varvec{X}}) (k-1)-(1-P_{j_1}({\varvec{X}}))\right) =\text {sign}\left( P_{j_1}({\varvec{X}})- \frac{1}{k}\right) . \end{aligned}$$

That is, \(f^*_{j_1}({\varvec{X}})<0\) is equivalent to \(P_{j_1}({\varvec{X}})<1/k\). Now, in the next step, because we restrict to data with \(Y_i\ne j_1\) and \(f^*_{j_1}(X_i)<0\), it is clear that \(f^{*}_{j_2}\) minimizes

$$\begin{aligned}&E\left[ \frac{k-2}{k-1}I(Y=j_2)[1-f({\varvec{X}})]_{+}+\frac{1}{k-1}I(Y\ne j_2)[1+f({\varvec{X}})]_{+}\right. \\&\quad \qquad \left. \Big | Y\ne j_1, f^*_{j_1}(X)<0\right] \\&\quad = E\left[ \frac{1}{k-1} \frac{1}{1-P_{j_1}(X)}\left\{ P_{j_2}(X) (k-2)[1-f({\varvec{X}})]_{+}\right. \right. \\&\qquad \qquad \left. \left. +\,(1-P_{j_1}(X)-P_{j_2}(X))[1+f({\varvec{X}}_i)]_{+} \right\} \Big |P_{j_1}(X)<1/k\right] . \end{aligned}$$

Thus, we conclude that

$$\begin{aligned} \text {sign}(f^{*}_{j_2}({\varvec{X}}))=\text {sign} \left( P_{j_2}({\varvec{X}}) (k-2)-(1-P_{j_1}({\varvec{X}})-P_{j_2}({\varvec{X}}))\right) \times I\{P_{j1}({\varvec{X}})<1/k\} . \end{aligned}$$

That is, \(f^*_{j2}({\varvec{X}})<0\) if and only if

$$\begin{aligned} \frac{P_{j_2}({\varvec{X}})}{1-P_{j_1}({\varvec{X}})}<\frac{1}{k-1}. \end{aligned}$$

We continue the same arguments and establish the relationship between \(f^{*}_{j_l}\) and \(P_{j_l}\) as

$$\begin{aligned} \text {sign}(f^{*}_{j_l}({\varvec{X}}))=\text {sign}\left( \frac{P_{j_l}({\varvec{X}})}{\sum _{h=l}^k P_{j_h}({\varvec{X}})}-\frac{1}{k-l+1}\right) . \end{aligned}$$

In other words, we obtain that for this subject with \(f_{j_1}^*({\varvec{X}})<0, \ldots , f_{j_{k-1}}^*({\varvec{X}})<0\), it holds that

$$\begin{aligned}&P_{j_1}({\varvec{X}})<\frac{1}{k},\quad \frac{P_{j_2}({\varvec{X}})}{1-P_{j_1}({\varvec{X}})} <\frac{1}{k-1},\cdots , \end{aligned}$$
(A.2)
$$\begin{aligned}&\frac{P_{j_{k-2}}({\varvec{X}})}{P_{j_{k-2}}({\varvec{X}})+P_{j_{k-1}}({\varvec{X}})+P_{k}({\varvec{X}})}<\frac{1}{3}, \quad \frac{P_{j_{k-1}}({\varvec{X}})}{P_{j_{k-1}}({\varvec{X}})+P_{k}({\varvec{X}})}<\frac{1}{2} . \end{aligned}$$
(A.3)

Starting from the last inequality, we have

$$\begin{aligned}&P_{j_{k-1}}({\varvec{X}})<P_{k}({\varvec{X}}),\quad P_{j_{k-2}}({\varvec{X}})< \frac{1}{3}(P_{j_{k-2}}({\varvec{X}})+P_{j_{k-1}}({\varvec{X}})+P_{j_k}({\varvec{X}})), \\&\quad \dots , P_{j_2}({\varvec{X}})<\frac{\sum _{h=2}^kP_{j_h}({\varvec{X}})}{k-1},\quad P_{j_1}({\varvec{X}})<\frac{1}{k}, \end{aligned}$$

so obtain

$$\begin{aligned} P_{j_{k-1}}({\varvec{X}})<P_{k}({\varvec{X}}), \quad P_{j_{k-2}}({\varvec{X}})<P_{k}({\varvec{X}}), \dots , P_{j_2}({\varvec{X}})<P_{k}({\varvec{X}}), \quad P_{j_1}({\varvec{X}})<\frac{1}{k}. \end{aligned}$$

Therefore, \(P_{k}({\varvec{X}})=\max _{l=1}^k P_{l}({\varvec{X}})\).

For the other direction, suppose that \(P_k({\varvec{X}})=\max _{l=1}^k P_l({\varvec{X}})\). We order \(P_1({\varvec{X}}),\ldots ,P_{k-1}({\varvec{X}})\) to obtain a sequence with \(P_{j1}({\varvec{X}})\le P_{j2}({\varvec{X}})\le \cdots \le P_{j(k-1)}({\varvec{X}})\le P_{jk}({\varvec{X}})\). We consider the corresponding classification rules for the same sequence. Because all inequalities in (A.2) and (A.3) hold, from the equivalence between \(f^*_{j_l}\) and \(P_{j_l}\), it is straightforward to see that

$$\begin{aligned} f^*_{j_1}({\varvec{X}})<0, \ldots , f^*_{j_{k-1}}({\varvec{X}})<0. \end{aligned}$$

In other words, \({{\mathcal {D}}}^*({\varvec{X}})=k\). Hence, we have proved that the FB-SVM rule is the Bayesian rule that correctly classifies subjects into class k.

To prove the consistency of the remaining classes, FB-SVM obtains the rule for class \((k-1)\) conditional on \(Y\ne k\) and \(\mathcal{D}^*(X)\ne k\). Using the same proof as above, we conclude that

$$\begin{aligned} {{{\mathcal {D}}}}^*({\varvec{X}})=(k-1) \text { if and only if } (k-1)=\text {argmax}_{l=1}^{k-1}{{\widetilde{P}}}_{k-1,l}({\varvec{X}}), \end{aligned}$$

where \({\widetilde{P}}_{k-1,l}({\varvec{X}})\) is the conditional probability of \(Y=l\) given \(X={\varvec{X}}\), \(Y\ne k\), and \({{\mathcal {D}}}^*(X)\ne k\). Clearly, \({\widetilde{P}}_{k-1,l}({\varvec{X}})\) is proportional to \(P_l({\varvec{X}})\) for \(l=1,\ldots ,k-1\). Moreover, \({{\mathcal {D}}}^*({\varvec{X}})\ne k\) implies that \(P_k({\varvec{X}})\) cannot be the maximum. Therefore,

$$\begin{aligned} (k-1)=\text {argmax}_{l=1}^{k-1}P_{k-1,l}({\varvec{X}})= \text {argmax}_{l=1}^{k}P_{l}({\varvec{X}}). \end{aligned}$$

That is,

$$\begin{aligned} {{{\mathcal {D}}}}^*({\varvec{X}})=(k-1) \text { if and only if } (k-1)= {\text {argmax}}_{l=1}^{k}P_{l}({\varvec{X}}). \end{aligned}$$

We continue this proof for the remaining classes and finally obtain Theorem 3.1.

Proof of Theorem 3.2

We first examine the difference

$$\begin{aligned} \Delta _k=P(Y=k, \widehat{{\mathcal {D}}}({\varvec{X}})\ne k)-P(Y=k, \mathcal{D}^{*}({\varvec{X}})\ne k). \end{aligned}$$

Clearly,

$$\begin{aligned} \Delta _k\le P(Y=k, \widehat{{\mathcal {D}}}({\varvec{X}})\ne k, {{\mathcal {D}}}^*({\varvec{X}})=k). \end{aligned}$$

From Theorem 3.1, for any \({\varvec{x}}\) in the domain of \({\varvec{X}}\), we let \(j_1({\varvec{x}}), j_2({\varvec{x}}), \ldots , j_{k-1}({\varvec{x}})\) be the permutation of \(\{1,\ldots ,k-1\}\) such that

$$\begin{aligned} P(Y=j_1({\varvec{x}})|{\varvec{X}}={\varvec{x}})<\cdots <P(Y=j_{k-1}({\varvec{x}})|{\varvec{X}}={\varvec{x}}). \end{aligned}$$

Then, \({{\mathcal {D}}}^*({\varvec{x}})= k\) implies that \(f_{j_l({\varvec{x}})}^*({\varvec{x}})<0\) for any \(l=1,\ldots ,k-1\). On the other hand, \(\widehat{{\mathcal {D}}}({\varvec{X}})\ne k\) implies that for this particular permutation, there exists some \(l=1,\ldots ,k-1\) such that \({\widehat{f}}_{j_l}({\varvec{x}})>0\) so \(\widehat{f}_{j_l}({\varvec{x}})f_{j_l}^*({\varvec{x}})<0\). Therefore, we obtain

$$\begin{aligned} \Delta _k&\le P\left( \cup _{j}\left\{ Y=k, \text {there exists some} \, l=1,\ldots ,k-1 \text { such that} \, \widehat{f}_{j_l}({\varvec{X}})f_{j_l}^*({\varvec{X}})<0 \right\} \right) \\&\le \sum _jP\left( Y=k, \text {there exists some} \, l=1,\ldots ,k-1 \, \text {such that} \, {\widehat{f}}_{j_l}({\varvec{X}})f_{j_l}^*({\varvec{X}})<0 \right) \\&\le \sum _j P\left( Z_{j_1}=-1,\ldots ,Z_{j_{k-1}}=-1, \widehat{f}_{j_l}({\varvec{X}})f_{j_l}^*({\varvec{X}})<0\right) . \end{aligned}$$

Hence, it suffices to bound each term on the right-hand side of the above inequality.

When \(l=1\), under conditions (C.1)–(C.4), from Theorem 8.25 in Steinwart and Christmann [16], there exists a probability at least \(1-3e^{-\epsilon }\) and a constant \(C_1\) such that

$$\begin{aligned} P(Z_{j_1}{\widehat{f}}_{j_1}({\varvec{X}})<0)-P(Z_{j_1}f_{j_1}^*({\varvec{X}})<0)\le C_1Q_n(\epsilon ), \end{aligned}$$

where

$$\begin{aligned} Q_{n}(\epsilon )=\left\{ \lambda _n^{\frac{\tau }{2+\tau }} \sigma _n^{-\frac{d\tau }{d+\tau }}+\sigma _n^{-\beta } +\epsilon \left( n\lambda _n^{p}\sigma _n^{\frac{1-p}{1+\epsilon _0d}} \right) ^{-\frac{q+1}{q+2-p}}\right\} \end{aligned}$$

with any constant \(\epsilon _0>0\) and \(d/(d+\tau )<p<2\). According to Lemma 5 in Bartlett et al. [2] and condition (C.2), this gives

$$\begin{aligned} P({\widehat{f}}_{j_1}({\varvec{X}}) f_{j_1}^*({\varvec{X}})<0)\le [C_1Q_n(\epsilon )]^{\alpha }, \end{aligned}$$

where \(\alpha =q/(1+q)\).

When \(l=2\), because \(Z_{ij_2}\) is no longer defined if \(Z_{ij_1}=1\), we extend to define \(Z_{ij_2}=1\) if \(Z_{ij_1}=1\). We then consider the following minimization

$$\begin{aligned} n^{-1}\sum _{i=1}^n I(Z_{ij_1}{\widehat{f}}_{j_1}({\varvec{X}}_i)>0) (1-Z_{ij_2} g(Z_{ij_1}, {\varvec{X}}_i))_++\lambda _n (\Vert g(1, {\varvec{x}})\Vert +\Vert g(-1, {\varvec{x}})\Vert ), \end{aligned}$$

which is equivalent to minimizing

$$\begin{aligned} n^{-1}\sum _{i=1}^n I(Z_{ij_1}=1, {\widehat{f}}_{j_1}({\varvec{X}}_i)>0) (1-g(1, {\varvec{X}}_i))_++\lambda _n \Vert g(1, {\varvec{x}})\Vert \end{aligned}$$

and

$$\begin{aligned} n^{-1}\sum _{i=1}^n I(Z_{ij_1}=-1, {\widehat{f}}_{j_1}({\varvec{X}}_i)<0) (1-Z_{ij_2}g(-1, {\varvec{X}}_i))_+ +\lambda _n \Vert g(-1, {\varvec{x}})\Vert . \end{aligned}$$

Thus, it is obvious that the optimal estimator for g, denoted by \({\widehat{g}}\), is given as

$$\begin{aligned} {\widehat{g}}(1, {\varvec{x}})=1, \ \ {\widehat{g}}(-1, {\varvec{x}})={\widehat{f}}_{j_2}({\varvec{x}}). \end{aligned}$$

Similarly, the optimal estimator that minimizes the limit is given as

$$\begin{aligned} g^*(1, {\varvec{x}})=1, \ \ g^*(-1, {\varvec{x}})=f_{j_2}^*({\varvec{x}}). \end{aligned}$$

We then apply to \({\widehat{g}}\) the same arguments used by Steinwart and Christmann [16] to prove Theorem 8.25 and obtain

$$\begin{aligned}&P(Z_{j_2} {\widehat{g}}(Z_{j_1}, {\varvec{X}})<0)-P(Z_{j_2}g^*(Z_{j_1},{\varvec{X}})<0) \\&\quad \le C_2\left\{ Q_n(\epsilon ) +|P(Z_{j_1}\widehat{f}_{j_1}({\varvec{X}})>0)-P(Z_{j_1}f_{j_1}^*({\varvec{X}})>0)|\right\} \end{aligned}$$

with a probability at least \(1-3e^{-\epsilon }\) for a constant \(C_2\). The second term in the right-hand side is due to that the estimation is conditional on a random set with \(Z_{ij_1}\widehat{f}_{j_1}({\varvec{X}}_i)>0\). On the other hand, from the previous result at \(l=1\), this term is bounded by \(C_1Q_n(\epsilon )\) with probability at least \(1-3e^{-\epsilon }\). We conclude that with a probability at least \(1-6e^{-\epsilon }\), it holds that

$$\begin{aligned} P(Z_{j_2} {\widehat{g}}(Z_{j_1}, {\varvec{X}})<0)-P(Z_{j_2}g^*(Z_{j_1},{\varvec{X}})<0)\le C_3 Q_n(\epsilon ) \end{aligned}$$

for \(C_3=C_2(1+C_1)\). From the fact that \({\widehat{g}}=g^*=1\) if \(Z_{j_1}=1\), we have that with a probability at least \(1-3e^{-\epsilon }\),

$$\begin{aligned}&P(Z_{j_1}=-1, Z_{j_2}{\widehat{f}}_{j_2}({\varvec{X}})<0) -P(Z_{j_1}=-1, Z_{j_2}f_{j_2}^*({\varvec{X}})<0) \le C_3 Q_n(\epsilon ). \end{aligned}$$

Thus, Lemma 5 in Bartlett et al. [2] gives

$$\begin{aligned} P(Z_{j_1}=-1, {\widehat{f}}_{j_2}({\varvec{X}})f_{j_2}^*({\varvec{X}})<0)\le [C_3Q_n(\epsilon )]^{\alpha }. \end{aligned}$$

We continue the same arguments for \(l=3,\ldots ,k-1\) to obtain

$$\begin{aligned}&E\left[ I\left\{ Z_{j_l}{\widehat{f}}_{j_l}({\varvec{X}})<0, Z_{j_{l-1}}=-1,\ldots , Z_{j_1}=-1\right\} \right. \\&\quad \left. -\,I\left\{ Z_{j_l}f_{j_l}^*({\varvec{X}})<0, Z_{j_{l-1}}=-1,\ldots , Z_{j_1}=-1\right\} \right] \le C_lQ_n(\epsilon ) \end{aligned}$$

with a probability at least \(1-3l e^{-\epsilon }\). Hence, with a probability \(1-[3k(k-1)/2]e^{-\epsilon }\), \(\Delta _k\le CQ_n(\epsilon )^{\alpha }\) for a constant C.

Similarly, we can examine the difference

$$\begin{aligned} \Delta _{k-1}= & {} P(Y=k-1, \widehat{{\mathcal {D}}}({\varvec{X}})\ne k-1)-P(Y=k-1, {{\mathcal {D}}}^*({\varvec{X}})\ne k-1). \end{aligned}$$

We follow exactly the same arguments as before by considering all possible permutations from \(\{1,\ldots ,k-2\}\) and \(l=1,\ldots ,k-2\). The only difference in the argument is that the random set is restricted to subjects with \(Y_{i}\ne k\) and \(\widehat{{\mathcal {D}}}^{(k)}=-1\). However, the probability of the latter differs from the probability \(Y_i\ne k\) and \({{\mathcal {D}}}^{*(k)}=-1\) by \(CQ_n(\epsilon )\) from the previous conclusion. Therefore, we obtain that with a probability at least \(1-[3k(k-1)/2+3(k-1)(k-2)/2]e^{-\epsilon }\), \(\Delta _{k-1}\le CQ_n(\epsilon )^{\alpha }\) for another constant C. Continue the same arguments for \(\Delta _{l}, l=k-2,\ldots ,1\), where \(\Delta _l=P(Y=l, \widehat{{\mathcal {D}}}({\varvec{X}})\ne l)-P(Y=l, {{\mathcal {D}}}^*({\varvec{X}})\ne l).\) Finally, by combining all these results, we conclude that

$$\begin{aligned} P(Y\ne \widehat{{\mathcal {D}}}({\varvec{X}}))\le P(Y\ne \mathcal{D}^*({\varvec{X}}))+CQ_n(\epsilon )^{\alpha } \end{aligned}$$

with a probability at least \(1-C'e^{-\epsilon }\), where \(C'\) is a constant depending on k. Theorem 3.2 holds.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, X., Wang, Y. & Zeng, D. Multicategory Classification Via Forward–Backward Support Vector Machine. Commun. Math. Stat. 8, 319–339 (2020). https://doi.org/10.1007/s40304-019-00179-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40304-019-00179-2

Keywords

Mathematics Subject Classification

Navigation