Abstract
In this paper, we propose a new algorithm to extend support vector machine (SVM) for binary classification to multicategory classification. The proposed method is based on a sequential binary classification algorithm. We first classify a target class by excluding the possibility of labeling as any other classes using a forward step of sequential SVM; we then exclude the already classified classes and repeat the same procedure for the remaining classes in a backward step. The proposed algorithm relies on SVM for each binary classification and utilizes only feasible data in each step; therefore, the method guarantees convergence and entails light computational burden. We prove Fisher consistency of the proposed forward–backward SVM (FB-SVM) and obtain a stochastic bound for the predicted misclassification rate. We conduct extensive simulations and analyze real-world data to demonstrate the superior performance of FB-SVM, for example, FB-SVM achieves a classification accuracy much higher than the current standard for predicting conversion from mild cognitive impairment to Alzheimer’s disease.
Similar content being viewed by others
References
Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing multiclass to binary: a unifying approach for margin classifiers. J. Mach. Learn. Res. 1, 113–141 (2001)
Bartlett, P.L., Jordan, M.I., McAuliffe, J.D.: Convexity, classification, and risk bounds. J. Am. Stat. Assoc. 101(473), 138–156 (2006)
Bredensteiner, E.J., Bennett, K.P.: Multicategory classification by support vector machines. In: Computational Optimization. Springer, pp. 53–79 (1999)
Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2, 265–292 (2002)
Cui, Y., Liu, B., Luo, S., Zhen, X., Fan, M., Liu, T., Zhu, W., Park, M., Jiang, T., Jin, J.S., et al.: Identification of conversion from mild cognitive impairment to Alzheimer’s disease using multivariate predictors. PLoS ONE 6(7), e21896 (2011)
Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. J. Artif. Intell. Res. 2, 263–286 (1995)
Dogan, U., Glasmachers, T., Igel, C.: A unified view on multi-class support vector classification. J. Mach. Learn. Res. 17, 1–32 (2016)
Hill, S.I., Doucet, A.: A framework for kernel-based multi-category classification. J. Artif. Intell. Res. (JAIR) 30, 525–564 (2007)
Kreßel, U.H.G.: Pairwise classification and support vector machines. In: Advances in Kernel Methods. MIT Press, pp. 255–268 (1999)
Lauer, F., Guermeur, Y.: Msvmpack: a multi-class support vector machine package. J. Mach. Learn. Res. 12, 2293–2296 (2011)
Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines: theory and application to the classification of microarray data and satellite radiance data. J. Am. Stat. Assoc. 99(465), 67–81 (2004)
Liu, Y.: Fisher consistency of multicategory support vector machines. In: International Conference on Artificial Intelligence and Statistics, pp. 291–298 (2007)
Liu, Y., Shen, X.: Multicategory \(\psi \)-learning. J. Am. Stat. Assoc. 101(474), 500–509 (2006)
Liu, Y., Yuan, M.: Reinforced multicategory support vector machines. J. Comput. Graph. Stat. 20(4), 901–919 (2011)
Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)
Tewari, A., Bartlett, P.L.: On the consistency of multiclass classification methods. J. Mach. Learn. Res. 8, 1007–1025 (2007)
Vapnik, V.N., Vapnik, V.: Statistical Learning Theory, vol. 1. Wiley, New York (1998)
Weiner, M.W., Aisen, P.S., Jack, C.R., Jagust, W.J., Trojanowski, J.Q., Shaw, L., Saykin, A.J., Morris, J.C., Cairns, N., Beckett, L.A., et al.: The Alzheimer’s disease neuroimaging initiative: progress report and future plans. Alzheimer’s Dementia 6(3), 202–211 (2010)
Weston, J., Watkins, C., et al.: Support vector machines for multi-class pattern recognition. ESANN 99, 219–224 (1999)
Zhang, T.: Statistical analysis of some multi-category large margin classification methods. J. Mach. Learn. Res. 5, 1225–1251 (2004)
Acknowledgements
This work is supported by NIH Grants R01GM124104, NS073671, NS082062, NUL1 RR025747, Alzheimer’s Disease Neuroimaging Initiative (ADNI) (U01 AG024904, DOD ADNI, W81XWH-12-2- 0012), and a pilot award from the Gillings Innovation Lab at the University of North Carolina. The authors acknowledge the investigators within the ADNI who contributed to the design and implementation of ADNI.
Author information
Authors and Affiliations
Corresponding author
Appendices
Proof of Theorem 3.1
We start from class label k and follow the order in FB-SVM. First, we show \({{\mathcal {D}}}^*({\varvec{X}})=k\) if and only if \(P_k({\varvec{X}})=\max _{h=1}^k P_h({\varvec{X}})\). For any \({\varvec{X}}\) with \({{\mathcal {D}}}^*({\varvec{X}})=k\), by the definition of \({{\mathcal {D}}}^*\), there exists a permutation \((j_1,\ldots ,j_{k-1})\) of \(\{1,\ldots ,k-1\}\) such that \({{\mathcal {D}}}^{*(k)}_l({\varvec{X}})=-1\) for \(l=j_1,\ldots ,j_{k-1}\). That is,
On the other hand, from the estimation of \({\widehat{f}}_{j_1}\), it is clear that \(f^{*}_{j_1}\) is the minimizer of the expectation of a weighted hinge loss corresponding to \(V_{n,j1}\), which is given by
Simple algebra following standard SVM theory gives
That is, \(f^*_{j_1}({\varvec{X}})<0\) is equivalent to \(P_{j_1}({\varvec{X}})<1/k\). Now, in the next step, because we restrict to data with \(Y_i\ne j_1\) and \(f^*_{j_1}(X_i)<0\), it is clear that \(f^{*}_{j_2}\) minimizes
Thus, we conclude that
That is, \(f^*_{j2}({\varvec{X}})<0\) if and only if
We continue the same arguments and establish the relationship between \(f^{*}_{j_l}\) and \(P_{j_l}\) as
In other words, we obtain that for this subject with \(f_{j_1}^*({\varvec{X}})<0, \ldots , f_{j_{k-1}}^*({\varvec{X}})<0\), it holds that
Starting from the last inequality, we have
so obtain
Therefore, \(P_{k}({\varvec{X}})=\max _{l=1}^k P_{l}({\varvec{X}})\).
For the other direction, suppose that \(P_k({\varvec{X}})=\max _{l=1}^k P_l({\varvec{X}})\). We order \(P_1({\varvec{X}}),\ldots ,P_{k-1}({\varvec{X}})\) to obtain a sequence with \(P_{j1}({\varvec{X}})\le P_{j2}({\varvec{X}})\le \cdots \le P_{j(k-1)}({\varvec{X}})\le P_{jk}({\varvec{X}})\). We consider the corresponding classification rules for the same sequence. Because all inequalities in (A.2) and (A.3) hold, from the equivalence between \(f^*_{j_l}\) and \(P_{j_l}\), it is straightforward to see that
In other words, \({{\mathcal {D}}}^*({\varvec{X}})=k\). Hence, we have proved that the FB-SVM rule is the Bayesian rule that correctly classifies subjects into class k.
To prove the consistency of the remaining classes, FB-SVM obtains the rule for class \((k-1)\) conditional on \(Y\ne k\) and \(\mathcal{D}^*(X)\ne k\). Using the same proof as above, we conclude that
where \({\widetilde{P}}_{k-1,l}({\varvec{X}})\) is the conditional probability of \(Y=l\) given \(X={\varvec{X}}\), \(Y\ne k\), and \({{\mathcal {D}}}^*(X)\ne k\). Clearly, \({\widetilde{P}}_{k-1,l}({\varvec{X}})\) is proportional to \(P_l({\varvec{X}})\) for \(l=1,\ldots ,k-1\). Moreover, \({{\mathcal {D}}}^*({\varvec{X}})\ne k\) implies that \(P_k({\varvec{X}})\) cannot be the maximum. Therefore,
That is,
We continue this proof for the remaining classes and finally obtain Theorem 3.1.
Proof of Theorem 3.2
We first examine the difference
Clearly,
From Theorem 3.1, for any \({\varvec{x}}\) in the domain of \({\varvec{X}}\), we let \(j_1({\varvec{x}}), j_2({\varvec{x}}), \ldots , j_{k-1}({\varvec{x}})\) be the permutation of \(\{1,\ldots ,k-1\}\) such that
Then, \({{\mathcal {D}}}^*({\varvec{x}})= k\) implies that \(f_{j_l({\varvec{x}})}^*({\varvec{x}})<0\) for any \(l=1,\ldots ,k-1\). On the other hand, \(\widehat{{\mathcal {D}}}({\varvec{X}})\ne k\) implies that for this particular permutation, there exists some \(l=1,\ldots ,k-1\) such that \({\widehat{f}}_{j_l}({\varvec{x}})>0\) so \(\widehat{f}_{j_l}({\varvec{x}})f_{j_l}^*({\varvec{x}})<0\). Therefore, we obtain
Hence, it suffices to bound each term on the right-hand side of the above inequality.
When \(l=1\), under conditions (C.1)–(C.4), from Theorem 8.25 in Steinwart and Christmann [16], there exists a probability at least \(1-3e^{-\epsilon }\) and a constant \(C_1\) such that
where
with any constant \(\epsilon _0>0\) and \(d/(d+\tau )<p<2\). According to Lemma 5 in Bartlett et al. [2] and condition (C.2), this gives
where \(\alpha =q/(1+q)\).
When \(l=2\), because \(Z_{ij_2}\) is no longer defined if \(Z_{ij_1}=1\), we extend to define \(Z_{ij_2}=1\) if \(Z_{ij_1}=1\). We then consider the following minimization
which is equivalent to minimizing
and
Thus, it is obvious that the optimal estimator for g, denoted by \({\widehat{g}}\), is given as
Similarly, the optimal estimator that minimizes the limit is given as
We then apply to \({\widehat{g}}\) the same arguments used by Steinwart and Christmann [16] to prove Theorem 8.25 and obtain
with a probability at least \(1-3e^{-\epsilon }\) for a constant \(C_2\). The second term in the right-hand side is due to that the estimation is conditional on a random set with \(Z_{ij_1}\widehat{f}_{j_1}({\varvec{X}}_i)>0\). On the other hand, from the previous result at \(l=1\), this term is bounded by \(C_1Q_n(\epsilon )\) with probability at least \(1-3e^{-\epsilon }\). We conclude that with a probability at least \(1-6e^{-\epsilon }\), it holds that
for \(C_3=C_2(1+C_1)\). From the fact that \({\widehat{g}}=g^*=1\) if \(Z_{j_1}=1\), we have that with a probability at least \(1-3e^{-\epsilon }\),
Thus, Lemma 5 in Bartlett et al. [2] gives
We continue the same arguments for \(l=3,\ldots ,k-1\) to obtain
with a probability at least \(1-3l e^{-\epsilon }\). Hence, with a probability \(1-[3k(k-1)/2]e^{-\epsilon }\), \(\Delta _k\le CQ_n(\epsilon )^{\alpha }\) for a constant C.
Similarly, we can examine the difference
We follow exactly the same arguments as before by considering all possible permutations from \(\{1,\ldots ,k-2\}\) and \(l=1,\ldots ,k-2\). The only difference in the argument is that the random set is restricted to subjects with \(Y_{i}\ne k\) and \(\widehat{{\mathcal {D}}}^{(k)}=-1\). However, the probability of the latter differs from the probability \(Y_i\ne k\) and \({{\mathcal {D}}}^{*(k)}=-1\) by \(CQ_n(\epsilon )\) from the previous conclusion. Therefore, we obtain that with a probability at least \(1-[3k(k-1)/2+3(k-1)(k-2)/2]e^{-\epsilon }\), \(\Delta _{k-1}\le CQ_n(\epsilon )^{\alpha }\) for another constant C. Continue the same arguments for \(\Delta _{l}, l=k-2,\ldots ,1\), where \(\Delta _l=P(Y=l, \widehat{{\mathcal {D}}}({\varvec{X}})\ne l)-P(Y=l, {{\mathcal {D}}}^*({\varvec{X}})\ne l).\) Finally, by combining all these results, we conclude that
with a probability at least \(1-C'e^{-\epsilon }\), where \(C'\) is a constant depending on k. Theorem 3.2 holds.
Rights and permissions
About this article
Cite this article
Zhou, X., Wang, Y. & Zeng, D. Multicategory Classification Via Forward–Backward Support Vector Machine. Commun. Math. Stat. 8, 319–339 (2020). https://doi.org/10.1007/s40304-019-00179-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40304-019-00179-2