Abstract
The emergence of deep learning frameworks paves the way for achieving higher-level data abstractions and possess the potential in consolidating both supervised and unsupervised learning paradigms. Researchers have made many successful explorations in the field of deep learning, with applications in the fields of face recognition, text mining, language translation, image prediction, and action recognition. Kernel machines act as a bridge between the linearity and nonlinearity for many machine learning algorithms such as support vector machines, extreme learning machines, and core vector machines. These Kernel machines play a vital role in mapping the data in the input space to a Kernel-induced high-dimensional feature space to obtain a better distribution of the data. In this Kernel-induced high-dimensional feature space, the distribution of data points will be more amenable to the classification problem under consideration. The Kernel trick facilitates in transforming the machine learning algorithms that require only inner product computations between the data vectors into a Kernel-based approach by selecting an appropriate Kernel function. In Kernel-based approaches, the Kernel functions can thus be utilized for accomplishing the inner product computations between the transformed data vectors in an implicitly defined Kernel-induced feature space. Unlike neural networks, the Kernel machines guarantee structural risk minimization and global optimal solutions. Also, the Kernel machines exhibit capabilities such as theoretical tractability and excellent performance in practical applications. These attempts motivated the researchers towards utilizing the emerging trends of deep learning with Kernel methods for building deep Kernel machines. Researchers integrate Kernel methods and deep learning networks for maintaining their advantages and make up their limitations, then apply the deep Kernel learning approaches for improving the performance of the learning algorithm in different applications. Different ways of building deep Kernel machines by integrating the Kernel methods and deep learning architectures include utilizing Kernel machines as the final classifier of deep learning networks, Kernelization in deep neural networks for better feature enrichment, and building deep Kernel machines by utilizing deep or multiple Kernels in different tasks. This survey attempts to provide an overview of different approaches in building several deep Kernel learning architectures for enhancing the learning algorithm properties and their performance in practical applications.
Similar content being viewed by others
References
Abd-Elsalam RO, Hassan YF, Saleh MW (2017) New deep Kernel learning based models for image classification. Int J Adv Comput Sci Appl 8(7):407–411
Ackley DH, Hinton GE, Sejnowski TJ (1985) A learning algorithm for Boltzmann machines. Cogn Sci 9(1):147–169
Afzal A, Asharaf S (2017) Deep Kernel learning in core vector machines. Pattern Anal Appl 21:721
Afzal A, Asharaf S (2018) Deep multiple multilayer Kernel learning in core vector machines. Expert Syst Appl 96:149–156
Al-Shedivat M, Wilson AG, Saatchi Y, Hu Z, Xing EP (2017) Learning scalable deep Kernels with recurrent structure. J Mach Learn Res 18(82):1–37
Anselmi F, Rosasco L, Tan C, Poggio T (2015) Deep convolutional networks are hierarchical Kernel machines. arXiv preprint arXiv:150801084
Anwar S, Hwang K, Sung W (2016) Learning separable fixed-point Kernels for deep convolutional neural networks. In: Acoustics, speech and signal processing (ICASSP), 2016 IEEE international conference on, IEEE, pp 1065–1069
Aronszajn N (1950) Theory of reproducing Kernels. Trans Am Math Soc 68(3):337–404
Bach FR, Lanckriet GR, Jordan MI (2004) Multiple Kernel learning, conic duality, and the SMO algorithm. In: Proceedings of the twenty-first international conference on Machine learning, ACM, p 6
Badoiu M, Clarkson KL (2003) Smaller core-sets for balls. In: Proceedings of the fourteenth annual ACM-SIAM symposium on discrete algorithms, society for industrial and applied mathematics, pp 801–802
Bădoiu M, Clarkson KL (2008) Optimal core-sets for balls. Comput Geom 40(1):14–22
Bādoiu M, Har-Peled S, Indyk P (2002) Approximate clustering via core-sets. In: Proceedings of the thiry-fourth annual ACM symposium on theory of computing, ACM, pp 250–257
Baum EB (1988) On the capabilities of multilayer perceptrons. J Complex 4(3):193–215
Belue LM, Bauer KW (1995) Determining input features for multilayer perceptrons. Neurocomputing 7(2):111–121
Bengio Y, Lamblin P, Popovici D, Larochelle H (2007a) Greedy layer-wise training of deep networks. In: Advances in neural information processing systems, pp 153–160
Bengio Y, LeCun Y et al (2007b) Scaling learning algorithms towards AI. Large-Scale Kernel Mach 34(5):1–41
Bengio Y et al (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127
Blaschko MB, Lampert CH (2008) Learning to localize objects with structured output regression. In: European conference on computer vision, Springer, Berlin pp 2–15
Brahma PP, Wu D, She Y (2015) Why deep learning works: a manifold disentanglement perspective. IEEE Trans Neural Netw Learn Syst 27(10):1997–2008
Bu S, Liu Z, Han J, Wu J, Ji R (2014) Learning high-level feature by deep belief networks for 3-D model retrieval and recognition. IEEE Trans Multimedia 16(8):2154–2167
Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2(2):121–167
Chen D, Jacob L, Mairal J (2018) Biological sequence modeling with convolutional Kernel networks. bioRxiv p 217257
Cheng CC, Kingsbury B (2011) Arccosine Kernels: acoustic modeling with infinite neural networks. In: Acoustics, speech and signal processing (ICASSP), 2011 IEEE international conference on, IEEE, pp 5200–5203
Cho Y, Saul LK (2009) Kernel methods for deep learning. In: Advances in neural information processing systems, pp 342–350
Cho Y, Saul LK (2010) Large-margin classification in infinite neural networks. Neural Comput 22(10):2678–2697
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:14123555
Chung J, Gulcehre C, Cho K, Bengio Y (2015) Gated feedback recurrent neural networks. In: International conference on machine learning, pp 2067–2075
Collier M, Beel J (2018) Implementing neural turing machines. In: International conference on artificial neural networks, Springer, Berlin pp 94–104
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27
Cover TM (1965) Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans Electr Comput 3:326–334
Cox TF, Cox MA (2000) Multidimensional scaling. Chapman and hall/CRC Press, Boca Raton
De R, Hinton G, Williams R (1986) Learning internal representations by back-propagating errors. Nature 323:533–536
De Amorim RC, Mirkin B (2012) Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering. Pattern Recogn 45(3):1061–1075
Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Senior A, Tucker P, Yang K, Le QV, et al. (2012) Large scale distributed deep networks. In: Advances in neural information processing systems, pp 1223–1231
Deng L (2012) Three classes of deep learning architectures and their applications: a tutorial survey. APSIPA Trans Signal Inf Process. https://www.microsoft.com/en-us/research/publication/three-classes-of-deep-learning-architectures-and-their-applications-a-tutorial-survey/
Deng L (2014) A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Trans Signal Inform Process 3:e2
Dey R, Salemt FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), IEEE, pp 1597–1600
Fletcher R (2013) Practical methods of optimization. Wiley, New Jersey
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Graves A (2012) Long short-term memory. In: Supervised sequence labelling with recurrent neural networks, Springer, Berlin pp 37–45
Graves A, Wayne G, Danihelka I (2014) Neural turing machines. arXiv preprint arXiv:14105401
Greve RB, Jacobsen EJ, Risi S (2015) Evolving neural turing machines. In: Neural information processing systems: reasoning, attention, memory workshop
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Guyon I, Boser B, Vapnik V (1993) Automatic capacity tuning of very large vc-dimension classifiers. In: Advances in neural information processing systems, pp 147–155
Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst their Appl 13(4):18–28
Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Hinton GE, Salakhutdinov RR (2008) Using deep belief nets to learn covariance Kernels for gaussian processes. In: Advances in neural information processing systems, pp 1249–1256
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Hochreiter S, Schmidhuber J (1997a) Long short-term memory. Neural Comput 9(8):1735–1780
Hochreiter S, Schmidhuber J (1997b) LSTM can solve hard long time lag problems. In: Advances in neural information processing systems, pp 473–479
Hofmann M (2006) Support vector machines-Kernels and the Kernel trick. Notes 26(3):1–16
Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36:1171–1220
Huang GB, Lee H, Learned-Miller E (2012) Learning hierarchical representations for face verification with convolutional deep belief networks. In: Computer vision and pattern recognition (CVPR), 2012 IEEE conference on, IEEE, pp 2518–2525
Huang J, Yuen PC, Chen WS, Lai JH (2007) Choosing parameters of Kernel subspace LDA for recognition of face images under pose and illumination variations. IEEE Trans Syst Man Cybern Part B (Cybernetics) 37(4):847–862
Jones DR (2001) A taxonomy of global optimization methods based on response surfaces. J Global Optim 21(4):345–383
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kumar A, Irsoy O, Ondruska P, Iyyer M, Bradbury J, Gulrajani I, Zhong V, Paulus R, Socher R (2016) Ask me anything: dynamic memory networks for natural language processing. In: International conference on machine learning, pp 1378–1387
Kumar P, Mitchell JS, Yildirim EA (2003) Approximate minimum enclosing balls in high dimensions using core-sets. J Exp Algorithmics (JEA) 8:1–1
Larochelle H, Bengio Y, Louradour J, Lamblin P (2009) Exploring strategies for training deep neural networks. J Mach Learn Res 10:1–40
Le L, Hao J, Xie Y, Priestley J (2016) Deep Kernel: learning Kernel function from data using deep neural network. In: Proceedings of the 3rd IEEE/ACM international conference on big data computing, applications and technologies, pp 1–7
Le QV (2013) Building high-level features using large scale unsupervised learning. In: Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on, IEEE, pp 8595–8598
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th annual international conference on machine learning, ACM, pp 609–616
Lu Z, May A, Liu K, Garakani AB, Guo D, Bellet A, Fan L, Collins M, Kingsbury B, Picheny M, et al. (2014) How to scale up Kernel methods to be as good as deep neural nets. arXiv preprint arXiv:14114000
Mahé P, Ueda N, Akutsu T, Perret JL, Vert JP (2004) Extensions of marginalized graph Kernels. In: Proceedings of the twenty-first international conference on machine learning, p 70
Mairal J (2016) End-to-end Kernel learning with supervised convolutional Kernel networks. In: Advances in neural information processing systems, pp 1399–1407
Mairal J, Koniusz P, Harchaoui Z, Schmid C (2014) Convolutional Kernel networks. In: Advances in neural information processing systems, pp 2627–2635
Malhotra P, TV V, Vig L, Agarwal P, Shroff G (2017) Timenet: Pre-trained deep recurrent neural network for time series classification. arXiv preprint arXiv:170608838
Mockus J, Tiesis V, Zilinskas A (1978) The application of Bayesian methods for seeking the extremum. Towards Global Optim 2(117–129):2
Mohammadnia-Qaraei MR, Monsefi R, Ghiasi-Shirazi K (2018) Convolutional Kernel networks based on a convex combination of cosine Kernels. Pattern Recogn Lett 116:127–134
Montavon G, Müller KR (2012) Learning feature hierarchies with centered deep Boltzmann machines. arXiv preprint arXiv:12033783
Montavon G, Müller KR, Braun ML (2010) Layer-wise analysis of deep networks with Gaussian Kernels. In: Advances in neural information processing systems, pp 1678–1686
Montavon G, Braun ML, Müller KR (2011) Kernel analysis of deep networks. J Mach Learn Res 12:2563–2581
Mu T, Nandi AK (2009) Multiclass classification based on extended support vector data description. IEEE Trans Syst Man Cybern Part B (Cybernetics) 39(5):1206–1216
Muller KR, Mika S, Ratsch G, Tsuda K, Scholkopf B (2001) An introduction to Kernel-based learning algorithms. IEEE Trans Neural Netw 12(2):181–201
Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E (2015) Deep learning applications and challenges in big data analytics. J Big Data 2(1):1
Neal RM (1990) Learning stochastic feedforward networks. Dep Comput Sci Univ Tor 64:1577
Norouzi M, Ranjbar M, Mori G (2009) Stacks of convolutional restricted Boltzmann machines for shift-invariant feature learning. In: Computer vision and pattern recognition, 2009. CVPR 2009. IEEE conference on, IEEE, pp 2735–2742
Pérez-Cruz F, Bousquet O (2004) Kernel methods and their potential use in signal processing. IEEE Signal Process Mag 21(3):57–65
Poggio T, Girosi F (1989) A theory of networks for approximation and learning. Tech. rep, Massachusetts inst of tech cambridge artificial intelligence lab
Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple Kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 2539–2544
Pouyanfar S, Sadiq S, Yan Y, Tian H, Tao Y, Reyes MP, Shyu ML, Chen SC, Iyengar S (2018) A survey on deep learning: algorithms, techniques, and applications. ACM Comput Surv (CSUR) 51(5):92
Rebai I, BenAyed Y, Mahdi W (2016) Deep multilayer multiple Kernel learning. Neural Comput Appl 27(8):2305–2314
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326
Ruck DW, Rogers SK, Kabrisky M (1990) Feature selection using a multilayer perceptron. J Neural Netw Comput 2(2):40–48
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533
Salakhutdinov R, Hinton G (2009) Deep boltzmann machines. In: van Dyk D, Welling M (eds) Proceedings of the twelfth international conference on artificial intelligence and statistics, PMLR, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, Proceedings of machine learning research, vol 5, pp 448–455, http://proceedings.mlr.press/v5/salakhutdinov09a.html
Schölkopf B (2001) The Kernel trick for distances. In: Advances in neural information processing systems, pp 301–307
Scholkopf B, Smola AJ (2001) Learning with Kernels: support vector machines, regularization, optimization, and beyond. MIT press, Cambridge
Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a Kernel eigenvalue problem. Neural Comput 10(5):1299–1319
Seeger M (2004) Gaussian processes for machine learning. Int J Neural Syst 14(02):69–106
Shawe-Taylor J, Cristianini N (2000) Support vector machines. An introduction to support vector machines and other Kernel-based learning methods, Cambridge university press, Cambridge pp 93–112
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
Smola AJ, Schölkopf B, Müller KR (1998) The connection between regularization operators and support vector Kernels. Neural Netw 11(4):637–649
Smolensky P (1986) Information processing in dynamical systems: foundations of harmony theory. Colorado Univ at Builder Dept of Computer Science, Tech. rep
Snoek J, Larochelle H, Adams RP (2012) Practical Bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems, pp 2951–2959
Song H, Thiagarajan JJ, Sattigeri P, Spanias A (2018) Optimizing Kernel machines using deep learning. IEEE Trans Neural Netw Learn Syst 99:1–13
Strobl EV, Visweswaran S (2013) Deep multiple Kernel learning. In: Machine learning and applications (ICMLA), 2013 12th international conference on, IEEE, vol 1, pp 414–417
Sukhbaatar S, Weston J, Fergus R, et al. (2015) End-to-end memory networks. In: Advances in neural information processing systems, pp 2440–2448
Sutskever I, Hinton G (2010) Temporal-Kernel recurrent neural networks. Neural Netw 23(2):239–243
Suykens JA, Vandewalle J (2000) Recurrent least squares support vector machines. IEEE Trans Circuits Syst I: Fundam Theory Appl 47(7):1109–1114
Suzuki J, Hirao T, Sasaki Y, Maeda E (2003) Hierarchical directed acyclic graph Kernel: methods for structured natural language data. In: Proceedings of the 41st annual meeting on association for computational linguistics-volume 1, association for computational linguistics, pp 32–39
Tang D, Qin B, Liu T (2015) Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 1422–1432
Tenenbaum JB, De Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323
Tsang IW, Kwok JT, Cheung PM (2005a) Core vector machines: fast SVM training on very large data sets. J Mach Learn Res 6:363–392
Tsang IW, Kwok JTY, Cheung PM (2005b) Very large SVM training using core vector machines. In: AISTATS
Tsang IW, Kocsor A, Kwok JT (2007) Simpler core vector machines with enclosing balls. In: Proceedings of the 24th international conference on machine learning, ACM, pp 911–918
Vapnik V (1998) The support vector method of function estimation. In: Nonlinear modeling, Springer, Berlin pp 55–85
Vapnik V, Lerner AY (1963) Recognition of patterns with help of generalized portraits. Avtomat i Telemekh 24(6):774–780
Vapnik V, Golowich SE, Smola AJ (1997) Support vector method for function approximation, regression estimation and signal processing. In: Advances in neural information processing systems, pp 281–287
Vishwanathan SVN, Schraudolph NN, Kondor R, Borgwardt KM (2010) Graph Kernels. J Mach Learn Res 11:1201–1242
Wahba G (1990) Spline models for observational data, vol 59. SIAM, Philadelphia
Wang T, Zhao D, Tian S (2015) An overview of Kernel alignment and its applications. Artif Intell Rev 43(2):179–192
Weinberger KQ, Blitzer J, Saul LK (2006) Distance metric learning for large margin nearest neighbor classification. In: Advances in neural information processing systems, pp 1473–1480
Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560
Weston J, Chopra S, Bordes A (2014) Memory networks. arXiv preprint arXiv:14103916
Wiering MA, Schomaker LR (2014) Multi-layer support vector machines. Regul Opt Kernels Support Vector Mach 19:457
Wilson AG, Hu Z, Salakhutdinov R, Xing EP (2016) Deep Kernel learning. In: Artificial intelligence and statistics, pp 370–378
Xiong C, Merity S, Socher R (2016) Dynamic memory networks for visual and textual question answering. In: International conference on machine learning, pp 2397–2406
Xiong H, Swamy M, Ahmad MO (2005) Optimizing the Kernel in the empirical feature space. IEEE Trans Neural Netw 16(2):460–474
Yanardag P, Vishwanathan S (2015) Deep graph Kernels. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1365–1374
Yang R, Tan J, Kafatos M (2006) A pattern selection algorithm in Kernel PCA applications. In: International conference on software and data technologies, Springer, pp 374–387
Yu D, Deng L (2011) Deep learning and its applications to signal and information processing [exploratory dsp]. IEEE Signal Process Mag 28(1):145–154
Zaremba W, Sutskever I (2015) Reinforcement learning neural turing machines-revised. arXiv preprint arXiv:150500521
Zhang Y, Sohn K, Villegas R, Pan G, Lee H (2015) Improving object detection with deep convolutional networks via Bayesian optimization and structured prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 249–258
Zhuang J, Tsang IW, Hoi SC (2011) Two-layer multiple Kernel learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp 909–917
Acknowledgements
This work has been supported by Kerala State Council for Science, Technology and Environment (KSCSTE) under the Fellowship No. 48/FSHP/2016/KSCSTE.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Nikhitha, N.K., Afzal, A.L. & Asharaf, S. Deep Kernel machines: a survey. Pattern Anal Applic 24, 537–556 (2021). https://doi.org/10.1007/s10044-020-00933-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-020-00933-1