Abstract
Active learning is a major area of interest within the field of machine learning, especially when the labeled instances are very difficult, time-consuming or expensive to obtain. In this paper, we review various active learning methods for manifold data, where the intrinsic manifold structure of data is also incorporated into the active learning query strategies. In addition, we present a new manifold-based active learning algorithm for Gaussian process classification. This new method uses a data-dependent kernel derived from a semi-supervised model that considers both labeled and unlabeled data. The method performs a regularization on the smoothness of the fitted function with respect to both the ambient space and the manifold where the data lie. The regularization parameter is treated as an additional kernel (covariance) parameter and estimated from the data, permitting adaptation of the kernel to the given dataset manifold geometry. Comparisons with other AL methods for manifold data show faster learning performance in our empirical experiments. MATLAB code that reproduces all examples is provided as supplementary materials.
Similar content being viewed by others
Notes
This two-spirals experiment was conducted using the library developed by Stefano Melacci, available at http://www.dii.unisi.it/~melacci/lapsvmp/index.html.
This example is generated using the code provided by Cai and He (2012).
Here we abuse the f notation to represent a latent variable instead of the function to be learned as used in Sect. 2.
This is the number of labeled instances required for an algorithm to achieve the specified accuracy.
The regular perceptron update consists of the simple rule: if \((x_t,y_t)\) is misclassified, then \(w_{t+1}=w_{t}+y_t x_t\), where w is a weight variable. For a linear classifier, this update rule moves the classification boundary in the right direction as new instances arrive. For a detailed theoretical study, see the classic reference Rosenblatt (1958).
It is also known as the out-of-sample error, which is a measure of accuracy of a model on the unseen instances.
References
Alaeddini A, Craft E, Meka R, Martinez S (2019) Sequential Laplacian regularized V-optimal design of experiments for response surface modeling of expensive tests: an application in wind tunnel testing. IISE Trans. https://doi.org/10.1080/24725854.2018.1508928
Aronszajn N (1950) Theory of reproducing kernels. Trans Am Math Soc 68:337–404
Atlas LE, Cohn DA, Ladner RE (1990) Training connectionist networks with queries and selective sampling. In: Touretzky DS (ed) Advances in neural information processing systems, vol 2. Morgan-Kaufmann, Burlington, pp 566–573
Balcan MF, Beygelzimer A, Langford J (2009) Agnostic active learning. J Comput Syst Sci 75(1):78–89
Belkin M (2003) Problems of learning on manifolds. Ph.D. thesis, The University of Chicago
Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15(6):1373–1396
Belkin M, Niyogi P (2005) Towards a theoretical foundation for Laplacian-based manifold methods. In: Proceedings of conference on learning theory
Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 7:2399–2434
Bishop C (2006) Pattern recognition and machine learning. Springer, New York
Cai D, He X (2012) Manifold adaptive experimental design for text categorization. IEEE Trans Knowl Data Eng 24(4):707–719
Chaudhuri K, Kakade SM, Netrapalli P, Sanghavi S (2015) Convergence rates of active learning for maximum likelihood estimation. Adv Neural Inf Process Syst 28:1090–1098
Chen C, Chen Z, Bu J, Wang C, Zhang L, Zhang C (2010) G-optimal design with Laplacian regularization. In: Proceedings of the twenty-fourth AAAI conference on artificial intelligence, vol 1, pp 413–418
Chu W, Ghahramani Z (2005) Preference learning with Gaussian processes. In: Proceedings of the 22nd international conference on machine learning, ICML’05. ACM, New York, NY, USA, pp 137–144. https://doi.org/10.1145/1102351.1102369
Chu W, Sindhwani V, Ghahramani Z, Keerthi SS (2007) Relational learning with Gaussian processes. In: Proceedings of the 19th international conference on neural information processing systems, pp 289–296
Cohn D (1994) Neural network exploration using optimal experiment design. Adv Neural Inf Process Syst 6:679–686
Cohn D, Atlas L, Ladner R (1994) Improving generalization with active learning. Mach Learn 15(2):201–221. https://doi.org/10.1007/BF00993277
Coifman R, Lafon S, Lee A, Maggioni M, Nadler B, Warner F, Zuker S (2005) Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. Proc Natl Acad Sci 102(21):7426–7431
Dasgupta S, Hsu D, Monteleoni C (2007) A general agnostic active learning algorithm. Adv Neural Inf Process Syst 20:353–360
Dasgupta S, Kalai AT, Monteleoni C (2009) Analysis of perceptron-based active learning. J Mach Learn Res 10:281–299
Donoho D, Grimes C (2003) Hessian eigenmaps: locally linear embedding techniques for high dimensional data. Proc Natl Acad Sci 100(10):5591–5596
Evans LPG, Adams NM, Anagnostopoulos C (2015) Estimating optimal active learning via model retraining improvement
Fedorov VV (1972) Theory of optimal experiments. Academic Press, Cambridge
Freund Y, Seung HS, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm. Mach Learn 28(2):133–168
Gal Y, Islam R, Ghahramani Z (2017) Deep Bayesian active learning with image data. http://arxiv.org/abs/1703.02910
Hanneke S (2007) A bound on the label complexity of agnostic active learning. In: Proceedings of the 24th international conference on machine learning, ICML’07. ACM, New York, NY, USA, pp 353–360. https://doi.org/10.1145/1273496.1273541
He X (2010) Laplacian regularized D-optimal design for active learning and its application to image retrieval. IEEE Trans Imgae Process 19(1):254–263
Hein M, Audibert JY, von Luxburg U (2005) From graphs to manifolds–weak and strong pointwise consistency of graph Laplacians. In: Proceedings of the 18th conference on learning theory (2005)
Houlsby N, Huszár F, Ghahramani Z, Lengyel M (2011) Bayesian active learning for classification and preference learning
Joshi AJ, Porikli F, Papanikolopoulos N (2009) Multiclass active learning for image classification. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2372–2379
Kapoor A, Grauman K, Urtasun R, Darrell T (2007) Active learning with Gaussian processes for object categorization. In: IEEE 11th international conference on computer vision, vol 2
Lafon S (2004) Diffusion maps and geometric harmonics. Ph.D. thesis, Yale University
Li X, Guo Y (2013) Adaptive active learning for image classification. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 859–866
Li C, Liu H, Cai D (2014) Active learning on manifolds. Neurocomputing 123:398–405
McCallum A, Nigam K (1998) Employing EM and pool-based active learning for text classification. In: Proceedings of the fifteenth international conference on machine learning, ICML’98. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 350–358. http://dl.acm.org/citation.cfm?id=645527.757765. Retrieved 18 Dec 2019
Minka TP (2001) A family of algorithms for approximate Bayesian inference. Ph.D. thesis, Massachusetts Institute of Technology
Nickisch H, Rasmussen CE (2008) Approximations for binary Gaussian process classification. J Mach Learn Res 9:2035–2078
Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, Cambridge
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65:386
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323–2326
Roy N, Mccallum A (2001) Toward optimal active learning through Monte Carlo estimation of error reduction. In: Proceedings of the international conference on machine learning
Seeger M (2003) Bayesian Gaussian process models: Pac-Bayesian generalisation error bounds and sparse approximations. Ph.D. thesis, University of Edinburgh
Seeger M (2005) Expectation propagation for exponential families
Settles B (2010) Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison
Settles B (2012) Active learning. Morgan & Claypool, New York
Seung HS, Opper M, Sompolinsky H (1992) Query by committee. In: Proceedings of the fifth annual workshop on computational learning theory, COLT’92. ACM, New York, NY, USA, pp 287–294. https://doi.org/10.1145/130385.130417
Shannon C (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423
Sindhwani V, Niyogi P, Belkin M (2005) Beyond the point cloud: from transductive to semi-supervised learning. In: Proceedings, twenty second international conference on machine learning
Sindhwani V, Chu W, Keerthi SS (2007) Semi-supervised Gaussian process classifiers. In: International joint conference on artificial intelligence, pp 1059–1064
Sun S, Hussain Z, Shawe-Taylor J (2014) Manifold-preserving graph reduction for sparse semi-supervised learning. Neurocomputing 124:13–21
Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290:2319–2323
Tong S, Chang E (2001) Support vector machine active learning for image retrieval. In: Proceedings of the ninth ACM international conference on multimedia, MULTIMEDIA’01. ACM, New York, NY, USA, pp 107–118. https://doi.org/10.1145/500141.500159
Wahba G (1990) Spline models for observational data. Society for Industrial and Applied Mathematics, Philadelphia
Xu H, Yu L, Davenport MA, Zha H (2017) Active manifold learning via a unified framework for manifold landmarking. http://arxiv.org/abs/1710.09334
Yao G, Lu K, He X (2013) G-optimal feature selection with Laplacian regularization. Neurocomputing 119:175–181
Yu K, Bi J, Tresp V (2006) Active learning via transductive experimental design. In: Proceedings of the 23rd international conference on machine learning
Yu K, Zhu S, Xu W, Gong Y (2008) Non-greedy active learning for text categorization using convex transductive experimental design. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 635–642 (2008)
Zeng J, Lesnikowski A, Alvarez JM (2018) The relevance of Bayesian layer positioning to model uncertainty in deep Bayesian active learning. http://arxiv.org/abs/1811.12535
Zhou J, Sun S (2014) Active learning of gaussian processes with manifold-preserving graph reduction. Neural Comput Appl 25:1615–1625
Zhu X, Ghahramani Z, Lafferty J (2003a) Semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the twentieth international conference on machine learning
Zhu X, Lafferty J, Ghahramani Z (2003b) Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the ICML-2003 workshop on the continuum from labeled to unlabeled data
Acknowledgements
We thank the anonymous referees and the editors for useful suggestions that have significantly improved the presentation of this paper.
Funding
Funding was provided by National Science Foundation (US) (Grant No. 1537987).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This invited paper is discussed in comments available at: https://doi.org/10.1007/s11749-019-00695-x, https://doi.org/10.1007/s11749-019-00696-w.
The authors gratefully acknowledge NSF Grant CMMI 1537987.
Appendices
Appendix A: Expectation propagation approximation
As discussed before, for GP Classification (GPC), we need to approximate the posterior distribution of latent variables. Nickisch and Rasmussen (2008) provide a comprehensive overview of different methods for approximate inference in GPC model for binary classification. The algorithms they evaluated includes Laplace Approximation, Expectation Propagation, KL-Divergence Minimization, Variational Bounds, Factorial Variational and Markov Chain Monte Carlo. They draw the conclusion that “Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight.” Thus, in the SSGP-AL algorithm, Expectation Propagation (EP) is chosen as the approximation method for GPC. In order to simplify notational complexities, we present the standard EP algorithm for supervised GPC. In our paper, we use a data-dependent kernel \(\tilde{\mathcal {K}}\) that also utilizes the information in the unlabeled instances, so the posterior distribution will also condition on the graph \(\mathcal {G}\).
In EP, the problem of non-Gaussian likelihood function is solved by a local likelihood approximation in the form of an un-normalized Gaussian function in the latent variable \(f_{\mathbf{x}_i}\), i.e.,
where \( \tilde{\mu }_i, \tilde{ \sigma }_i^2 \) and \(\tilde{Z}_i\) are local approximation parameters. After we have the local approximations, the posterior distribution \(P(\mathbf{f}_L|Y_L)\) can be approximated by \(q(\mathbf{f}_L|Y_L)\), defined as
Note that \(Z_{EP}=q(Y_L)\) is the marginal likelihood (also known as evidence or normalization constant) associated with the approximate posterior distribution \(q(\mathbf{f}_L|Y_L)\). Let \( \tilde{{\varvec{\mu }}}\) be a vector of \(\tilde{\mu }_i\) and \({\tilde{\varvec{\Sigma }}}\) be a diagonal matrix with elements \({\tilde{\Sigma }}_{ii}=\tilde{ \sigma }_i^2\), we can rewrite the product of the independent local approximations \(t_i\) as
By the property of product of two Gaussians, it can be easily shown that
where
Given the approximate posterior distribution \(q(f_L|Y_L)\), one can compute the approximate posterior distribution of latent variables \(f_{\mathbf{x}_t}\) at test point \(\mathbf{x}_t\), given by
where
It can be shown that the predictive probability at test point \(\mathbf{x}_t\) is given by
In order to get the posterior distribution \(q(\mathbf{f}_L|Y_L)\), we first need to compute the local approximation \(t_i\) and its corresponding parameters \( \tilde{\mu }_i, \tilde{ \sigma }_i^2, \tilde{Z}_i\). The key idea behind the EP algorithm is to update the individual \(t_i\) approximations sequentially. We start with some current posterior approximation \(q(f_{\mathbf{x}_i}|Y_L)\sim N(\mu _i, \sigma _i^2)\), and after leaving out the local approximation \(t_i\), one obtains the so-called cavity distribution
where
This result can be easily checked by multiplying Eqs. (37) and (48) to recover \(q(f_{\mathbf{x}_i}|Y_L)\). Next combine the cavity distribution \(q_{-i}(f_{\mathbf{x}_i})\) with the exact likelihood function \(P(y_i|f_{\mathbf{x}_i})\) to obtain a non-Gaussian distribution which can then be approximated by a desired Gaussian distribution \(\hat{q}(f_{\mathbf{x}_i})\)
Finally, we compute the local approximation parameters \( \tilde{\mu }_i, \tilde{ \sigma }_i^2, \tilde{Z}_i\) of the approximation \(t_i\) by minimizing the Kullback–Leibler (KL) divergence between \(\hat{q}(f_{\mathbf{x}_i})\) and \(q_{-i}(f_{\mathbf{x}_i})t_i\). It can be shown this minimization is equivalent to matching the first and second moments of these two distributions. The zeroth-order (normalization constant), first-order and second-order moments of the distribution \(\hat{q}(f_{\mathbf{x}_i})\) can be shown to be
where \(z_i=y_i \mu _{-i}/\sqrt{(1+\sigma _{-i}^2)}\). From matching the above moments, the local approximation parameters of \(t_i\) are:
After updating the site parameters, the approximate posterior distribution \(q(\mathbf{f}_L|Y_L)\) is updated using Eq. (41). The pseudo-code for the EP algorithm is provided in Algorithm 2.
Appendix B: Estimation of the GP hyperparameters
In the new approach presented in Sect. 4, we estimate the GP hyperparameters in the data-based kernel \(\tilde{\mathcal {K}}\) by maximizing the logarithm of \(\tilde{Z}_{EP}\) in
where \(\tilde{Z}_{EP}\) is an approximation for marginal likelihood \(P(Y_L|\mathcal {G})\) in (31) by using Expectation Propagation. Note that \(\tilde{Z}_{EP}=q(Y_L|\mathcal {G})\) is the marginal likelihood associated with the approximate posterior distribution \(q(\mathbf{f}_L|Y_L,\mathcal {G})\). Let \( \tilde{{\varvec{\mu }}}\) be a vector containing the \(\tilde{\mu }_i\) and \(\tilde{\varvec{\Sigma }}\) be a diagonal matrix with elements \(\tilde{\varvec{\Sigma }}(i,i)=\tilde{ \sigma }_i^2\). We can rewrite the product of the independent local approximations \(t_i\) as
Based on the Expectation Propagation algorithm, the approximate marginal likelihood \(\tilde{Z}_{EP}\) is given by
Given \(P(\mathbf {f_L}|\mathcal {G}) \sim N(\mathbf {0}, {\tilde{\varvec{K}}}_{LL})\) and Eq. (59), and using the property of the product of two multivariate Gaussians, it can be shown that:
where
and \(\int N(\mathbf{f}_L | {\varvec{\omega }} , {\varvec{\Omega }} ) \mathbf{f}_L =1 \) by the definition of multivariate Gaussian PDF. Then the log-likelihood yields
where the explicit form of \(\tilde{Z}_i\) is derived in the EP process, \(\mu _{-i}\) and \(\sigma _{-i}\) are the parameters in the cavity distribution.
Optimization of \(\log \tilde{Z}_{EP}\) with respect to the hyperparameters of a data-based kernel function \(\tilde{\mathcal {K}}\) requires evaluation of the partial derivatives from Eq. (66). Seeger (2005) demonstrated that the derivatives of the last three terms in Eq. (66) with respect to the hyperparameters vanish. As a result, only the derivatives of the first two terms need to be considered.
Let \({\varvec{\theta }}=\{\theta _j\}_{j=1}^m\) be the collection of hyperparameters in the data-based kernel \(\tilde{\mathcal {K}}\). Given Eq. (66), the derivative of log-likelihood \(\log \tilde{Z}_{EP}\) with respect to hyperparameter \(\theta _j\) can be reduced to
In this paper, we choose a radial basis function (RBF) kernel as the base kernel \(\mathcal {K}\) in Eq. (29), i.e.,
Thus, note there are two hyperparameters in the data-based kernel \(\tilde{\mathcal {K}}\), which include the regularization parameter \(\lambda _R\) and the range (or length-scale) parameter \(\sigma _{\mathrm {rbf}}\) in the RBF kernel \(\mathcal {K}\). Both of the hyperparameters are optimized under the log-scale. The derivatives of Gram matrix \(\tilde{\mathbf {K}}\) with respect to \(\log \sigma _{\mathrm {rbf}}\) and \(\log \lambda _R\) are derived as
and for the RBF kernel \(\mathcal {K}\), one can easily get
where \(\circ \) is the Hadamard product between two matrices, and \(\log \mathbf {K}\) is a notation that represents taking logarithm of each element in matrix \(\mathbf {K}\) (not the logarithm of matrix \(\mathbf {K}\)). Note that in (68) \(\partial \tilde{\mathbf {K}}_{LL} / \partial \theta _j\) is just a submatrix, associated with the labeled data \(\mathbf {X}_L\), of \(\partial \tilde{\mathbf {K}} / \partial \theta _j\), since the marginal likelihood \(\tilde{Z}_{EP}\) can only be computed for the labeled data.
In summary, the hyperparameters \(\lambda _R\) and \(\sigma _{\mathrm {rbf}}\) in the data-based kernel \(\tilde{\mathcal {K}}\) can both be estimated by maximizing the log-likelihood (66). This optimization problem can be solved by using a gradient-based algorithm with the derivatives provided in Eqs. (70–72).
Rights and permissions
About this article
Cite this article
Li, H., Del Castillo, E. & Runger, G. On active learning methods for manifold data. TEST 29, 1–33 (2020). https://doi.org/10.1007/s11749-019-00694-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-019-00694-y