On active learning methods for manifold data

Li, Hang; Del Castillo, Enrique; Runger, George

doi:10.1007/s11749-019-00694-y

On active learning methods for manifold data

Invited Paper
Published: 02 January 2020

Volume 29, pages 1–33, (2020)
Cite this article

TEST Aims and scope Submit manuscript

941 Accesses
5 Citations
Explore all metrics

Abstract

Active learning is a major area of interest within the field of machine learning, especially when the labeled instances are very difficult, time-consuming or expensive to obtain. In this paper, we review various active learning methods for manifold data, where the intrinsic manifold structure of data is also incorporated into the active learning query strategies. In addition, we present a new manifold-based active learning algorithm for Gaussian process classification. This new method uses a data-dependent kernel derived from a semi-supervised model that considers both labeled and unlabeled data. The method performs a regularization on the smoothness of the fitted function with respect to both the ambient space and the manifold where the data lie. The regularization parameter is treated as an additional kernel (covariance) parameter and estimated from the data, permitting adaptation of the kernel to the given dataset manifold geometry. Comparisons with other AL methods for manifold data show faster learning performance in our empirical experiments. MATLAB code that reproduces all examples is provided as supplementary materials.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Active learning of Gaussian processes with manifold-preserving graph reduction

Article 06 June 2014

Active Learning with Clustering and Unsupervised Feature Learning

Selecting Influential Examples: Active Learning with Expected Model Output Changes

Notes

This two-spirals experiment was conducted using the library developed by Stefano Melacci, available at http://www.dii.unisi.it/~melacci/lapsvmp/index.html.
This example is generated using the code provided by Cai and He (2012).
Here we abuse the f notation to represent a latent variable instead of the function to be learned as used in Sect. 2.
This is the number of labeled instances required for an algorithm to achieve the specified accuracy.
The regular perceptron update consists of the simple rule: if $(x_t,y_t)$ is misclassified, then $w_{t+1}=w_{t}+y_t x_t$, where w is a weight variable. For a linear classifier, this update rule moves the classification boundary in the right direction as new instances arrive. For a detailed theoretical study, see the classic reference Rosenblatt (1958).
It is also known as the out-of-sample error, which is a measure of accuracy of a model on the unseen instances.
http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.info.txt.

References

Alaeddini A, Craft E, Meka R, Martinez S (2019) Sequential Laplacian regularized V-optimal design of experiments for response surface modeling of expensive tests: an application in wind tunnel testing. IISE Trans. https://doi.org/10.1080/24725854.2018.1508928
Article Google Scholar
Aronszajn N (1950) Theory of reproducing kernels. Trans Am Math Soc 68:337–404
Article MathSciNet Google Scholar
Atlas LE, Cohn DA, Ladner RE (1990) Training connectionist networks with queries and selective sampling. In: Touretzky DS (ed) Advances in neural information processing systems, vol 2. Morgan-Kaufmann, Burlington, pp 566–573
Google Scholar
Balcan MF, Beygelzimer A, Langford J (2009) Agnostic active learning. J Comput Syst Sci 75(1):78–89
Article MathSciNet Google Scholar
Belkin M (2003) Problems of learning on manifolds. Ph.D. thesis, The University of Chicago
Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15(6):1373–1396
Article Google Scholar
Belkin M, Niyogi P (2005) Towards a theoretical foundation for Laplacian-based manifold methods. In: Proceedings of conference on learning theory
Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 7:2399–2434
MathSciNet MATH Google Scholar
Bishop C (2006) Pattern recognition and machine learning. Springer, New York
MATH Google Scholar
Cai D, He X (2012) Manifold adaptive experimental design for text categorization. IEEE Trans Knowl Data Eng 24(4):707–719
Article Google Scholar
Chaudhuri K, Kakade SM, Netrapalli P, Sanghavi S (2015) Convergence rates of active learning for maximum likelihood estimation. Adv Neural Inf Process Syst 28:1090–1098
Google Scholar
Chen C, Chen Z, Bu J, Wang C, Zhang L, Zhang C (2010) G-optimal design with Laplacian regularization. In: Proceedings of the twenty-fourth AAAI conference on artificial intelligence, vol 1, pp 413–418
Chu W, Ghahramani Z (2005) Preference learning with Gaussian processes. In: Proceedings of the 22nd international conference on machine learning, ICML’05. ACM, New York, NY, USA, pp 137–144. https://doi.org/10.1145/1102351.1102369
Chu W, Sindhwani V, Ghahramani Z, Keerthi SS (2007) Relational learning with Gaussian processes. In: Proceedings of the 19th international conference on neural information processing systems, pp 289–296
Cohn D (1994) Neural network exploration using optimal experiment design. Adv Neural Inf Process Syst 6:679–686
Google Scholar
Cohn D, Atlas L, Ladner R (1994) Improving generalization with active learning. Mach Learn 15(2):201–221. https://doi.org/10.1007/BF00993277
Article Google Scholar
Coifman R, Lafon S, Lee A, Maggioni M, Nadler B, Warner F, Zuker S (2005) Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. Proc Natl Acad Sci 102(21):7426–7431
Article Google Scholar
Dasgupta S, Hsu D, Monteleoni C (2007) A general agnostic active learning algorithm. Adv Neural Inf Process Syst 20:353–360
Google Scholar
Dasgupta S, Kalai AT, Monteleoni C (2009) Analysis of perceptron-based active learning. J Mach Learn Res 10:281–299
MathSciNet MATH Google Scholar
Donoho D, Grimes C (2003) Hessian eigenmaps: locally linear embedding techniques for high dimensional data. Proc Natl Acad Sci 100(10):5591–5596
Article MathSciNet Google Scholar
Evans LPG, Adams NM, Anagnostopoulos C (2015) Estimating optimal active learning via model retraining improvement
Fedorov VV (1972) Theory of optimal experiments. Academic Press, Cambridge
Google Scholar
Freund Y, Seung HS, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm. Mach Learn 28(2):133–168
Article Google Scholar
Gal Y, Islam R, Ghahramani Z (2017) Deep Bayesian active learning with image data. http://arxiv.org/abs/1703.02910
Hanneke S (2007) A bound on the label complexity of agnostic active learning. In: Proceedings of the 24th international conference on machine learning, ICML’07. ACM, New York, NY, USA, pp 353–360. https://doi.org/10.1145/1273496.1273541
He X (2010) Laplacian regularized D-optimal design for active learning and its application to image retrieval. IEEE Trans Imgae Process 19(1):254–263
Article MathSciNet Google Scholar
Hein M, Audibert JY, von Luxburg U (2005) From graphs to manifolds–weak and strong pointwise consistency of graph Laplacians. In: Proceedings of the 18th conference on learning theory (2005)
Houlsby N, Huszár F, Ghahramani Z, Lengyel M (2011) Bayesian active learning for classification and preference learning
Joshi AJ, Porikli F, Papanikolopoulos N (2009) Multiclass active learning for image classification. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2372–2379
Kapoor A, Grauman K, Urtasun R, Darrell T (2007) Active learning with Gaussian processes for object categorization. In: IEEE 11th international conference on computer vision, vol 2
Lafon S (2004) Diffusion maps and geometric harmonics. Ph.D. thesis, Yale University
Li X, Guo Y (2013) Adaptive active learning for image classification. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 859–866
Li C, Liu H, Cai D (2014) Active learning on manifolds. Neurocomputing 123:398–405
Article Google Scholar
McCallum A, Nigam K (1998) Employing EM and pool-based active learning for text classification. In: Proceedings of the fifteenth international conference on machine learning, ICML’98. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 350–358. http://dl.acm.org/citation.cfm?id=645527.757765. Retrieved 18 Dec 2019
Minka TP (2001) A family of algorithms for approximate Bayesian inference. Ph.D. thesis, Massachusetts Institute of Technology
Nickisch H, Rasmussen CE (2008) Approximations for binary Gaussian process classification. J Mach Learn Res 9:2035–2078
MathSciNet MATH Google Scholar
Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, Cambridge
MATH Google Scholar
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65:386
Article Google Scholar
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323–2326
Article Google Scholar
Roy N, Mccallum A (2001) Toward optimal active learning through Monte Carlo estimation of error reduction. In: Proceedings of the international conference on machine learning
Seeger M (2003) Bayesian Gaussian process models: Pac-Bayesian generalisation error bounds and sparse approximations. Ph.D. thesis, University of Edinburgh
Seeger M (2005) Expectation propagation for exponential families
Settles B (2010) Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison
Settles B (2012) Active learning. Morgan & Claypool, New York
MATH Google Scholar
Seung HS, Opper M, Sompolinsky H (1992) Query by committee. In: Proceedings of the fifth annual workshop on computational learning theory, COLT’92. ACM, New York, NY, USA, pp 287–294. https://doi.org/10.1145/130385.130417
Shannon C (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423
Article MathSciNet Google Scholar
Sindhwani V, Niyogi P, Belkin M (2005) Beyond the point cloud: from transductive to semi-supervised learning. In: Proceedings, twenty second international conference on machine learning
Sindhwani V, Chu W, Keerthi SS (2007) Semi-supervised Gaussian process classifiers. In: International joint conference on artificial intelligence, pp 1059–1064
Sun S, Hussain Z, Shawe-Taylor J (2014) Manifold-preserving graph reduction for sparse semi-supervised learning. Neurocomputing 124:13–21
Article Google Scholar
Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290:2319–2323
Article Google Scholar
Tong S, Chang E (2001) Support vector machine active learning for image retrieval. In: Proceedings of the ninth ACM international conference on multimedia, MULTIMEDIA’01. ACM, New York, NY, USA, pp 107–118. https://doi.org/10.1145/500141.500159
Wahba G (1990) Spline models for observational data. Society for Industrial and Applied Mathematics, Philadelphia
Book Google Scholar
Xu H, Yu L, Davenport MA, Zha H (2017) Active manifold learning via a unified framework for manifold landmarking. http://arxiv.org/abs/1710.09334
Yao G, Lu K, He X (2013) G-optimal feature selection with Laplacian regularization. Neurocomputing 119:175–181
Article Google Scholar
Yu K, Bi J, Tresp V (2006) Active learning via transductive experimental design. In: Proceedings of the 23rd international conference on machine learning
Yu K, Zhu S, Xu W, Gong Y (2008) Non-greedy active learning for text categorization using convex transductive experimental design. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 635–642 (2008)
Zeng J, Lesnikowski A, Alvarez JM (2018) The relevance of Bayesian layer positioning to model uncertainty in deep Bayesian active learning. http://arxiv.org/abs/1811.12535
Zhou J, Sun S (2014) Active learning of gaussian processes with manifold-preserving graph reduction. Neural Comput Appl 25:1615–1625
Article Google Scholar
Zhu X, Ghahramani Z, Lafferty J (2003a) Semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the twentieth international conference on machine learning
Zhu X, Lafferty J, Ghahramani Z (2003b) Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the ICML-2003 workshop on the continuum from labeled to unlabeled data

Download references

Acknowledgements

We thank the anonymous referees and the editors for useful suggestions that have significantly improved the presentation of this paper.

Funding

Funding was provided by National Science Foundation (US) (Grant No. 1537987).

Author information

Authors and Affiliations

Harold and Inge Marcus Department of Industrial and Manufacturing Engineering, The Pennsylvania State University, State College, USA
Hang Li & Enrique Del Castillo
Department of Statistics, The Pennsylvania State University, State College, USA
Enrique Del Castillo
Department of Biomedical Informatics, Arizona State University, Tempe, USA
George Runger

Authors

Hang Li
View author publications
You can also search for this author in PubMed Google Scholar
Enrique Del Castillo
View author publications
You can also search for this author in PubMed Google Scholar
George Runger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Enrique Del Castillo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This invited paper is discussed in comments available at: https://doi.org/10.1007/s11749-019-00695-x, https://doi.org/10.1007/s11749-019-00696-w.

The authors gratefully acknowledge NSF Grant CMMI 1537987.

Appendices

Appendix A: Expectation propagation approximation

As discussed before, for GP Classification (GPC), we need to approximate the posterior distribution of latent variables. Nickisch and Rasmussen (2008) provide a comprehensive overview of different methods for approximate inference in GPC model for binary classification. The algorithms they evaluated includes Laplace Approximation, Expectation Propagation, KL-Divergence Minimization, Variational Bounds, Factorial Variational and Markov Chain Monte Carlo. They draw the conclusion that “Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight.” Thus, in the SSGP-AL algorithm, Expectation Propagation (EP) is chosen as the approximation method for GPC. In order to simplify notational complexities, we present the standard EP algorithm for supervised GPC. In our paper, we use a data-dependent kernel $\tilde{\mathcal {K}}$ that also utilizes the information in the unlabeled instances, so the posterior distribution will also condition on the graph $\mathcal {G}$.

In EP, the problem of non-Gaussian likelihood function is solved by a local likelihood approximation in the form of an un-normalized Gaussian function in the latent variable $f_{\mathbf{x}_i}$, i.e.,

$$\begin{aligned} P(y_i|f_{x_i}) \simeq t_i (f_{\mathbf{x}_i}|\tilde{Z}_i, \tilde{\mu }_i,\tilde{ \sigma }_i^2)\triangleq \tilde{Z}_i N(f_{\mathbf{x}_i}|\tilde{\mu }_i,\tilde{ \sigma }_i^2) \end{aligned}$$

(37)

where $ \tilde{\mu }_i, \tilde{ \sigma }_i^2 $ and $\tilde{Z}_i$ are local approximation parameters. After we have the local approximations, the posterior distribution $P(\mathbf{f}_L|Y_L)$ can be approximated by $q(\mathbf{f}_L|Y_L)$, defined as

$$\begin{aligned} q(\mathbf{f}_L|Y_L) \triangleq \frac{\prod _{i=1}^l t_i (f_{\mathbf{x}_i}|\tilde{Z}_i, \tilde{\mu }_i,\tilde{ \sigma }_i^2) N(\mathbf{0}, \mathbf{K}_{LL})}{Z_{EP}} \end{aligned}$$

(38)

Note that $Z_{EP}=q(Y_L)$ is the marginal likelihood (also known as evidence or normalization constant) associated with the approximate posterior distribution $q(\mathbf{f}_L|Y_L)$. Let $ \tilde{{\varvec{\mu }}}$ be a vector of $\tilde{\mu }_i$ and ${\tilde{\varvec{\Sigma }}}$ be a diagonal matrix with elements ${\tilde{\Sigma }}_{ii}=\tilde{ \sigma }_i^2$, we can rewrite the product of the independent local approximations $t_i$ as

$$\begin{aligned} \prod _{i=1}^l t_i (f_{\mathbf{x}_i}|\tilde{Z}_i, \tilde{\mu }_i,\tilde{ \sigma }_i^2)=N(\tilde{\mu }, {\tilde{\Sigma }}) \prod _{i=1}^l \tilde{Z}_i \end{aligned}$$

(39)

By the property of product of two Gaussians, it can be easily shown that

$$\begin{aligned} q(\mathbf f_L|Y_L)\sim N(\mathbf \mu , \Sigma ) \end{aligned}$$

(40)

where

$$\begin{aligned} {\varvec{\mu }}=\Sigma \tilde{\Sigma }\tilde{\mu } \text { and } {\varvec{\Sigma }}=(\mathbf{K}_{LL}^{-1}+{\tilde{\varvec{\Sigma }}}^{-1})^{-1}. \end{aligned}$$

(41)

Given the approximate posterior distribution $q(f_L|Y_L)$, one can compute the approximate posterior distribution of latent variables $f_{\mathbf{x}_t}$ at test point $\mathbf{x}_t$, given by

$$\begin{aligned} P(f_{\mathbf{x}_t}|Y_L)= \int P(f_{\mathbf{x}_t}|f_L) P(f_L|Y_L) df_L \approx N(\mu _t, \sigma ^2_t), \end{aligned}$$

(42)

where

$$\begin{aligned} \mu _t= & {} \mathbf{K}_{Lt}^T \mathbf{K}_{LL}^{-1}{\varvec{\mu }} \end{aligned}$$

(43)

$$\begin{aligned} \sigma _t^2= & {} \mathcal {K}(\mathbf{x}_t,\mathbf{x}_t)-\mathbf{K}_{Lt}^T(\mathbf{K}_{LL}^{-1}+{\tilde{\varvec{\Sigma }}})\mathbf{K}_{Lt}\end{aligned}$$

(44)

$$\begin{aligned} \mathbf{K}_{Lt}= & {} [\mathcal {K}(\mathbf{x}_t,\mathbf{x}_1),\ldots ,\mathcal {K}(\mathbf{x}_t,\mathbf{x}_l)]^T \end{aligned}$$

(45)

It can be shown that the predictive probability at test point $\mathbf{x}_t$ is given by

$$\begin{aligned} q(y_t=1|Y_L)= & {} \Phi \left( \frac{\mu _t}{\sqrt{1+\sigma _t^2}}\right) \end{aligned}$$

(46)

$$\begin{aligned}= & {} \Phi \left( \frac{\mathbf{K}_{Lt}^T (\mathbf{K}_{LL}+{\tilde{\varvec{\Sigma }}})^{-1}{{\tilde{\varvec{\mu }}}}}{\sqrt{1+\mathcal {K}(\mathbf{x}_t,\mathbf{x}_t)-\mathbf{K}_{Lt}^T(\mathbf{K}_{LL}+{\tilde{\varvec{\Sigma }}})^{-1}\mathbf{K}_{Lt}}}\right) \end{aligned}$$

(47)

In order to get the posterior distribution $q(\mathbf{f}_L|Y_L)$, we first need to compute the local approximation $t_i$ and its corresponding parameters $ \tilde{\mu }_i, \tilde{ \sigma }_i^2, \tilde{Z}_i$. The key idea behind the EP algorithm is to update the individual $t_i$ approximations sequentially. We start with some current posterior approximation $q(f_{\mathbf{x}_i}|Y_L)\sim N(\mu _i, \sigma _i^2)$, and after leaving out the local approximation $t_i$, one obtains the so-called cavity distribution

$$\begin{aligned} q_{-i}(f_{\mathbf{x}_i})\triangleq N(f_{\mathbf{x}_i}|\mu _{-i}, \sigma _{-i}^2) \end{aligned}$$

(48)

where

$$\begin{aligned} \mu _{-i}= & {} \sigma _{-i}^2 \mu _i - \tilde{\sigma }_i^{-2}\tilde{\mu }_i ) \end{aligned}$$

(49)

$$\begin{aligned} \sigma _{-i}^2= & {} (\sigma _i^{-2}-\tilde{\sigma }_i^{-2})^{-1} \end{aligned}$$

(50)

This result can be easily checked by multiplying Eqs. (37) and (48) to recover $q(f_{\mathbf{x}_i}|Y_L)$. Next combine the cavity distribution $q_{-i}(f_{\mathbf{x}_i})$ with the exact likelihood function $P(y_i|f_{\mathbf{x}_i})$ to obtain a non-Gaussian distribution which can then be approximated by a desired Gaussian distribution $\hat{q}(f_{\mathbf{x}_i})$

$$\begin{aligned} q_{-i}(f_{\mathbf{x}_i})P(y_i|f_{\mathbf{x}_i})\simeq \hat{Z}_i N(\hat{\mu }_i,\hat{\sigma }_i)\triangleq \hat{q}(f_{\mathbf{x}_i}). \end{aligned}$$

(51)

Finally, we compute the local approximation parameters $ \tilde{\mu }_i, \tilde{ \sigma }_i^2, \tilde{Z}_i$ of the approximation $t_i$ by minimizing the Kullback–Leibler (KL) divergence between $\hat{q}(f_{\mathbf{x}_i})$ and $q_{-i}(f_{\mathbf{x}_i})t_i$. It can be shown this minimization is equivalent to matching the first and second moments of these two distributions. The zeroth-order (normalization constant), first-order and second-order moments of the distribution $\hat{q}(f_{\mathbf{x}_i})$ can be shown to be

$$\begin{aligned} \hat{Z}_i= & {} \Phi (z_i) \end{aligned}$$

(52)

$$\begin{aligned} \hat{\mu }_i= & {} \mu _{-i}+\frac{y_i \sigma _{-i}^2 N(z_i)}{\Phi (z_i) \sqrt{1+\sigma _{-i}^2}} \end{aligned}$$

(53)

$$\begin{aligned} \hat{\sigma }_i^2= & {} \sigma _{-i}^2 - \frac{\sigma _{-i}^4 N(z_i)}{(1+\sigma _{-i}^2)\Phi (z_i)}\left( z_i+\frac{N(z_i)}{\Phi (z_i)}\right) \end{aligned}$$

(54)

where $z_i=y_i \mu _{-i}/\sqrt{(1+\sigma _{-i}^2)}$. From matching the above moments, the local approximation parameters of $t_i$ are:

$$\begin{aligned} \tilde{\mu }_i= & {} \tilde{\sigma }_i^2 (\hat{\sigma }_i^{-2}\hat{\mu }_i - \sigma _{-i}^{-2} \mu _{-i}) \end{aligned}$$

(55)

$$\begin{aligned} \tilde{\sigma }_i^2= & {} (\hat{\sigma }_i^{-2}-\sigma _{-i}^{-2})^{-1} \end{aligned}$$

(56)

$$\begin{aligned} \tilde{Z}_i= & {} \Phi (z_i) \sqrt{2\pi } \sqrt{\sigma _{-i}^2+\tilde{\sigma }_i^2}\mathrm {exp}\left( \frac{(\mu _{-i}-\tilde{\mu }_i)^2}{2(\sigma _{-i}^2+\tilde{\sigma }_i^2)}\right) \end{aligned}$$

(57)

After updating the site parameters, the approximate posterior distribution $q(\mathbf{f}_L|Y_L)$ is updated using Eq. (41). The pseudo-code for the EP algorithm is provided in Algorithm 2.

Appendix B: Estimation of the GP hyperparameters

In the new approach presented in Sect. 4, we estimate the GP hyperparameters in the data-based kernel $\tilde{\mathcal {K}}$ by maximizing the logarithm of $\tilde{Z}_{EP}$ in

$$\begin{aligned} q(\mathbf{f}_L|Y_L,\mathcal {G}) \triangleq \frac{\prod _{i=1}^l t_i (f_{\mathbf{x}_i}|\tilde{Z}_i, \tilde{\mu }_i,\tilde{ \sigma }_i^2) N(\mathbf{0}, \tilde{\mathbf{K}}_{LL})}{\tilde{Z}_{EP}} \end{aligned}$$

(58)

where $\tilde{Z}_{EP}$ is an approximation for marginal likelihood $P(Y_L|\mathcal {G})$ in (31) by using Expectation Propagation. Note that $\tilde{Z}_{EP}=q(Y_L|\mathcal {G})$ is the marginal likelihood associated with the approximate posterior distribution $q(\mathbf{f}_L|Y_L,\mathcal {G})$. Let $ \tilde{{\varvec{\mu }}}$ be a vector containing the $\tilde{\mu }_i$ and $\tilde{\varvec{\Sigma }}$ be a diagonal matrix with elements $\tilde{\varvec{\Sigma }}(i,i)=\tilde{ \sigma }_i^2$. We can rewrite the product of the independent local approximations $t_i$ as

$$\begin{aligned} \prod _{i=1}^l t_i (f_{\mathbf{x}_i}|\tilde{Z}_i, \tilde{\mu }_i,\tilde{ \sigma }_i^2)=N(\tilde{\mu }, \tilde{\varvec{\Sigma }}) \prod _{i=1}^l \tilde{Z}_i \end{aligned}$$

(59)

Based on the Expectation Propagation algorithm, the approximate marginal likelihood $\tilde{Z}_{EP}$ is given by

$$\begin{aligned} \tilde{Z}_{EP} = \int q(Y_L,\mathbf{f}_L |\mathcal {G} ) d\mathbf{f}_L = \int P(\mathbf{f}_L | \mathcal {G}) \prod _{i=1}^l t_i(f_{\mathbf{x}_i}|\tilde{Z_i}, \tilde{\mu }_i, \tilde{\sigma }_i^2) d\mathbf{f}_L \end{aligned}$$

(60)

Given $P(\mathbf {f_L}|\mathcal {G}) \sim N(\mathbf {0}, {\tilde{\varvec{K}}}_{LL})$ and Eq. (59), and using the property of the product of two multivariate Gaussians, it can be shown that:

$$\begin{aligned} \tilde{Z}_{EP}=Z_0^{-1} \prod _{i=1}^l \tilde{Z}_i \int N(\mathbf{f}_L | {\varvec{\omega }} , {\varvec{\Omega }} ) \mathbf{f}_L =Z_0^{-1} \prod _{i=1}^l \tilde{Z}_i \end{aligned}$$

(61)

where

$$\begin{aligned} Z_0^{-1}= & {} (2\pi )^{l/2} |{\tilde{\mathbf {K}}}_{LL}+{\tilde{\varvec{\Sigma }}}|^{-1/2} \mathrm {exp}\left( -\frac{1}{2}{\tilde{\mu }}^\top ({\tilde{\mathbf {K}}}_{LL}+{\tilde{\varvec{\Sigma }}})^{-1}{\tilde{\mu }}\right) \end{aligned}$$

(62)

$$\begin{aligned} {\varvec{\omega }}= & {} {\varvec{\Omega }}{\tilde{\varvec{\Sigma }}}^{-1} {\tilde{\mu }} \end{aligned}$$

(63)

$$\begin{aligned} {\varvec{\Omega }}= & {} ({\tilde{\mathbf {K}}}_{LL}^{-1}+{\tilde{\varvec{\Sigma }}}^{-1})^{-1} \end{aligned}$$

(64)

and $\int N(\mathbf{f}_L | {\varvec{\omega }} , {\varvec{\Omega }} ) \mathbf{f}_L =1 $ by the definition of multivariate Gaussian PDF. Then the log-likelihood yields

$$\begin{aligned} \log \tilde{Z}_{EP}= & {} \mathrm {log} (Z_0^{-1})+\sum _{i=1}^l \mathrm {log }(\tilde{Z}_i) \end{aligned}$$

(65)

$$\begin{aligned}= & {} -\frac{1}{2} \mathrm {log} |\tilde{\mathbf {K}}_{LL}+\tilde{\varvec{\Sigma }}|-\frac{1}{2}\tilde{\mu }^\top (\tilde{\mathbf {K}}_{LL}+\tilde{\varvec{\Sigma }})^{-1}\tilde{\mu } +\sum _{i=1}^l \mathrm {log} \; \Phi \left( \frac{y_i \mu _{-i}}{\sqrt{1+\sigma ^2_{-i}}}\right) \nonumber \\&+\frac{1}{2}\sum _{i=1}^l \mathrm {log}\left( \sigma ^2_{-i}+\tilde{\sigma }^2_i\right) +\sum _{i=1}^l \frac{(\mu _{-i}-\tilde{\mu }_i)^2}{2(\sigma _{-i}^2+\tilde{\sigma }_i^2)} \end{aligned}$$

(66)

where the explicit form of $\tilde{Z}_i$ is derived in the EP process, $\mu _{-i}$ and $\sigma _{-i}$ are the parameters in the cavity distribution.

Optimization of $\log \tilde{Z}_{EP}$ with respect to the hyperparameters of a data-based kernel function $\tilde{\mathcal {K}}$ requires evaluation of the partial derivatives from Eq. (66). Seeger (2005) demonstrated that the derivatives of the last three terms in Eq. (66) with respect to the hyperparameters vanish. As a result, only the derivatives of the first two terms need to be considered.

Let ${\varvec{\theta }}=\{\theta _j\}_{j=1}^m$ be the collection of hyperparameters in the data-based kernel $\tilde{\mathcal {K}}$. Given Eq. (66), the derivative of log-likelihood $\log \tilde{Z}_{EP}$ with respect to hyperparameter $\theta _j$ can be reduced to

$$\begin{aligned} \frac{\partial \log \tilde{Z}_{EP}}{\partial \theta _j}= & {} \frac{\partial }{\partial \theta _j}\left( -\frac{1}{2} \mathrm {log} |\tilde{\mathbf {K}}_{LL}+\tilde{\varvec{\Sigma }}|-\frac{1}{2}\tilde{\mu }^\top (\tilde{\mathbf {K}}_{LL}+\tilde{\varvec{\Sigma }})^{-1}\tilde{\mu }\right) \end{aligned}$$

(67)

$$\begin{aligned}= & {} -\frac{1}{2} {{\,\mathrm{tr}\,}}\left( (\tilde{\mathbf {K}}_{LL}+\tilde{\varvec{\Sigma }})^{-1}\frac{\partial \tilde{\mathbf {K}}_{LL}}{\partial \theta _j}\right) \nonumber \\&+\frac{1}{2}\tilde{\mu }^\top (\tilde{\mathbf {K}}_{LL}+\tilde{\varvec{\Sigma }})^{-1}\frac{\partial \tilde{\mathbf {K}}_{LL}}{\partial \theta _j}(\tilde{\mathbf {K}}_{LL}+\tilde{\varvec{\Sigma }})^{-1}\tilde{\mu } \end{aligned}$$

(68)

In this paper, we choose a radial basis function (RBF) kernel as the base kernel $\mathcal {K}$ in Eq. (29), i.e.,

$$\begin{aligned} \mathcal {K} (\mathbf{x}_i, \mathbf{x}_j)=\exp \left( -\frac{(\mathbf{x}_i-\mathbf{x}_j)^\top (\mathbf{x}_i - \mathbf{x}_j)}{2\sigma ^2_{\mathrm {rbf}}}\right) \end{aligned}$$

(69)

Thus, note there are two hyperparameters in the data-based kernel $\tilde{\mathcal {K}}$, which include the regularization parameter $\lambda _R$ and the range (or length-scale) parameter $\sigma _{\mathrm {rbf}}$ in the RBF kernel $\mathcal {K}$. Both of the hyperparameters are optimized under the log-scale. The derivatives of Gram matrix $\tilde{\mathbf {K}}$ with respect to $\log \sigma _{\mathrm {rbf}}$ and $\log \lambda _R$ are derived as

$$\begin{aligned} \frac{\partial \tilde{\mathbf {K}}}{\partial (\log \sigma _{\mathrm {rbf}})}= & {} (\mathbf {I}+\lambda _R \mathbf {K} L)^{-1} \frac{\partial \mathbf {K}}{\partial (\log \sigma _{\mathrm {rbf}})} (\mathbf {I}+\lambda _R L \mathbf {K} )^{-1} \end{aligned}$$

(70)

$$\begin{aligned} \frac{\partial \tilde{\mathbf {K}}}{\partial (\log \lambda _R)}= & {} -(\mathbf {K}^{-1}+\lambda _R L)^{-1} \lambda _R L (\mathbf {K}^{-1}+\lambda _R L)^{-1} \end{aligned}$$

(71)

and for the RBF kernel $\mathcal {K}$, one can easily get

$$\begin{aligned} \frac{\partial \mathbf {K}}{\partial (\log \sigma _{\mathrm {rbf}})} =-2\mathbf {K} \circ \log \mathbf {K} \end{aligned}$$

(72)

where $\circ $ is the Hadamard product between two matrices, and $\log \mathbf {K}$ is a notation that represents taking logarithm of each element in matrix $\mathbf {K}$ (not the logarithm of matrix $\mathbf {K}$). Note that in (68) $\partial \tilde{\mathbf {K}}_{LL} / \partial \theta _j$ is just a submatrix, associated with the labeled data $\mathbf {X}_L$, of $\partial \tilde{\mathbf {K}} / \partial \theta _j$, since the marginal likelihood $\tilde{Z}_{EP}$ can only be computed for the labeled data.

In summary, the hyperparameters $\lambda _R$ and $\sigma _{\mathrm {rbf}}$ in the data-based kernel $\tilde{\mathcal {K}}$ can both be estimated by maximizing the log-likelihood (66). This optimization problem can be solved by using a gradient-based algorithm with the derivatives provided in Eqs. (70–72).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, H., Del Castillo, E. & Runger, G. On active learning methods for manifold data. TEST 29, 1–33 (2020). https://doi.org/10.1007/s11749-019-00694-y

Download citation

Published: 02 January 2020
Issue Date: March 2020
DOI: https://doi.org/10.1007/s11749-019-00694-y

Keywords

Mathematics Subject Classification

62

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On active learning methods for manifold data

Abstract

Access this article

Similar content being viewed by others

Active learning of Gaussian processes with manifold-preserving graph reduction

Active Learning with Clustering and Unsupervised Feature Learning

Selecting Influential Examples: Active Learning with Expected Model Output Changes

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Expectation propagation approximation

Appendix B: Estimation of the GP hyperparameters

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

On active learning methods for manifold data

Abstract

Access this article

Similar content being viewed by others

Active learning of Gaussian processes with manifold-preserving graph reduction

Active Learning with Clustering and Unsupervised Feature Learning

Selecting Influential Examples: Active Learning with Expected Model Output Changes

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Expectation propagation approximation

Appendix B: Estimation of the GP hyperparameters

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation