Skip to main content
Log in

On active learning methods for manifold data

  • Invited Paper
  • Published:
TEST Aims and scope Submit manuscript

Abstract

Active learning is a major area of interest within the field of machine learning, especially when the labeled instances are very difficult, time-consuming or expensive to obtain. In this paper, we review various active learning methods for manifold data, where the intrinsic manifold structure of data is also incorporated into the active learning query strategies. In addition, we present a new manifold-based active learning algorithm for Gaussian process classification. This new method uses a data-dependent kernel derived from a semi-supervised model that considers both labeled and unlabeled data. The method performs a regularization on the smoothness of the fitted function with respect to both the ambient space and the manifold where the data lie. The regularization parameter is treated as an additional kernel (covariance) parameter and estimated from the data, permitting adaptation of the kernel to the given dataset manifold geometry. Comparisons with other AL methods for manifold data show faster learning performance in our empirical experiments. MATLAB code that reproduces all examples is provided as supplementary materials.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. This two-spirals experiment was conducted using the library developed by Stefano Melacci, available at http://www.dii.unisi.it/~melacci/lapsvmp/index.html.

  2. This example is generated using the code provided by Cai and He (2012).

  3. Here we abuse the f notation to represent a latent variable instead of the function to be learned as used in Sect. 2.

  4. This is the number of labeled instances required for an algorithm to achieve the specified accuracy.

  5. The regular perceptron update consists of the simple rule: if \((x_t,y_t)\) is misclassified, then \(w_{t+1}=w_{t}+y_t x_t\), where w is a weight variable. For a linear classifier, this update rule moves the classification boundary in the right direction as new instances arrive. For a detailed theoretical study, see the classic reference Rosenblatt (1958).

  6. It is also known as the out-of-sample error, which is a measure of accuracy of a model on the unseen instances.

  7. http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.info.txt.

References

  • Alaeddini A, Craft E, Meka R, Martinez S (2019) Sequential Laplacian regularized V-optimal design of experiments for response surface modeling of expensive tests: an application in wind tunnel testing. IISE Trans. https://doi.org/10.1080/24725854.2018.1508928

    Article  Google Scholar 

  • Aronszajn N (1950) Theory of reproducing kernels. Trans Am Math Soc 68:337–404

    Article  MathSciNet  Google Scholar 

  • Atlas LE, Cohn DA, Ladner RE (1990) Training connectionist networks with queries and selective sampling. In: Touretzky DS (ed) Advances in neural information processing systems, vol 2. Morgan-Kaufmann, Burlington, pp 566–573

    Google Scholar 

  • Balcan MF, Beygelzimer A, Langford J (2009) Agnostic active learning. J Comput Syst Sci 75(1):78–89

    Article  MathSciNet  Google Scholar 

  • Belkin M (2003) Problems of learning on manifolds. Ph.D. thesis, The University of Chicago

  • Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15(6):1373–1396

    Article  Google Scholar 

  • Belkin M, Niyogi P (2005) Towards a theoretical foundation for Laplacian-based manifold methods. In: Proceedings of conference on learning theory

  • Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 7:2399–2434

    MathSciNet  MATH  Google Scholar 

  • Bishop C (2006) Pattern recognition and machine learning. Springer, New York

    MATH  Google Scholar 

  • Cai D, He X (2012) Manifold adaptive experimental design for text categorization. IEEE Trans Knowl Data Eng 24(4):707–719

    Article  Google Scholar 

  • Chaudhuri K, Kakade SM, Netrapalli P, Sanghavi S (2015) Convergence rates of active learning for maximum likelihood estimation. Adv Neural Inf Process Syst 28:1090–1098

    Google Scholar 

  • Chen C, Chen Z, Bu J, Wang C, Zhang L, Zhang C (2010) G-optimal design with Laplacian regularization. In: Proceedings of the twenty-fourth AAAI conference on artificial intelligence, vol 1, pp 413–418

  • Chu W, Ghahramani Z (2005) Preference learning with Gaussian processes. In: Proceedings of the 22nd international conference on machine learning, ICML’05. ACM, New York, NY, USA, pp 137–144. https://doi.org/10.1145/1102351.1102369

  • Chu W, Sindhwani V, Ghahramani Z, Keerthi SS (2007) Relational learning with Gaussian processes. In: Proceedings of the 19th international conference on neural information processing systems, pp 289–296

  • Cohn D (1994) Neural network exploration using optimal experiment design. Adv Neural Inf Process Syst 6:679–686

    Google Scholar 

  • Cohn D, Atlas L, Ladner R (1994) Improving generalization with active learning. Mach Learn 15(2):201–221. https://doi.org/10.1007/BF00993277

    Article  Google Scholar 

  • Coifman R, Lafon S, Lee A, Maggioni M, Nadler B, Warner F, Zuker S (2005) Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. Proc Natl Acad Sci 102(21):7426–7431

    Article  Google Scholar 

  • Dasgupta S, Hsu D, Monteleoni C (2007) A general agnostic active learning algorithm. Adv Neural Inf Process Syst 20:353–360

    Google Scholar 

  • Dasgupta S, Kalai AT, Monteleoni C (2009) Analysis of perceptron-based active learning. J Mach Learn Res 10:281–299

    MathSciNet  MATH  Google Scholar 

  • Donoho D, Grimes C (2003) Hessian eigenmaps: locally linear embedding techniques for high dimensional data. Proc Natl Acad Sci 100(10):5591–5596

    Article  MathSciNet  Google Scholar 

  • Evans LPG, Adams NM, Anagnostopoulos C (2015) Estimating optimal active learning via model retraining improvement

  • Fedorov VV (1972) Theory of optimal experiments. Academic Press, Cambridge

    Google Scholar 

  • Freund Y, Seung HS, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm. Mach Learn 28(2):133–168

    Article  Google Scholar 

  • Gal Y, Islam R, Ghahramani Z (2017) Deep Bayesian active learning with image data. http://arxiv.org/abs/1703.02910

  • Hanneke S (2007) A bound on the label complexity of agnostic active learning. In: Proceedings of the 24th international conference on machine learning, ICML’07. ACM, New York, NY, USA, pp 353–360. https://doi.org/10.1145/1273496.1273541

  • He X (2010) Laplacian regularized D-optimal design for active learning and its application to image retrieval. IEEE Trans Imgae Process 19(1):254–263

    Article  MathSciNet  Google Scholar 

  • Hein M, Audibert JY, von Luxburg U (2005) From graphs to manifolds–weak and strong pointwise consistency of graph Laplacians. In: Proceedings of the 18th conference on learning theory (2005)

  • Houlsby N, Huszár F, Ghahramani Z, Lengyel M (2011) Bayesian active learning for classification and preference learning

  • Joshi AJ, Porikli F, Papanikolopoulos N (2009) Multiclass active learning for image classification. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2372–2379

  • Kapoor A, Grauman K, Urtasun R, Darrell T (2007) Active learning with Gaussian processes for object categorization. In: IEEE 11th international conference on computer vision, vol 2

  • Lafon S (2004) Diffusion maps and geometric harmonics. Ph.D. thesis, Yale University

  • Li X, Guo Y (2013) Adaptive active learning for image classification. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 859–866

  • Li C, Liu H, Cai D (2014) Active learning on manifolds. Neurocomputing 123:398–405

    Article  Google Scholar 

  • McCallum A, Nigam K (1998) Employing EM and pool-based active learning for text classification. In: Proceedings of the fifteenth international conference on machine learning, ICML’98. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 350–358. http://dl.acm.org/citation.cfm?id=645527.757765. Retrieved 18 Dec 2019

  • Minka TP (2001) A family of algorithms for approximate Bayesian inference. Ph.D. thesis, Massachusetts Institute of Technology

  • Nickisch H, Rasmussen CE (2008) Approximations for binary Gaussian process classification. J Mach Learn Res 9:2035–2078

    MathSciNet  MATH  Google Scholar 

  • Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, Cambridge

    MATH  Google Scholar 

  • Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65:386

    Article  Google Scholar 

  • Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323–2326

    Article  Google Scholar 

  • Roy N, Mccallum A (2001) Toward optimal active learning through Monte Carlo estimation of error reduction. In: Proceedings of the international conference on machine learning

  • Seeger M (2003) Bayesian Gaussian process models: Pac-Bayesian generalisation error bounds and sparse approximations. Ph.D. thesis, University of Edinburgh

  • Seeger M (2005) Expectation propagation for exponential families

  • Settles B (2010) Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison

  • Settles B (2012) Active learning. Morgan & Claypool, New York

    MATH  Google Scholar 

  • Seung HS, Opper M, Sompolinsky H (1992) Query by committee. In: Proceedings of the fifth annual workshop on computational learning theory, COLT’92. ACM, New York, NY, USA, pp 287–294. https://doi.org/10.1145/130385.130417

  • Shannon C (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423

    Article  MathSciNet  Google Scholar 

  • Sindhwani V, Niyogi P, Belkin M (2005) Beyond the point cloud: from transductive to semi-supervised learning. In: Proceedings, twenty second international conference on machine learning

  • Sindhwani V, Chu W, Keerthi SS (2007) Semi-supervised Gaussian process classifiers. In: International joint conference on artificial intelligence, pp 1059–1064

  • Sun S, Hussain Z, Shawe-Taylor J (2014) Manifold-preserving graph reduction for sparse semi-supervised learning. Neurocomputing 124:13–21

    Article  Google Scholar 

  • Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290:2319–2323

    Article  Google Scholar 

  • Tong S, Chang E (2001) Support vector machine active learning for image retrieval. In: Proceedings of the ninth ACM international conference on multimedia, MULTIMEDIA’01. ACM, New York, NY, USA, pp 107–118. https://doi.org/10.1145/500141.500159

  • Wahba G (1990) Spline models for observational data. Society for Industrial and Applied Mathematics, Philadelphia

    Book  Google Scholar 

  • Xu H, Yu L, Davenport MA, Zha H (2017) Active manifold learning via a unified framework for manifold landmarking. http://arxiv.org/abs/1710.09334

  • Yao G, Lu K, He X (2013) G-optimal feature selection with Laplacian regularization. Neurocomputing 119:175–181

    Article  Google Scholar 

  • Yu K, Bi J, Tresp V (2006) Active learning via transductive experimental design. In: Proceedings of the 23rd international conference on machine learning

  • Yu K, Zhu S, Xu W, Gong Y (2008) Non-greedy active learning for text categorization using convex transductive experimental design. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 635–642 (2008)

  • Zeng J, Lesnikowski A, Alvarez JM (2018) The relevance of Bayesian layer positioning to model uncertainty in deep Bayesian active learning. http://arxiv.org/abs/1811.12535

  • Zhou J, Sun S (2014) Active learning of gaussian processes with manifold-preserving graph reduction. Neural Comput Appl 25:1615–1625

    Article  Google Scholar 

  • Zhu X, Ghahramani Z, Lafferty J (2003a) Semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the twentieth international conference on machine learning

  • Zhu X, Lafferty J, Ghahramani Z (2003b) Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the ICML-2003 workshop on the continuum from labeled to unlabeled data

Download references

Acknowledgements

We thank the anonymous referees and the editors for useful suggestions that have significantly improved the presentation of this paper.

Funding

Funding was provided by National Science Foundation (US) (Grant No. 1537987).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Enrique Del Castillo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This invited paper is discussed in comments available at: https://doi.org/10.1007/s11749-019-00695-x, https://doi.org/10.1007/s11749-019-00696-w.

The authors gratefully acknowledge NSF Grant CMMI 1537987.

Appendices

Appendix A: Expectation propagation approximation

As discussed before, for GP Classification (GPC), we need to approximate the posterior distribution of latent variables. Nickisch and Rasmussen (2008) provide a comprehensive overview of different methods for approximate inference in GPC model for binary classification. The algorithms they evaluated includes Laplace Approximation, Expectation Propagation, KL-Divergence Minimization, Variational Bounds, Factorial Variational and Markov Chain Monte Carlo. They draw the conclusion that “Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight.” Thus, in the SSGP-AL algorithm, Expectation Propagation (EP) is chosen as the approximation method for GPC. In order to simplify notational complexities, we present the standard EP algorithm for supervised GPC. In our paper, we use a data-dependent kernel \(\tilde{\mathcal {K}}\) that also utilizes the information in the unlabeled instances, so the posterior distribution will also condition on the graph \(\mathcal {G}\).

In EP, the problem of non-Gaussian likelihood function is solved by a local likelihood approximation in the form of an un-normalized Gaussian function in the latent variable \(f_{\mathbf{x}_i}\), i.e.,

$$\begin{aligned} P(y_i|f_{x_i}) \simeq t_i (f_{\mathbf{x}_i}|\tilde{Z}_i, \tilde{\mu }_i,\tilde{ \sigma }_i^2)\triangleq \tilde{Z}_i N(f_{\mathbf{x}_i}|\tilde{\mu }_i,\tilde{ \sigma }_i^2) \end{aligned}$$
(37)

where \( \tilde{\mu }_i, \tilde{ \sigma }_i^2 \) and \(\tilde{Z}_i\) are local approximation parameters. After we have the local approximations, the posterior distribution \(P(\mathbf{f}_L|Y_L)\) can be approximated by \(q(\mathbf{f}_L|Y_L)\), defined as

$$\begin{aligned} q(\mathbf{f}_L|Y_L) \triangleq \frac{\prod _{i=1}^l t_i (f_{\mathbf{x}_i}|\tilde{Z}_i, \tilde{\mu }_i,\tilde{ \sigma }_i^2) N(\mathbf{0}, \mathbf{K}_{LL})}{Z_{EP}} \end{aligned}$$
(38)

Note that \(Z_{EP}=q(Y_L)\) is the marginal likelihood (also known as evidence or normalization constant) associated with the approximate posterior distribution \(q(\mathbf{f}_L|Y_L)\). Let \( \tilde{{\varvec{\mu }}}\) be a vector of \(\tilde{\mu }_i\) and \({\tilde{\varvec{\Sigma }}}\) be a diagonal matrix with elements \({\tilde{\Sigma }}_{ii}=\tilde{ \sigma }_i^2\), we can rewrite the product of the independent local approximations \(t_i\) as

$$\begin{aligned} \prod _{i=1}^l t_i (f_{\mathbf{x}_i}|\tilde{Z}_i, \tilde{\mu }_i,\tilde{ \sigma }_i^2)=N(\tilde{\mu }, {\tilde{\Sigma }}) \prod _{i=1}^l \tilde{Z}_i \end{aligned}$$
(39)

By the property of product of two Gaussians, it can be easily shown that

$$\begin{aligned} q(\mathbf f_L|Y_L)\sim N(\mathbf \mu , \Sigma ) \end{aligned}$$
(40)

where

$$\begin{aligned} {\varvec{\mu }}=\Sigma \tilde{\Sigma }\tilde{\mu } \text { and } {\varvec{\Sigma }}=(\mathbf{K}_{LL}^{-1}+{\tilde{\varvec{\Sigma }}}^{-1})^{-1}. \end{aligned}$$
(41)

Given the approximate posterior distribution \(q(f_L|Y_L)\), one can compute the approximate posterior distribution of latent variables \(f_{\mathbf{x}_t}\) at test point \(\mathbf{x}_t\), given by

$$\begin{aligned} P(f_{\mathbf{x}_t}|Y_L)= \int P(f_{\mathbf{x}_t}|f_L) P(f_L|Y_L) df_L \approx N(\mu _t, \sigma ^2_t), \end{aligned}$$
(42)

where

$$\begin{aligned} \mu _t= & {} \mathbf{K}_{Lt}^T \mathbf{K}_{LL}^{-1}{\varvec{\mu }} \end{aligned}$$
(43)
$$\begin{aligned} \sigma _t^2= & {} \mathcal {K}(\mathbf{x}_t,\mathbf{x}_t)-\mathbf{K}_{Lt}^T(\mathbf{K}_{LL}^{-1}+{\tilde{\varvec{\Sigma }}})\mathbf{K}_{Lt}\end{aligned}$$
(44)
$$\begin{aligned} \mathbf{K}_{Lt}= & {} [\mathcal {K}(\mathbf{x}_t,\mathbf{x}_1),\ldots ,\mathcal {K}(\mathbf{x}_t,\mathbf{x}_l)]^T \end{aligned}$$
(45)

It can be shown that the predictive probability at test point \(\mathbf{x}_t\) is given by

$$\begin{aligned} q(y_t=1|Y_L)= & {} \Phi \left( \frac{\mu _t}{\sqrt{1+\sigma _t^2}}\right) \end{aligned}$$
(46)
$$\begin{aligned}= & {} \Phi \left( \frac{\mathbf{K}_{Lt}^T (\mathbf{K}_{LL}+{\tilde{\varvec{\Sigma }}})^{-1}{{\tilde{\varvec{\mu }}}}}{\sqrt{1+\mathcal {K}(\mathbf{x}_t,\mathbf{x}_t)-\mathbf{K}_{Lt}^T(\mathbf{K}_{LL}+{\tilde{\varvec{\Sigma }}})^{-1}\mathbf{K}_{Lt}}}\right) \end{aligned}$$
(47)

In order to get the posterior distribution \(q(\mathbf{f}_L|Y_L)\), we first need to compute the local approximation \(t_i\) and its corresponding parameters \( \tilde{\mu }_i, \tilde{ \sigma }_i^2, \tilde{Z}_i\). The key idea behind the EP algorithm is to update the individual \(t_i\) approximations sequentially. We start with some current posterior approximation \(q(f_{\mathbf{x}_i}|Y_L)\sim N(\mu _i, \sigma _i^2)\), and after leaving out the local approximation \(t_i\), one obtains the so-called cavity distribution

$$\begin{aligned} q_{-i}(f_{\mathbf{x}_i})\triangleq N(f_{\mathbf{x}_i}|\mu _{-i}, \sigma _{-i}^2) \end{aligned}$$
(48)

where

$$\begin{aligned} \mu _{-i}= & {} \sigma _{-i}^2 \mu _i - \tilde{\sigma }_i^{-2}\tilde{\mu }_i ) \end{aligned}$$
(49)
$$\begin{aligned} \sigma _{-i}^2= & {} (\sigma _i^{-2}-\tilde{\sigma }_i^{-2})^{-1} \end{aligned}$$
(50)

This result can be easily checked by multiplying Eqs. (37) and (48) to recover \(q(f_{\mathbf{x}_i}|Y_L)\). Next combine the cavity distribution \(q_{-i}(f_{\mathbf{x}_i})\) with the exact likelihood function \(P(y_i|f_{\mathbf{x}_i})\) to obtain a non-Gaussian distribution which can then be approximated by a desired Gaussian distribution \(\hat{q}(f_{\mathbf{x}_i})\)

$$\begin{aligned} q_{-i}(f_{\mathbf{x}_i})P(y_i|f_{\mathbf{x}_i})\simeq \hat{Z}_i N(\hat{\mu }_i,\hat{\sigma }_i)\triangleq \hat{q}(f_{\mathbf{x}_i}). \end{aligned}$$
(51)

Finally, we compute the local approximation parameters \( \tilde{\mu }_i, \tilde{ \sigma }_i^2, \tilde{Z}_i\) of the approximation \(t_i\) by minimizing the Kullback–Leibler (KL) divergence between \(\hat{q}(f_{\mathbf{x}_i})\) and \(q_{-i}(f_{\mathbf{x}_i})t_i\). It can be shown this minimization is equivalent to matching the first and second moments of these two distributions. The zeroth-order (normalization constant), first-order and second-order moments of the distribution \(\hat{q}(f_{\mathbf{x}_i})\) can be shown to be

$$\begin{aligned} \hat{Z}_i= & {} \Phi (z_i) \end{aligned}$$
(52)
$$\begin{aligned} \hat{\mu }_i= & {} \mu _{-i}+\frac{y_i \sigma _{-i}^2 N(z_i)}{\Phi (z_i) \sqrt{1+\sigma _{-i}^2}} \end{aligned}$$
(53)
$$\begin{aligned} \hat{\sigma }_i^2= & {} \sigma _{-i}^2 - \frac{\sigma _{-i}^4 N(z_i)}{(1+\sigma _{-i}^2)\Phi (z_i)}\left( z_i+\frac{N(z_i)}{\Phi (z_i)}\right) \end{aligned}$$
(54)

where \(z_i=y_i \mu _{-i}/\sqrt{(1+\sigma _{-i}^2)}\). From matching the above moments, the local approximation parameters of \(t_i\) are:

$$\begin{aligned} \tilde{\mu }_i= & {} \tilde{\sigma }_i^2 (\hat{\sigma }_i^{-2}\hat{\mu }_i - \sigma _{-i}^{-2} \mu _{-i}) \end{aligned}$$
(55)
$$\begin{aligned} \tilde{\sigma }_i^2= & {} (\hat{\sigma }_i^{-2}-\sigma _{-i}^{-2})^{-1} \end{aligned}$$
(56)
$$\begin{aligned} \tilde{Z}_i= & {} \Phi (z_i) \sqrt{2\pi } \sqrt{\sigma _{-i}^2+\tilde{\sigma }_i^2}\mathrm {exp}\left( \frac{(\mu _{-i}-\tilde{\mu }_i)^2}{2(\sigma _{-i}^2+\tilde{\sigma }_i^2)}\right) \end{aligned}$$
(57)

After updating the site parameters, the approximate posterior distribution \(q(\mathbf{f}_L|Y_L)\) is updated using Eq. (41). The pseudo-code for the EP algorithm is provided in Algorithm 2.

figure b

Appendix B: Estimation of the GP hyperparameters

In the new approach presented in Sect. 4, we estimate the GP hyperparameters in the data-based kernel \(\tilde{\mathcal {K}}\) by maximizing the logarithm of \(\tilde{Z}_{EP}\) in

$$\begin{aligned} q(\mathbf{f}_L|Y_L,\mathcal {G}) \triangleq \frac{\prod _{i=1}^l t_i (f_{\mathbf{x}_i}|\tilde{Z}_i, \tilde{\mu }_i,\tilde{ \sigma }_i^2) N(\mathbf{0}, \tilde{\mathbf{K}}_{LL})}{\tilde{Z}_{EP}} \end{aligned}$$
(58)

where \(\tilde{Z}_{EP}\) is an approximation for marginal likelihood \(P(Y_L|\mathcal {G})\) in (31) by using Expectation Propagation. Note that \(\tilde{Z}_{EP}=q(Y_L|\mathcal {G})\) is the marginal likelihood associated with the approximate posterior distribution \(q(\mathbf{f}_L|Y_L,\mathcal {G})\). Let \( \tilde{{\varvec{\mu }}}\) be a vector containing the \(\tilde{\mu }_i\) and \(\tilde{\varvec{\Sigma }}\) be a diagonal matrix with elements \(\tilde{\varvec{\Sigma }}(i,i)=\tilde{ \sigma }_i^2\). We can rewrite the product of the independent local approximations \(t_i\) as

$$\begin{aligned} \prod _{i=1}^l t_i (f_{\mathbf{x}_i}|\tilde{Z}_i, \tilde{\mu }_i,\tilde{ \sigma }_i^2)=N(\tilde{\mu }, \tilde{\varvec{\Sigma }}) \prod _{i=1}^l \tilde{Z}_i \end{aligned}$$
(59)

Based on the Expectation Propagation algorithm, the approximate marginal likelihood \(\tilde{Z}_{EP}\) is given by

$$\begin{aligned} \tilde{Z}_{EP} = \int q(Y_L,\mathbf{f}_L |\mathcal {G} ) d\mathbf{f}_L = \int P(\mathbf{f}_L | \mathcal {G}) \prod _{i=1}^l t_i(f_{\mathbf{x}_i}|\tilde{Z_i}, \tilde{\mu }_i, \tilde{\sigma }_i^2) d\mathbf{f}_L \end{aligned}$$
(60)

Given \(P(\mathbf {f_L}|\mathcal {G}) \sim N(\mathbf {0}, {\tilde{\varvec{K}}}_{LL})\) and Eq. (59), and using the property of the product of two multivariate Gaussians, it can be shown that:

$$\begin{aligned} \tilde{Z}_{EP}=Z_0^{-1} \prod _{i=1}^l \tilde{Z}_i \int N(\mathbf{f}_L | {\varvec{\omega }} , {\varvec{\Omega }} ) \mathbf{f}_L =Z_0^{-1} \prod _{i=1}^l \tilde{Z}_i \end{aligned}$$
(61)

where

$$\begin{aligned} Z_0^{-1}= & {} (2\pi )^{l/2} |{\tilde{\mathbf {K}}}_{LL}+{\tilde{\varvec{\Sigma }}}|^{-1/2} \mathrm {exp}\left( -\frac{1}{2}{\tilde{\mu }}^\top ({\tilde{\mathbf {K}}}_{LL}+{\tilde{\varvec{\Sigma }}})^{-1}{\tilde{\mu }}\right) \end{aligned}$$
(62)
$$\begin{aligned} {\varvec{\omega }}= & {} {\varvec{\Omega }}{\tilde{\varvec{\Sigma }}}^{-1} {\tilde{\mu }} \end{aligned}$$
(63)
$$\begin{aligned} {\varvec{\Omega }}= & {} ({\tilde{\mathbf {K}}}_{LL}^{-1}+{\tilde{\varvec{\Sigma }}}^{-1})^{-1} \end{aligned}$$
(64)

and \(\int N(\mathbf{f}_L | {\varvec{\omega }} , {\varvec{\Omega }} ) \mathbf{f}_L =1 \) by the definition of multivariate Gaussian PDF. Then the log-likelihood yields

$$\begin{aligned} \log \tilde{Z}_{EP}= & {} \mathrm {log} (Z_0^{-1})+\sum _{i=1}^l \mathrm {log }(\tilde{Z}_i) \end{aligned}$$
(65)
$$\begin{aligned}= & {} -\frac{1}{2} \mathrm {log} |\tilde{\mathbf {K}}_{LL}+\tilde{\varvec{\Sigma }}|-\frac{1}{2}\tilde{\mu }^\top (\tilde{\mathbf {K}}_{LL}+\tilde{\varvec{\Sigma }})^{-1}\tilde{\mu } +\sum _{i=1}^l \mathrm {log} \; \Phi \left( \frac{y_i \mu _{-i}}{\sqrt{1+\sigma ^2_{-i}}}\right) \nonumber \\&+\frac{1}{2}\sum _{i=1}^l \mathrm {log}\left( \sigma ^2_{-i}+\tilde{\sigma }^2_i\right) +\sum _{i=1}^l \frac{(\mu _{-i}-\tilde{\mu }_i)^2}{2(\sigma _{-i}^2+\tilde{\sigma }_i^2)} \end{aligned}$$
(66)

where the explicit form of \(\tilde{Z}_i\) is derived in the EP process, \(\mu _{-i}\) and \(\sigma _{-i}\) are the parameters in the cavity distribution.

Optimization of \(\log \tilde{Z}_{EP}\) with respect to the hyperparameters of a data-based kernel function \(\tilde{\mathcal {K}}\) requires evaluation of the partial derivatives from Eq. (66). Seeger (2005) demonstrated that the derivatives of the last three terms in Eq. (66) with respect to the hyperparameters vanish. As a result, only the derivatives of the first two terms need to be considered.

Let \({\varvec{\theta }}=\{\theta _j\}_{j=1}^m\) be the collection of hyperparameters in the data-based kernel \(\tilde{\mathcal {K}}\). Given Eq. (66), the derivative of log-likelihood \(\log \tilde{Z}_{EP}\) with respect to hyperparameter \(\theta _j\) can be reduced to

$$\begin{aligned} \frac{\partial \log \tilde{Z}_{EP}}{\partial \theta _j}= & {} \frac{\partial }{\partial \theta _j}\left( -\frac{1}{2} \mathrm {log} |\tilde{\mathbf {K}}_{LL}+\tilde{\varvec{\Sigma }}|-\frac{1}{2}\tilde{\mu }^\top (\tilde{\mathbf {K}}_{LL}+\tilde{\varvec{\Sigma }})^{-1}\tilde{\mu }\right) \end{aligned}$$
(67)
$$\begin{aligned}= & {} -\frac{1}{2} {{\,\mathrm{tr}\,}}\left( (\tilde{\mathbf {K}}_{LL}+\tilde{\varvec{\Sigma }})^{-1}\frac{\partial \tilde{\mathbf {K}}_{LL}}{\partial \theta _j}\right) \nonumber \\&+\frac{1}{2}\tilde{\mu }^\top (\tilde{\mathbf {K}}_{LL}+\tilde{\varvec{\Sigma }})^{-1}\frac{\partial \tilde{\mathbf {K}}_{LL}}{\partial \theta _j}(\tilde{\mathbf {K}}_{LL}+\tilde{\varvec{\Sigma }})^{-1}\tilde{\mu } \end{aligned}$$
(68)

In this paper, we choose a radial basis function (RBF) kernel as the base kernel \(\mathcal {K}\) in Eq. (29), i.e.,

$$\begin{aligned} \mathcal {K} (\mathbf{x}_i, \mathbf{x}_j)=\exp \left( -\frac{(\mathbf{x}_i-\mathbf{x}_j)^\top (\mathbf{x}_i - \mathbf{x}_j)}{2\sigma ^2_{\mathrm {rbf}}}\right) \end{aligned}$$
(69)

Thus, note there are two hyperparameters in the data-based kernel \(\tilde{\mathcal {K}}\), which include the regularization parameter \(\lambda _R\) and the range (or length-scale) parameter \(\sigma _{\mathrm {rbf}}\) in the RBF kernel \(\mathcal {K}\). Both of the hyperparameters are optimized under the log-scale. The derivatives of Gram matrix \(\tilde{\mathbf {K}}\) with respect to \(\log \sigma _{\mathrm {rbf}}\) and \(\log \lambda _R\) are derived as

$$\begin{aligned} \frac{\partial \tilde{\mathbf {K}}}{\partial (\log \sigma _{\mathrm {rbf}})}= & {} (\mathbf {I}+\lambda _R \mathbf {K} L)^{-1} \frac{\partial \mathbf {K}}{\partial (\log \sigma _{\mathrm {rbf}})} (\mathbf {I}+\lambda _R L \mathbf {K} )^{-1} \end{aligned}$$
(70)
$$\begin{aligned} \frac{\partial \tilde{\mathbf {K}}}{\partial (\log \lambda _R)}= & {} -(\mathbf {K}^{-1}+\lambda _R L)^{-1} \lambda _R L (\mathbf {K}^{-1}+\lambda _R L)^{-1} \end{aligned}$$
(71)

and for the RBF kernel \(\mathcal {K}\), one can easily get

$$\begin{aligned} \frac{\partial \mathbf {K}}{\partial (\log \sigma _{\mathrm {rbf}})} =-2\mathbf {K} \circ \log \mathbf {K} \end{aligned}$$
(72)

where \(\circ \) is the Hadamard product between two matrices, and \(\log \mathbf {K}\) is a notation that represents taking logarithm of each element in matrix \(\mathbf {K}\) (not the logarithm of matrix \(\mathbf {K}\)). Note that in (68) \(\partial \tilde{\mathbf {K}}_{LL} / \partial \theta _j\) is just a submatrix, associated with the labeled data \(\mathbf {X}_L\), of \(\partial \tilde{\mathbf {K}} / \partial \theta _j\), since the marginal likelihood \(\tilde{Z}_{EP}\) can only be computed for the labeled data.

In summary, the hyperparameters \(\lambda _R\) and \(\sigma _{\mathrm {rbf}}\) in the data-based kernel \(\tilde{\mathcal {K}}\) can both be estimated by maximizing the log-likelihood (66). This optimization problem can be solved by using a gradient-based algorithm with the derivatives provided in Eqs. (7072).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, H., Del Castillo, E. & Runger, G. On active learning methods for manifold data. TEST 29, 1–33 (2020). https://doi.org/10.1007/s11749-019-00694-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11749-019-00694-y

Keywords

Mathematics Subject Classification

Navigation