Fast Gaussian Process Regression for Big Data
Introduction
Gaussian Processes (GP) are attractive tools to perform supervised learning tasks on complex datasets on which traditional parametric methods may not be effective. Traditional parametric methods make a priori rigid assumptions about the nature and form of the model. For example, a linear regression model is a parametric model choice. This choice may not adapt to the data as it changes, for example, updates to the data might suggest that a non-linear model is more appropriate. Finding a suitable parametric form for a complex dataset is very challenging without prior experience with the data. Gaussian Processes offer flexibility in this regard. We only need to commit to a broad family of functions that are suitable for the problem by specifying an appropriate covariance function. We do not need to commit to a rigid functional form between the response and predictor variables. This can simplify the effort required to pick a good model when confronted with an unfamiliar dataset that requires a complex model. They are also easier to use in comparison to alternatives like neural networks [1]. Gaussian Processes offer some practical advantages over Support Vector Machines (SVM) [2]. They offer uncertainty estimates with predictions. The kernel and regularization parameters can be learned directly from the data instead of using cross-validation. Feature selection can be incorporated into the learning algorithm. For regression, exact inference is possible with Gaussian Processes. To apply Gaussian Processes to classification, we need to resort to approximate inference techniques such as Markov Chain Monte Carlo, Laplace Approximation or Variational Inference. Even though exact inference is possible for Gaussian Process regression, the computation of the solution requires matrix inversion. For a dataset with n rows or data instances, the time complexity of matrix inversion is . The space complexity associated with storing a matrix of size n is . This restricts the applicability of the technique to small or moderate-sized datasets. In this paper we present an algorithm that uses subset selection and ideas borrowed from bootstrap aggregation to mitigate the problem discussed above. Parallel implementation of this algorithm is also possible and can further improve performance.
The rest of this paper is organized as follows: In section 2, we motivate the problem. In section 3, we present our solution to the problem. Our solution is based on combining estimators developed on subsets of the data. The selection of the subsets is based on simple random sampling with replacement (similar to what is done in the bootstrap). The size of the subset selected is a key aspect of this algorithm. This is determined empirically. We present two methods to determine the subset size. We present the motivating ideas leading to the final form of the algorithm. When the model is an additive structure of univariate components, this has attractive implications on the convergence rate [3]. An additive model worked well for the datasets used in this study. Relevant facts from Minimax theory for non-parametric regression, that are consistent with the experimental results reported in this work, are presented.
In section 4, we present a brief summary of related work. Applying Gaussian Processes to large datasets has attracted a lot of interest from the machine learning research community. Connecting ideas to research related to the algorithm reported in this work are presented. Selecting parameters for an algorithm is an arduous task. However this algorithm has only two important parameters, the subset size and the number of estimators. We provide guidelines to pick these parameters based on detailed experiments across a range of datasets.
In section 5 we provide experimental results that provide insights into the effect of the parameters associated with the algorithm. In section 6, we illustrate the application of our algorithm to synthetic and real world data sets. We applied the algorithm developed in this work to data sets with over a million instances. We compare the performance of the proposed method to the Sparse Gaussian Process [4] and the Stochastic Variational Gaussian Process [5]. The inputs required by these algorithms are similar to the inputs required for the proposed method and therefore are applicable in a similar context. We compare the estimates obtained from the reported algorithm with two other popular methods to perform regression on large datasets – Gradient Boosted Trees (using XGBoost, [6]) and the Generalized Additive Model (GAM) [7]. Results from experiments performed as part of this study show that accuracies from the proposed method are comparable to those obtained from Gradient Boosted Trees or GAM's. However there are some distinct advantages to using a Gaussian Process model. A Gaussian Process model yields uncertainty estimates directly whereas methods like Gradient Boosted Trees do not provide this (at least directly). [8] and [9]. A Gaussian Process model is also directly interpretable [2, pg. 5, 6] in comparison to methods like Gradient Boosted Trees or Neural Networks. Therefore, the proposed method can yield both explanatory and predictive models. It is possible to use stacking [10] to combine the estimates from the proposed model with those obtained from a competing model (like Gradient Boosted Trees) and obtain higher accuracies. Combining a Gaussian Process solution with XGBoost has been used by [11].
In section 7, we present the conclusions from this work. The contribution of this work is as follows. The proposed method to perform Gaussian Process regression on large datasets has a very simple implementation in comparison to other alternatives, with similar levels of accuracy. The algorithm has two key parameters – the subset size and the number of estimators. Detailed guidelines to pick these parameters are provided. The choice of a method to scale Gaussian Process regression to large datasets depends on the characteristics of the problem. This is discussed in section 4. The proposed method is most effective for problems where the response depends on a small number of features and the kernel characteristics are unknown. In such cases, exploratory data analysis can be used to arrive at appropriate kernel choices [12]. Additive models may work well for these problems. Appropriate preprocessing like principal component analysis can be used if needed to create additive models. The rate of convergence for additive models is attractive [13]. This implies that we can build very effective models with a small proportion of samples. Sparse Gaussian Processes see [14] and Stochastic Variational Gaussian Processes [5] are also appropriate for such problems. These require a more complex implementation and may require extensive effort to tune the optimization component of the algorithm (see [5]). Results of the experiments conducted as part of this study show that the proposed method can match or exceed the performance of these methods.
Section snippets
Problem formulation
A Gaussian Process y with additive noise can be represented as: Here:
- •
y represents the observed response.
- •
x represents an input vector of covariates.
- •
η represents the noise. The noise terms are assumed to be identical, independently distributed (IID) random variables drawn from a normal distribution with variance .
- •
f represents the function being modeled. It is a multivariate normal with mean function and covariance function .
If we want to make a prediction for the value of the
Proposed solution
Since GP regression is both effective and versatile on complex datasets in comparison to parametric methods, a solution to the bottlenecks mentioned above will enable us to apply GP regression to large datasets that are complex. We run into such datasets routinely in this age of big data. Our solution to applying Gaussian Process regression to large data sets is based on developing estimators on smaller subsets of the data. The size of the subset and the desired accuracy are key parameters for
Related work
Rasmussen and Williams [15, Chapter 8] provide a detailed discussion of the approaches used to apply Gaussian Process regression to large datasets. The study of Quiñonero-Candela et al. [26] is another detailed review of the various approaches to applying Gaussian Processes to large datasets. The choice of a method appropriate for a regression task is dependent on the problem context. Therefore we discuss the work related to scaling Gaussian Process regression taking the problem context into
Effect of the parameters
Selection of algorithm parameters appropriate for a machine learning task is an arduous task for all practitioners. To alleviate this difficulty, we provide guidelines for parameter selection based on detailed experimentation. The proposed algorithm has three parameters:
- 1.
The dataset size
- 2.
The subset size
- 3.
The number of estimators
Application of the algorithm
In this section we describe the results of applying the algorithm reported in this work to the datasets described in section 5.1.
Conclusion
There are many methods to scale Gaussian Process regression to large datasets. The appropriate choice depends on the problem context. For example, if Gaussian Process regression has been applied to similar datasets, then the kernel types and the kernel hyper-parameters may be known and methods such as the Nystrom method are suitable. If we only seek estimates at a small number of test points, methods such as the Locally Approximate Gaussian Process may be suitable (see section 4 for a detailed
Acknowledgement
The authors would like to thank the reviewers for their many insightful suggestions and comments that have helped improve the presentation of the paper significantly.
References (44)
Stacked generalization
Neural Netw.
(1992)Gefcom2012 hierarchical load forecasting: gradient boosting machines and Gaussian processes
Int. J. Forecast.
(2014)Gaussian Processes for Machine Learning
(2006)A tutorial on Gaussian processes (or why I don't use SVMs), MLSS Workshop talk by Zoubin Ghahramani on Gaussian Processes
Additive regression and other nonparametric models
Ann. Stat.
(1985)Variational learning of inducing variables in sparse Gaussian processes
- et al.
Gaussian processes for big data
- et al.
Xgboost: a scalable tree boosting system
- et al.
The Elements of Statistical Learning
(2001) - et al.
Probabilistic estimation of respiratory rate using Gaussian processes
Probabilistic baseline estimation via Gaussian process
Automatic Model Construction with Gaussian Processes
Optimal global rates of convergence for nonparametric regression
Ann. Stat.
Sparse Gaussian processes using pseudo-inputs
Gaussian Processes for Machine Learning
Pattern Recognition, vol. 128
Random forests
Mach. Learn.
Generalized product of experts for automatic and principled fusion of Gaussian process predictions
Introduction to Nonparametric Estimation
A Distribution-Free Theory of Nonparametric Regression
Minimax-optimal nonparametric regression in high dimensions
Ann. Stat.
Bayesian manifold regression
Ann. Stat.
Cited by (34)
A new approach to probabilistic classification based on Gaussian process and support vector machine
2023, Computers and Industrial EngineeringLearning quadrotor dynamics for precise, safe, and agile flight control
2023, Annual Reviews in ControlAn integrated framework for improving sea level variation prediction based on the integration Wavelet-Artificial Intelligence approaches
2022, Environmental Modelling and SoftwareCitation Excerpt :In reality, the best machine learning algorithms relies on the problem at hand and the nature of dataset (Almaliki, 2019). As a result, the authors focused on a study of commonly used algorithms that have proven to be efficient when compared to other ones especially for large datasets issues as in the current research (Cervantes et al., 2008; Das et al., 2018; Franco-Arcega et al., 2012; Zhang et al., 2018a). Thus, four selected algorithms namely Gradient Boosted Decision Trees Regression (GBTR), Supported Vector Regression (SVR), Gaussian Processing Regression (GPR), and K-Neighbor Regressor (K-NNR).
Prediction and parameter uncertainty for winter wheat phenology models depend on model and parameterization method differences
2020, Agricultural and Forest Meteorology