Fast Gaussian Process Regression for Big Data

doi:10.1016/j.bdr.2018.06.002

Big Data Research

Volume 14, December 2018, Pages 12-26

https://doi.org/10.1016/j.bdr.2018.06.002 Get rights and content

Abstract

Gaussian Processes are widely used for regression tasks. A known limitation in the application of Gaussian Processes to regression tasks is that the computation of the solution requires performing a matrix inversion. The solution also requires the storage of a large matrix in memory. These factors restrict the application of Gaussian Process regression to small and moderate size datasets. We present an algorithm that combines estimates from models developed using subsets of the data obtained in a manner similar to the bootstrap. The sample size is a critical parameter for this algorithm. Guidelines for reasonable choices of algorithm parameters, based on a detailed experimental study, are provided. Various techniques have been proposed to scale Gaussian Processes to large-scale regression tasks. The most appropriate choice depends on the problem context. The proposed method is most appropriate for problems where an additive model works well and the response depends on a small number of features. The minimax rate of convergence for such problems is attractive and we can build effective models with a small subset of the data. The Stochastic Variational Gaussian Process and the Sparse Gaussian Process are also appropriate choices for such problems. Results from experiments conducted as part of this study indicate that the algorithm presented in this work can be as effective as these methods. Unlike these methods, the proposed algorithm requires minimal hyper-parameter tuning and is much simpler to implement. The rate of convergence is also attractive.

Introduction

Gaussian Processes (GP) are attractive tools to perform supervised learning tasks on complex datasets on which traditional parametric methods may not be effective. Traditional parametric methods make a priori rigid assumptions about the nature and form of the model. For example, a linear regression model is a parametric model choice. This choice may not adapt to the data as it changes, for example, updates to the data might suggest that a non-linear model is more appropriate. Finding a suitable parametric form for a complex dataset is very challenging without prior experience with the data. Gaussian Processes offer flexibility in this regard. We only need to commit to a broad family of functions that are suitable for the problem by specifying an appropriate covariance function. We do not need to commit to a rigid functional form between the response and predictor variables. This can simplify the effort required to pick a good model when confronted with an unfamiliar dataset that requires a complex model. They are also easier to use in comparison to alternatives like neural networks [1]. Gaussian Processes offer some practical advantages over Support Vector Machines (SVM) [2]. They offer uncertainty estimates with predictions. The kernel and regularization parameters can be learned directly from the data instead of using cross-validation. Feature selection can be incorporated into the learning algorithm. For regression, exact inference is possible with Gaussian Processes. To apply Gaussian Processes to classification, we need to resort to approximate inference techniques such as Markov Chain Monte Carlo, Laplace Approximation or Variational Inference. Even though exact inference is possible for Gaussian Process regression, the computation of the solution requires matrix inversion. For a dataset with n rows or data instances, the time complexity of matrix inversion is $O (n^{3})$ . The space complexity associated with storing a matrix of size n is $O (n^{2})$ . This restricts the applicability of the technique to small or moderate-sized datasets. In this paper we present an algorithm that uses subset selection and ideas borrowed from bootstrap aggregation to mitigate the problem discussed above. Parallel implementation of this algorithm is also possible and can further improve performance.

The rest of this paper is organized as follows: In section 2, we motivate the problem. In section 3, we present our solution to the problem. Our solution is based on combining estimators developed on subsets of the data. The selection of the subsets is based on simple random sampling with replacement (similar to what is done in the bootstrap). The size of the subset selected is a key aspect of this algorithm. This is determined empirically. We present two methods to determine the subset size. We present the motivating ideas leading to the final form of the algorithm. When the model is an additive structure of univariate components, this has attractive implications on the convergence rate [3]. An additive model worked well for the datasets used in this study. Relevant facts from Minimax theory for non-parametric regression, that are consistent with the experimental results reported in this work, are presented.

In section 4, we present a brief summary of related work. Applying Gaussian Processes to large datasets has attracted a lot of interest from the machine learning research community. Connecting ideas to research related to the algorithm reported in this work are presented. Selecting parameters for an algorithm is an arduous task. However this algorithm has only two important parameters, the subset size and the number of estimators. We provide guidelines to pick these parameters based on detailed experiments across a range of datasets.

In section 5 we provide experimental results that provide insights into the effect of the parameters associated with the algorithm. In section 6, we illustrate the application of our algorithm to synthetic and real world data sets. We applied the algorithm developed in this work to data sets with over a million instances. We compare the performance of the proposed method to the Sparse Gaussian Process [4] and the Stochastic Variational Gaussian Process [5]. The inputs required by these algorithms are similar to the inputs required for the proposed method and therefore are applicable in a similar context. We compare the estimates obtained from the reported algorithm with two other popular methods to perform regression on large datasets – Gradient Boosted Trees (using XGBoost, [6]) and the Generalized Additive Model (GAM) [7]. Results from experiments performed as part of this study show that accuracies from the proposed method are comparable to those obtained from Gradient Boosted Trees or GAM's. However there are some distinct advantages to using a Gaussian Process model. A Gaussian Process model yields uncertainty estimates directly whereas methods like Gradient Boosted Trees do not provide this (at least directly). [8] and [9]. A Gaussian Process model is also directly interpretable [2, pg. 5, 6] in comparison to methods like Gradient Boosted Trees or Neural Networks. Therefore, the proposed method can yield both explanatory and predictive models. It is possible to use stacking [10] to combine the estimates from the proposed model with those obtained from a competing model (like Gradient Boosted Trees) and obtain higher accuracies. Combining a Gaussian Process solution with XGBoost has been used by [11].

In section 7, we present the conclusions from this work. The contribution of this work is as follows. The proposed method to perform Gaussian Process regression on large datasets has a very simple implementation in comparison to other alternatives, with similar levels of accuracy. The algorithm has two key parameters – the subset size and the number of estimators. Detailed guidelines to pick these parameters are provided. The choice of a method to scale Gaussian Process regression to large datasets depends on the characteristics of the problem. This is discussed in section 4. The proposed method is most effective for problems where the response depends on a small number of features and the kernel characteristics are unknown. In such cases, exploratory data analysis can be used to arrive at appropriate kernel choices [12]. Additive models may work well for these problems. Appropriate preprocessing like principal component analysis can be used if needed to create additive models. The rate of convergence for additive models is attractive [13]. This implies that we can build very effective models with a small proportion of samples. Sparse Gaussian Processes see [14] and Stochastic Variational Gaussian Processes [5] are also appropriate for such problems. These require a more complex implementation and may require extensive effort to tune the optimization component of the algorithm (see [5]). Results of the experiments conducted as part of this study show that the proposed method can match or exceed the performance of these methods.

Section snippets

Problem formulation

A Gaussian Process y with additive noise can be represented as: $y = f (x) + η .$ Here:

•
y represents the observed response.
•
x represents an input vector of covariates.
•
η represents the noise. The noise terms are assumed to be identical, independently distributed (IID) random variables drawn from a normal distribution with variance $σ_{n}^{2}$ .
•
f represents the function being modeled. It is a multivariate normal with mean function $μ (x)$ and covariance function $K (x)$ .

If we want to make a prediction for the value of the

Proposed solution

Since GP regression is both effective and versatile on complex datasets in comparison to parametric methods, a solution to the bottlenecks mentioned above will enable us to apply GP regression to large datasets that are complex. We run into such datasets routinely in this age of big data. Our solution to applying Gaussian Process regression to large data sets is based on developing estimators on smaller subsets of the data. The size of the subset and the desired accuracy are key parameters for

Related work

Rasmussen and Williams [15, Chapter 8] provide a detailed discussion of the approaches used to apply Gaussian Process regression to large datasets. The study of Quiñonero-Candela et al. [26] is another detailed review of the various approaches to applying Gaussian Processes to large datasets. The choice of a method appropriate for a regression task is dependent on the problem context. Therefore we discuss the work related to scaling Gaussian Process regression taking the problem context into

Effect of the parameters

Selection of algorithm parameters appropriate for a machine learning task is an arduous task for all practitioners. To alleviate this difficulty, we provide guidelines for parameter selection based on detailed experimentation. The proposed algorithm has three parameters:

1.
The dataset size
2.
The subset size
3.
The number of estimators

Accordingly, three sets of experiments were performed to capture the effect of each of these parameters on the performance of the algorithm. These experiments are described

Application of the algorithm

In this section we describe the results of applying the algorithm reported in this work to the datasets described in section 5.1.

Conclusion

There are many methods to scale Gaussian Process regression to large datasets. The appropriate choice depends on the problem context. For example, if Gaussian Process regression has been applied to similar datasets, then the kernel types and the kernel hyper-parameters may be known and methods such as the Nystrom method are suitable. If we only seek estimates at a small number of test points, methods such as the Locally Approximate Gaussian Process may be suitable (see section 4 for a detailed

Acknowledgement

The authors would like to thank the reviewers for their many insightful suggestions and comments that have helped improve the presentation of the paper significantly.

References (44)

D.H. Wolpert
Stacked generalization
Neural Netw.
(1992)
J.R. Lloyd
Gefcom2012 hierarchical load forecasting: gradient boosting machines and Gaussian processes
Int. J. Forecast.
(2014)
C.E. Rasmussen
Gaussian Processes for Machine Learning
(2006)
Z. Ghahramani
A tutorial on Gaussian processes (or why I don't use SVMs), MLSS Workshop talk by Zoubin Ghahramani on Gaussian Processes
C.J. Stone
Additive regression and other nonparametric models
Ann. Stat.
(1985)
M.K. Titsias
Variational learning of inducing variables in sparse Gaussian processes
J. Hensman et al.
Gaussian processes for big data
T. Chen et al.
Xgboost: a scalable tree boosting system
J. Friedman et al.
The Elements of Statistical Learning
(2001)
M.A. Pimentel et al.
Probabilistic estimation of respiratory rate using Gaussian processes

Y. Weng et al.

Probabilistic baseline estimation via Gaussian process

D. Duvenaud

Automatic Model Construction with Gaussian Processes

(2014)

C.J. Stone

Optimal global rates of convergence for nonparametric regression

Ann. Stat.

(1982)

E. Snelson et al.

Sparse Gaussian processes using pseudo-inputs

C.E. Rasmussen et al.

Gaussian Processes for Machine Learning

(2005)

C.M. Bishop

Pattern Recognition, vol. 128

(2006)

L. Breiman

Random forests

Mach. Learn.

(2001)

Y. Cao et al.

Generalized product of experts for automatic and principled fusion of Gaussian process predictions

A.B. Tsybakov

Introduction to Nonparametric Estimation

(2009)

L. Györfi et al.

A Distribution-Free Theory of Nonparametric Regression

(2006)

Y. Yang et al.

Minimax-optimal nonparametric regression in high dimensions

Ann. Stat.

(2015)

Y. Yang et al.

Bayesian manifold regression

Ann. Stat.

(2016)

Cited by (34)

A new approach to probabilistic classification based on Gaussian process and support vector machine
2023, Computers and Industrial Engineering
This paper proposes three algorithms that combine the Support Vector Machine (SVM) and Gaussian Process (GP) in a unified framework to classify large datasets efficiently and obtain probabilistic information on the classification results. The first algorithm works in two steps. In Step 1, the algorithm judiciously selects a sample of size m from the training dataset of size n, where m ≪ n. Step 2 then runs a Gaussian Process Classification (GPC) on the selected sample. The second algorithm is based on the first one, where instead of a standard GPC, a sparse GPC (a low-cost variant of standard GPC) is run in Step 2. The third algorithm is based on manipulating the Gaussian Process Regression (GPR) to classify the data. Unlike GPC, GPR uses an exact inference method that greatly reduces the computational complexity. We have experimented with seven datasets to compare the performance of the proposed algorithms with the existing state-of-the-art methodologies. In addition to providing probabilistic information, the proposed algorithms have proved to be computationally efficient, especially in the training phase, where they consistently deliver faster results than the existing algorithms. The accuracies provided by the proposed algorithms are satisfactory, and at times even surpass the accuracy obtained by the existing algorithms.
A Gaussian process regression machine learning model for forecasting retail property prices with Bayesian optimizations and cross-validation
2023, Decision Analytics Journal
The real estate market in China has been growing rapidly during the past decade, with different property price patterns across various regions. Among different types of properties, prices of retail properties have not been sufficiently analyzed. This study focuses on forecasting problems of price indices of retail properties across ten major cities in China from 2005 to 2021 through monthly data and Gaussian process regression models. We examine ten kernels, four basis functions, and two predictor standardization options for constructing the forecast model through Bayesian optimization and cross-validation for each price index. The ten models built lead to accurate out-of-sample forecast results for a two-year window from 2019 to 2021. Relative root mean square errors across the ten indices range from 0.0113% to 0.4835%. Market participants and policymakers could utilize forecast models here for assessing retail property prices.
Learning quadrotor dynamics for precise, safe, and agile flight control
2023, Annual Reviews in Control
This article reviews the state-of-the-art modeling and control techniques for aerial robots such as quadrotor systems and presents several future research directions in this area. The review starts by introducing the benefits and drawbacks of classic physic-based dynamic modeling and control techniques. Subsequently, the manuscript presents the key challenges to augment or replace classic techniques with data-driven approaches that can offer several key benefits in terms of flight precision, safety, adaptation, and agility.
An integrated framework for improving sea level variation prediction based on the integration Wavelet-Artificial Intelligence approaches
2022, Environmental Modelling and Software
Citation Excerpt :
In reality, the best machine learning algorithms relies on the problem at hand and the nature of dataset (Almaliki, 2019). As a result, the authors focused on a study of commonly used algorithms that have proven to be efficient when compared to other ones especially for large datasets issues as in the current research (Cervantes et al., 2008; Das et al., 2018; Franco-Arcega et al., 2012; Zhang et al., 2018a). Thus, four selected algorithms namely Gradient Boosted Decision Trees Regression (GBTR), Supported Vector Regression (SVR), Gaussian Processing Regression (GPR), and K-Neighbor Regressor (K-NNR).
Modeling of Sea Level Variation (SLV) is a complicated phenomenon owing to multiple factors that happen at different spatial and temporal scales. Thus, this paper presents an innovative multistep interdependent framework based on Wavelet Transformation (WT) and Artificial Intelligence (AI) algorithms for SLV prediction. A wavelet time-frequency approach along with harmonic analysis is performed firstly to understand deeply SLV behavior. Then, Neighborhood Component Analysis (NCA) is applied for the Feature selection (FS) purposes. Finally, a Deep Learning Neural Network (DLNN) algorithm is utilized to predict precisely SLV based on a newly compacted dataset. The findings revealed the potential of the DLNN model over the Machine Learning models as it improves the SLV prediction accuracy by 23%. Additionally, the proposed DLNN model can predict SLV for a time horizon of three days with a correlation coefficient = 0.91 that can help in predicting SLV early for disaster management purposes.
Prediction and parameter uncertainty for winter wheat phenology models depend on model and parameterization method differences
2020, Agricultural and Forest Meteorology
Crop phenological development models are fundamental tools that can be used for scheduling agricultural practices and predicting crop yields. In previous research, different crop phenology models were compared, and different parameterization methods for crop model calibration were evaluated, which revealed that both model structures and parameterization methods were important for crop modeling. However, few studies have considered the combination of both factors and compared these influences simultaneously. Therefore, information regarding the extent of variation in model accuracy and uncertainty depending on model structure and parameterization method is lacking. In this study, we developed three winter wheat phenology models with different structures, i.e., the Agricultural Production System Simulator model for wheat (APSIM-wheat), Wang and Engel model, and sigmoid and exponential function based model, to predict the heading date as a case study. We calibrated these models using three different parameterization methods (augmented Lagrange multiplier method, Nelder-Mead method, and Bayesian optimization with Gaussian process) to investigate their effects on model accuracy and uncertainty. Six-fold cross validation of nine combinations of model calibration (3 models × 3 parameterizations) and their validation revealed that accuracies ranged mostly from 2 to 7 days in the root mean square error (RMSE). The coefficient of variation of RMSE varied widely in among model structures and parameterization methods (∼0.01–0.6). Furthermore, the coefficient of variation of model parameters also varied substantially depending both on model structure and parameterization method. Especially for the model with more parameters, we found that the prediction and parameter stability varied depending on parameterization methods. These findings suggest that both prediction and parameter uncertainty varied with model structure and parameterization method and emphasize the importance of which models and parameterization methods modelers use for robust crop phenology model.
Multi-Decadal Temporal Reconstruction of Sentinel-3 Olci-Based Vegetation Products with Multioutput Gaussian Process Regression
2024, SSRN

View all citing articles on Scopus

View full text

Fast Gaussian Process Regression for Big Data

Abstract

Introduction

Section snippets

Problem formulation

Proposed solution

Related work

Effect of the parameters

Application of the algorithm

Conclusion

Acknowledgement

Neural Netw.

Int. J. Forecast.

Gaussian Processes for Machine Learning

A tutorial on Gaussian processes (or why I don't use SVMs), MLSS Workshop talk by Zoubin Ghahramani on Gaussian Processes

Additive regression and other nonparametric models

Ann. Stat.

Variational learning of inducing variables in sparse Gaussian processes

Gaussian processes for big data

Xgboost: a scalable tree boosting system

The Elements of Statistical Learning

Probabilistic estimation of respiratory rate using Gaussian processes

Probabilistic baseline estimation via Gaussian process

Automatic Model Construction with Gaussian Processes

Optimal global rates of convergence for nonparametric regression

Ann. Stat.

Sparse Gaussian processes using pseudo-inputs

Gaussian Processes for Machine Learning

Pattern Recognition, vol. 128

Random forests

Mach. Learn.

Generalized product of experts for automatic and principled fusion of Gaussian process predictions

Introduction to Nonparametric Estimation

A Distribution-Free Theory of Nonparametric Regression

Minimax-optimal nonparametric regression in high dimensions

Ann. Stat.

Bayesian manifold regression

Ann. Stat.