1 Introduction

Preference learning involves comparing a set of alternatives according to a particular quality (Fürnkranz and Hüllermeier 2010), which often leads to a divergence of opinion between people. For example, in argument mining, a sub-field of natural language processing (NLP), one goal is to rank arguments by their convincingness (Habernal and Gurevych 2016). Whether a particular argument is convincing or not depends on the reader’s point of view and prior knowledge (Lukin et al. 2017). Similarly, personal preferences affect recommender systems, which often perform better if they tailor recommendations to a specific user (Resnick and Varian 1997). Disagreements also occur when preference annotations are acquired from multiple annotators, for example, using crowdsourcing, and are often mitigated by redundant labelling (Snow et al. 2008; Banerji et al. 2010). Therefore, we require preference learning methods that can account for differences of opinion to (1) predict personal preferences for members of a crowd and (2) infer a consensus given observations from multiple users. For both tasks, our goal is to rank items or choose the preferred item from any given pair.

Recommender systems often predict a user’s preferences via collaborative filtering, which overcomes data sparsity by exploiting similarities between the preferences of different users (Resnick and Varian 1997; Koren et al. 2009). Many recommender systems are based on matrix factorisation techniques that are trained using observations of numerical ratings. However, different annotators often disagree over numerical annotations and can label inconsistently over time (Ovadia 2004; Yannakakis and Hallam 2011), as annotators may interpret the values differently: a score of 4/5, say, from one annotator may be equivalent to 3/5 from another. The problem is avoided by pairwise labelling, in which the annotator selects their preferred item from a pair, which can be quicker (Kendall 1948; Kingsley and Brown 2010; Yang and Chen 2011), more accurate (Kiritchenko and Mohammad 2017), and facilitates the total sorting of items, as it avoids two items having the same value.

Pairwise labels provided by a crowd or extracted from user logs (Joachims 2002) are often noisy and sparse, i.e., many items or users have few or no labels. This motivates a Bayesian treatment, which has been shown to benefit matrix factorisation  (Salakhutdinov and Mnih 2008) and preference learning  (Chen et al. 2013). Some previous Bayesian methods for preference learning use Gaussian processes (GPs) to account for input features of items or users (Chu and Ghahramani 2005; Houlsby et al. 2012; Khan et al. 2014). These are features that can be extracted from content or metadata, such as embeddings (Mikolov et al. 2013; Devlin et al. 2019), which are commonly used by NLP methods to represent words or documents using a numerical vector. Input features allow the model to extrapolate to new items or users and mitigate labelling errors (Felt et al. 2016). However, previous Bayesian preference learning methods that account for input features using GPs do not scale to large numbers of items, users, or pairwise labels, as their computational and memory requirements grow with the size of the dataset.

In this paper, we propose a scalable Bayesian approach to pairwise preference learning with large numbers of users or annotators. Our method, crowdGPPL, jointly models personal preferences and the consensus of a crowd through a combination of matrix factorisation and Gaussian processes. We propose a stochastic variational inference (SVI) scheme (Hoffman et al. 2013) that scales to extremely large datasets, as its memory complexity and the time complexity of each iteration are fixed independently of the size of the dataset. Our new approach opens the door to novel applications involving very large numbers of users, items and pairwise labels, that would previously have exceeded computational or memory resources and were difficult to parallelise. We evaluate the method empirically on two real-world datasets to demonstrate the scalability of our approach, and its ability to predict both personal preferences and a consensus given preferences from thousands of users. Our results improve performance over the previous state-of-the-art (Simpson and Gurevych 2018) on a crowdsourced argumentation dataset, and show that modelling personal preferences improves predictions of the consensus, and vice versa.

2 Related work

To obtain a ranking from pairwise labels, many preference learning methods model the user’s choices as a random function of the latent utility of the items. Inferring the utilities of items allows us to rank them, estimate numerical ratings and predict pairwise labels. Many popular instances of this approach, known as a random utility model (Thurstone 1927), are variants of the Bradley-Terry (BT) model (Bradley and Terry 1952; Plackett 1975; Luce 1959), which assumes a logistic likelihood, or the Thurstone-Mosteller model  (Thurstone 1927; Mosteller 1951), which assumes a probit likelihood. Recent work on the BT model has developed computationally efficient active learning, but does not consider input features (Li et al. 2018). Another commonly-used ranking method, SVM-rank (Joachims 2002), predicts pairwise labels from input features without a random utility model, so cannot predict utilities. Gaussian process preference learning (GPPL) provides a Bayesian treatment of the random utility model, using input features to predict the utilities of test items and share information between similar items (Chu and Ghahramani 2005). As GPPL can only predict the preferences of a single user, we introduce a new, scalable approach to model individuals in a crowd.

Previous work on preference learning from crowdsourced data treats disagreements as annotation errors and infers only the consensus, rather than modelling personal preferences. For instance, Chen et al. (2013) and Wang et al. (2016) tackle annotator disagreement using Bayesian approaches that learn the labelling accuracy of each worker. Recently, Pan et al. (2018) and Han et al. (2018) introduced scalable methods that extend this idea from pairwise labels to noisy k-ary preferences, i.e., totally-ordered subsets of k items. Fu et al. (2016) improved SVM-rank by identifying outliers in crowdsourced data that correspond to probable errors, while Uchida et al. (2017) extend SVM-rank to account for different levels of confidence in each pairwise annotation expressed by the annotators. However, while these approaches differentiate the level of noise for each annotator, they ignore labelling bias as the differences between users are not random but depend on personal preferences toward particular items. With small numbers of labels per item, these biases may reduce the accuracy of the estimated consensus. Furthermore, previous aggregation methods for crowdsourced preferences do not consider item features, so cannot predict the utility of test items (Chen et al. 2013; Wang et al. 2016; Han et al. 2018; Pan et al. 2018; Li et al. 2018). Our approach goes beyond these methods by predicting personal preferences and incorporating input features.

A number of methods use matrix factorisation to predict personal preferences from pairwise labels, including Yi et al. (2013), who focus on small numbers of pairs per user, and Salimans et al. (2012), who apply Bayesian matrix factorisation to handle sparse data. Matrix factorisation represents observed ratings in a user-item matrix, which it decomposes into two matrices of lower rank than the user-item matrix, one corresponding to users and one to items. Users with similar ratings have similar columns in the user matrix, where each entry is a weight over a latent rating. By multiplying the low-dimensional representations, we can predict ratings for unseen user-item pairs. Kim et al. (2014) use a simplification that assumes that each user’s preferences depend on only one latent ranking. However, previous works combining matrix factorisation with pairwise preference labels do not account for input features. This contrasts with work on matrix factorisation with side information, where the ratings or preferences as well as input features are directly observed, including recent neural network approaches (Volkovs et al. 2017), Bayesian approaches that concatenate input feature vectors with the low-dimensional factored representations (Porteous et al. 2010), and GP-based methods (Adams et al. 2010). Besides providing a Bayesian method for matrix factorisation with both input features and pairwise labels, this paper introduces a much more scalable inference method for a GP-based model.

GPs were previously used for personal preference prediction by Guo et al. (2010), who propose a GP over the joint feature space of users and items. Since this scales cubically in the number of users, Abbasnejad et al. (2013) propose to cluster users into behavioural groups, but distinct clusters do not allow for collaborative learning between users whose preferences only partially overlap, e.g. when two users both like one genre of music, but have different preferences over other genres. Khan et al. (2014) instead learn a GP for each user, then add a matrix factorisation term that performs collaborative filtering. However, this approach does not model the relationship between input features and the low-rank matrices, unlike Lawrence and Urtasun (2009) who place GP priors over latent ratings. Neither of these last two methods are fully Bayesian as the users’ weights are optimised rather than marginalised. An alternative is the collaborative GP (collabGP) (Houlsby et al. 2012), which places GP priors over user weights and latent factors, thereby exploiting input features for both users and items. However, unlike our approach, collabGP predicts only pairwise labels, not the utilities of items, which are useful for rating and ranking, and can only be trained using pairwise labels, even if observations of the utilities are available. Furthermore, existing GP-based approaches suffer from scalability issues and none of the previous methods jointly model the consensus as well as personal preferences in a fully-Bayesian manner.

Established methods for GP inference with non-Gaussian likelihoods, such as the Laplace approximation and expectation propagation (Rasmussen and Williams 2006), have time complexity \({\mathcal {O}}(N^3)\) with N data points and memory complexity \({\mathcal {O}}(N^2)\). For collabGP, Houlsby et al. (2012) use a sparse generalized fully independent training conditional (GFITC) approximation (Snelson and Ghahramani 2006) to reduce time complexity to \({\mathcal {O}}(PM^2 + UM^2)\) and memory complexity to \({\mathcal {O}}(PM + UM)\), where P is the number of pairwise labels, \(M \ll P\) is a fixed number of inducing points, and U is the number of users. However, this is not sufficiently scalable for very large numbers of users or pairs, due to increasing memory consumption and optimisation steps that cannot be distributed. Recent work on distributing and parallelising Bayesian matrix factorisation is not easily applicable to models that incorporate GPs  (Ahn et al. 2015; Saha et al. 2015; Vander Aa et al. 2017; Chen et al. 2018).

To handle large numbers of pairwise labels, Khan et al. (2014) subsample the data rather than learning from the complete training set. An alternative is stochastic variational inference (SVI) (Hoffman et al. 2013), which optimises a posterior approximation using a different subsample of training data at each iteration, meaning it learns from all training data over multiple iterations while limiting costs per iteration. SVI has been applied to GP regression (Hensman et al. 2013) and classification (Hensman et al. 2015), further improving scalability over earlier sparse approximations.  Nguyen and Bonilla (2014) introduce SVI for multi-output GPs, where each output is a weighted combination of latent functions. They apply their method to capture dependencies between regression tasks, treating the weights for the latent functions as hyperparameters. In this paper, we introduce a Bayesian treatment of the weights and apply SVI instead to preference learning. An SVI method for GPPL was previously introduced by Simpson and Gurevych (2018), which we detail in Sect. 4. However, as GPPL does not consider the individual preferences of users in a crowd, we propose a new model, crowdGPPL, which jointly models personal preferences and the crowd consensus using a combination of Gaussian processes and Bayesian matrix factorisation.

3 Bayesian preference learning for crowds

We assume that a pair of items, a and b, have utilities \(f(\varvec{x}_a)\) and \(f(\varvec{x}_b)\), which represent their value to a user, and that \(f: {\mathbb {R}}^D \mapsto {\mathbb {R}}\) is a function of item features, where \(\varvec{x}_a\) and \(\varvec{x}_b\) are vectors of length D containing the features of items a and b, respectively. If \(f(\varvec{x}_a) > f(\varvec{x}_b)\), then a is preferred to b (written \(a \succ b\)). The outcome of a comparison between a and b is a pairwise label, y(ab). Assuming that pairwise labels never contain errors, then \(y(a, b)=1\) if \(a \succ b\) and 0 otherwise. Given knowledge of f, we can compute the utilities of items in a test set given their features, and the outcomes of pairwise comparisons.

Thurstone (1927) proposed the random utility model, which relaxes the assumption that pairwise labels, y(ab), are always consistent with the ordering of \(f(\varvec{x}_a)\) and \(f(\varvec{x}_b)\). Under the random utility model, the likelihood \(p(y(a,b)=1)\) increases as \(f_a - f_b\) increases, i.e., as the utility of item a increases relative to the utility of item b. This reflects the greater consistency in a user’s choices when their preferences are stronger, while accommodating labelling errors or variations in a user’s choices over time. In the Thurstone-Mosteller model, noise in the observations is explained by a Gaussian-distributed noise term, \(\delta \sim {\mathcal {N}}(0, \sigma ^2)\):

$$\begin{aligned} p(y(a, b) | f(\varvec{x}_a) + \delta _{a}, f(\varvec{x}_b) + \delta _{b} ) = {\left\{ \begin{array}{ll} 1 &{} \text {if }f(\varvec{x}_a) + \delta _{a} \ge f(b) + \delta _{b} \\ 0 &{} \text {otherwise,} \end{array}\right. }&\end{aligned}$$
(1)

Integrating out the unknown values of \(\delta _a\) and \(\delta _b\) gives:

$$\begin{aligned}&p( y(a, b) | f(\varvec{x}_a), f(\varvec{x}_b) ) \nonumber \\&\quad = \int \int p( y(a, b) | f(\varvec{x}_a) + \delta _{a}, f(\varvec{x}_b) + \delta _{b} ) {\mathcal {N}}\left( \delta _{a}; 0, \sigma ^2\right) {\mathcal {N}}\left( \delta _{b}; 0, \sigma ^2\right) d\delta _{a} d\delta _{b} = \Phi \left( z \right) , \end{aligned}$$
(2)

where \(z = \frac{f(\varvec{x}_a) - f(\varvec{x}_b)}{\sqrt{2\sigma ^2}}\), and \(\Phi\) is the cumulative distribution function of the standard normal distribution, meaning that \(\Phi (z)\) is a probit likelihood.Footnote 1 This likelihood is also used by Chu and Ghahramani (2005) for Gaussian process preference learning (GPPL), but here we simplify the formulation by assuming that \(\sigma ^2 = 0.5\), which leads to z having a denominator of \(\sqrt{2 \times 0.5}=1\), hence \(z = f(\varvec{x}_a) - f(\varvec{x}_b)\). Instead, we model varying degrees of noise in the pairwise labels by scaling f itself, as we describe in the next section.

In practice, \(f(\varvec{x}_a)\) and \(f(\varvec{x}_b)\) must be inferred from pairwise training labels, \(\varvec{y}\), to obtain a posterior distribution over their values. If this posterior is a multivariate Gaussian distribution, then the probit likelihood allows us to analytically marginalise \(f(\varvec{x}_a)\) and \(f(\varvec{x}_b)\) to obtain the probability of a pairwise label:

$$\begin{aligned} p(y(a,b)| \varvec{y}) = \Phi ({\hat{z}}),&{\hat{z}} = \frac{{\hat{f}}_a - {\hat{f}}_b}{\sqrt{1 + C_{a,a} + C_{b,b} - 2C_{a,b}} }, \end{aligned}$$
(3)

where \({\hat{f}}_a\) and \({\hat{f}}_b\) are the means and \(\varvec{C}\) is the posterior covariance matrix of the multivariate Gaussian over \(f(\varvec{x}_a)\) and \(f(\varvec{x}_b)\). Unlike other choices for the likelihood, such as a sigmoid, the probit allows us to compute the posterior over a pairwise label without further approximation, hence we assume this pairwise label likelihood for our proposed preference learning model.

3.1 GPPL for single user preference learning

We can model the preferences of a single user by assuming a Gaussian process prior over the user’s utility function, \(f \sim {{\mathcal {G}}}{{\mathcal {P}}}(0, k_{\theta }/s)\), where \(k_{\theta }\) is a kernel function with hyperparameters \(\theta\) and s is an inverse scale parameter. The kernel function takes numerical item features as inputs and determines the covariance between values of f for different items. The choice of kernel function and its hyperparameters controls the shape and smoothness of the function across the feature space and is often treated as a model selection problem. Kernel functions suitable for a wide range of tasks include the squared exponential and the Matérn (Rasmussen and Williams 2006), which both make minimal assumptions but assign higher covariance to items with similar feature values. We use \(k_{\theta }\) to compute a covariance matrix \(\varvec{K}_{\theta }\), between a set of N observed items with features \(\varvec{X} = \{ \varvec{x}_1, \ldots , \varvec{x}_N \}\).

Here we extend the original definition of GPPL (Chu and Ghahramani 2005), by introducing the inverse scale, s, which is drawn from a gamma prior, \(s \sim {\mathcal {G}}(\alpha _0, \beta _0)\), with shape \(\alpha _0\) and scale \(\beta _0\). The value of 1 / s determines the variance of f, and therefore the magnitude of differences between \(f(\varvec{x}_a)\) and \(f(\varvec{x}_b)\) for items a and b. This in turn affects the level of certainty in the pairwise label likelihood as per Eq. 2.

Given a set of P pairwise labels, \(\varvec{y}=\{y_1,\ldots ,y_P\}\), where \(y_p=y(a_p, b_p)\) is the preference label for items \(a_p\) and \(b_p\), we can write the joint distribution over all variables as follows:

$$\begin{aligned} p\left( \varvec{y}, \varvec{f}, s | k_{\theta }, \varvec{X}, \alpha _0, \beta _0 \right) = \prod _{p=1}^P p( y_p | \varvec{f} ) {\mathcal {N}}(\varvec{f}; \varvec{0}, \varvec{K}_{\theta }/s) {\mathcal {G}}(s; \alpha _0, \beta _0) \end{aligned}$$
(4)

where \(\varvec{f} = \{f(\varvec{x}_1),\ldots ,f(\varvec{x}_N)\}\) is a vector containing the utilities of the N items referred to by \(\varvec{y}\), and \(p( y_p | \varvec{f} ) = \Phi \left( z_p \right)\) is the pairwise likelihood (Eq. 2).

3.2 Crowd preference learning

To predict the preferences of individuals in a crowd, we could use an independent GPPL model for each user. However, by modelling all users jointly, we can exploit correlations between their interests to improve predictions when preference data is sparse, and reduce the memory cost of storing separate models. Correlations between users can arise from common interests over certain subsets of items, such as in one particular genre in a book recommendation task. Identifying such correlations helps to predict preferences from fewer observations and is the core idea of collaborative filtering (Resnick and Varian 1997) and matrix factorisation (Koren et al. 2009).

As well as individual preferences, we wish to predict the consensus by aggregating preference labels from multiple users. Individual biases of different users may affect consensus predictions, particularly when data for certain items comes from a small subset of users. The consensus could also help predict preferences of users with little or no data by favouring popular items and avoiding generally poor items. We therefore propose crowdGPPL, which jointly models the preferences of individual users as well as the underlying consensus of the crowd. Unlike previous methods for inferring the consensus, such as CrowdBT (Chen et al. 2013), we do not treat differences between users as simply the result of labelling errors, but also account for their subjective biases towards particular items.

For crowdGPPL, we represent utilities in a matrix, \(\varvec{F} \in {\mathbb {R}}^{N \times U}\), with U columns corresponding to users. Within \(\varvec{F}\), each entry \(F_{a,j} = f(\varvec{x}_a, \varvec{u}_j)\) is the utility for item a for user j with user features \(\varvec{u}_j\). We assume that \(\varvec{F} = \varvec{V}^T \varvec{W} + \varvec{t}\varvec{1^T}\) is the product of two low-rank matrices plus a column vector of consensus utilities, \(\varvec{t} \in {\mathbb {R}}^N\), where \(\varvec{W} \in {\mathbb {R}}^{C \times U}\) is a latent representation of the users, \(\varvec{V} \in {\mathbb {R}}^{C \times N}\) is a latent representation of the items, C is the number of latent components, i.e., the dimension of the latent representations, and \(\varvec{1}\) is a column vector of ones of length U. The column \(\varvec{v}_{.,a}\) of \(\varvec{V}\), and the column \(\varvec{w}_{.,j}\) of \(\varvec{W}\), are latent vector representations of item a and user j, respectively. Each row of \(\varvec{V}\), \(\varvec{v}_c=\{ v_c(\varvec{x}_1),\ldots ,v_c(\varvec{x}_N)\}\), contains evaluations of a latent function, \(v_c\sim {{\mathcal {G}}}{{\mathcal {P}}}(\varvec{0}, k_{\theta } /s^{(v)}_c)\), of item features, \(\varvec{x}_a\), where k is a kernel function, \(s^{(v)}_c\) is an inverse function scale, and \(\theta\) are kernel hyperparameters. The consensus utilities, \(\varvec{t} = \{t(\varvec{x}_1),\ldots ,t(\varvec{x}_N)\}\), are values of a consensus utility function over item features, \(t\sim {{\mathcal {G}}}{{\mathcal {P}}}(\varvec{0}, k_{\theta } /s^{(t)})\), which is shared across all users, with inverse scale \(s^{(t)}\). Similarly, each row of \(\varvec{W}\), \(\varvec{w}_c=\{w_c(\varvec{u}_1),\ldots ,w_c(\varvec{u}_U)\}\), contains evaluations of a latent function, \(w_c \sim {{\mathcal {G}}}{{\mathcal {P}}}(\varvec{0}, k_{\eta }/s_c^{(w)})\), of user features, \(\varvec{u}_j\), with inverse scale \(s_c^{(w)}\) and kernel hyperparameters \(\eta\). Therefore, each utility in \(\varvec{F}\) can be written as a weighted sum over the latent components:

$$\begin{aligned} f(\varvec{x}_a, \varvec{u}_j) = \sum _{c=1}^C v_c(\varvec{x}_a) w_c(\varvec{u}_j) + t(\varvec{x}_a), \end{aligned}$$
(5)

where \(\varvec{u}_j\) are the features of user j and \(\varvec{x}_a\) are the features of item a. Each latent component corresponds to a utility function for certain items, which is shared by a subset of users to differing degrees. For example, in the case of book recommendation, c could relate to science fiction novels, \(v_c\) to a ranking over them, and \(w_c\) to the degree of agreement of users with that ranking. The individual preferences of each user j deviate from a consensus across users, t, according to \(\sum _{c=1}^C v_c(\varvec{x}_a) w_c(\varvec{u}_j)\). This allows us to subtract the effect of individual biases when inferring the consensus utilities. The consensus can also help when inferring personal preferences for new combinations of users and items that are very different to those in the training data by accounting for any objective or widespread appeal that an item may have.

Although the model assumes a fixed number of components, C, the GP priors over \(\varvec{w}_c\) and \(\varvec{v}_c\) act as shrinkage or ARD priors that favour values close to zero (MacKay 1995; Psorakis et al. 2011). Components that are not required to explain the data will have posterior expectations and scales \(1/s^{(v)}\) and \(1/s^{(w)}\) approaching zero. Therefore, it is not necessary to optimise the value of C by hand, providing a sufficiently large number is chosen.

Equation 5 is similar to cross-task crowdsourcing (Mo et al. 2013), which uses matrix factorisation to model annotator performance in different tasks, where \(\varvec{t}\) corresponds to the objective difficulty of a task. However, unlike crowdGPPL, they do not use GPs to model the factors, nor apply the approach to preference learning. For preference learning, collabGP (Houlsby et al. 2012) is a related model that excludes the consensus and uses values in \(\varvec{v}_c\) to represent pairs rather than individual items, so does not infer item ratings. It also omits scale parameters for the GPs that encourage shrinkage when C is larger than required.

We combine the matrix factorisation method with the preference likelihood of Eq. 2 to obtain the joint preference model for multiple users, crowdGPPL:

$$\begin{aligned}&p\left( \varvec{y}, \varvec{V}, \varvec{W}, \varvec{t}, s^{(v)}_1 , \ldots , s^{(v)}_C, s^{(w)}_1, \ldots , s^{(w)}_C, s^{(t)} | k_{\theta }, \varvec{X}, k_{\eta }, \varvec{U}, \alpha _0^{(t)}, \beta _0^{(t)}, \alpha _0^{(v)}, \beta _0^{(v)}, \alpha _0^{(w)}, \beta _0^{(w)} \right) \nonumber \\&\quad = \prod _{p=1}^P \Phi \left( z_p \right) {\mathcal {N}}\left( \varvec{t}; \varvec{0}, \varvec{K}_{\theta } /s^{(t)}\right) {\mathcal {G}}\left( {s^{(t)}}; \alpha _0^{(t)}, \beta _0^{(t)}\right) \prod _{c=1}^C \left\{ {\mathcal {N}}\left( \varvec{v}_c; \varvec{0}, \varvec{K}_{\theta } /s^{(v)}_c\right) \right. \nonumber \\&\qquad \left. {\mathcal {N}}\left( \varvec{w}_c; \varvec{0}, \varvec{L}_{\eta }/s^{(w)}_c\right) {\mathcal {G}}\left( s^{(v)}_c; \alpha _0^{(v)}, \beta _0^{(v)}\right) {\mathcal {G}}\left( s^{(w)}_c; \alpha _0^{(w)}, \beta _0^{(w)} \right) \right\} ,&\end{aligned}$$
(6)

where \(z_p = \varvec{v}_{.,a_p}^T \varvec{w}_{.,u_p} + t_{a_p} - \varvec{v}_{.,b_p}^T \varvec{w}_{.,u_p} - t_{b_p}\), index p refers to a user and a pair of items, \(\{u_p, a_p, b_p \}\), \(\varvec{U}\) is the set of feature vectors for all users, \(\varvec{K}_{\theta }\) is the prior covariance for the items as in GPPL, and \(\varvec{L}_{\eta }\) is the prior covariance for the users computed using \(k_{\eta }\).

4 Scalable inference

Given a set of pairwise training labels, \(\varvec{y}\), we aim to find the posterior over the matrix \(\varvec{F}^*=\varvec{V}^{*T} \varvec{W}^*\) of utilities for test items and test users, and the posterior over consensus utilities for test items, \(\varvec{t}^*\). The non-Gaussian likelihood (Eq. 2) makes exact inference intractable, hence previous work uses the Laplace approximation for GPPL (Chu and Ghahramani 2005) or combines expectation propagation (EP) with variational Bayes for a multi-user model (Houlsby et al. 2012). The Laplace approximation is a maximum a-posteriori solution that takes the most probable values of parameters rather than integrating over their distributions, and has been shown to perform poorly for classification compared to EP (Nickisch and Rasmussen 2008). However, a drawback of EP is that convergence is not guaranteed  (Minka 2001). More importantly, inference for a GP using either method has computational complexity \({\mathcal {O}}(N^3)\) and memory complexity \({\mathcal {O}}(N^2)\), where N is the number of data points.

The cost of inference can be reduced using a sparse approximation based on a set of inducing points, which act as substitutes for the points in the training dataset. By choosing a fixed number of inducing points, \(M \ll N\), the computational cost is cut to \({\mathcal {O}}(NM^2)\), and the memory complexity to \({\mathcal {O}}(NM)\). Inducing points must be selected using either heuristics or by optimising their positions to maximise an estimate of the marginal likelihood. One such sparse approximation is the generalized fully independent training conditional (GFITC) (Naish-guzman and Holden 2008; Snelson and Ghahramani 2006), used by Houlsby et al. (2012) for collabGP. However, time and memory costs that grow linearly with \({\mathcal {O}}(N)\) start to become a problem with thousands of data points, as all data must be processed in every iterative update, before any other parameters such as s are updated, making GFITC unsuitable for very large datasets (Hensman et al. 2015).

We derive a more scalable approach for GPPL and crowdGPPL using stochastic variational inference (SVI) (Hoffman et al. 2013). For GPPL, this reduces the time complexity of each iteration to \({\mathcal {O}}(P_i M^2 + P_i^2 M + M^3)\), and memory complexity to \({\mathcal {O}}(P_i M + M^2 + P_i^2)\), where \(P_i\) is a mini-batch size that we choose in advance. Neither \(P_i\) nor M are dependent on the size of the dataset, meaning that SVI can be run with arbitrarily large datasets, and other model parameters such as s can be updated before processing all data to encourage faster convergence. First, we define a suitable likelihood approximation to enable the use of SVI.

4.1 Approximating the posterior with a pairwise likelihood

The preference likelihood in Eq. 2 is not conjugate with the Gaussian process, which means there is no analytic expression for the exact posterior. For single-user GPPL, we therefore approximate the preference likelihood with a Gaussian:

$$\begin{aligned} p(\varvec{f} | \varvec{y}, s)&\propto \prod _{p=1}^P p\left( y_p | z_p\right) p\left( \varvec{f} | \varvec{K}, s\right) = \prod _{p=1}^P \Phi \left( z_p\right) {\mathcal {N}}\left( \varvec{f}; \varvec{0}, \varvec{K}/s\right) \nonumber \\&\approx \prod _{p=1}^P {\mathcal {N}}\left( y_p; \Phi (z_p), Q_{p,p}\right) {\mathcal {N}}\left( \varvec{f}; \varvec{0}, \varvec{K}/s\right) = {\mathcal {N}}\left( \varvec{y}; \Phi (\varvec{z}), \varvec{Q}\right) {\mathcal {N}}\left( \varvec{f}; \varvec{0}, \varvec{K}/s\right) , \end{aligned}$$
(7)

where \(\varvec{Q}\) is a diagonal noise covariance matrix and we omit the kernel hyperparameters, \(\theta\), to simplify notation. For crowdGPPL, we use the same approximation to the likelihood, but replace \(\varvec{f}\) with \(\varvec{F}\). We estimate the diagonals of \(\varvec{Q}\) by moment matching our approximate likelihood with \(\Phi (z_p)\), which defines a Bernoulli distribution with variance \(Q_{p,p} = \Phi (z_p)(1 - \Phi (z_p))\). However, this means that \(\varvec{Q}\) depends on \(\varvec{z}\) and therefore on \(\varvec{f}\), so the approximate posterior over \(\varvec{f}\) cannot be computed in closed form. To resolve this, we approximate \(Q_{p,p}\) using an estimated posterior over \(\Phi (z_p)\) computed independently for each pairwise label, p. We obtain this estimate by updating the parameters of the conjugate prior for the Bernoulli likelihood, which is a beta distribution with parameters \(\gamma\) and \(\lambda\). We find \(\gamma\) and \(\lambda\) by matching the moments of the beta prior to the prior mean and variance of \(\Phi (z_p)\), estimated using numerical integration. The prior over \(\Phi (z_p)\) is defined by a GP for single-user GPPL, \(p(\Phi (z_p) | \varvec{K}, \alpha _0, \beta _0)\), and a non-standard distribution for crowdGPPL. Given the observed label \(y_p\), we estimate the diagonals in \(\varvec{Q}\) as the variance of the posterior beta-Bernoulli:

$$\begin{aligned} Q_{p,p}&\approx \frac{ (\gamma + y_p)(\lambda + 1 - y_p) }{(\gamma + \lambda + 1)^2}.&\end{aligned}$$
(8)

The covariance \(\varvec{Q}\) therefore approximates the expected noise in the observations, hence captures variance due to \(\sigma\) in Eq. 2. This approximation performs well empirically for Gaussian process classification (Reece et al. 2011; Simpson et al. 2017) and classification using extended Kalman filters (Lee and Roberts 2010; Lowne et al. 2010).

Unfortunately, the nonlinear term \(\Phi (\varvec{z})\) means that the posterior is still intractable, so we replace \(\Phi (\varvec{z})\) with a linear function of \(\varvec{f}\) by taking the first-order Taylor series expansion of \(\Phi (\varvec{z})\) about the expectation \({\mathbb {E}}[\varvec{f}] = {{\hat{\varvec{f}}}}\):

$$\begin{aligned} \Phi (\varvec{z})&\approx {\tilde{\Phi }}(\varvec{z}) = \varvec{G} \left( \varvec{f}-{{\hat{\varvec{f}}}}\right) + \Phi ({{\hat{\varvec{z}}}}), \end{aligned}$$
(9)
$$\begin{aligned} G_{p,i}&= \frac{\partial \Phi ({\hat{z}}_p)}{\partial f_i} = \Phi ({\hat{z}}_p)\left( 1 - \Phi ({\hat{z}}_p)\right) \left( 2y_p - 1\right) \left( [i = a_p] - [i = b_p]\right) , \end{aligned}$$
(10)

where \({{\hat{\varvec{z}}}}\) is the expectation of \(\varvec{z}\) computed using Eq. 3, and \([i=a]=1\) if \(i=a\) and is 0 otherwise. There is a circular dependency between \({{\hat{\varvec{f}}}}\), which is needed to compute \({\hat{\varvec{z}}}\), and \(\varvec{G}\). We estimate these terms using a variational inference procedure that iterates between updating \(\varvec{f}\) and \(\varvec{G}\) (Steinberg and Bonilla 2014) as part of Algorithm 1. The complete approximate posterior for GPPL is now as follows:

$$\begin{aligned} p(\varvec{f} | \varvec{y}, s) \approx {\mathcal {N}}\left( \varvec{y}; \varvec{G} (\varvec{f}-{\mathbb {E}}[\varvec{f}]) + \Phi ({\hat{\varvec{z}}}), \varvec{Q}\right) {\mathcal {N}}\left( \varvec{f}; \varvec{0}, \varvec{K}/s\right) / Z = {\mathcal {N}}\left( \varvec{f}; {\hat{\varvec{f}}}, \varvec{C}\right) , \end{aligned}$$
(11)

where Z is a normalisation constant. Linearisation means that our approximate likelihood is conjugate to the prior, so the approximate posterior is also Gaussian. Gaussian approximations to the posterior have shown strong empirical results for classification (Nickisch and Rasmussen 2008) and preference learning (Houlsby et al. 2012), and linearisation using a Taylor expansion has been widely tested in the extended Kalman filter (Haykin 2001) as well as Gaussian processes (Steinberg and Bonilla 2014; Bonilla et al. 2016).

4.2 SVI for single user GPPL

Using the linear approximation in the previous section, posterior inference requires inverting \(\varvec{K}\) with computational cost \({\mathcal {O}}(N^3)\) and taking an expectation with respect to s, which remains intractable. We address these problems using stochastic variational inference (SVI) with a sparse approximation to the GP that limits the size of the covariance matrices we need to invert. We introduce \(M \ll N\) inducing items with inputs \(\varvec{X}_m\), utilities \(\varvec{f}_m\), and covariance \(\varvec{K}_{mm}\). The covariance between the observed and inducing items is \(\varvec{K}_{nm}\). For clarity, we omit \(\theta\) from this point on. We assume a mean-field approximation to the joint posterior over inducing and training items that factorises between different sets of latent variables:

$$\begin{aligned} p\left( \varvec{f}, \varvec{f}_m, s | \varvec{y}, \varvec{X}, \varvec{X}_m, k_{\theta }, \alpha _0, \beta _0 \right)&\approx q\left( \varvec{f}, \varvec{f}_m, s\right) = q(s)q\left( \varvec{f}\right) q\left( \varvec{f}_m\right) , \end{aligned}$$
(12)

where q(.) are variational factors defined below. Each factor corresponds to a subset of latent variables, \(\varvec{\zeta }_i\), and takes the form \(\ln q({\varvec{\zeta }}_i) = {\mathbb {E}}_{j \ne i}[\ln p({\varvec{\zeta }}_i, \varvec{x}, \varvec{y})]\). That is, the expectation with respect to all other latent variables, \(\varvec{\zeta }_j,\forall j \ne i\), of the log joint distribution of the observations and latent variables, \({\varvec{\zeta }}_i\). To obtain the factor for \(\varvec{f}_m\), we marginalise \(\varvec{f}\) and take expectations with respect to q(s):

$$\begin{aligned} \ln q\left( \varvec{f}_m\right)&= \ln {\mathcal {N}}\left( \varvec{y}; {\tilde{\Phi }}(\varvec{z}), \varvec{Q}\right) + \ln {\mathcal {N}}\left( \varvec{f}_m; \varvec{0}, \frac{\varvec{K}_{mm}}{{\mathbb {E}}\left[ s\right] }\right) + \text {const} = \ln {\mathcal {N}}\left( \varvec{f}_m; {\hat{\varvec{f}}}_m, \varvec{S} \right) , \end{aligned}$$
(13)

where the variational parameters \({\hat{\varvec{f}}}_m\) and \(\varvec{S}\) are computed using an iterative SVI procedure described below. We choose an approximation of \(q(\varvec{f})\) that depends only on the inducing point utilities, \(\varvec{f}_m\), and is independent of the observations:

$$\begin{aligned} \ln q\left( \varvec{f}\right)&= \ln {\mathcal {N}}\left( \varvec{f}; \varvec{A} {\hat{\varvec{f}}}_m, \varvec{K} + \varvec{A} \left( \varvec{S} - \varvec{K}_{mm}/{\mathbb {E}}[s]\right) \varvec{A}^T \right) , \end{aligned}$$
(14)

where \(\varvec{A}=\varvec{K}_{nm} \varvec{K}^{-1}_{mm}\). Therefore, we no longer need to invert an \(N \times N\) covariance matrix to compute \(q(\varvec{f})\). The factor q(s) also depends only the inducing points:

$$\begin{aligned}&\ln q(s) = {\mathbb {E}}_{q\left( \varvec{f}_m\right) }\left[ \ln {\mathcal {N}}\left( \varvec{f}_m| \varvec{0}, \varvec{K}_{mm}/s\right) \right] + \ln {\mathcal {G}}(s; \alpha _0, \beta _0) + \mathrm {const} = \ln {\mathcal {G}}(s; \alpha , \beta ),&\end{aligned}$$
(15)

where \(\alpha = \alpha _0 + \frac{M}{2}\) and \(\beta = \beta _0 + \frac{1}{2} \text {tr}\left( \varvec{K}^{-1}_{mm}\left( S + {\hat{\varvec{f}}}_m {\hat{\varvec{f}}}_m^T\right) \right)\). The expected value is \({\mathbb {E}}[s] = \frac{\alpha }{\beta }\).

We apply variational inference to iteratively reduce the KL-divergence between our approximate posterior and the true posterior (Eq. 12) by maximising a lower bound, \({\mathcal {L}}\), on the log marginal likelihood (detailed equations in “Appendix 1”), which is given by:

$$\begin{aligned} \ln p\left( \varvec{y} | \varvec{K}, \alpha _0, \beta _0\right) &=\text {KL}\left( q\left( \varvec{f}, \varvec{f}_m, s\right) || p\left( \varvec{f}, \varvec{f}_m, s | \varvec{y}, \varvec{K}, \alpha _0, \beta _0\right) \right) + {\mathcal {L}} \nonumber \\ {\mathcal {L}} &={\mathbb {E}}_{q(\varvec{f})}\left[ \ln p(\varvec{y} | \varvec{f})\right] + {\mathbb {E}}_{q\left( \varvec{f}_m, s\right) }\left[ \ln p\left( \varvec{f}_m, s | \varvec{K}, \alpha _0, \beta _0 \right) \right. \nonumber \\&\left. -\, \ln q\left( \varvec{f}_m\right) - \ln q(s)\right] . \end{aligned}$$
(16)

To optimise \({\mathcal {L}}\), we initialise the q factors randomly, then update each one in turn, taking expectations with respect to the other factors.

The only term in \({\mathcal {L}}\) that refers to the observations, \(\varvec{y}\), is a sum of P terms, each of which refers to one observation only. This means that \({\mathcal {L}}\) can be maximised by considering a random subset of observations at each iteration (Hensman et al. 2013). For the ith update of \(q\left( \varvec{f}_m\right)\), we randomly select \(P_i\) observations \(\varvec{y}_i = \{ y_p \forall p \in \varvec{P}_i \}\), where \(\varvec{P}_i\) is a random subset of indexes of observations, and \(P_i\) is a mini-batch size. The items referred to by the pairs in the subset are \(\varvec{N}_i = \{a_p \forall p \in \varvec{P}_i \} \cup \{ b_p \forall p \in \varvec{P}_i\}\). We perform updates using \(\varvec{Q}_i\) (rows and columns of \(\varvec{Q}\) for pairs in \(\varvec{P}_i\)), \(\varvec{K}_{im}\) and \(\varvec{A}_i\) (rows of \(\varvec{K}_{nm}\) and \(\varvec{A}\) in \(\varvec{N}_i\)), \(\varvec{G}_i\) (rows of \(\varvec{G}\) in \(\varvec{P}_i\) and columns in \(\varvec{N}_i\)), and \({\hat{\varvec{z}}}_i = \left\{ {\hat{\varvec{z}}}_p \forall p \in P_i \right\}\). The updates optimise the natural parameters of the Gaussian distribution by following the natural gradient (Hensman et al. 2015):

$$\begin{aligned} \varvec{S}^{-1}_i&= (1 - \rho _i) \varvec{S}^{-1}_{i-1} + \rho _i\left( {\mathbb {E}}[s]\varvec{K}_{mm}^{-1} + \pi _i\varvec{A}_i^T \varvec{G}^T_{i} \varvec{Q}^{-1}_i \varvec{G}_{i} \varvec{A}_{i} \right)&\end{aligned}$$
(17)
$$\begin{aligned} {{\hat{\varvec{f}}}}_{m,i}&= \varvec{S}_i \left( (1 - \rho _i) \varvec{S}^{-1}_{i-1} {\hat{\varvec{f}}}_{m,i-1} + \rho _i \pi _i \varvec{A}_{i}^{T} \varvec{G}_{i}^T \varvec{Q}_i^{-1} \left( \varvec{y}_i - \Phi ({\hat{\varvec{z}}}_i) + \varvec{G}_{i} \varvec{A}_i {\hat{\varvec{f}}}_{m,i-1} \right) \right) \end{aligned}$$
(18)

where \(\rho _i=(i + \epsilon )^{-r}\) is a mixing coefficient that controls the update rate, \(\pi _i = \frac{P}{P_i}\) weights each update according to sample size, \(\epsilon\) is a delay hyperparameter and r is a forgetting rate (Hoffman et al. 2013).

By performing updates in terms of mini-batches, the time complexity of Eqs. 17 and 18 is \({\mathcal {O}}(P_i M^2 + P_i^2 M + M^3)\) and memory complexity is \({\mathcal {O}}(M^2 + P_i^2 + M P_i)\). The only parameters that must be stored between iterations relate to the inducing points, hence the memory consumption does not grow with the dataset size as in the GFITC approximation used by Houlsby et al. (2012). A further advantage of stochastic updating is that the s parameter (and any other global parameters not immediately depending on the data) can be learned before the entire dataset has been processed, which means that poor initial estimates of s are rapidly improved and the algorithm can converge faster.

figure a

The complete SVI algorithm is summarised in Algorithm 1. It uses a nested loop to learn \(\varvec{G}_i\), which avoids storing the complete matrix, \(\varvec{G}\). It is possible to distribute computation in lines 3-6 by selecting multiple random samples to process in parallel. A global estimate of \({\hat{\varvec{f}}}_m\) and \(\varvec{S}\) is passed to each compute node, which runs the loop over lines 4 to 6. The resulting updated \({\hat{\varvec{f}}}_m\) and \(\varvec{S}\) values are then passed back to a central node that combines them by taking a mean weighted by \(\pi _i\) to account for the size of each batch.

Inducing point locations can be learned as part of the variational inference procedure, which breaks convergence guarantees, or by an expensive optimisation process (Hensman et al. 2015). We obtain good performance by choosing inducing points up-front using K-means++ (Arthur and Vassilvitskii 2007) with M clusters to cluster the feature vectors, then taking the cluster centres as inducing points that represent the distribution of observations.

The inferred distribution over the inducing points can be used to estimate the posteriors of test items, \(f(\varvec{x}^*)\), according to:

$$\begin{aligned} \varvec{f}^*&= \varvec{K}_{*m} \varvec{K}_{mm}^{-1} {\hat{\varvec{f}}}_m,&\varvec{C}^* = \varvec{K}_{**} + \varvec{K}_{*m} \varvec{K}_{mm}^{-1} (\varvec{S} - \varvec{K}_{mm} / {\mathbb {E}}[s] ) \varvec{K}_{mm}^{-1}\varvec{K}_{*m}^T , \end{aligned}$$
(19)

where \(\varvec{C}^*\) is the posterior covariance of the test items, \(\varvec{K}_{**}\) is their prior covariance, and \(\varvec{K}_{*m}\) is the covariance between test and inducing items.

4.3 SVI for crowdGPPL

We now provide the variational posterior for the crowdGPPL model defined in Eq. 6:

$$\begin{aligned}&p\left( \varvec{V}, \varvec{V}_m, \varvec{W}, \varvec{W}_m, \varvec{t}, \varvec{t}_m, s^{(v)}_1, \ldots , s^{(v)}_C, s^{(w)}_1, \ldots , s^{(w)}_C, s^{(t)} | \varvec{y}, \varvec{X}, \varvec{X}_m, \varvec{U}, \varvec{U}_m, k, \alpha _0, \beta _0 \right) \nonumber \\&\quad \approx q(\varvec{t}) q(\varvec{t}_m)q\left( s^{(t)}\right) \prod _{c=1}^{C} q(\varvec{v}_{c})q(\varvec{w}_c)q(\varvec{v}_{c,m})q(\varvec{w}_{c,m}) q\left( s^{(v)}_c\right) q\left( s^{(w)}_c\right) , \end{aligned}$$
(20)

where \(\varvec{U}_m\) are the feature vectors of inducing users and the variational q factors are defined below. We use SVI to optimise the lower bound on the log marginal likelihood (detailed in “Appendix 2”), which is given by:

$$\begin{aligned} {\mathcal {L}}_{cr} &={\mathbb {E}}_{q(\varvec{F})} [\ln p(\varvec{y} | \varvec{F})] + {\mathbb {E}}_{q\left( \varvec{t}_m, s^{(t)}\right) } \left[ \ln p\left( \varvec{t}_m, s^{(t)} | \varvec{K}_{mm}, \alpha _0^{(t)}, \beta _0^{(t)}\right) - \ln q(\varvec{t}_m) - \ln q\left( s^{(t)}\right) \right] \nonumber \\&+ \sum _{c=1}^C \bigg \{ {\mathbb {E}}_{q\left( \varvec{v}_{m,c},s^{(v)}_c\right) }\left[ \ln p\left( \varvec{v}_{m,c}, s^{(v)}_c | \varvec{K}_{mm}, \alpha _0^{(v)}, \beta _0^{(v)}\right) - \ln q(\varvec{v}_{m,c}) - \ln q\left( s_c^{(v)}\right) \right] \nonumber \\&+ {\mathbb {E}}_{q\left( \varvec{w}_{m,c}, s_c^{(w)}\right) }\left[ \ln p\left( \varvec{w}_{m,c},s^{(w)}_c | \varvec{L}_{mm}, \alpha _0^{(w)}, \beta _0^{(w)} \right) - \ln q(\varvec{w}_{m,c} ) - \ln q\left( s_c^{(w)} \right) \right] \bigg \} . \end{aligned}$$
(21)

The SVI algorithm follows the same pattern as Algorithm 1, updating each q factor in turn by computing means and covariances for \(\varvec{V}_m\), \(\varvec{W}_m\) and \(\varvec{t}_m\) instead of \(\varvec{f}_m\) (see Algorithm 2). The time and memory complexity of each update are \({\mathcal {O}}(CM_{\mathrm {items}}^3 + CM_{\mathrm {items}}^2 P_i + CM_{\mathrm {items}} P_i^2\)\(+ CM_{\mathrm {users}}^3 + CM_{\mathrm {users}}^2 P_i + CM_{\mathrm {users}} P_i^2 )\) and \({\mathcal {O}}(CM_{\mathrm {items}}^2 + P_i^2 + M_{\mathrm {items}} P_i + CM_{\mathrm {users}}^2 + M_{\mathrm {users}} P_i)\), respectively. The variational factor for the cth inducing item component is:

$$\begin{aligned} \ln q(\varvec{v}_{m,c})&= {\mathbb {E}}_{q(\varvec{t}, \varvec{w}_{m,c'}\forall c', \varvec{v}_{m,c'}\forall c'\backslash c) }\left[ \ln {\mathcal {N}}\left( \varvec{y}; {\tilde{\Phi }}(\varvec{z}), Q \right) \right] + \ln {\mathcal {N}}\left( \varvec{v}_{m,c}; \varvec{0}, \frac{\varvec{K}_{mm}}{{\mathbb {E}}\left[ s^{(v)}_c\right] }\right) + \text {const} \nonumber \\&= \ln {\mathcal {N}}\left( \varvec{v}_{m,c}; {\hat{\varvec{v}}}_{m,c}, \varvec{S}_c^{(v)} \right) , \end{aligned}$$
(22)

where posterior mean \({\hat{\varvec{v}}}_{m,c}\) and covariance \(\varvec{S}_c^{(v)}\) are computed using equations of the same form as Eqs. 17 and 18, except \(\varvec{Q}^{-1}\) is scaled by expectations over \(\varvec{w}_{m,c}\), and \({\hat{\varvec{f}}}_{m,i}\) is replaced by \({\hat{\varvec{v}}}_{m,c,i}\). The factor for the inducing points of \(\varvec{t}\) follows a similar pattern to \(\varvec{v}_{m,c}\):

$$\begin{aligned} \ln q(\varvec{t}_m)&=\, {\mathbb {E}}_{q(\varvec{w}_{m,c}\forall c, \varvec{v}_{m,c}\forall c)}\left[ \ln {\mathcal {N}}\left( \varvec{y}; {\tilde{\Phi }}(\varvec{z}), Q \right) \right] + \ln {\mathcal {N}}\left( \varvec{t}_m; \varvec{0}, \frac{\varvec{K}_{mm}}{{\mathbb {E}}[s^{(t)}]} \right) + \text {const} \nonumber \\&=\, \ln {\mathcal {N}}\left( \varvec{t}_m; {\hat{\varvec{t}}}_{m}, \varvec{S}^{(t)} \right) ,&\end{aligned}$$
(23)

where the equations for \({\hat{\varvec{t}}}\) and \(\varvec{S}^{(t)}\) are the same as Eqs. 17 and 18, except \({\hat{\varvec{f}}}_{m,i}\) is replaced by \({\hat{\varvec{t}}}_{m,i}\). Finally, the variational distribution for each inducing user’s component is:

$$\begin{aligned} \ln q(\varvec{w}_{ m,c} ) =&{\mathbb {E}}_{q(\varvec{t},\varvec{w}_{m,c'}\forall c'\backslash c, \varvec{v}_{m,c'}\forall c')}\left[ \ln {\mathcal {N}} \left( \varvec{y}; {\tilde{\Phi }}(\varvec{z}), Q \right) \right] + \ln {\mathcal {N}}\left( \varvec{w}_{ m,c}; \varvec{0}, \frac{\varvec{L}_{mm}}{{\mathbb {E}}[s^{(w)}_c]} \right) + \text {const} \nonumber \\&= \ln {\mathcal {N}}\left( \varvec{w}_{m,c}; {\hat{\varvec{w}}}_{m,c}, {\varvec{\Sigma }}_c \right) , \end{aligned}$$
(24)

where \({\hat{\varvec{w}}}_c\) and \({\varvec{\Sigma }}_{c}\) also follow the pattern of Eqs. 17 and 18, with \(\varvec{Q}^{-1}\) scaled by expectations of \(\varvec{w}_{c,m}\), and \({\hat{\varvec{f}}}_{m,i}\) replaced by \({\hat{\varvec{w}}}_{m,c,i}\). We provide the complete equations for the variational means and covariances for \(\varvec{v}_{m,c}\), \(\varvec{t}_m\) and \(\varvec{w}_{m,c}\) in “Appendix 3”. The expectations for inverse scales, \(s^{(v)}_1,\ldots ,s^{(v)}_c\), \(s^{(w)}_1,\ldots ,s^{(w)}_c\) and \(s^{(t)}\) can be computed using Eq. 15 by substituting the corresponding terms for \(\varvec{v}_c\), \(\varvec{w}_c\) or \(\varvec{t}\) instead of \(\varvec{f}\).

Predictions for crowdGPPL can be made by computing the posterior mean utilities, \(\varvec{F}^*\), and the covariance \({\varvec{\Lambda }}_u^*\) for each user, u, in the test set:

$$\begin{aligned}&\varvec{F}^* = {\hat{\varvec{t}}}^* + \sum _{c=1}^C {\hat{\varvec{v}}}_{c}^{*T} {\hat{\varvec{w}}}_{c}^*, \qquad {\varvec{\Lambda }}_u^* = \varvec{C}_{t}^* + \sum _{c=1}^C \omega _{c,u}^* \varvec{C}_{v,c}^* + {\hat{w}}_{c,u}^2 \varvec{C}_{v,c}^* +\omega _{c,u}^* {\hat{\varvec{v}}}_{c}{\hat{\varvec{v}}}_{c}^T,&\end{aligned}$$
(25)

where \({\hat{\varvec{t}}}^*\), \({\hat{\varvec{v}}}_{c}^*\) and \({\hat{\varvec{w}}}_{c}^*\) are posterior test means, \(\varvec{C}_{t}^*\) and \(\varvec{C}_{v,c}^*\) are posterior covariances of the test items, and \(\omega _{c,u}^*\) is the posterior variance of the user components for u. (see “Appendix 4”, Eqs. 39 to 41). The mean \(\varvec{F}^*\) and covariances \(\Lambda ^*_u\) can be inserted into Eq. 2 to predict pairwise labels. In practice, the full covariance terms are needed only for Eq. 2, so need only be computed between items for which we wish to predict pairwise labels.

5 Experiments

Table 1 Summary of datasets showing average counts for the training and test sets used in each fold/subsample

Our experiments test key aspects of crowdGPPL: predicting consensus utilities and personal preferences from pairwise labels and the scalability of our proposed SVI method. In Sect. 5.1, we use simulated data to test the robustness of crowdGPPL to noise and unknown numbers of latent components. Section 5.2 compares different configurations of the model against alternative methods using the Sushi datasetsFootnote 2 (Kamishima 2003). Section 5.3 evaluates prediction performance and scalability of crowdGPPL in a high-dimensional NLP task with sparse, noisy crowdsourced preferences (UKPConvArgCrowdSample,Footnote 3  Simpson and Gurevych (2018)). Finally, Sect. 5.4 evaluates whether crowdGPPL ignores redundant components. The datasets are summarised in Table 1.

As baselines, we compare crowdGPPL against GPPL, which we train on all users’ preference labels to learn a single utility function, and GPPL-per-user, in which a separate GPPL instance is learned for each user with no collaborative learning. We also compare against the GPVU model (Khan et al. 2014) and collabGP  (Houlsby et al. 2012). CollabGP contains parameters for each pairwise label and each user, so has a larger memory footprint than our SVI scheme, which stores only the moments at the inducing points.

We test crowdBT (Chen et al. 2013) as part of a method for predicting consensus utilities from crowdsourced pairwise preferences. CrowdBT models each worker’s accuracy, assuming that the differences between workers’ labels are due to random errors rather than subjective preferences. Since crowdBT does not account for the item features, it cannot predict utilities for items that were not part of the training set. We therefore treat the posterior mean utilities produced by crowdBT as training labels for Gaussian process regression using SVI. We set the observation noise variance of the GP equal to the crowdBT posterior variance of the utilities to propagate uncertainty from crowdBT to the GP. This pipeline method, crowdBT–GP, tests whether it is sufficient to treat annotator differences as noise, in contrast to the crowdGPPL approach of modelling individual preferences.

We evaluate the methods using the following metrics: accuracy (acc), which is the fraction of correct pairwise labels; cross entropy error (CEE) between the posterior probabilities over pairwise labels and the true labels, which captures the quality of the pairwise posterior; and Kendall’s\(\tau\), which evaluates the ranking obtained by sorting items by predicted utility.

Fig. 1
figure 1

Simulations: rank correlation between true and inferred utilities. a and b vary the level of noise in pairwise training labels, c varies the number of pairwise training labels

5.1 Simulated noisy data

First, we evaluate whether crowdGPPL is able to model individual preferences with varying amounts of labelling noise. We set the number of latent components to \(C=20\) and all Gamma hyperparameters for crowdGPPL, GPPL and GPPL-per-user to \(\alpha _0 = 1\), \(\beta _0 = 100\). We use Matérn 3/2 kernels with the length-scale for each dimension of the feature vector, d, chosen by a median heuristic:

$$\begin{aligned} l_{d,\mathrm {MH}} = \mathrm {median}( \{ ||x_{i,d} - x_{j,d}||, \forall i=1,\ldots ,N, \forall j=1,\ldots ,N\} ). \end{aligned}$$
(26)

This is a computationally frugal way to choose the length-scales, that has been extensively used in various kernel methods (e.g.,  Bors and Pitas (1996); Gretton et al. (2012)). The SVI hyperparameters were set to \(\rho =0.9\), \(P_i=1000\) and \(\epsilon =1\). Hoffman et al. (2013) found that higher values of \(\rho\) gave better final results but slightly slower convergence, recommending 0.9 as a good balance across several datasets, and did not find any effect from changing \(\epsilon\). We follow their recommendations and do not find it necessary to perform further tuning in our experiments. Both M and \(P_i\) are constrained in practice by the computational resources available—we investigate these further in Sect. 5.3.

In simulation (a), to test consensus prediction, we generate a \(20\times 20\) grid of points and split them into 50% training and test sets. For each gridpoint, we generate pairwise labels by drawing from the generative model of crowdGPPL with \(U=20\) users, \(C=5\), each \(s^{(v)}_c\) set to random values between 0.1 and 10, and \(s^{(w)}_c = 1, \forall c\). We vary \(s^{(t)}\) to control the noise in the consensus function. We train and test crowdGPPL with \(C=U\) and repeat the complete experiment 25 times, including generating new data.

Figure 1a shows that crowdGPPL better recovers the consensus ranking than the baselines, even as noise increases, as GPPL’s predictions are worsened by biased users who deviate consistently from the consensus. For GPPL-per-user, the consensus is simply the mean of all users’ predicted utilities, so does not benefit from sharing information between users when training. For simulation (b), we modify the previous setup by fixing \(s^{(t)} = 5\) and varying \(s^{(v)}_c,\forall c\) to evaluate the methods’ ability to recover the personal preferences of simulated users. The results in Fig. 1b show that crowdGPPL is able to make better predictions when noise is below 0.3.

We hypothesise that crowdGPPL can recover latent components given sufficient training data. In simulation (c), we generate data using the same setup as before, but fix \(s^{(t)} = s^{(v)}_c = s^{(w)} = 1,\forall c\) and vary the number of pairwise training labels and the number of true components through \(C_{\mathrm {true}} \in \{ 1, 3, 10, 20\}\). We match inferred components to the true components as follows: compute Pearson correlations between each unmatched true component and each unmatched inferred component; select the pair with the highest correlation as a match; repeat until all true components are matched. In Fig. 1c we plot the mean correlation between matched pairs of components. For all values of \(C_{\mathrm {true}}\), increasing the number of training labels beyond 700 brings little improvement. Performance is highest when \(C_{\mathrm {true}} = 20\), possibly because the predictive model has \(C = 20\), so is a closer match to the generating model. However, crowdGPPL is able to recover latent components reasonably well for all values of \(C_{\mathrm {true}}\) given \(>500\) labels, despite mismatches between C and \(C_{\mathrm {true}}\).

5.2 Sushi preferences

Table 2 Predicting personal preferences on Sushi datasets, means over 25 repeats

The sushi datasets contain, for each user, a gold standard preference ranking of 10 types of sushi, from which we generate gold-standard pairwise labels. To test performance with very few training pairs, we obtain Sushi-A-small by selecting 100 users at random from the complete Sushi-A dataset, then selecting 5 pairs for training and 25 for testing per user. For Sushi-A, we select 100 users at random from the complete dataset, then split the data into training and test sets by randomly selecting 20 training and 25 test pairs per user. For Sushi-B, we use all 5000 workers, and subsample 10 training and 1 test pair per user.

We compare standard crowdGPPL with four other variants:

  • crowdGPPL\(\backslash\)inducing: does not use the sparse inducing point approximation and instead uses all the original points in the training set;

  • crowdGPPL\({\backslash \varvec{u}}\): ignores the user features;

  • crowdGPPL\({\backslash \varvec{u} \backslash \varvec{x}}\): ignores both user and item features;

  • crowdGPPL\({\backslash \varvec{u} \backslash \varvec{t}}\): excludes the consensus function \(\varvec{t}\) from the model as well as the user features.

For methods with \(\backslash \varvec{u}\), the user covariance matrix, \(\varvec{L}\), is replaced by the identity matrix, and for crowdGPPL\({\backslash \varvec{u} \backslash \varvec{x}}\), \(\varvec{K}\) is also replaced by the identity matrix. As the user features do not contain detailed, personal information (only region, age group, gender, etc.), they are not expected to be sufficiently informative to predict personal preferences on their own. Therefore, for crowdGPPL and crowdGPPL\(\backslash\)inducing, we compute \(\varvec{L}\) for 10 latent components using the Matérn 3/2 kernel function and use the identity matrix for the remaining 10. CollabGP is also tested with and without user features. We set hyperparameters \(C=20\), \(\epsilon =1\), \(\rho =0.9\), \(P_i=200\) for Sushi-A-small and Sushi-A, and \(P_i=2000\) for Sushi-B, without optimisation. For the gamma hyperparameters, a grid search over \(\{10^{-1},\ldots ,10^3\}\) on withheld user data from Sushi-A resulted in \(\alpha _0=1, \beta _0=100\) for GPPL variants, and \(\alpha _0^{(t)}=1,\beta _0^{(t)}=100\), \(\alpha _0^{(v)}=1,\beta _0^{(v)}=10\) and \(\alpha _0^{(w)}=1,\beta _0^{(w)}=10\) for crowdGPPL variants. The complete process of subsampling, training and testing, was repeated 25 times for each dataset.

The results in Table 2 illustrate the benefit of personalised models over single-user GPPL. The inducing point approximation does not appear to harm performance of crowdGPPL, but including the user features tends to decrease its performance compared to crowdGPPL\(\backslash \varvec{u}\) and crowdGPPL\(\backslash \varvec{u}\backslash \varvec{x}\), except on Sushi-A-small, where they may help with the small amount of training data. Comparing crowdGPPL\(\backslash \varvec{u}\) with crowdGPPL\(\backslash \varvec{u}\backslash \varvec{t}\), including the consensus function improves performance modestly. The strong performance of GPPL-per-user suggests that even 10 pairs per person were enough to learn a reasonable model for Sushi-B. As expected, the more memory-intensive collabGP performs comparably well to crowdGPPL on accuracy and CEE but does not provide a ranking function for computing Kendall’s \(\tau\). GPVU does not perform as well as other personalised methods on Sushi-A and Sushi-B, potentially due to its maximum likelihood inference steps. The results show that crowdGPPL is competitive despite the approximate SVI method, so in the next experiment, we test the approach on a larger crowdsourced dataset where low memory consumption is required.

5.3 Argument convincingness

We evaluate consensus learning, personal preference learning and scalability on an NLP task, namely, ranking arguments by convincingness. The task requires learning from crowdsourced data, but is not simply an aggregation task as it requires learning a predictor for test documents that were not compared by the crowd. The dataset, UKPConvArgCrowdSample, was subsampled by Simpson and Gurevych (2018) from raw data provided by Habernal and Gurevych (2016), and contains arguments written by users of online debating forums, with crowdsourced judgements of pairs of arguments indicating the most convincing argument. The data is divided into 32 folds (16 topics, each with 2 opposing stances). For each fold, we train on 31 folds and test on the remaining fold. We extend the task to predicting both the consensus and personal preferences of individual crowd workers. GPPL previously outperformed SVM and Bi-LSTM methods at consensus prediction for UKPConvArgCrowdSample (Simpson and Gurevych 2018). We hypothesise that a worker’s view of convincingness depends on their personal view of the subject discussed, so crowdGPPL may outperform GPPL and crowdBT-GP on both consensus and personal preference prediction.

The dataset contains 32, 310 linguistic and embedding features for each document (we use mean GloVe embeddings for the words in each document, see Simpson and Gurevych (2018)). The high-dimensionality of the input feature vectors requires us to modify the length-scale heuristic for all GP methods, as the distance between items grows with the number of dimensions, which causes the covariance to shrink to very small values. We therefore use \(l_{d,\mathrm {scaledMH}} = 20\sqrt{D} \times l_{d,\mathrm {MH}}\), where D is the dimension of the input feature vectors, and the scale was chosen by comparing the training set accuracy with scales in \(\{\sqrt{D}, 10\sqrt{D}, 20\sqrt{D}, 100\sqrt{D}\}\). The hyperparameters are the same as Sect. 5.1 except GPPL uses \(\alpha _0 = 2\), \(\beta _0 = 200\) and crowdGPPL uses \(\alpha ^{(t)}_0=\alpha ^{(v)}_0=2\), \(\beta ^{(t)}_0=\beta ^{(t)}_0=200\), \(\alpha ^{(w)}_0=1\), \(\beta ^{(w)}_0=10\). We do not optimise \(\alpha _0\), but choose \(\beta _0\) by comparing training set accuracy for GPPL with \(\beta _0 \in \left\{ 2,200,20000\right\}\). The best value of \(\beta _0\) is also used for \(\beta ^{(t)}_0\) and \(\beta ^{(v)}_0\), then training set accuracy of crowdGPPL is used to select \(\beta ^{(w)}_0 \in \left\{ 1, 10, 100 \right\}\). We set \(C=50\), \(M=500\), \(P_i=200\), \(\epsilon =10\), and \(\rho =0.9\) without optimisation.

Table 3 UKPConvArgCrowdSample: predicting consensus, personal preferences for all workers, and personal preferences for workers with >50 pairs in the training set

Table 3 shows that crowdGPPL outperforms both GPPL and crowdBT–GP at predicting both the consensus and personal preferences (significant for Kendall’s \(\tau\) with \(p<0.05\), Wilcoxon signed-rank test), suggesting that there is a benefit to modelling individual workers in subjective, crowdsourced tasks. We also compare against crowdGPPL without the consensus (crowdGPPL\(\backslash \varvec{t}\)) and find that including \(\varvec{t}\) in the model improves personalised predictions. This is likely because many workers have few training pairs, so the consensus helps to identify arguments that are commonly considered very poor or very convincing. Table 3 also shows that for workers with more than 50 pairs in the training set, accuracy and CEE improve for all methods but \(\tau\) decreases, suggesting that some items may be ranked further away from their correct ranking for these workers. It is possible that workers who were willing to complete more annotations (on average 31 per fold) deviate further from the consensus, and crowdGPPL does not fully capture their preferences given the data available.

Fig. 2
figure 2

Wall-clock times for training+prediction of consensus utilities for arguments in the training folds of UKPConvArgCrowdSample. CrowdGPPL was run with \(C=5\). In b, c and d, \(M=100\). Lines show means over 32 runs, bands indicate 1 standard deviation (mostly very little variation between folds)

We examine the scalability of our SVI method by evaluating GPPL and crowd-GPPL with different numbers of inducing points, M, and different mini-batch sizes, \(P_i\). Figure 2a shows the trade-off between runtime and training set accuracy as an effect of choosing M. Accuracy levels off as M increases, while runtime continues to increase rapidly in a polynomial fashion. Using inducing points can therefore give a large improvement in runtimes with a fairly small performance hit. Figure 2b demonstrates that smaller batch sizes do not negatively affect the accuracy, although they increase runtimes as more iterations are required for convergence. The runtimes flatten out as \(P_i\) increases, so we recommend choosing \(P_i\ge 200\) but small enough to complete an iteration rapidly with the computational resources available. Figure 2c, d show runtimes as a function of the number of items in the training set, N, and the number of pairwise training labels, P, respectively (all other settings remain as in Fig. 2a). In both cases, the increases to runtime are small, despite the growing dataset size.

Fig. 3
figure 3

Latent component variances, \(1/\left( s^{(v)}_c s^{(w)}_c\right)\) in crowdGPPL, means over all runs

5.4 Posterior variance of item components

We investigate how many latent components were actively used by crowdGPPL on the UKPConvArgCrowdSample and Sushi-A datasets. Figure 3 plots the posterior expectations of the inferred scales, \(1/\left( s^{(v)}_c s^{(w)}_c\right)\), for the latent item components. The plots show that many factors have a relatively small variance and therefore do not contribute to many of the model’s predictions. This indicates that our Bayesian approach will only make use of components that are supported by the data, even if C is larger than required.

6 Conclusions

We proposed a novel Bayesian preference learning approach for modelling both the preferences of individuals and the overall consensus of a crowd. Our model learns the latent utilities of items from pairwise comparisons using a combination of Gaussian processes and Bayesian matrix factorisation to capture differences in opinion. We introduce a stochastic variational inference (SVI) method, that, unlike previous work, can scale to arbitrarily large datasets, since its time and memory complexity do not grow with the dataset size. Our experiments confirm the method’s scalability and show that jointly modelling the consensus and personal preferences can improve predictions of both. Our approach performs competitively against less scalable alternatives and improves on the previous state of the art for predicting argument convincingness from crowdsourced data (Simpson and Gurevych 2018).

Future work will investigate learning inducing point locations and optimising length-scale hyperparameters by maximising the variational lower bound, \({\mathcal {L}}\), as part of the variational inference method. Another important direction will be to generalise the likelihood from pairwise comparisons to comparisons involving more than two items (Pan et al. 2018) or best–worst scaling (Kiritchenko and Mohammad 2017) to provide scalable Bayesian methods for other forms of comparative preference data.