1 Introduction

Globally, in 2017 approximately 11 million deaths and a loss of 255 million disability-adjusted life-years were attributable to poor diet [1]. In the UK, 63% of the adult population are overweight and 27% are obese.Footnote 1 Given that nutrition is a key determinant for health and wellbeing, understanding food choices is of paramount importance. The process that leads to food choices is a complex one, which has been studied from several points of view, ranging from the biological [2, 3], to the demographic [46] and socioeconomic [7], focusing predominantly on developed Western countries. However, only recently has the cultural aspect started to garner attention [810]. Culture can be seen as a sort of collective memory that influences individual behaviors [11]. Food choices can be reinforced via the existence of “lifestyle enclaves” [12], where small preferences get amplified by elective affinities. For instance, occupation might influence food intake through work-related social circles [13, 14].

In this work, we take inspiration from the theory of the French sociologist Pierre Bourdieu, which links health and lifestyle to social class identity [15]. This identity is then differentiated through “taste” in music, art, and—in our context—culinary preferences. The theory connects taste to the pursuing of cultural capital, a non-material resource that accumulates throughout the life course [16]. Bourdieu introduced three forms of cultural capital [17]: incorporated, e.g., skills, competencies, personal effort, and time investment, objectified, e.g., possession of books, dictionaries, instruments related to artistic expressions, and institutionalized, e.g., educational attainment at the level of individual or family.

We focus on the third form of cultural capital, operationalized as the level of educational attainment. In particular, we show how education can explain inequities in food choices that go beyond socioeconomic determination and the consequent barriers to food access due to cost. To this extent, there is general agreement among researchers [4, 18, 19] that education and income are distinct concepts that are likely to make separate and unique contributions to health outcomes [20]. Education has been widely studied in connection to health and food choices for several reasons. First, it may provide the tools to access and comprehend dietary information and its impact on health. Second, social diffusion theory suggests that highly educated people generally take up innovations sooner than less-educated [21], which might affect the success of health-related interventions. Third, is can affect time use and thus the opportunities to allocate time to food acquisition and preparation.

Existing studies around cultural capital and nutrition use small-scale, traditional methodologies [10, 2225] such as food diaries, questionnaires [26, 27], and surveys [21, 28]. As such, existing datasets have not allowed the modeling of spatial clustering, which is an important confounder for both culture and health, as it has been shown in other fields [29]. The presence of this confounder may invalidate the associations discovered by conventional methods such as OLS regression, and produce estimates that are biased and inconsistent. This confounder occurs when individual data points are not independent and identically distributed, rather they are spatially correlated. Observed variation in the dependent variable may thus arise from latent influences related to culture, infrastructure, recreational amenities, and a host of other factors not present in the data. Modeling these latent influences is important especially because many health interventions work at a community level, such as “farm to school” programs [30], food literacy initiatives [31], and community gardens [32].

In this work, we use a unique fine-grained dataset that allows us to reveal the role of space and geography in modeling the relationship between food consumption and culture. Unlike traditional approaches, we use a large-scale log of the food-related purchases of 1.6M customers of a major retail chain in London across the time span of a year [33]. This resource allows us to observe—in an indirect way—the daily food choices at an unprecedented granularity, with the assumption that supermarket purchases largely represent the dietary intake of a household.Footnote 2 This granularity enables the exploration of two complementary dimensions that capture multiple facets of food consumption: (1) diet composition, along the dimensions of macronutrients and product categories, and (2) diet diversity.

Through the use of spatially-aware regression models, which include multiple environmental features of an areal unit, we illustrate the effects of spatial clustering in the emergence of localized communities with homogeneous behavior, and the advantages over standard regression approaches on fitting performance and biases in the estimates. We further show the presence of spillover effects within neighboring communities, all of which need to be considered when designing policies and interventions, or informing decision support systems for public health.

2 Related work

A large body of literature explores the interplay between socioeconomic status and food consumption, showing how inequities in the access to resources, privilege, or power, play a role in shaping people’s dietary habits. The dimension of wealth, often modeled with the average household income or employment rate, has been connected to the ability to afford certain categories of food products. Cost has been mentioned as one of the main obstacles to a widespread consumption of fruits, vegetables [34, 35], or lean meats [36], and connected to the lack of a healthy diet [37]. The individual and household disadvantages are compounded by residential segregation. For example, disadvantaged communities often face spatio-structural barriers to access to food [38], as well as to physical activities, fitness clubs, and weight loss programs [36], all of which have effects on health outcomes. As a result, disadvantaged communities have a higher incidence of obesity than wealthy ones [39]. In this direction, a vast body of literature studies the relationship between food environment and diet to capture the degree of food access using both respondent-based perceived measures and quantitative approaches that leverage Geographic Information System (GIS) technology [40]. The latter commonly use store density (using buffer distances), or proximity to the nearest food store to operationalize food access [41]. Another common method involves store audits, in which researchers estimate the shelf-space occupied by certain foods in a store, or assess product variety and food prices within stores using measures such as the Nutrition Environment Measure Survey (NEMS) [42]. Several food environment conceptualizations have been proposed, mainly divided into community food environment and consumer food environment [43] that draw attention to the distribution of food sources within a community or within a local retailer, respectively. It has been suggested that a largely accepted food access conceptualization involves 5 dimensions relevant in the healthcare setting: availability, accessibility, affordability, acceptability, and accommodation [41]. Refer to [44] for a literature review. In addition, spatial modeling and GIS have been adopted to characterize additional dimensions of the food ecosystem. For example, Khushi et al. [45] investigated spatial inequality in food consumption, nutrient consumption, and production-consumption gaps at the sub-national levels in the Punjab province of Pakistan. Dohyeong et al. [46] characterized the spatial patterns of unhealthy food consumption in South Korea and they modeled the presence of areas with constrained access to fresh and nutritious foods, providing guidelines for targeted nutrition and public health programs. Moreover, methods for mapping provincial spatial food consumption data by accounting for spatial variability in population structure (age and gender) have been proposed in [47] with the intent to inform policy makers interested in promoting the consumption of locally produced food, as assessing localized nutritional demand. However, in contrast with our analysis, all these studies are based on corse geographical units such as administrative districts or regions, making it hard to address the variability in consumption within localized communities, e.g., neighborhoods, that often show a pronounced diversity especially in multi-cultural and multi-ethnic megacities.

Along with the socioeconomic dimensions, it is broadly acknowledged how the cultural group to which one belongs is of great importance when it comes to food preferences [48]. In the 1980s, the French sociologist Bourdieu [15] proposed a theory on the relationship between material and non-material capital to explain social inequalities, stratification and the distribution of power. Bourdieu connected taste, a multidimensional concept involving attributes, such as musical, artistic and culinary preferences, to the pursuing of cultural capital, a non-material resource that accumulates throughout one’s life course [15].

Even though several studies attempted to quantitatively characterize the different forms of cultural capital and their relation to food choices [10, 2225], they were mainly based on interviews and questionnaires on small samples of the population and potentially affected by common biases [49] related to the way a question is designed or administered. Kamphuis et al. [16] performed a systematic review of cultural capital indicators; they identify several indicators of family institutionalized (e.g. parents’ education completed) and objectivized (e.g. possession of books, art) or incorporated cultural capital (e.g. cultural participation, skills). After designing a questionnaire to capture these dimensions along with food habits of the participants, they found evidence of a connection between cultural capital and healthy food choices. The link between healthy diet and cultural capital has been observed in several studies. For instance, in a study of a cohort of adolescents in Norway, Fismen et al. [50] identified cultural capital as a stronger predictor than material capital of disparities in consumption of fruit and vegetables (positively correlated), and it was the only significant predictor of consumption of sweets and sugared soft drinks (negatively correlated).

Institutionalized cultural capital in the form of educational attainment has been widely adopted by the studies that focused on modeling cultural inequities and food choices, since level of education arguably affects what type of social milieu people inhabit, and consequently it affects what type of food one is exposed to [15]. Moreover, the availability of aggregated data at a fined-grained geographical scale, usually from the census, is another driving reason of this choice, one that we also embrace in this work. A low educational level has been connected to diets higher in fat density [5153], ultra-processed and ready made foods [54], sugar-rich [55] products, meat products (especially red meat) [21], and to a lower food group variety [8, 21]. On the contrary, highly educated people tend to consume more fruits and vegetables [55], fish [56], and to follow a more diverse diet. At last, social diffusion theory suggests that highly educated people generally take up innovations sooner than less-educated people. For example, in the UK, foods and diets low in saturated fat were adopted by the tertiary-educated before others [57]. Social epidemiologists also suggest that education enables people to rise up the social class hierarchy, thus allowing them greater power over outcomes in their lives, for example through higher incomes. In this study, we aim to re-examine some of these trends using a new high-granularity, large dataset of nutritional behavior.

3 Methods

3.1 Data

In this work, we characterize food consumption by using the Tesco Grocery 1.0 dataset [33] that contains an anonymized record of 420M food items purchased by 1.6M fidelity card owners who shopped in one of the 411 Tesco stores in Greater London during 2015. Tesco is the largest food retailer in UK with around 30% market share and a solid geographical coverage in the area of study. The dataset contains aggregated and privacy-preserving data views that combine individual purchases at different spatial granularities by using the home location field from the loyalty card application as the way to geolocate customers. The fine-grained geographical information included in Tesco Grocery 1.0 is the key to link food consumption data to any attribute that can be measured at the level of statistical census areas, e.g., demographic, socioeconomic, and health determinants. In [33], the authors provide an analysis of the representativeness of the Tesco consumers base by comparing the number of unique customers to the general population, and report a solid match. Moreover, they prove the ecological validity of the dataset by comparing the grocery purchases with metabolic syndrome conditions that are strongly linked to food consumption habits. An in-depth discussion on sample bias is provided in the Limitations section of the current work.

To model food consumption, we focus on three groups of variables of interest: macronutrients, product categories, as a proxy for diet composition, and diet variety. The first captures the nutritional properties of a food product and it is connected to the concept of energy intake that we measure in calories. In fact, a food item contains different types of nutrients in different proportions, which are transformed by the human body into energy and structural material for its growth and maintenance. We consider the following nutrients that have been connected to diet and culture in the literature: fats, carbohydrates, proteins, and fibers. A few studies distinguish between different types of fats, e.g., saturated fats that are fat molecules that have no double bonds between carbon molecules because they are saturated with hydrogen molecules. However, in our work, we model the lipids intake within a single macro category, this is justified also by the high degree of correlation observed between the variables fats and saturated fats in the Tesco dataset (\(\rho _{\mathrm{fats}}^{\mathrm{saturated}\ \mathrm{fats}}=0.8\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.001\)). A similar observation holds for sugar that loosely refers to a number of carbohydrates, such as monosaccharides, disaccharides, or oligosaccharides, as its excessive consumption has been implicated in the onset of obesity, diabetes, cardiovascular diseases, dementia, and tooth decay [5860]. Accordingly, we do not consider sugar separately in the experimental setting due to the strong correlation with the macronutrient carbohydrates (\(\rho _{\mathrm{carbohydrates}}^{\mathrm{sugar}}=0.85\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.001\)).

The second group of variables is related to a classification of the food products into categories. Even though several food taxonomies have been proposed in the literature [61], there is no consensus on how to group foods [62]. In this work, we adopt the classification in [33] which focuses on the following categories: oils, fish, produce, red meats, readymade, and sweets. Among the food categories, we observe two pairs of highly correlated variables, poultry and red meats (\(\rho _{\mathrm{poultry}}^{\mathrm{red}\ \mathrm{meats}}=0.77\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.001\)), and sweets and grains (\(\rho _{\mathrm{sweets}}^{\mathrm{grains}}=0.79\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.001\)). As such, in the rest of the experiments, we use red meats to characterize food products containing animal flesh, and sweets to include baked products using grain flours and, for example, candies or chocolate. The produce category includes fresh vegetables and fruits, while readymade contains pre-cooked meals that are usually available in a specific area of the store and that need to be open, often warmed-up, and eaten. At last, to test the hypothesis that eating a wide variety of foods improves dietary adequacy, we adopt the normalized entropy of the distributions of nutrients (h_nutrients), and food groups (h_products) as proxies for variety. Moreover, we consider the weight, and the calories intake (energy) of the average product sold in an area as a measure of quantity. At this stage, we have for each spatial unit a characterization of the diet quality and variety in that area that could be potentially linked to census variables. It is worth noting that we represent the nutritional features of the hypothetical average product consumed in an area, since we cannot characterize individual or group behaviors.

We characterize the educational attainment variables by using the 2011 Census data and, in particular, the table Highest level of qualification by sexFootnote 3 in the Local Characteristics series. The highest level of qualification is derived from the question asking people to indicate the types of qualifications held. The following levels are available: no qualifications, level 1, level 2, apprenticeship, level 3, level 4 and above, and other qualifications. In this work, we restrict our analysis to a representative class for the low-educated (level 1) and high-educated (level 4+) population strata. This design choice implies a linear approximation of the effect of education, which might hide more complex effects of the distribution of education levels. However, we think that the interpretability of the model, and its applicability, offer a good tradeoff for this simplifying assumption. To account for confounding variables that could influence both education and than food choices, we explore the dimensions of gender, age, and income, which have been extensively linked to dietary habits in previous work. Gender and age, in the form of average age, are extracted from the census, while economic status is quantified via a model-based estimate of the equivalized net income per household after the deduction of housing costs.Footnote 4 All the variables are standardized with zero mean and unit variance. A summary of all the variables used in this study can be found in Table 1, and a characterization of the spatial distribution and cross-correlation is presented in the Additional file 1.

Table 1 Description and source of the variables used for characterizing food consumption, educational attainment, and socioeconomic confounding factors

Data is aggregated at the geographical level of administrative areas in UK, thus implementing a privacy-preserving methodology. In particular, we adopt as a reference the spatial units of Middle Super Output Areas (MSOA), that have an average population of 7200 [63] and an average surface of 1.6 square kilometers within the Greater London region. In contrast with previous work, MSOA provide a considerably finer granularity that enables community-level observations. We refer to the variable representativeness (\(\mathrm{mean}=0.37\), \(\sigma =0.16\)) defined in [8] as the min-max normalized ratio between the number of unique customers and the number of residents to characterize how much the user base captures the census statistics. We limit our analysis to the 774 out of 983 geographical units that have a \(\mathtt{representativeness} \geq 0.15\) (see Fig. 1 for more details on the spatial coverage). The hierarchical structure and the shapefiles of the various census units are provided by the Open Geography PortalFootnote 5 of the Office for National Statistics (ONS).Footnote 6

Figure 1
figure 1

Representativeness of the Tesco customers base. Gray areas are filtered out from the experimental setting due to low significance

3.2 Spatial modeling

Since a large extent of socioeconomic and cultural phenomena are driven by spatially-aware data generation processes, and the modeling of food consumption has been rarely addressed in space, we approach our problem with spatial econometrics tools. To formalize the neighboring relation between areal units we refer to spatial weights [64] and we adopt a contiguity approach based on the binary queen criterion where \(w_{h,k}=1\) if the areas h and k share at least one vertex, 0 otherwise. To test the sensitivity of the results to the choice of a different spatial arrangements, we explore alternative weights structures based on distance, in particular the k-nearest neighbors (knn), where each area has a fixed number of k closest neighbors, and approaches based on kernel functions with adaptive bandwidth. To diagnose the presence of spatial dependence in the outcome variables or in the residuals of the regressors, we rely on the global measure of spatial autocorrelation Moran’s I [65] that tests how a variable in space differs significantly from the expected value under the null hypothesis of spatial randomness.

In this work, we start from the following vector notation of the general linear spatial econometric model for cross-sectional data:

$$ Y = \rho WY+ \alpha \iota {_{N}} +X\beta + WX\theta + u, \quad \text{where } u=\lambda Wu + \varepsilon . $$
(1)

Y denotes an \(N\times 1\) vector with the observations of the dependent variable for every spatial unit in the sample (\(i=1,\ldots, N\)), \(\iota _{N}\) is a \(N\times 1\) vector of ones associated with the constant parameter α, X contains a \(N\times K\) matrix of exogenous explanatory variables, with the associated parameters β represented in a \(K\times 1\) vector, and \(\varepsilon =(\varepsilon _{1},\ldots ,\varepsilon _{N})^{T}\) is a vector of disturbance terms, where the \(\varepsilon _{i}\) are independently and identically distributed error terms with zero mean and variance \(\delta ^{2}\). W denotes an \(N\times N\) nonnegative matrix describing the spatial arrangement of the units in the sample. The model specifies three main terms: (1) an endogenous interaction effect \(\rho WY\), (2) an exogenous interaction effect \(WX\theta \), and (3) an interaction effect amongst error terms \(\lambda Wu\). They model, respectively, the interplay between the value of the dependent, independent, and error terms in a spatial unit i and the values of the other spatial units. From the general nested model, the configuration of the parameters ρ, θ, λ leads to different spatial model specifications. For example, \(\lambda =0\), while removing the lagged errors, leads to the definition of the Spatial Durbin Model [66, 67] (SDM), that resolves in a Spatial Lag Model [68] (SLM) when \(\theta =0\). In a similar way, nullifying the lagged dependent variable component (\(\rho =0\)) defines the Spatial Error Durbin Model [69] (SEDM) and the simpler Spatial Error Model [68] (SEM) when also \(\theta =0\) (the spatial dependence is modeled via the error term alone). When all the ρ, θ, λ parameters are null, the specification traces back to a standard linear regression. In this settings, the interpretation of results involves two main components: a direct impact that links the characteristics of spatial unit with the value of the dependent variable in the same unit, and an indirect impact that models the spillover effects. The spillover effects can be further categorized in local or global. In local spillovers (\(\rho =0\)) the indirect effects are measurable only in the neighboring units, e.g., areas where \(W_{ij}\neq 0\). This leads to the adoption of a SEDM or SEM specifications. In contrast, in global spillover effects (\(\rho \neq 0\)) the indirect influence of a spatial unit falls on the entire set of locations, producing high-order effects observable even in spatial units that are not directly connected. This is compatible with the SDM or the SLM specifications.

In the literature, different approaches have been adopted to find the most appropriate spatial econometric model specification given an empirical use case. In this work, following the discussion in [70], we use a theoretical rather than data-driven approach to guide the decision. In the case of food choices, it is difficult to form a reasonable argument to include endogenous interaction effects even though they are found statistically. Including endogenous interaction effects would imply that the consumption of a particular food in an area has effects on the entire city, which is difficult to justify. The scenario points to a local spillover specification and, accordingly, we focus our analysis on the SDEM and SEM models. For completeness, we also run the Lagrange Multiplier [71] (LM) diagnostics based on the OLS residuals in their standard and robust forms that are at the core of the methodology described in [72].

For the experiments we use the functions errorsarlm and lagsarlm of the spatialregFootnote 7 package in R.

4 Results

To assess the presence of spatial dependence in the outcome, we run a permutation test for the Moran’s I statistic in the case of the educational attainment variables level1 and level4 with the number of random permutations \(n=10\text{,}000\). We observe a strong positive autocorrelation in both cases \(I_{\mathrm{level}1}=0.7202\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.0001\) and \(I_{\mathrm{level}4}=0.684\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.0001\). The presence of clusters of areas with similar behavior is confirmed by the visual inspection of the spatial distribution of values shown in Fig. 2 and from the Moran I scatterplot in Fig. 3. It is worth noting how the two variables show complementary spatial patterns: in fact, the areas with a high prevalence of a low-educated population (dark areas in the left map) correspond to the areas with a low presence of high-educated population (yellow areas in the right map) and vice versa. In this direction, we mainly focus on modeling the high education outcome (level4), and we show how the complementary variable for low education (level1) performs when space allows. To test if the observed spatial autocorrelation could be explained by the spatial structure of the covariates alone, we first run a linear regression analysis and we check for the presence of autocorrelation in the residuals [72]. Fitting performance is evaluated with the Akaike Information Criterion [73] (AIC), the Bayesian Information Criterion [74] (BIC), and the Nagelkerke’s pseudo R2 [75] when appropriate.

Figure 2
figure 2

Spatial distribution of the educational attainment variables

Figure 3
figure 3

Moran scatter plot: the slope of the linear fit equals Moran’s I statistics. It shows how neighboring areas behave similarly

We organize the rest of this section in two main parts that follow the same methodological pipeline and that capture the complementary dimensions of food consumption: (1) diet composition, along the dimensions of macronutrients and product categories, and (2) diet diversity.

4.1 Diet composition

Macronutrients

In this section, we explore the interplay between educational attainment and the consumption of different macronutrients. We first compute a linear regression model observing the presence of significant spatial autocorrelation in the residuals (\(I_{\mathrm{level}4}=0.4409\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.0001\)) (refer to Additional file 2 for more details). To account for local spillover effects, we estimate the SEM and SDEM models. The spatial specification choice is coherent with the results of the robust Lagrange Multiplier tests, \(\mathrm{LM}^{\mathrm{error}}_{\mathrm{level}4}=154.32\), \(\mathtt{p}\mbox{-}\mathtt{value} <0.0001\) and \(\mathrm{LM}^{\mathrm{lag}}_{\mathrm{level}4}=39.549\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.0001\) that suggests the adoption of a lagged error specification due to the higher value of the corresponding statistics. Figure 4 summarizes the fitting performance of the alternatives tested, thus identifying SDEM as the best performing model across measures (\(\mathrm{AIC}=490\)). A likelihood ratio test using the function LR.sarlm in the spatialreg package confirms the goodness of the choice of SDEM over SEM (\(\mathrm{LR}=61.077\), \(\mathtt{p}\mbox{-}\mathtt{value} <0.0001\)).

Figure 4
figure 4

Summary of the fitting performance of the different model specifications in the case of low (level1) and high (level4) educational levels (the lower, the better)

In Fig. 5(a) we present the SEDM regression results for the target variables level1 and level4. In both cases, the parameter λ is highly significant (\(\lambda _{\mathrm{level}1}=0.67\) and \(\lambda _{\mathrm{level}4}=0.71\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.0001\)), thus confirming the presence of a strong spatial error lag in the empirical data. The group of variables on the left side defines the direct impact while the variables on the right side estimate the indirect effect of the neighboring spatial units as defined in Sect. 3. Moreover, we test for the presence of spatial autocorrelation in the residuals, observing respectively a Moran’s \(I_{\mathrm{level}1}=-0.02\), \(\mathtt{p}\mbox{-}\mathtt{value}=0.8\) and \(I_{\mathrm{level}4}=-0.024\), \(\mathtt{p}\mbox{-}\mathtt{value}=0.83\) that show how the SEDM, unlike a standard linear regressor, produces an uncorrelated spatial structure in the residual in accordance with the hypothesis of independence. Finally, we observe the presence of heteroscedasticity via a studentized Breusch–Pagan [76] test (\(\mathrm{BP}_{\mathrm{level}4}=69.6\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.0001\)). While heteroscedasticity does not affect the estimation of the coefficients, it biases the estimation of the significance; however, since the observed p-values have, in the most cases, values below 10−3, the net effect is not substantial.

Figure 5
figure 5

Model parameters (variable weights) of the SDEM for the nutrients, categories, and diversity scenarios

Product categories

We explore the dimension of the food categories following the same pipeline described in the previous section. The residuals of a linear regressor shows a significant spatial autocorrelation (\(I_{\mathrm{level}4}=0.41\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.0001\)), and the LM tests are coherent with the current choice of spatial specification (\(\mathrm{LM}^{\mathrm{error}}_{\mathrm{level}4}=112.5\), \(\mathtt{p}\mbox{-}\mathtt{value} <0.0001\) and \(\mathrm{LM}^{\mathrm{lag}}_{\mathrm{level}4}=60.7\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.0001\)). SDEM shows the best fitting performance also for the case of food categories (see Fig. 4), in accordance with a likelihood ratio test (\(\mathrm{LR}=59.7\), \(\mathtt{p}\mbox{-}\mathtt{value} <0.0001\)). Figure 5(b) summarizes the direct and indirect impacts for the fitted model, confirming significant spatial effects in the error terms (\(\lambda _{\mathrm{level}1}=0.69\) and \(\lambda _{\mathrm{level}4}=0.7\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.0001\)). The application of SDEM produces residuals free from spatial autocorrelation (Moran’s \(I_{\mathrm{level}1}=-0.015\), \(\mathtt{p}\mbox{-}\mathtt{value}=0.71\) and \(I_{\mathrm{level}4}=-0.02\), \(\mathtt{p}\mbox{-}\mathtt{value}=0.77\)). At last, we observe heteroscedasticity (\(\mathrm{BP}_{\mathrm{level}4}=58\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.0001\)) similarly to the macronutrients case.

4.2 Diet variety

Spatial patterns are observed in the residuals of a linear regressor (Moran’s \(I_{\mathrm{level}4}=0.53\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.0001\)), and the LM tests are coherent with the current choice of spatial specification (\(\mathrm{LM}^{\mathrm{error}}_{\mathrm{level}4}=197.4\), \(\mathtt{p}\mbox{-}\mathtt{value} <0.0001\) and \(\mathrm{LM}^{\mathrm{lag}}_{\mathrm{level}4}=42.3\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.0001\)). Figure 5(c) summarizes the direct and indirect effects in the case of the best performing model SEDM (see Fig. 4 in accordance with the likelihood ratio test \(\mathrm{LR}=52.4\), \(\mathtt{p}\mbox{-}\mathtt{value} <0.0001\)). The parameters \(\lambda _{\mathrm{level}1}=0.78\) and \(\lambda _{\mathrm{level}4}=0.78\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.0001\) confirm the presence of significant spatial patterns. We acknowledge the presence of heteroscedasticity (\(\mathrm{BP}_{\mathrm{level}4}=30.05\), \(\mathtt{p}\mbox{-}\mathtt{value}=0.008\)) and missing spatial autocorrelation in the residual of SEDM (Moran’s \(I_{\mathrm{level}1}=-0.03\), \(\mathtt{p}\mbox{-}\mathtt{value}=0.88\) and \(I_{\mathrm{level}4}=-0.03\), \(\mathtt{p}\mbox{-}\mathtt{value}=0.93\)).

4.3 Sensitivity analysis

In this section, we explore the extent to which the observed results are sensitive to changes in the experimental design. First, we focus on a comparative analysis between the baseline model with only the socioeconomic confounds and the complete model including the food choices variables. The interplay between educational attainment and socioeconomic determinants has been extensively studied in previous work and, consistently, we observe how they play a primary role in the predictive framework. However, as shown in Fig. 4, adding the food consumption dimensions does provide a significant improvement in the fitting performance. We measure a 30%, 21%, and 18% reduction of the AIC for the nutrients, food categories, and diet variety cases (level1) and, specularly, a 48%, 35%, and 28% reduction in the case of the high education outcome (level4). A consistent behavior is observed in relation to alternative performance metrics, e.g., BIC and Nagelkerke \(R^{2}\) as summarized in Additional file 2. Moreover, it is worth noting that the spatial-aware models consistently outperform the standard linear regression framework by a large extent (on average we observe an improvements in AIC greater or equal to 75%) underscoring the benefits of taking into account the geographical structure of the determinants.

Second, we focus on the choice of the weighting scheme that has a central role in a spatial econometric framework [64]. We extend the initial experimental settings based on a contiguity approach with different distance-based spatial arrangements methods: the k-nearest neighbors (k-nn), in which each spatial unit is connected to a fixed number of k closest neighbors, and a class of kernel functions with adaptive bandwidth (gaussian, quadratic, triangular, quartic, and uniform). For simplicity, we present the nutrients and level4 case, similar results apply to the other scenarios. In the case of the nearest neighbors approach, we explore the range \(k \in [3,15]\) obtaining the best performing model with \(k=6\) (see Fig. 6(A)) and an overall performance that is slightly lower than the contiguity case (\(\mathrm{AIC}=278\)). We present the full results for the best performing model with \(k=6\) in Additional file 2 showing how the learned relations are stable and change only partially in strength. Switching to the kernel weights approach, we study the behavior of different classes of kernel functions exploring a bandwidth size within the same range of the k-nn scenario. Figure 6(B)–(F) summarize the observed performance curves showing similar results across methods. An extensive comparison between kernel functions is out of the scope of the paper, however, the best performing kernel settings is the triangular function with bandwidth size equals to 9 (\(\mathrm{AIC}=257\)) which reaches a very similar output to the contiguity case (\(\mathrm{AIC}=260\)). These results confirm the stability of the learned relations across a wide range of spatial arrangements.

Figure 6
figure 6

AIC fitting performance under different weighting schemes

5 Discussion

The first dimension of food choices that we explore is related to nutrients consumption. First, we focus on the direct impacts that model the effect of the intrinsic characteristics of a spatial unit on the educational attainment variable. As shown in Fig. 5(a), we observe that a low educational level is connected to diets high in carbohydrates [55], including sugar. Conversely, areas with a predominance of highly educated residents show a higher consumption of fibers, which provides a range of important health benefits, particularly in preventing heart and cardiovascular diseases, stroke, hypertension, diabetes, obesity, and some gastrointestinal pathologies [7779]. Diets higher in fat density have been associated to lower education in several studies [5153], which is consistent with the empirical measure of rank correlation observed in our scenario (\(\rho _{\mathrm{level}4}^{\mathrm{fats}}=-0.36\) and \(\rho _{\mathrm{level}1}^{\mathrm{fats}}=0.37\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.001\)). However, in a multivariate settings and discounting for the presence of the other predictors, the fat variable—due to its strong interplay with the carbohydrates variable (\(\rho _{\mathrm{fats}}^{\mathrm{carbs}}=0.57\))—appears to have a weak positive effect on the high education outcome that could be due to the presence of multicollinearity or misspecification of the model [80]. At last, we observe a non statistically significant direct association with protein consumption, this could be potentially related to the diverse set of sources of proteins and that have been associated in literature with different heath outcomes, socioeconomic factors, as well as impact on the environment. In fact, protein supply comes from both plant, e.g., legumes, soya, nuts and seeds, and animal sources, e.g., fish and seafood, poultry, pork and beef, and derivatives of milk such as dairy products. These products have been associated with low and high educational levels depending on their relation with a healthy diet; in this scenario, a definitive association is hard to pinpoint. When focusing on the spatial spillover effects, we do not observe significant neighboring effects for fiber and fat, while carbohydrates shows also in this case the stronger effect, that underlines how the neighboring units affect the model’s decision in the same direction as the observed direct effects. For the case of protein, the results show an indirect negative impact on the level4 variable. We remark that the spatial autoregressive model does not necessarily capture a causal model, i.e., we do not assume that the food choices have a causal effect on the education attainment. Therefore, the interpretation of the indirect effect needs some care, as they represent a spatial spillover of the covariates and the dependent variable, rather than actual influence of the neighboring units. The direction of these spillover effects is in almost all cases in accordance with the direct effects, which is a confirmation of the robustness of the results. These spillovers may be caused by several modeling factors, including the granularity of the spatial discretization, the specific choice of neighborhood function, and the arbitrariness of the spatial borders.

Switching now to food categories, there is a vast literature discussing the interplay between food choices and socioeconomic factors, and to a lesser extent, cultural capital. A high intake of fruit and vegetables is one of the cornerstones of a healthy diet, and has been recommended to the general public to reduce the risk of cardiovascular, coronary hearth diseases, and stroke [81]. Consistently with previous work, we observe a positive association between high educational attainment and consumption of vegetables and fruits [50, 55, 82, 83]. The prevalence of sweets and, in general, products high in density of sugar, is evident in communities with a lower educational level [50], an effect that, similarly to the case of carbohydrates, is the strongest in intensity in our experimental scenario. Focusing on animal-based products, it is interesting to note the different behavior between the variables fish and red meats. While high-educated people are less likely to be regular consumers of several meat products [21] (\(\beta =-0.114\)), in particular for the case of processed meats, they tend to consume more fish [56] (\(\beta =0.076\)), including seafood. Moreover, fish [84] and seafood [85] are an excellent source of protein and provide a range of benefits for major health outcomes among adults, even when taking into account that the presence of contaminants such as mercury in polluted natural environment could pose potential risks [84]. We observe non significant relations for the readymade and oils categories. In the first case, consumption of readymade and ultra-processed food products have been associated to the lower educational strata [54], however, even if the pairwise correlation follows this tendency (\(\rho _{\mathrm{level}4}^{\mathrm{readymade}}=-0.31\) and \(\rho _{\mathrm{level}1}^{\mathrm{readymade}}=0.45\), \(\mathtt{p}\mbox{-}\mathtt{value}<0.001\)), when adjusting for confounders the relation becomes not significant. Moreover, the tendency of creating healthy versions of readymade food products to target singles, small households, or professionals that tend to have a higher level of education is a factor to take into in consideration when exploring this dimension in its full extent. Consistently with the difficulty to characterize fats consumption in the case of nutrients, the heterogeneity of the oils category does not allow to draw a significant picture. In the case of level4, the contribution of the spatial spillover effects are relevant for the dimensions of sweets, red meats, and produce, thus indicating the importance to consider the influence of neighboring areas and the community effects. Indirect impacts are moderately significant for the readymade category in the expected direction.

Education is a valuable variable to consider as a proxy for consumers’ dietary knowledge and ability to process nutritional information. As a consequence, a solid body of research associates more educated subjects to the awareness and the importance of a balanced diet [21, 86, 87]. Consistently, we observe that the variety of nutrients h_nutrients is positively (\(\beta _{\mathrm{level}4}=0.2\)) associated to a higher level of educational attainment. It is worth noting that the variety of products h_items that reflects the number of unique purchased products follows an opposite trend (\(\beta _{\mathrm{level}4}=-0.13\)). This result means that, even though it seems that low-educated customers select amongst a wider range of products, the nutritional variety is limited. Moreover, high-educated people show the tendency to have a smaller caloric footprint (\(\beta _{\mathrm{level}4}=-0.08\)) and to consume a smaller quantity of food products in weight (\(\beta _{\mathrm{level}4}=-0.07\)) that is consistent with previous studies linking this observation to a lower incidence to obesity and a lower average Body Index Mass (BMI) [88]. The variables energy and weight show a significant or moderately significant spatial spillovers.

Further exploring the causal pathways of the relationships will help in designing effective interventions for improving health outcomes and dietary behavior in particular. For instance, Chandola et al. [89] examined six hypothetical pathways linking education and health via structural equation modeling. Since they found a combination of mechanisms being involved, they conclude that “improvements in population educational attainment may not automatically lead to improvements in population health”. Such pathways may also involve access to outside resources including nutritional and health services. When examining the connection between education and health outcomes of an older Japanese cohort, Oshio [90] finds regular health check-ups to be one of the primary mediators. Further, previous studies have shown [91, 92] that health literacy in particular may be an important factor mediating the relationship between educational attainment and many health behaviors, such as being physically inactive, making diet choices, and being obese. Ongoing policy efforts are already attempting to incorporate health education and related services into educational environments, such as the “Whole School, Whole Community, Whole Child” (WSCC) approach developed by the U.S. Centers for Disease Control and Prevention (CDC) [93] and the “Skilled for Health” initiative lead by U.K.’s National Health Service (NHS) [94]. The long-term impact of such interventions within communities will be made clearer by using data-driven, anonymous monitoring of nutritional behaviors of large cohorts, such as one presented in this work.

Limitations

There are few limitations and open points that should be mentioned:

  • Our study is based upon the dataset provided in [33] that aggregates the purchasing history of customers of a specific retailer and who have opted for a loyalty card. Even though the authors provide a representativeness score and some empirical evaluation of the biases introduced in their study, the user base is not exempt from sample bias. It might occur that the average customer is more likely to represent a specific level of educational attainment, and, by reflection, specific age and income profiles. For example, a student that is in the process of obtaining a degree, might be less willing to sign in for a loyalty card and, therefore, be accounted for in the study. This could lead to a biased estimation of the interplay between education and food choices; however, the large-scale nature of the dataset and the extensive adoption of the Tesco loyalty program [95] by its customers might reduce this effect. Moreover, food choices are aggregated at the level of administrative units that does not enable the characterization of the dietary habits of individuals, and models the average behavior in a geographical area instead.

  • There is no consensus in food and nutrition research on how to group food products in coherent categories [62] leaving the choice to the specific use case. However, the heterogeneity of food products in a group can be very high, smoothing out the intrinsic differences in health outcomes, affordability, or sociodemographic adoption determinants such as education or gender. For example, proteins are not all the same: there is a wide variety of foods that provide a protein intake that have a very different source, organoleptic properties, connection with health outcomes, impact of the environment and sustainability, or ethical concerns. Not being able to control the aggregation schema limits the applicability of an hypothesis-driven approach where the groups formation is guided by the research question under study.

  • Education attainment is only one aspect of the broader concept of cultural capital (institutionalized cultural capital) and, in general, of the cultural substrate that has been arguably identified as crucial when it comes to model food choices. For example, to better capture demographic variations, other measures should be considered. For instance, Rohit et al. [96] show that, for older African Americans, reading level may be a better predictor of baseline neurocognitive status than the years of schooling (possibly due to different quality of schooling available to students of different races). Taking into consideration other measures of cultural capital in general, and cognitive development in particular, may reduce observed racial disparities [97].

  • The tight interplay between food variables in characterizing dietary habits and the complex construct of socioeconomic determinants give rise to multicollinearity effects in the explanatory variables. Even though some researchers seem to assume that different socioeconomic indicators reflect the same underlying information and can therefore be used interchangeably [18] (typically, correlations between education, occupation, and income are weak to moderate with magnitude in the range 0.3–0.6 in developed countries [98]), we embrace the evidence from several previous studies that shows the unique contribution of each indicator. Moreover, the relevance of a specific indicator might differ between subgroups of the population, such as between adults and adolescents.

  • As pointed out in previous work, the process that underlies food choices is multifaceted and it involves a broad range of dimensions that are not fully captured in our study. These omitted variables are left out partially due to the unavailability of location-aware data with compatible spatial and temporal scales. Moreover, we aim at testing a specific set of hypotheses rather than exploring a wider spectrum of determinants. It is worth noting that these unobserved confounding factors could potentially explain away the link between education and diet; however, we rely on the extensive literature that explores the extent of this relation to corroborate our findings. We speculate that ethnicity, religion, or the dimensions being part of the Index of Multiple DeprivationFootnote 8 (IMD) could play a role in this direction.

6 Conclusion

The interplay between educational attainment and food choices has been the subject of a wide body of literature especially for the important ramifications in the public health domain. In this work, we explored the interplay between institutionalized cultural capital, a form of cultural capital theorized by the sociologist Pierre Bordieu, and dietary choices showing how education plays an important role beyond socioeconomic determination. To this extent, we adopted an anonymized large-scale record of food purchases in a major grocery store chain in the Greater London area to quantitatively model food consumption across the three complementary dimensions of macronutrients, food categories, and diet variety. Purchases were geographically aggregated at the level of fine-grained administrative areas (MSOA) and, unlikely most of previous work, we explored this relation in space with the adoption of spatial autoregressive models that aim at capturing the direct and indirect impacts that the spatial dependence induces. We observed that highly-educated areas tend to follow a healthier and more diverse diet, characterized by a higher consumption of fibers, fruits, vegetables, and fish products, along with a more balanced and diversified nutritional intake. On the contrary, a low educational attainment is generally connected to diets high in carbohydrates, sweets and red meats, as well as to a higher caloric intake and average portion size. These relations are consistent with the findings emerging from literature and they allow to map with an unprecedented spatial granularity the behavior of localized communities enabling the design of health policies and interventions that better adhere to the social, economic, and cultural contexts of a place.