1 Introduction

Unemployment is an issue currently faced by the vast majority of economies. It is a red-hot topic in studies carried out by economists and forecasters. Analyses are often based on offering explanations, consequences and possible solutions to the problem, by different models that simplify real complexity.

Numerous jobless suffer constrains that generate problems of a macroeconomic nature, such as a decrease in consumption and investment which, eventually, affect GDP. Moreover, unemployment is also related to welfare problems as inequality and social exclusion. At least for these reasons, it is of most importance to correctly predict and evaluate unemployment in order to monitor its evolution, anticipate trend shifts, and design pro-employment policies.

Spain is a country with a high unemployment level compared with its peers, peaking, in the 2013 recession, to 5 million registered unemployed workers. For the purpose of this study, we use the official figures provided by the Spanish Public Employment Service (SEPE).Footnote 1 Typically, data unemployment is released with certain delay which means that the use of leading, or coincident, indicators will be useful to anticipate its evolution and improving its forecasts (see, e.g., Stock and Watson 1993, for details on leading indicators).

With this in mind, the aim of this work is to propose some simple alternatives to univariate models for predicting the Spanish unemployment. We search for models which include additional, free of charge and available-to-everyone up-to-date information. We look for this information on the Internet search engines. These applications contain a large amount of information, available almost instantaneously, and reveal many aspects of the individuals’ preferences through their search histories. In this paper, without losing generality, we focus on searches in Google. More specifically, we use one of its tools, known as Google Trends (GT). Our hypothesis is that, using updated search indices obtained from GT there is a large margin to improve the predictions of the Spanish unemployment provided by a suitable univariate model.

However, any forecaster will soon discover that GT is not the panacea. As we will discuss in the next sections, some not trivial decision must be made when trying to optimize the information gathered from GT. This issue is treated in the paper in an application to the Spanish unemployment forecasting, although the procedures suggested could be applied in other contexts.

By means of a recursive forecasting exercise, we find that a SARIMA model with additional GT queries, applied to the Spanish unemployment series and relative to a univariate benchmark model, yields a statistically significant improvement in terms of forecasting accuracy that ranges 10–25%. This gain depends on the way the GT information is treated, with Principal Components Analysis (PCA) or Forward Stepwise Selection (FSS), and is robust to the variables that affect the results of the forecasting exercise. In our application, FSS outperforms PCA.

The paper is organized as follows. Section 2 provides a revision of the literature in the use of GT as explanatory variables, focusing on unemployment applications. Section 3 details the data employed in the analysis, paying particular attention to the GT queries and how those are generated and obtained. Section 4 presents the benchmark model, the proposed alternatives and their relation with other common methods in the literature. The latter are based on data reduction methods, which are introduced in Sect. 5. Section 6 compares the forecasting results of the proposed models relative to the benchmark and Sect. 7 analyzes the robustness of the previous results. The last section highlights the main findings of the paper.

2 Background and literature

This line of research began in 2004 and has been gaining popularity since then, boosted by the increasing use of the Internet worldwide. Johnson et al. (2004) are the first researchers who exploit this information source. The authors analyze the relationship between access to health related pages and flu symptoms searches with the cases reported by the U.S. Center for Disease Control and Prevention. Also working on Google searches related to the flu, Eysenbach (2006) pioneered to include Google search data in order to improve the forecasts. Similarly, Ginsberg et al. (2009) studied the benefits of using Google searches to estimate outbreaks of influenza in the USA. The result was a tool for estimating and forecasting illnesses, which is known as Google Flu Trends. A major contribution of all these studies is the transformation of the benchmark models, with seriously delayed data, to those based on immediately available Google queries results.

The first researchers to look into the economic variables that can be related to these Internet searches are Choi and Varian (2009, 2012). Their hypothesis is that the Internet searches can be related to certain users preferences as, before making a decision (such as buying a car or looking for a job), many consumers carry out a prior Internet search. In their 2012 work, they use different GT categories related to unemployment to build an indicator for estimating the level of unemployment in real time, avoiding the delay incurred in the official figures. Likewise, Askitas and Zimmermann (2009), based on Ginsberg et al. (2009), innovate on the search for GT terms to obtain an indicator to predict unemployment. Coeval in time, Francesco D’Amuri has worked intensely in this field. D’Amuri (2009) analyzes how Google forecasts unemployment in Italy. He pays special attention to the potential selection bias in favor of young job seekers, as a consequence of being the greatest consumers of this tool. D’Amuri and Marcucci (2009) show the improvement in unemployment forecasts in the USA, when using an index generated by searches in GT. Finally, D’Amuri and Marcucci (2017) revisit the theory of the previous work, incorporate the effects of the 2008 financial recession and disaggregate the GT searches at a federal level. To sum up, all these works highlight the importance of including GT for estimating unemployment levels. Two very recent works for the USA with similar conclusions are Nagao et al. (2019) and Borup and Schütte (2020). The latter deserves more attention as it is likely the paper closest to ours. Contrary to most of the literature, the authors work with a large GT queries dataset and use dimension reduction techniques (particularly soft-thresholding) to estimate employment models with random forest methods. Our paper differs to theirs in the queries, the samples, the dimension reduction methods applied (PCA and a suggested FSS), the endogenous variable, the benchmark model and the inclusion of a deep robustness exercise.

On the other hand, the papers by Fondeur and Karamé (2013) and Naccarato et al. (2018) also analyze the unemployment by means of GT queries, but they focus, particularly, on youth unemployment in France and Italy, respectively. As far as we know, only Vicente et al. (2015) deal with the Spanish unemployment with the GT approach. However, their paper models and predicts the unemployment with only two GT queries plus a confident indicator. As a result, they do not cope with the dimension reduction problem. Additionally, their forecasting horizon is only of 12 periods, and they do not vary the sample, which could make their conclusions sample-dependent.

Moreover, the use of GT queries and Internet searches, in general, as tools for modeling and forecasting has extended to distinct economic fields as: tourism (Pavlicek and Kristoufek 2015; Siliverstovs and Wochner 2018), inflation and GDP (Woo and Owen 2019; Niesert et al. 2020; Poza and Monge 2020), or even oil consumption (Yu et al. 2019).

Recently, two opposite mainstreams show up in the way this source of information should be used. While most of the authors stand up for the use of a few queries to reduce the noise in the analysis, see D’Amuri (2009), Fondeur and Karamé (2013), Vozlyublennaia (2014), D’Amuri and Marcucci (2017), Naccarato et al. (2018) or Yu et al. (2019); some others favor the use of more queries, see Pan et al. (2012), Li et al. (2017) and Borup and Schütte (2020). From our viewpoint, the use of GT information to improve models and their forecasts has currently two problems to be solved: (1) what are the suitable queries to extract the most informative series; and, (2) how to compress and filter this (sometimes huge) amount of information. Although both issues are related, our paper attempts to shed some light on the second one by applying two data reduction methods to a significant amount of GT queries results.

3 Data

This section details both, the unemployment data used as endogenous variable and the GT queries employed as potential explanatory variables.

3.1 Unemployment data

The unemployment series used in the paper is provided by the Spanish State Employment Service (SEPE 2019). It is released monthly during the first week of the next month and represents the number of people declaring to look for a job at a public employment office. The sample extends from January 2004 to September 2018, so that it covers business cycle expansions and recessions, with a total of 177 monthly observations.Footnote 2

3.2 Google Trends (GT)

Google browser is the most used search engine on the planet. According to NetMarketShare (2019), the Google browser had in December 2018 a 77.1% and an 85.8% share in desktop computers and mobile devices, respectively. For this reason, GT represents a reliable estimation of all the searches made on the Internet.

GT is a search trends feature that shows how frequently a given search term is entered into Google’s search engine, relative to the site’s total search volume over a given period of time. Google launched this tool in May 2006 and released an extension called Google Search Insight in August 2008. In 2012, both tools were merged to create the current version of GT, which is the one employed in this paper (Google 2020b).

Mathematically, being n(qlt) the number of searches for the query q, in the location l during the period t, the relative popularity (RP) of the query is expressed as:

$$\begin{aligned} \hbox {RP}_{(q,l,t)}=\dfrac{n(q,l,t)}{\Sigma _{q \in Q(l,t)}n(q,l,t)} \times \Pi _{\big (n(q,l,t)>\tau \big ),} \end{aligned}$$
(1)

where Q(lt) is the set of all the queries made from l during t and \(\Pi _{\big (n(q,l,t)>\tau \big )}\) is a dummy variable whose value is 1 when the query is sufficiently popular (the absolute number of search queries n(qlt) exceeds \(\tau \)) and 0 otherwise. The resulting numbers are then scaled on a range of 0–100 depending on the proportion of a topic with respect to the total number of all the search topics. So, the index of GT is defined as:

$$\begin{aligned} \hbox {IGT}_{(q,l,t)}=\frac{\text {RP}(q,l,t)}{\max \{\text {RP}(q,l,t)_{t \in {1,2,\ldots ,T}}\}}\times 100. \end{aligned}$$
(2)

These indexes can be obtained from January 1st 2004 up to 36 h prior to the search. GT excludes search data conducted by very few users and shows the topics of popular searches, assigning a zero in terms with a low search volume. In addition, searches performed repeatedly from the same machine in a short time period are removed. Finally, queries containing apostrophes and other special characters are filtered.

We have conducted a search of 200 job queries between January 2004 and September 2018. The method to choose these terms deserves some explanation. We divide the terms of the searches in four groups: (1) series representing the queries related to leading job search applications (e.g., Infojobs, Jobday, LinkedIn, etc); (2) searches related to Spanish unemployment centers, whether online, physical, public or private (e.g., Employment office, SEPE, Randstad, etc); (3) queries related to standard job searching terms (e.g., Job offers, How to Find a Job, How to Find Work, etc); and, finally, (4) searches directly related to those companies that generate most employment in Spain (e.g., work in Inditex, Carrefour work, Santander job). In order to complement these queries we also use the available GT tool called ‘related searches’ (see, Google 2020a), which allows us to download the queries made by the users related to the previous terms.

From the 200 queries initially raised, we finally obtained data for 163 of them, as certain searches do not meet the conditions laid out by the GT index.Footnote 3

4 Benchmark model and proposed alternatives

We follow Box and Jenkins (1976) ARIMA methodology to obtain our benchmark model. The univariate monthly time series model considered is:

$$\begin{aligned} \Phi _P(B^{s})\phi _p(B)\nabla ^d\nabla ^D_{s}u_{t}=\mu + \Theta _Q(B^{s})\theta _q(B)a_{t}, \end{aligned}$$
(3)

where \(\phi _p (B) = 1-\phi _1 B -\cdots - \phi _p B^p\), \(\theta _q (B) = 1 - \theta _1 B -\cdots - \theta _q B^q\) are polynomials in B of degrees p and q, respectively, while \(\Phi _P (B) = 1-\Phi _1 B^s -\cdots - \Phi _P B^{sP}\) and \(\Theta _Q (B) = 1 - \Theta _1 B^s -\cdots - \Theta _Q B^{sQ}\) are polynomials in \(B^s\) of degrees P and Q, respectively, and s is the seasonal frequency (\(s=12\) in our case). Moreover, \(\mu \) is a constant, B is the lag operator so that \(Bu_t=u_{t-1}\), \(\nabla = (1-B)\) is the difference operator and \({a_t}\) is a sequence of uncorrelated Gaussian variates with mean zero and variance \(\sigma ^2_a\). To meet the traditional Box and Jenkins’ modelling requirements of stationarity and invertibility, we assume that all the zeros of the polynomials in B and \(B^s\) are outside the unit circle and have no common factors. This is often called as the Seasonal AutoRegressive Integrated Moving Average (SARIMA) form of the stochastic process \(u_t\).

The identification using common tools (graphics, autocorrelation and partial autocorrelation functions, and unit root tests) leads us to a SARIMA\((2,1,1)\times (0,1,1)_{12}\) model. However, the residuals do not seem to represent a Gaussian white noise process due to an influential outlier in 2008. This is not surprising as this date corresponds to the beginning of the global financial crisis, which hardly hit Spanish unemployment.Footnote 4 In order to model this outlier we include a step dummy variable defined as: \(\xi ^{08/03}=1\), when \(t<2008/03\) and \(\xi ^{08/03}=0\), otherwise.

The final model is presented in Eqs. (4a4b), whose residuals do not evidence any sign of misspecification and are now compatible with the statistical assumptions on \({a_t}\):

$$\begin{aligned}&u_{t}= \omega _{0}\xi ^{08/03} + \eta _{t}; \end{aligned}$$
(4a)
$$\begin{aligned}&(1- \phi _1B -\phi _2B^{2})\nabla \nabla _{12}\eta _{t}=(1-\Theta _1B^{12})a_{t}. \end{aligned}$$
(4b)

We will use this model as benchmark in the forecasting exercises in Sects. 6 and 7.Footnote 5 Although it is not the purpose of the analysis, some theoretical implications can be drawn from the empirical identification of this model. As the nonstationary tests do not reject the unit root hypothesis, the hysteresis theory (see Blanchard and Summers 1987), which indicates that shock effects on the unemployment will persist because of the rigidity of the labor market, cannot be rejected either. This results in line with the analysis, also for Spain, performed by Romero-Avila and Usabiaga (2007), and Cheng et al. (2014).

The alternative models are build on top of the benchmark. We propose to include additional explanatory series in Eq. (4a) and keep the ARMA noise structure, in Eq. (4b), as long as the statistical diagnosis does not reveal any sign of misspecification. Therefore, the proposed alternative models can be represented as the transfer function:

$$\begin{aligned}&u_{t}=\omega _{0}\xi ^{08/03} + \sum _{i=0}^{I}\beta _{i}x_{it} + \eta _{t}; \end{aligned}$$
(5a)
$$\begin{aligned}&(1- \phi _1B -\phi _2B^{2})\nabla \nabla _{12}\eta _{t}=(1-\Theta _1B^{12})a_{t}, \end{aligned}$$
(5b)

where exogenous variables \(x_{it},\) \(i=1, 2, 3, \ldots , I\) will depend on the two different methods proposed to summarize the huge amount of information downloaded from GT. These two alternatives are detailed in the next section. The estimates for the benchmark model can be found in Table 2, for \(I = 0\). As expected, the value for \({\hat{\omega }}_{0}\) is negative and highly significant, which implies that the financial crisis yielded a permanent increase in the Spanish unemployment level of 79.770 people. The estimates of the ARMA parameters are also presented in Table 2, along with those of the alternative models.Footnote 6

5 Data reduction

There are basically two groups of methods to overcome the dimensionality curse arisen from the use of many GT queries results. The first one exploits the redundant information of the data and creates a smaller set of new variables, each being a combination of the original ones, which replicates most of the information contained originally. These techniques are usually known as dimensionality reduction methods; see Van Der Maaten et al. (2009) for a complete survey. The second one encompasses the procedures that drop the less relevant variables from the original dataset by keeping the most explanatory ones. This is often called variable (or feature) selection (see, e.g., Guyon and Elisseeff 2003).

This section presents two methods (one of each of the previous groups) used to compare the forecasting performance of the Spanish unemployment, by reducing the amount of information obtained via GT. First, we briefly describe the Principal Component Analysis (PCA), one of the most widely used dimensionality reduction methods. Second, we propose a Forward Stepwise Selection algorithm adapted to our problem.

5.1 Principal component analysis

PCA is one of the most popular algorithms for dimensionality reduction. The reader unfamiliar with this procedure may consult Jolliffe (2002).

Broadly speaking, given the set of GT queries results (which is 163-dimension), PCA is the standard technique for finding the best-from a least-squares error sense- subspace of a lower dimension, I. The first principal component is the one that minimizes the distance between the data and its projection onto the principal component. The second principal component is chosen in the same way, but must be uncorrelated with the first one (or perpendicular to its direction), and so on.

In our case we compute the first 10 principal components, which accumulate around 70% of total variance of the GT result series. Interestingly, the two first components explain close to 50% of total variance. We stop at component 10 in an attempt to capture more information even if from the third one onward the marginal contribution to total variance is quite low; see Fig. 1.

Fig. 1
figure 1

PCA analysis of the 163 GT queries

As it is often the case for applications with many variables (i.e., with high dimension), it is difficult to interpret the principal components obtained, as they are often a linear combination of many variables. However, we will try to give some insight on the three first principal components by looking at their correlations with the original queries. The first principal component is positively and highly (linearly) related to job-search apps (e.g., LinkedIn, Indeed, Milanuncios), unemployment centers (e.g., Randstad, SEPE) and some, but not many, queries related to companies (as work for Decathlon or Lidl). The second principal component is mostly related to queries seeking jobs in particular firms (as hiring in Carrefour, Eulen, CaixaBank, Mercadona, etc). Finally, the third principal component mostly represents the queries related to standard job searching terms (e.g., employment, public employment, work, job vacancies, etc).Footnote 7

The first alternative to the benchmark model consists of including the previous principal components as the explanatory variables \(x_{it}\) in Eq. (5a). This means that \(x_{it}\) will be the ith-principal component, \(i=1, 2, \ldots , I\) and \(I=1,2, \ldots ,10\), calculated from the set of variables obtained from GT (\(N=163\)). As some readers will notice, this method is similar to the Principal Component Regression (PCR). In PCR, the principal components of the explanatory variables are used as regressors. Particularly, as we do here, one often uses a few principal components for regression, making PCR a shrinkage procedure. The main advantage of our proposal with respect to original PCR is that we additionally incorporate a model for the noise. Certainly, Eq. (5a) can be considered as a PCR when \(x_{it}\) for \(i=1,\ldots ,I\) is a subset of the principal components previously calculated. However, our proposal also includes Eq. (5b) in the model, so that \(\omega _0\) and the \(\beta \)s can jointly be estimated with \(\phi _1,\phi _2\) and \(\Theta _1\), which capture the remaining autocorrelation of the residuals.

5.2 Forward stepwise selection

Now we propose an alternative model based on a FSS method. As before, we start with the original set of 163 queries. The process consists of estimating Model (5a5b) with a potential explanatory variable, without lags, in Eq. (5a). We do this for each variable in our set of 163 series. Therefore, a model is estimated for each variable. Once the estimation loop is finished, we sort the models by the lowest AIC criterion.Footnote 8 This allows us to choose the best model out of all the estimates, obviously under the previous criterion. Next, we compute the one-step-ahead out-of-sample forecasts in the evaluation sample (2015/12 to 2018/09 in our case) based on the estimates of the selected model. We save these forecasts and calculate its corresponding Root Mean Squared Error (RMSE).Footnote 9 If the RMSE is lower than the one obtained with the benchmark model, we repeat this process again, by adding a new explanatory variable to the previous model. For this, we rerun the model selection loop and choose the next variable whose model minimizes the information criterion. We repeat this process until the inclusion of a variable, whose model yields the lowest information criterion, does not provide a lower RMSE than that obtained with the benchmark model. Notice that the RMSE is only used to make the algorithm stop. Figure 2 depicts a diagram that illustrates the procedure.Footnote 10

The resulting models to be compared against the PCA-based method and the benchmark can also be defined by the transfer function (5a5b), but in this case \(x_{it}\) is the variable chosen by the proposed feature selection method, with \(i=1, 2, 3, \ldots , I\) and \(I=0, 1, 2,\dots \) until the algorithm stops.

Fig. 2
figure 2

Feature selection algorithm. \(I=0\) corresponds to the benchmark univariate model

The first repetition of the loop defined in Fig. 2 provides a ranking sorted by increasing AIC, of the explanatory variables obtained in the GT queries (see “Appendix”, Table 5). The variable that provides the lowest AIC is the query for the term LinkedIn. The professional social network had three million users in Spain in 2012 (Jiménez 2012). The inclusion of this variable considerably improves the model in terms of different information criteria and residual statistics. When repeating the exercise keeping LinkedIn in the model, as \(x_ {1t}\), the procedure leads to the selection of the query for the term Carrefour job, denoted by \(x_ {2t}\).Footnote 11 Carrefour is a distribution company with 1,088 stores in Spain in 2019 (Osorio 2019). The rest of the explanatory variables chosen and their order of selection are presented in the next section, Table 2.

5.3 Alternative procedures

Other tools might be used to select the forecasting explanatory variables and models. Some related alternatives in the literature are the Lasso regression (Tibshirani 1996), the Model Confident Set (MCS, Hansen et al. 2011) and the Bayesian Model Averaging (BSA, Hoeting et al. 1999), although others can also be found. Each of these procedures has its pros and cons with respect to our methods, which were chosen mainly because of its conceptual simplicity (SARIMAX models, PCA and AIC criterion are well-known for most forecasters) at an affordable computational cost. In this sense, Lasso regression could be a good alternative as it works well for a big number of potential explanatory variables and its computational cost is low. However, it has two drawbacks in this application: (i) as far as we know, there is currently no procedure to estimate the Lasso regression in a transfer function model like (5a5b) that include MA terms,Footnote 12 and (ii) as we do not have a large sample size (\(T=177\)) and the number of potential explanatory variables is big (\(N=163\)), we cannot directly perform a Lasso regression as we will not have enough degrees of freedom to estimate. Therefore, some kind of dimension reduction technique should be previously used anyway. Regarding the MCS, although it is a powerful tool for model comparison, it does not fit our problem as well as our procedures do. That is because of the slight difference between model selection and feature selection. MCS is a model selection procedure. In this application there would be a huge number of models to compare (even restricting to only two GT queries in model (5a5b) will yield \(\left( {\begin{array}{c}163\\ 2\end{array}}\right) =13{,}203\) models to compare!). For that reason, it seems logical to first deal with the feature selection problem, and then chose the best model. Obviously, the MCS (or BMA) can be used to select the best model among those given in Table 2, but for the sake of simplicity we decide to use the most common out-of-sample RMSE comparison for this purpose. In turn, in order to apply BMA, a prior distribution over the considered models must be specified, which is usually non-trivial. Similarly to MCS, in our application the number of models under consideration is huge and the computational cost of BMA will become enormous.

6 Prediction evaluation

This section investigates the accuracy of the methods exposed previously when forecasting the Spanish unemployment in an out-of-sample validation of 34 periods. To this aim we use a recursive (expanding) forecasting scheme. In the exercise, all the estimations converge adequately and no model shows evidence of poor specification.

Table 1 presents the most common residual statistics for Model (5a5b) by including cumulatively and sequentially: (i) the principal components given in Sect. 5.1, and (ii) the results for specific GT queries chosen by the features’ selection algorithm of Sect. 5.2. The main statistics shown are: Normality test (Jarque–Bera test), absence of autocorrelation (Ljung–Box test) and of heteroskedasticity (Goldfeld—Quandt test). Residuals do not evidence non-normality nor autocorrelation, although a few of them (when adding the principal components as explanatory variables particularly) may be heteroskedastic. For the PCA-based models, p values of the coefficients show poor explanatory power from the second principal component onward (except maybe the 6th one). Conversely, all the feature selection-based models have significant estimated coefficients (see Table 1, parameter \({\hat{\beta }}_{I}\)).

Table 1 Estimates of the \({\hat{\beta }}_{I}\) coefficients and common residual tests for model I

Table 2 presents the estimates of the SARIMA parameters and the step-dummy variable, the AIC and the RMSE both, absolute and relative to the benchmark’s. The coefficients \({\hat{\omega }}_{0}\) measuring the effect on the unemployment of the 2008 financial crisis shows a stable negative and significant value in all the models. When looking at the autoregressive polynomial coefficients (\({\hat{\phi }}_{1}\) and \({\hat{\phi }}_{2}\)), the AR1 always provides a significant and positive coefficient while the AR2 is only significant for the models that include just one explanatory variable, either the first principal component or the LinkedIn query. In turn, the estimated seasonal moving average (\({\hat{\Theta }}_1\)) is always highly significant and negative. All these values show the stability and robustness of the models, whose coefficients and statistics do not vary significantly when additional explanatory variables are sequentially incorporated.

Table 2 Estimates of the coefficients in Eq. (5b)

Akaike’s criterion is considerably lower for the feature selection-based models (relative to PCA-based and benchmark models) and it decreases with each additional explanatory GT query. This was expected as a result of the design of the feature selection algorithm.

Regarding the forecasting accuracy, the RMSE of each of the models for the out-of-sample forecast period 2015/12–2018/09 is evaluated. In other words, a comparison of this error measure is made over a total of 33 one-step-ahead forecasts. Table 2 and Fig. 3 show the RMSE improvement of the compared methodologies against the benchmark.

Fig. 3
figure 3

Forecasting accuracy of the models: RMSEs comparison relative to the benchmark model forecasts

The major advantage for the PCA-based models appears when \(I = 3\), a gain close to 9% of predictive accuracy relative to benchmark’s. This result is compatible with the fact that from the third principal component, the relative explained variance of each additional component is marginal (see Fig. 1). Regarding the feature selection-based model, the best improvement occurs with \(I = 4\), i.e., when the model incorporates GT queries for the terms LinkedIn, Carrefour job, Ikea employment and How to Find a Job (HFJ). In such a case, the gain in terms of RMSE relative to benchmark’s is around 25%. Interestingly, the higher leap in forecast accuracy comes with the introduction of the GT search LinkedIn, which, individually, represents an improvement in predictive accuracy of 22.3%. The rest of the variables, instead, add a relative minor advance.Footnote 13 Furthermore, from the inclusion of the fifth variable, the forecasting precision begins to decrease almost linearly and when \(I=9\) it becomes even worse than the benchmark’s. That is why our algorithm (see Fig. 2) stops here, when \(I = 9\) as \(\hbox {RMSE}_{0}<\hbox {RMSE}_ {9}\). We just include \(I=10\) for comparison purpose.

A small sample size (it reaches 176 observations for the last forecast) could imply that an important part of the RMSE is noise due to estimation error. Moreover, the out-of-sample size is not very large either (33 forecasts), which adds more uncertainty to the RMSE. To incorporate the latter source of uncertainty in our forecast evaluation, Table 2 offers in its last column the p value of the Diebold and Mariano (1995) test. The null hypothesis is that the two predictions, those obtained from the benchmark and the corresponding alternative model, have the same accuracy. Accordingly, a small p value evidences that the suggested model predicts better than the benchmark with a particular significance level. Thus, the PCA-based model including the first three components outperforms the benchmark with a 3.6% significance level, while the FSS-based model with the first three (or four) variables beats the benchmark with a 0.9% significance level.

Notice that the Diebold and Mariano test is appropriate in this application even if it does not account for parameter estimation error (see, e.g., West 2006; Escanciano and Olmo 2010). Although here we apply this test to forecast provided by estimated (not known) models and, therefore, they are subject to parameter uncertainty, in all the cases treated the out-of-sample size is small relative to the in-sample size. This makes the extra term related to parameter estimation error, which is not accounted by the limiting variance derived by Diebold and Mariano (1995), to vanish asymptotically (see, West 2006). Thus, assuming there are no estimation effects is expected to be a good approximation in our forecasting evaluation exercise.

7 Robustness analysis

As the analysis in the previous section demonstrates a much better forecasting performance of the feature selection-based model, we carry out a robustness analysis only for this methodology. We do so by varying all the variables that may have some influence in the result of the forecasting evaluation: (i) the specification sample, (ii) the forecasting sample, (iii) the number of forecasting periods, and (iv) the date of the data extraction (as explained in Sect. 3.2, GT index may differ for different download dates). Although with a few exceptions, the results shown in Table 3 are pretty unambiguous: the use of GT queries along with the proposed feature selection-based model improves the forecasting accuracy in terms of RMSE relative to benchmark’s. The best RMSE implies a gain of 31.3%, we found better forecasting results in 11 out of 14 models and the average benefit (of the 14 models) is close to 15%. In terms of Diebold and Mariano’s test, 7 models beat the benchmark with a 5% significance level. Besides this main finding, some additional interesting facts can be withdrawn from this robustness check: (1) LinkedIn is definitively the key explanatory variable (when this term is not the best variable there is no predictive improvement); (2) the best RMSEs are usually obtained when adding extra explanatory variables to LinkedIn; (3) more explanatory variables (and better forecasting results) are found with the data downloaded in 2018/09 than with the series extracted in 2019/09; and (4) the lower is the number of forecasting periods, the higher is the forecasting accuracy.

Table 3 Robustness analysis

While points (1) and (2) of the previous observed facts are related to the high impact of the LinkedIn GT search result on the forecasting of the Spanish unemployment, points (3) and (4) are likely related to the design of the exercise. Regarding the latter, in our paper the models are specified with the information given in the Specification sample (see Table 3) and although they are re-estimated with the observations added in each period (this is, indeed, a recursive forecasting scheme), they are not re-specified. Thus, when the forecasting sample increases, the probability of finding a different model that better fits the new sample (i.e., a better specification) increases. For instance, our FSS algorithm in Fig. 2 chooses LinkedIn and Ikea as the first two best queries to be included in the model. The recursive forecasting scheme implies to update each observation, re-estimate and produce a new forecast with that model. So, in this exercise we do not rerun our FSS algorithm with each update. Our hypothesis is that, including a re-specification step when adding a new observation (i.e., rerunning the FSS algorithm to search for the best model with each update) will yield even better forecasting results. This will be, obviously, in exchange for a non-negligible increase in the computational cost, and remains as an open question for future research.

8 Final remarks

This paper studies whether additional information, collected in form of time series from queries applied to GT, improves in some extent the forecast accuracy of the Spanish unemployment obtained with a univariate model. When conducting this analysis, two questions arise: (1) what are the best queries one can introduce in GT, and (2) how to deal with the huge amount of information one can download from it. The first question is not the scope of this work but could be a subject of future research. In contrast, we compare two different ways to deal with close to 200 series downloaded: (i) the use of the standard techniques of PCA, and (ii) a proposed algorithm for FSS. The gains in RMSE relative to the benchmark are around 10% for the PCA-based model and 25% for the FSS-based model. The improvement of the FSS-based model is confirmed in a robustness analysis. Compared to the literature, our gain is greater than the 15% obtained by Vicente et al. (2015) for the same endogenous variable (but different period) and greater than the common 10–19% range find by, e.g., D’Amuri and Marcucci (2017) and Fondeur and Karamé (2013). The reason of this could be the larger amount of GT data used and the application of dimension reduction techniques.

Besides the gain in predictive accuracy found to forecast the Spanish unemployment, the paper also casts some light to the discussion in the literature about using more or less explanatory variables. Our results on the robustness exercise shows that it seems better to introduce only a few GT explanatory variables in the model. In our case, the best RMSE varies from 0 to 5 exogenous variables, depending on the sample and other parameters of the exercise. It certainly does on the endogenous variable to be analyzed as well.

Finally, in our application, the variable LinkedIn clearly arises as the best leading indicator among close to 200 series. Our FSS method demonstrates its potential to find it. As to the discussion about less of more queries, we show that the larger is the number of GT queries, the higher is the probability of finding one or more excellent indicators. At least, when no prior information on which are the most informative queries is available.