1 Introduction

During the years 2008–2013, consumption expenditures of Italian households was harshly hit by the Great Recession (ISTAT 2014) with a remarkable reduction of purchasing power (− 10.4% between 2007 and 2013). In that period, Italian households showed a reduction in tourism expenditure and a change in travel behaviour as well. The expenditure share devoted to accommodation facilities passed from 2.8% in 2010 to 2.3% in 2013 and the annual decrease in the number of trips by resident was nearly − 12% in 2010, − 19% in 2013. Only in 2015, for the first time after 7 years, there has been a remarkable increase (+ 13.5%).

Objective of our study is tourism behaviour of the Italian residents and, in particular, we analyse Italians’ participation in tourism in the period covering the recent economic recession. Using data from the survey on Trips of Italian Residents in Italy and Abroad, carried out quarterly by the Italian National Institute of Statistics (ISTAT), we investigate whether the propensity in tourism participation (i.e., the probability of having at least one holiday trip with at least an overnight stay in a quarter) and the intensity of participation (i.e., the sum of the length of stay of all holiday trips taken in a quarter) have changed over the period of analysis.

The theoretical framework used for the joint analysis of these two aspects is the hurdle model (Mullahy 1986), a modified count data model which allows to consider the response as result of a two-stage decision process: at first a person decides whether to take a holiday and then, conditionally to a positive decision, he decides the length of the holiday. In a general hurdle model, a binary model is used to represent the binary outcome of whether a count variable has a zero or a positive realisation and then the positive realisations are modelled by a truncated-at-zero count data model. Various specifications can be adopted for the truncated-at-zero model depending on the distribution of the positive realisations. Given their flexibility, the hurdle models have been widely used in several contexts of health and economic studies, and a few applications of this method can be found in tourism analysis as well (Hellström 2006; Bernini and Cracolici 2015, 2016; Boto-García et al. 2019).

A noteworthy concern in the analysis of the number of overnight stays is the presence of multiple spikes in its distribution. That occurrence is due to the propensity to take a holiday in typical day blocks (e.g. week-end, 1, 2 weeks, etc.), which in turn produces a concentration of the total number of overnight stays on certain values, known as inflated values. Some authors have treated this problem by re-defining the response variable into two or more classes and then applying a logit or a multinomial model (Alegre and Pou 2006; Nicolau and Más 2004); others adopted a latent class approach (Alegre et al. 2011) or employed a quantile regression model (Salmasi et al. 2012).

We take a novel approach, not yet adopted in the context of tourism analysis: the truncated-at-zero model for the positive responses is specified as a multiple inflated truncated negative binomial model, that is a finite mixture of a zero-truncated negative binomial and a set of degenerate distributions on the inflated values, with the mixture probabilities modelled through a multinomial logit model. Even considering a wider literature, at the best of our knowledge the actual specification of a hurdle model with a multiple inflated distribution for the positive responses, even if theoretically feasible, has not been presented before.

Results show that the economic recession impacted negatively on both components of the decision process and that, by controlling for the inflated nature of the response variable distribution, the proposed formulation provides a better representation of the Italians’ tourism behaviour in comparison with non-inflated hurdle models. In particular, by using a multiple inflated hurdle model we are able not only to identify the determinants of the phenomenon under study, but also to correctly fit the distribution of the total number of overnight stays, even in presence of extremely inflated values which are usually under-predicted by standard models. Given this characteristic, we believe that the use of multiple inflated hurdle models can produce results which could be useful for policy makers in evaluate how the Italians would react to the implementation of targeted tourism policies.

The paper is organized as follows. Section 2 describes the database and discusses the characteristics of the response variable. Section 3 presents the theoretical model used for the analysis and discusses its properties and usefulness for the aim of the study. Results of the empirical analysis are presented in Sect. 4, and the last section concludes with the discussion of the main findings.

2 Data

The analysis employs a pooled time series cross-section database of Italian residents in the period 2004–2013, which covers the last economic recession that has seriously affected Italian households. Data comes from the household survey on Trips and Holidays of Italian Residents in Italy and Abroad, which is the main statistical source of demand-side tourism data available in Italy. It is currently carried out by ISTAT for responding at the EU Reg.692/2011, and it collects information about domestic and outbound travels of the Italian residents.

From 1997 to 2013, it has been conducted quarterly on a national annual sample of about 14,000 households (about 3500 per quarter), comprising an annual total of about 32,000 individuals. Each year, data are collected for the periods January–March, April–June, July–September and October–December. In each quarter and for each individual, information on travels with at least one overnight stay concluded during the quarter, made for any main purpose, are recorded. Tourism trips are classified into business and holiday trips.

In addition, socio-demographic characteristics of all household components are recorded: age, gender, region of residence, education level, marital status, occupational status and professional position. It should be noted that this information is collected for all individuals, regardless of their being traveller or not. Therefore the survey data allows to identify the share and characteristics of both tourism participants and non-participants. Unfortunately, these characteristics do not include any information about the individuals’ economic status.

For the participants the survey offers also an in-depth insight about their tourism behaviour in terms of number of trips, nights spent and characteristics of the trip, but provides no information about tourism expenditure (although surveyed for the Tourism Satellite Account it is not provided for research purpose).

From 2014, the Trips and Holidays survey has become a focus included in the Households Budget Survey and deep changes have been introduced in every stage of the survey process. Therefore, since the two sources cannot be appropriately linked together, we stop our analysis at 2013. Moreover, given the adoption of the Euro currency occurred in 2002, we start the analysis from 2004.

As we are interested in studying the factors that may influence individual tourism behaviour, our unit of study is the individual. We limit the analysis to holiday trips and, since children’s tourism choices are not individually made, we consider only persons at least 15 years old.

We define our study variable as the total number of overnight stays in a quarter, obtained by summing the length of stay of each holiday trip made by an individual in that quarter (the variable is set at zero for an individual who has not travelled). The variable’s descriptive statistics presented in Table 1 show a prevalence of zero values: almost 80% of the sampled units are tourism non-participants. If we expand the data to the whole Italian population by using the expansion factor provided by ISTAT, we can estimate that only about the 24% of Italians (at least 15 years old) made on average at least a holiday trip in a quarter. Even in the summer quarter (July–September), this percentage reaches only 42%.

Table 1 Descriptive statistics of response variable

Considering the positive number of overnight stays, from Table 1 we can see that the variable is highly skewed and overdispersed (the variance is almost 14 times the mean). This is confirmed by Fig. 1: when the number of overnight stays increases its frequency quickly decreases, but the distribution has a long tail of low-occurrence values. From Fig. 1 we can observe multiple spikes in its distribution: the observed positive values of overnight stays are concentrated on specific values, like 2, 6, 7, 14, 15, 20 and 30 nights.

Fig. 1
figure 1

Distribution of the positive number of overnight stays in a quarter

As expected, seasonality plays an important role in determining tourism behaviour. The proportion of Italians who made on average at least a holiday trip in a quarter and the average number of overnight stays are significantly higher in the third quarter than in the other quarters. More important, as we can see in Table 2 and Fig. 2, the distribution of the positive number of overnight stays in the third quarter is completely different from that of the other quarters: it is more variable, with a longer tail and with a larger number of inflated values. In addition, there is a higher concentration on the inflated values.

Fig. 2
figure 2

Distribution of the positive number of overnight stays by quarter

Table 2 Descriptive statistics of the positive number of overnight stays by quarter

Finally, tourism behaviour is characterised by spatial heterogeneity as well. Table 3 and Fig. 3 show the distribution of the positive number of overnight stays divided by the individuals’ macro-regions of residence. We observe differences both in the proportion of tourists and in the average number of nights spent on holiday. Moreover, variance and overdispersion are heterogeneous across regions, whereas the same inflated values seems to be present in each distribution.

Table 3 Descriptive statistics of the positive number of overnight stays by macro-regions of residence
Fig. 3
figure 3

Distribution of the positive number of overnight stays by macro-regions of residence

3 Methodology

Since the response variable is discrete and non-negative, we refer to count data models. The most common models in literature are the Poisson and the negative binomial regression models.

One of the basic assumptions of these models is that both zero and positive values of the response variable come from the same data generating process. However, as it frequently occurs when analysing socio-economic phenomena, our data do not adhere to this assumption. In fact, it makes sense to assume that a person firstly decides whether or not to take a holiday (i.e. whether or not to participate in tourism), and then, conditionally to a positive decision, he decides the number of overnight stays. In such a situation it seems opportune to firstly separate participants from non-participants, zeroes from non-zeroes, through a binary model and then to model the positive responses using a truncated-at-zero count data model.

This assumption is typical of the hurdle model (Mullahy 1986), in which the two processes generating zeros and positive values are not constrained to be the same. Firstly, a binomial probability governs the binary outcome of whether a count variate has a zero or a positive realisation and then, if the hurdle is crossed (i.e. the realisation is positive), the conditional distribution of the positives is governed by a truncated-at-zero count data model. Such a conditional setting enables the interpretation of covariate effects through event incidence and frequency in the respective logistic and truncated distribution components (Cameron and Trivedi 2013).

Formally, let y be a discrete non-negative response variable and let \(\mathbf{X}\) and \(\mathbf{Z}\) be two covariates matrices (that could coincide, at least partially, or be completely different), then a generic hurdle model for each individual i can be defined as:

$$\Pr (y_{i} = j|{\mathbf{x}}_{i} ,{\mathbf{z}}_{i} ) = \left\{ {\begin{array}{*{20}l} {f_{1} (0|{\mathbf{x}}_{i} )} \hfill & {{\text{for}}\;{\text{ }}j = 0} \hfill \\ {f_{2} (j|{\mathbf{z}}_{i} )\left( {1 - f_{1} (0|{\mathbf{x}}_{i} )} \right)} \hfill & {{\text{for}}\;{\text{ }}j > 0} \hfill \\ \end{array} } \right.$$
(1)

where \(f_1(0|{\mathbf {x}}_i) = \Pr (y_i = 0|{\mathbf {x}}_i)\) is the probability of observing a count of 0, usually estimated from a logit or probit model, and \(f_2(j|{{\mathbf {z}}}_i)\) is a truncated-at-zero count data density

$$\begin{aligned} f_2(j|{{\mathbf {z}}}_i) = \Pr (y_i = j | y_i > 0,{{\mathbf {z}}}_i) = \frac{\Pr (y_i = j |{\mathbf {z}}_i)}{[1 - \Pr (y_i = 0 |{\mathbf {z}}_i)]} \end{aligned}$$
(2)

The choice of the model specification for \(f_2\) is usually driven by data characteristics. In particular, one should take into account if the data are overdispersed (i.e., the variance exceeds the mean) and if there is an abnormally large number of observations concentrated on one or more values (i.e., the distribution is inflated). As discussed in the previous section, the positive number of overnight stays is both overdispersed and inflated, therefore the specification for \(f_2\) will need to reflect both these characteristics.

Models that handle a single inflated value, typically at zero, have been proposed since the early 1990s, starting with the zero-inflated Poisson (ZIP) regression model first presented by Lambert (1992). However, studies on the generalisation of single-inflated models to the situation of multiple inflated distributions are relatively recent and still in progress.

Here we move from the generalisation of the ZIP model to a multiple inflated Poisson model (MIP) suggested by Giles (2007) to allow for count-inflation at multiple values. In dealing with multiple inflated count data the MIP model assumes a finite mixture model of a Poisson distribution and a set of degenerate distributions, one for each inflated value. In doing so, the MIP model assumes that overdispersion of data can only arise from splitting the data in more regimes. When this assumption does not hold and the overdispersion derives also from an heterogeneity component, it is opportune to generalise the MIP model into a multiple inflated negative binomial model (MINB), in analogy with what it is usually done when replacing a Poisson model with a negative binomial model. Finally, since we want to model truncated-at-zero count data, the negative binomial distribution will be replaced by its truncated counterpart, obtaining a multiple inflated truncated negative binomial model (MITNB).

Assuming that the positive count response has \(M-1\) inflated values, the MITNB distribution can be specified as:

$$y_{i} \sim \left\{ {\begin{array}{*{20}l} j \hfill & {{\text{with}}\;{\text{probability}}\;p_{{ij}} \;{\text{for}}\;j = 1, \ldots ,(M - 1)} \hfill \\ {TNB(\lambda _{i} ,\theta _{i} )} \hfill & {{\text{with}}\;{\text{probability}}\;p_{{iM}} } \hfill \\ \end{array} } \right.$$
(3)

where \(\sum _{j=1}^{M}p_{ij}=1\) and TNB(.) is the truncated negative binomial distribution. Note that the inflated values are not required to be consecutive in the model, even if they are denoted as \(1,\ldots ,(M-1)\) for notational convenience.

Under this specification, \(f_2\) becomes

$$\Pr (y_{i} = j|y_{i} > 0,{\mathbf{z}}_{i} ) = \left\{ {\begin{array}{*{20}l} {p_{{ij}} + p_{{iM}} \dfrac{{f_{{NB}} (j|{\mathbf{z}}_{i} )}}{{[1 - f_{{NB}} (0|{\mathbf{z}}_{i} ))]}}} \hfill & {{\text{for}}\;j = 1, \ldots ,(M - 1)} \hfill \\ {p_{{iM}} \dfrac{{f_{{NB}} (j|{\mathbf{z}}_{i} ))}}{{[1 - f_{{NB}} (0|{\mathbf{z}}_{i} ))]}}} \hfill & {{\text{for}}\;j \ge M} \hfill \\ \end{array} } \right.$$
(4)

in which

$$\begin{aligned} f _{NB}(j) = \Pr (y_i = j | {\mathbf {z}}_i) = \dfrac{\varGamma (\lambda _i+y_i)}{\varGamma (\lambda _i)\varGamma (y_i+1)} \left( \dfrac{\theta _i}{1+\theta _i} \right) ^{y_i} \left( \dfrac{1}{1+\theta _i}\right) ^{\lambda _i} \end{aligned}$$
(5)

indicates the probability mass distribution of the negative binomial model with variance function \(\lambda _i(1+\lambda _i/\theta _i)\), denoted as NB2 model in Cameron and Trivedi (2013, [p. 74]); where \(\varGamma\) is the gamma function, \(\lambda _i\) is the location parameter and \(\theta _i\) is the scale parameter (i.e. the inverse of the dispersion parameter). Then a smaller \(\theta _i\) corresponds to a larger overdispersion.

Both \(\lambda _i\) and \(\theta _i\) depend on covariates by the regression functions

$$\begin{aligned} \ln (\lambda _i)={\mathbf {z}}'_{1i} \varvec{\beta }_1 \end{aligned}$$
(6)
$$\begin{aligned} \ln (\theta _i)={\mathbf {z}}'_{2i}\varvec{\beta }_2 \end{aligned}$$
(7)

where matrices \({\mathbf {Z}}_1\) and \({\mathbf {Z}}_2\) are subsets of the covariate matrix \({\mathbf {Z}}\) and \(\varvec{\beta }_1\) and \(\varvec{\beta }_2\) are the vectors of the corresponding regression parameters. It is worth to note that traditionally the negative binomial model would assume that \(\lambda _i\) depends on covariates whereas \(\theta _i\) is constant, and therefore it could handle data with heteroscedasticity proportional to their mean. As stated by Greene (2007), the formulation of \(\theta _i\) given in (7) is the logical extension of the negative binomial model to allow for a heterogeneous scale parameter, and is usually known as heterogeneous negative binomial model (HNB)Footnote 1.

The mixing probabilities \(p_{ij}=\Pr (y_i = j)\) are modelled with a multinomial logit regression model (with reference category M)

$$\begin{aligned} \ln \left[ \dfrac{\Pr (y_i = j)}{\Pr (y_i = M)} \mid {\mathbf {z}}_{3i} \right] = {\mathbf {z}}'_{3i}\varvec{\gamma }_{j} \end{aligned}$$
(8)

therefore

$$\begin{aligned} \Pr \left( y_i = j| {\mathbf {z}}_{3i}\right) = \dfrac{ \exp \left( {\mathbf {z}}'_{3i}\varvec{\gamma }_{j} \right) }{ 1 + \sum _{m=1}^{M-1} \exp \left( {\mathbf {z}}'_{3i}\varvec{\gamma }_{m} \right) } \end{aligned}$$
(9)

for \(j=1,\ldots ,(M-1)\), where matrix \({\mathbf {Z}}_3\) is a subset of the covariate matrix \({\mathbf {Z}}\), \(\varvec{\gamma }_{j}\) is a vector of regression parameters specific to each \(M-1\) value.

The multinomial logit model is an extremely flexible formulation, but requires the estimation of several parameters. If necessary one can replace it with other more parsimonious models, but these usually require additional assumptions on the parametric model formulation (for example, Su et al. (2013) use a cumulative logit model which relies on the parallel regression assumption).

Assuming, as usually done, that the error terms of the binary and the truncated model are independent, the likelihood function can be separated in two parts (one for each model component) and the two components \(f_1\) and \(f_2\) can be fitted separately through maximum likelihood estimation. Due to the model complexity, likelihood maximisation needs to be carried out by numeric optimisation techniques.

Once the MITNB hurdle model as been estimated it is possible to calculate the predicted values for the two components of the model. In particular, the predicted number of positive overnight stays can be computed as

$$\begin{aligned} \begin{aligned}&\hat{E} \left( y_i | y> 0, {\mathbf {z}}_i \right) = \sum _{j=1}^{M-1} { \left( \hat{p}_{ij} v_j \right) } + \hat{p}_{iM} \hat{E}_{TNB} \left( y_i | y > 0, {\mathbf {z}}_i \right) \\&= \sum _{j=1}^{M-1} { \left( \hat{p}_{ij} v_j \right) } + \hat{p}_{iM} \frac{\hat{\lambda }_i}{1 - \left( 1 + \frac{\hat{\lambda }_i}{\hat{\theta }_i} \right) ^ {- \hat{\theta }_i} } \end{aligned} \end{aligned}$$
(10)

where \(v_j\) is the jth inflated value and \(\hat{p}_{ij}\), \(\hat{\lambda }_i\), \(\hat{\theta }_i\) are respectively the mixing probabilities, the location parameter and the scale parameter predicted for a generic observation i with covariates \({\mathbf {z}}_i\). The corresponding standard errors can then be computed by delta method.

Equation (10) highlights that, in order to accurately predict the positive values of y, it is essential to model not only the underline generating process (the TNB model) but also the inflated values.

4 Results

This section presents the results of the empirical analysis, divided in two parts. The first part focuses on the model specification and assessment, whereas the second part discusses the determinants of the two components (propensity and intensity) of tourism participation.

4.1 Model specification and assessment

The model includes a set of auxiliary variables that are collected by the Trips and Holidays survey for all sampled individuals (regardless of their being traveller or not). These are used as predictors of the total number of overnight stays in a quarter in the period 2004–2013. In detail, these variables can be classified into three categories:

  • Socio-demographic characteristics gender, age (scaled; both in level and in a quadratic form), education level (academic degree vs. other levels), number of household members, presence of children up to 10 years old in the household.

  • Economic characteristics the economic-related variables available in the dataset are occupational status, professional position, and number of income recipients in the household. This last variable has been transformed into relative terms and codified as a categorical variable: we computed the proportion of employed or retired household members (at least 16 years old), and we divided it in three categories (“no members”, “at most \(50\%\) of members”, “more than \(50\%\) of members”). The individual occupational status and the professional position have been combined into a single variable that distinguishes among “housewife/househusband”, “student”, “retired”, “disabled”, “managerial staff”, “office worker”, “manual worker”, “self employed” and “professional”. We are aware that the economic condition of the individuals may not be described entirely by these variables, but unfortunately this is the only information collected by the survey.

  • Temporal and spatial variables quarter, year (with 2004 as \(t=0\); as polynomial function of order three), and geographical area where the individual lives (divided in the five Italian NUTS1 regions: “North-West”, “North-East”, “Centre”, “South”, “Islands”).

We refer to Table 10 in “Appendix” for the descriptive statistics of all the variables, their acronym, and definition.

Both components of the hurdle model contain the above mentioned variables as main effects. That is, in our analysis the covariates matrix \(\mathbf {X}\) of the logit model (which governs the binary outcome participation/not participation in tourism) and the covariates matrix \({\mathbf {Z}}\) of the truncated model (which governs the positive values of overnight stays) coincide. The regression model for the location parameter \(\lambda\) includes all the variables of matrix \({\mathbf {Z}}\) as well.

About the scale parameter \(\theta\), from Table 2 we have observed that the variability of the positive response is much higher in the third quarter. Analogously, Table 3 showed that the overdispersion is heterogeneous across geographical areas. Therefore to control for the heterogeneity we model \(\theta\) as function of the third quarter (third vs. the others) and the NUTS1 regions.

Analogously, from Figs. 2 and 3 it is evident that the inflated values have different relevance depending on the quarter, but not on the geographical area. Therefore, the mixing probabilities are estimated via a multinomial logit function of the third quarter.

The last setting required to complete the model specification is the identification of the inflated values, that is the values of overnight stays for which we observe an abnormal large frequency. In fact, the multiple inflated model (4) described in the previous section assumes that the number of inflated values (\(M-1\)) is known, together with their values, therefore they need to be chosen before estimating the model. To this end, in order to identify the best model specification, we applied a two-step approach. First, we selected a list of plausible inflated values through visual inspection of the distribution of the observed responses (Fig. 1). Then, we compared several hurdle model specifications, each characterised by a different set of inflated values, using goodness-of-fit criteria, like the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) [as suggested by Cai et al. (2018)]. Table 4 describes some of the considered specifications.Footnote 2 In particular, the first model (Model 0) is the traditional negative binomial hurdle model without any inflated values, which has been used as benchmark model. Models 1, 2 and 3 are all multiple inflated negative binomial hurdle models: Model 1 includes only the most evident inflated values (see Fig. 1), whereas Model 2 and Model 3 each adds more values to the previous model specification. Model 4 is equivalent to Model 3, but the mixing probabilities are estimated as function of all quarters, therefore adding 32 additional parameters to the complete model.Footnote 3

Table 4 Mixture settings in alternative specifications of the hurdle model

Table 5 shows the corresponding model fit statistics.Footnote 4 It is clear that even by considering only few extremely evident inflations, as in Model 1, we obtain a much better fit to our data comparative to the benchmark hurdle model; and the fit improves again with the addition of other inflated values. Comparing Model 3 and Model 4 we can see that the inclusion of all quarters in the multinomial specification generates some additional gain (AIC is lower), but not as much as to be worthy of the additional complexity (BIC is higher). This confirm our choice to include only the third quarter in the multinomial model. Therefore, the final specification for the analysis is Model 3, which considers 16 inflated values: 2, 3, 4, 6, 7, 10, 14, 15, 20, 28, 29, 30, 40, 45, 50, 60 nights.

Table 5 Model fit statistics for alternative specifications of the hurdle model

In Figs. 4 and 5 we present the hanging rootogram plots for Model 0 and Model 3 respectively. As described by Kleiber and Zeileis (2016), the hanging rootogram is a graphic tool particularly useful for diagnosing issues such as overdispersion and multiple inflation in count data modelling. It displays predicted and observed distribution of the variable under study, showing how the model fits the data. Discrepancies are seen by comparison with the horizontal axis: if a bar doesn’t reach the zero line then the model over-predicts a particular value, and if the bar exceeds the zero line it under-predicts it. The vertical axis is scaled to the square-root of the frequencies to draw more attention to differences in the tails of the distribution. The comparison between the two plots confirms that the proposed multiple inflated approach provides a better adaptation to the data since the model corrects most of the under-prediction of the inflated values that is displayed in Fig. 4.

Fig. 4
figure 4

Comparison of predicted distribution (red line) and observed distribution (hung bars), Model 0. if a bar doesn’t reach the zero line indicates over-prediction, if it exceeds the zero line indicates under-prediction (color figure online)

Fig. 5
figure 5

Comparison of predicted distribution (red line) and observed distribution (hung bars), Model 3. if a bar doesn’t reach the zero line indicates over-prediction, if it exceeds the zero line indicates under-prediction (color figure online)

Finally, we want to briefly address the fact that our model specification assumes that all respondents are uncorrelated. We are aware that this assumption may not hold for individuals of the same family. However, since we are estimating a complex mixture of non linear models, relaxing the uncorrelation assumption requires the modification of the likelihood function in order to impose a clustered variance/covariance matrix on the data. This would add even more complexity to the already complex mixture structure of our model, which relies on maximum likelihood estimation carried out by numeric optimisation, generating an additional non-negligible computational effort. Given the large number of families and individuals in the dataset, the small average family size, and the evidence that a good part of tourism participants self-determines its tourism behaviourFootnote 5, we believe the simplifying assumption of uncorrelated observation can be maintained in our analysis.

4.2 Determinants of tourism behaviour

Maximum likelihood estimates of the multiple inflated hurdle model parameters are presented in the following tables: estimates for the logit regression are in Table 6, for the truncated negative binomial model are in Table 7, and for the multinomial regression are in  Table 8.

Table 6 ML estimates of logit model coefficients
Table 7 ML estimates of the truncated negative binomial model coefficients
Table 8 ML estimates of the multinomial logit model coefficients

Results show the importance of the socio-demographic variables as determinants of both the propensity and the intensity of tourism participation. Age has a discordant effect: older people tend to participate less in tourism, but when they do participate they tend to have longer holidays. Conversely, the effect of family composition is concordant in both components of the hurdle model: a larger family has a lower propension to travel and tend to spend fewer days on holiday, but having at least one young child increases both the odds of travelling and the number of overnight stays.

Consistently with the hypothesis that economic conditions matter, the proportion of household’s income recipients has a positive and highly significant effect on the decision to go on holiday. Moreover, estimates for the occupational status tell us that manual workers and unemployed persons are less likely to go on holiday, contrary to professionals, managerial staff and office workers who have a higher propensity to travel. But when on holiday, the occupational status acts primarily as a constraint on trips’ duration: working individuals have less time to spend on holidays than students and retired people. The same can be said for the proportion of income recipients, since a higher proportion is associated with fewer days of holiday.

The model confirms a remarkable North-South divide in tourism participation: assuming that the other covariates remain constant, the odds for residents of insular and southern regions are about 50% lower than that of north-western residents. And the same North–South dualism can be observed in the number of nights spent on holiday. In this respect, one should also consider that northern regions have a more efficient transportation system and a more favourable location as they are closer to foreign destinations that produce additional attractions for those Italian residents. On the other hand, since southern and insular regions have plenty of “in-house” leisure destinations, there could be a larger part of residents of these areas who prefer same-day trips, which are not registered in the dataset.

The estimated multinomial logit model for the mixing probabilities confirms a strong connection of the third quarter with the presence of inflated values in the distribution of the total number of overnight stays, as observed in Fig. 2. To understand the contribution of each inflated value to the mixture with the truncated negative binomial model, it is useful to calculate the mixing probabilities from the coefficients of Table 8. These probabilities, specific for years and quarters, are presented in Table 9.

Table 9 Estimated mixture probabilities \(\Pr \left( y = j \right)\) by Quarter

Overall, the mixing weights show that inflations are a non-negligible component of the intensity of tourism participation, especially in the third quarter in which less than 70% of the positive values seems to derive from the truncated negative binomial distribution. Even during the other quarters, the percentage of positive values explained by the TNB is about 73%. What is interesting to note, however, is the different composition of the mixture in the third quarter than in the rest of the year: the smaller inflated values have larger weights in the off-peak seasons, whereas in the third quarter the most frequent values are 6, 7, 14, 15 and 30 days. These values derive from the summation of one or more trips, which are commonly taken in weekend-, week-, or month-long blocks.

To understand the effect of the economic crisis, we compute the predictive margins of year and quarter for the two aspects of tourism participation, that is: (1) the predicted probability of having at least one trip and (2) the expected number of positive overnight stays, as functions of year and quarter. Predictive margins are computed as the average of the predicted values for all observations at each fixed value of year and quarter (leaving other covariates at their observed value). Since year and quarter influence each part of the MITNB hurdle model, by calculating the predictive margin we are able to see the overall role of the two covariates.

Predictive margins are plotted in Figs. 6 and 7 and reported in Table 11 in “Appendix”. Graphs show that the reaction to the Great Recession starts in 2009 when, after a period of growth in participation (comparatively to 2004), the predicted probability of travelling begins to decrease; and the decline spikes in 2011 probably due to the heavy fiscal restrictive measures adopted by the Italian government in that year. In addition, from 2011 the tourism participation drops below the 2004’s level and, after a year of stability, further decrease in 2013. This might indicate the presence of an inertia in reacting to the 2008 crisis: at first, households reacted to an increase in taxes by reducing savings to defend their living standards; after, considering the persistence of the crisis, households had to reduce consumption. Moreover, results reflect the general trend of a reduction in the average length of stay per trip which has been observed at the macro level ISTAT (2014) [Chp. 18]. In fact the decrease seams to have started even before 2008, but the highest reduction can be identified in 2013 indicating that the 2011 downturn, more than that of 2008, strongly affected tourism behaviour about intensity.

Fig. 6
figure 6

Predictive margins of Year and Quarter on tourism participation

Fig. 7
figure 7

Predictive margins of Year, Quarter and Short trip on positive values of the quarterly number of overnight stays

5 Final remarks

Estimation results for the proposed multiple inflated hurdle regression model show that, in Italy, the Great Recession had a negative impact on both the propensity and the intensity of tourism participation. Moreover, estimates confirm common knowledge that seasonality is a universal factor in tourism and that socio-demographic and economic characteristics are relevant in determining individuals’ tourism behaviour.

In assessing the effects of the Great Recession on tourism, we have to consider that nowadays tourism has become a “normal thing”, a part of the lifestyle, quality of life and well being of an increasing number of people (Bargeman and van der Poel 2006; Dolnicar et al. 2012; Cracolici et al. 2013). Therefore, we should likely observe an inertia in tourism behaviour and a higher probability of a “slicing strategy” (e.g., cheaper holiday) rather than a pure “cutback strategy” (e.g., fewer trips, reduced length of stay) (Bronner and De Hoog 2012). The dataset used in our analysis doesn’t include information about tourism expenditures, therefore we were not able to investigate whether Italians employs a “slicing strategy” in response to the economic crisis. Conversely, through the formulation of a hurdle model we studied which level of “cutback strategy” has been mostly implemented by the Italian citizens: (a) giving up holidays completely or (b) reducing in the number of overnight stays. Evidence shows that both cutback strategies has been applied in the period of the Great Recession, but the more prominent reaction to the crisis seems to be the complete renounce to leisure trips: the probability of participation diminished between 2007 and 2013 by more that 30% in the off-peak seasons and by 25% in the summer.

It is worth to note that, if data on tourism expenditures were available, we could think to investigate both the “slicing” and the “cutback” strategies together. This however would require the formulation of a multivariate model, since the two variables could be seen as two aspects of the same tourism behaviour.

One of the goals of our paper is to present a model that is able to handle multi-inflated distributions and that can be used by policy makers for predictive purposes. However, should we focusing more on the understanding of the phenomenon and less on the forecasting, we could think to include in the model some of the trip-related characteristics collected for tourism participants within the same survey, which could be seen as indirect proxies of trip costs (choices about accommodation, transportation, destination, etc... are usually connected to the tourist’s budget). These variables are specific to each single trip taken by an individual, whereas our response variable is, by its definition, aggregated over all holiday trips taken in a quarter. Therefore, in order to refer all trip-related characteristics to the individuals, they would need to be summarised or aggregated at the individual level for each quarter.

Motivated by the analysis of the impact of the Great Recession on tourism behaviour, the paper proposes a general, novel approach for dealing with count variables whose distribution is inflated in multiple values. This feature can not be represented through the probability distribution models commonly used for count data, but needs to be properly addressed (alongside other data characteristics like zero-inflation and overdispersion) in order to avoid possible estimation biases and incorrect inference about the model parameters (Cai et al. 2018). Moreover, failing to control for the inflated nature of the distribution can limit the model’s ability to produce reliable model based predictions.

We propose the use of a multiple inflated hurdle negative binomial model, with mixing probabilities modelled through a multinomial logit model, in comparison with the use of the well known hurdle negative binomial model. We show that, by controlling for the inflated nature of the response variable distribution, the proposed formulation provides a better representation of the Italians’ tourism behaviour in comparison with non-inflated hurdle models. In particular, by using a multiple inflated hurdle model we are able not only to identify the determinants of the phenomenon under study, but also to correctly fit the distribution of the total number of overnight stays, even in presence of extremely inflated values which are usually under-predicted by standard models. Given this characteristic, we believe that multiple inflated hurdle models can be useful tools for decision makers who are trying to forecast future events or the consequences of some new targeted policies.

The proposed methodology assumes that the inflated values are known, or are exogenously selected by a double procedure of visual inspection and model comparison. Optimal selection of the mixture components is a controversial issue when using any mixture model, and further research should be devoted to investigate the possibility of including the identification of the inflated values directly in the model estimation process.

Finally, that current model specification assumes that all respondents are uncorrelated. We are aware that this assumption may not hold for individuals of the same family, and a possible way for addressing the issue could be to include a family-random effect in our models. Given the large number of families in the dataset, the small average family size, and the evidence that a good part of tourism participants self-determines its tourism behaviour, we believe the simplifying assumption of uncorrelated observation can be maintained in our analysis. Nonetheless, further research on a multi-inflated random effect model for count data should definitely be considered.